Yes, latency is the problem. It can kinda be done if you're singing a part or maybe playing a non percussive instrument.
When I work with ProTools at home I switch to "low-latency mode" which allows me to monitor my input directly instead of hearing the sound after it has made the roundtrip through analog to digital conversion and back to analog. That only takes a few milliseconds but does make a difference when trying to play exactly with a drummer.
Now consider zoom converting analog to digital, then sending through your modem and across the internet (Speedtest for me shows a 30ms "ping" delay), back through your friend's modem (with their ping delay) and zoom's D-A conversion on their end. Now try to clap your hands exactly in time together.
As Richard said above, any "split-screen" style musical collaborations that you see must have been assembled after the fact. Somebody had to go first. The others record themselves playing along and then it's assembled in post production.
There's still some very creative music to be made this way but sadly it's not a real-time jam...
Jimmy J