Network bandwidth vs. latency

I ran across an ad for a device (maybe includes software) that claims to reduce latency so you can play music with other people in real time over the Internet.

https://www.sweetwater.com/insync/realtime-audio-buying-guide/

I am skeptical.

I am a software guy but not a network guy. The way I explain latency and bandwidth to people is imagine a long water hose. When you turn the water on, it takes a minute to get from the spigot to the other end of the hose. Maybe you can fill a bucket in 30 seconds. If you make the hose bigger, you can push more water through the hose, and maybe fill your bucket in 15 seconds. But it still takes a minute for the water to get from the spigot to the other end of the hose. If you have a gigabit network connection, you can download files lightning fast but it doesn’t decrease latency.

  1. Do I know what I’m talking about?
  2. Can a product like this that is just an endpoint device really do anything about network latency?

I’m wondering if rather than reduce latency, it synchronizes the feeds so that everyone has the same latency?

Latency wrecks trying to play together. Every one hears everyone else playing behind them. So you really do need to get it down. Digital audio has always had a latency problem, simply because the samples occur at a non-zero interval, and delivery pipelines incur a delay of samples. The device linked claimed latency between 8ms and 55ms. It is worth considering that sound propagates about one foot every millisecond. But the precedence (aka Hass) effect tells us that sounds within about 10 milliseconds are perceived to be the same sound, and that sounds that occur further apart than this are perceived as separate. You really want to get the latency down under 10 ms.

Latency in a network is dominated by the networks protocols. The signals propagate at a good fraction of the speed of light. Where you get into trouble is when the audio samples are aggregated into packets, and the network infrastructure queues packets. Sending audio over TCP with a packet size of 1500 bytes - which might be the simple answer to most things, means your stereo 48kHz 16 bit sampled audio might take about 8 ms just to fill a buffer. So you have reached your latency budget before the packet even leaves your machine.

There are existing protocols that attempt to address this. RTP (used for instance by WebRTC) attempts to address latency. First up, don’t bother with TCP. If a sample is lost you don’t want to wait for it, it is gone forever. Next, use small transmit packets. But you don’t want to end up with juddering audio. So you want to manage loss of packets, and you want to do so seamlessly. You will need to add a tiny bit of buffering - so as to avoid really bad problems with jitter. But only as much as you need. It may be possible to use some AI based tricks to smooth over glitches in delivered audio, and do so in a manner that allow aggressive reduction in buffering.

Latency is a hard problem, and is the often glossed over price we pay when using any digital audio (or video) system. If everything was still done in the analogue domain latency would not be an issue.

There is a persistent joke in the live mixing domain - the “suck” button. A secret button on a mixing desk that a mixing engineer can engage to make a bank that has pissed him off sound terrible. Such a button is easy. Just send the foldback through a short delay. The band will be incapable of playing in time, and will truly suck.

I recall an exhibit in the Science Center in Toronto that basically consisted of a microphone and headphones (state of the art, 1970’s). You talked and the headphones played your speech back about a second or two delayed. This was so disconcerting that it was impossible to keep speaking coherently - hearing yourself delayed disrupted your ability to compose words.

Then the internet, covid, and zoom came along and we all learned to live with that. (Although the early days of video conferencing with the 2-second delays, resulting in everyone pausing then speaking over each other, stopping and starting - that was good training.)

Francis makes some good points. There are specific protocols for VoIP calls, which tell the networks “this traffic needs priority for timely delivery” (as opposed to data like web pages, that can stand a delay, or video streaming, where the risk of delay is addressed by buffering). But if the connection sux, not much you can do to improve it.

That and the “COFFEE” speech synthesizer were two of my favourites as a kid!

A classic Far Side, though it’s possible it goes back further:

There could be some tradeoff/setup if you want good (i.e., low) latency for your live audio and your gaming. For example, on this page

there is a table comparing a default router setup versus turning on “smart queue management”. When the regular connection is heavily loaded with downloads and uploads, buffering results in a 20–30 ms increase in ping times over a base of 12 ms, a huge amount. With queue management turned on, bandwidth is throttled to 90% but the ping time is kept down to 12 ms.

I suspect the cartoon is the origin. However the delay trick is a real thing - I have heard first hand of its use.

Someone posted in another thread a video demonstration of a DIY version:

And a calculator! It could do square roots too, but when you got to square root of a square root of a square root… it started to slow down on the calculations if you kept hitting that button.

A walk through corridor where foam walls killed all ambient noise. A display where you twiddled knobs for R, G, and B (technically, Magenta, Cyan and Yellow) to try to match a given colour. And a lathe shaping chunks of aluminum using a classic coke bottle as a guide, so they were making aluminum coke bottle sculptures.

The future’s so bright, I gotta wear shades…♫

…but then I went through there 25 years later and nothing had changed, thanks to budget cuts… except some displays were broken.

I’m not even sure what that means. If you could theoretically synchronize latency (I have no idea how) I suppose that means that the latency from any point to any other point is the same. If you do that, then a sound played from point A will arrive at all the other points at the same time. But it will still be late. This doesn’t matter for one-way transmission. But if you send a sound back from point B when you hear the received sound, there is still a big delay at point A between sending the original sound, and receiving the sound played at point B. Too big a delay for musicians at points A and B to play together.

One tangential point I raised was that a lot of people think if they get the biggest bandwidth available then they can play music together. My understanding is that it doesn’t help, because bandwidth and latency are unrelated characteristics. Is that correct?

My thought (which is clearly wrong and was just a WAG) was that it looked at the latency of each stream and applied a delay to all but the laggiest so they were the same. While technically easy, that wouldn’t work for someone trying to play along with it.

You could agree on one person as the timekeeper (who doesn’t hear anyone else), and then everyone else (who hears that musician, but not each other) stays in synch with the timekeeper (as they hear them). Then, after the session, the software takes all of the recordings and synchs them together based on the timekeeper.

You could also take the audio streams and seriously degrade the quality in every way except timing, so that there was less data. It’d sound bad to all the bandmates, but they’d still be able to (try to) synch with it, and again for the final product you’d use the undegraded streams. With much less data to transmit (due to the degradation), you’d decrease some of the sources of latency (though not all, of course: There’s nothing you can do about the speed of light).

I’ve seen some cooperative multiplayer games where, in some situations, both players need to do something very close to simultaneously (like, within a fraction of a second). Lag could very easily be longer than that, so the game provided a countdown tool: Either player could press a button that would show a “3…2…1…DING” on both screens, timed so that if both players did their thing right at the DING, then with the current lag, both signals would reach the server at the same time.

You are correct. People think they need loads of bandwidth, when they probably don’t. At work, our VPN setups typically mean that most of our laptops have max 3-4Mb/sec throughput, but we are in international video meetings in Teams and nobody notices any issues. In contrast, my home network has gigabit Internet, but it doesn’t mean that the YouTube videos I watch are any better than when I watch them on my laptop (channeled through that tiny VPN pipe). I would need a full house of teenagers streaming video to saturate that connection.

Most of the issues we deal with at work are related to latency and not bandwidth: something like Microsoft’s SMB protocol for file shares suffers terribly with increased latency–in the same data center a 20MB file will be transferred instantly, while a user in Europe downloading the same file from NY will notice it takes a very long time.

This is because SMB is very chatty, sending things in 64k blocks and waiting for the response, so you end up with a max throughput per second of 64KB divided by RTT in seconds. A person in the UK might experience 75ms of RTT, so best possible SMB transfer would be around 800KB per second. A person in the same datacenter might have 250 microseconds of latency, meaning max SMB transfer would be around 250MB per second. In either case, bandwidth is not the limiting factor.

I cherry picked a protocol that is terrible at performance over a WAN, but it is most common in our environments that our application performance is limited by latency.

To complicate things, corporate environments have all kinds of load balancers, edge routers, firewalls, stateful filters, VPN tunnels, and other things that hold up packets along the way, making performance troubleshooting a challenge. For example, a firewall with stateful filtering will likely wait for all packets of an HTTP response to arrive so the FW can reassemble the full response and apply filtering rules, making it appear that packets were held up for possibly hundreds of milliseconds.

Back in the COVID days we had fleeting thoughts of band practice over Zoom and instantly dismissed that as a total non-starter since the latency is so much outside of everyone’s control. Maybe if you have a drummer that sets the groove without listening to the band so everyone can play along as they hear the drum beats; it might be bearable.

This would be similar to how marching bands produce a coherent performance while spread out over hundreds of feet: the beats might come from far away, but they only need to play in time with the music as they hear it, ignoring any visual cues–often the beats come from a hundred feet behind them. In the end, multiple layers of musicians produce a wave of music that is perfectly in time as it reaches the reviewing stand.