what is acceptible jitter for voip and videoconferencing?

Dear nanog-ers:

I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running

voip networks entirely separately… I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.

The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),

https://www.linkedin.com/feed/update/urn:li:ugcPost:7110029608753713152/

but I am even more interested in finding cites to support various viewpoints, including mine, and learning how slas are met to deliver it.

Hi Dave,

I don't know your use case but bear in mind that jitter impacts gaming
as well, and not necessarily in the same way it impacts voip and video
conferencing. Voip can have the luxury of dynamically growing the
jitter buffer. Gaming... often does not.

Just mentioning it so you don't get blind-sided.

Regards,
Bill Herrin

I go back many, many years as to baseline numbers for managing voip networks, including things like CISCO LLQ, diffserv, fqm prioritizing vlans, and running
voip networks entirely separately... I worked on codecs, such as oslec, and early sip stacks, but that was over 20 years ago.

I don't believe LLQ has utility in hardware based routers, packets
stay inside hardware based routers single digit microseconds with
nanoseconds of jitter. For software based devices, I'm sure the
situation is different.
Practical example, tier1 network running 3 vendors, with no LLQ can go
across the globe with lower jitter (microseconds) than I can ping my
M1 laptop 127.0.0.1, because I have to do context switches, the
network does not. This is in the BE queue measured in real operation
under long periods, without any engineering effort to try to achieve
low jitter.

The thing is, I have been unable to find much research (as yet) as to why my number exists. Over here I am taking a poll as to what number is most correct (10ms, 30ms, 100ms, 200ms),

I know there are academic papers as well as vendor graphs showing the
impact of jitter on quality. Here is one:
https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses
- this appears to roughly say '20ms' G711 is fine. But I'm sure this
is actually very complex to answer well, and I'm sure choice of codec
greatly impacts the answer, like whatsapp uses Opus, skype uses Silk
(maybe teams too?). And there are many more rare/exotic codecs
optimised for very specific scenarios, like massive packet loss.

I think it all goes back to the earliest MOS tests (“Hold up the number of fingers for how good the sound is”) and every once in a while somebody actually does some testing to look for correlations.

Thought it’s 15 years old, I like this thesis for the writer’s reporting: https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses

In particular, this table shows the correlation, and is consistent with what I would expect.

Lee

We run Teams Telephony in $DAYJOB, and it does use SILK.

Looks like codecs still are rapidly evolving in walled gardens. I just
learned about 'Satin'.

image hosted at ImgBB — ImgBB - notice 'payload description' from Teams admin
portal. So at least in some cases Teams switches from Silk to Satin,
wiki suggests 1on1 only, but I can't confirm or deny this.

My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.

Artifacts in audio are a product of packet loss or jitter resulting in codec issues issues leading to human subject perceptible audio anomalies, not so much latency by itself. Two way voice is remarkably NOT terrible on a 495ms RTT satellite based two-way geostationary connection as long as there is little or no packet loss.

Thank you all for your answers here, on the poll itself, and for papers like this one. The consensus seems to be settling around 30ms for VOIP with a few interesting outliers and viewpoints.

https://scholarworks.gsu.edu/cgi/viewcontent.cgi?article=1043&context=cs_theses

Something that came up in reading that… that I half remember from my early days of working with VOIP (on asterisk) was that silence suppression (and comfort noise) that did not send any packets was in general worse than sending silence (or comfort noise) - for two reasons - one was nat closures,
but the other was that steady stream also helped control congestion and had less jitter swings.

So in the deployments I was doing then, I universally disabled this feature on the phones I was using then.

In my mind (particularly in a network that is packet (not byte) buffer limited), this showed that point, (to an extreme!)

https://www.duo.uio.no/bitstream/handle/10852/45274/1/thesis.pdf

But my question is now, are we doing silence suppression (not sending packets) on voip nowadays?

Hi Tom,

Jitter doesn't necessarily cause artifacts in the audio. Modern
applications implement what's called a "jitter buffer." As the name
implies, the buffer collects and delays audio for a brief time before
playing it for the user. This allows time for the packets which have
been delayed a little longer (jitter) to catch up with the earlier
ones before they have to be played for the user. Smart implementations
can adjust the size of the jitter buffer to match the observed
variation in delay so that sound quality remains the same regardless
of jitter.

Indeed, on Zoom I barely noticed audio artifacts for a friend who was
experiencing 800ms jitter. Yes, really, 800ms. We had to quit our
gaming session because it caused his character actions to be utterly
spastic, but his audio came through okay.

The problem, of course, is that instead of the audio delay being the
average packet delay, it becomes the maximum packet delay. You start
to have problems with people talking over each other because when they
start they can't yet hear the other person talking. "Sorry, go ahead.
No, you go ahead."

Regards,
Bill Herrin

My understanding has always been that 30ms was set based on human perceptibility. 30ms was the average point at which the average person could start to detect artifacts in the audio.

Hi Tom,

Jitter doesn’t necessarily cause artifacts in the audio. Modern
applications implement what’s called a “jitter buffer.” As the name
implies, the buffer collects and delays audio for a brief time before
playing it for the user. This allows time for the packets which have
been delayed a little longer (jitter) to catch up with the earlier
ones before they have to be played for the user. Smart implementations
can adjust the size of the jitter buffer to match the observed
variation in delay so that sound quality remains the same regardless
of jitter.

Indeed, on Zoom I barely noticed audio artifacts for a friend who was
experiencing 800ms jitter. Yes, really, 800ms. We had to quit our
gaming session because it caused his character actions to be utterly
spastic, but his audio came through okay.

The problem, of course, is that instead of the audio delay being the
average packet delay, it becomes the maximum packet delay.

Yes. I talked to this point in my apnic session here: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/

I called it “riding the TCP sawtooth”- the compensating voip delay becomes equal to the maximum size of the buffer, and thus controls the jitter that way. Sometimes, to unreasonable extents, like 800ms in your example.

When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols?

But it's my impression that gaming is by far more affected by latency and thus jitter buffers for voice. Don't some ISP's even cater to gamers about latency?

Mike

Hi Dave,

You did not tell: is it interactive? Because we could use big buffers and convert jitter to latency (some STBs have sub-second buffers).

Then jitter would effectively become Zero (more precise: not a problem), and we deal only with latency consequences.

Hence, your question is not about jitter, it is about latency.

By all 5 (or 6?) senses, the Human is a 25ms resolution machine (limited by the animal part of our brain: limbic system). Anything faster is “real-time”. Even echo cancellation is not needed – we hear echo but cannot split signals.

Dog has 2x better resolution, cat is 3x better. They probably hate cheap monitor pictures (PAL/SECAM had 50Hz, NTSC had 60Hz).

25ms is for everything round trip. 8ms is wasted just for visualization on the best screen (120Hz).

The typical budget left for the networking part (speed of light in the fiber) is about 5ms one way (1000km or do you prefer miles?).

Maybe worse, depends on the rendering in GPU (3ms?), processing in the app (3ms?), sensor of the initial signal (1ms?), and so on.

The worst problem is that the jitter buffer would be substructed from the same 25ms budgetL

Hence, it is easy to consume (by the jitter buffer) the 10ms that we typically have for networking and come to the situation when we left just with 1ms that pushe us to install MEC (distributed servers to every municipality).

Accounting for jitter buffer, it is pretty hard to be “real-time” for humans.

Hint: “Pacing” is the solution. The application should send packets with equal intervals. It is very much adopted by all OTTs.

By the way, “pacing” has many other positive effects on networking.

The next level is about our reaction (possibility to click). That is 150ms for some people, 250ms on average.

Hence, gaming is pretty affected by 50ms one-way latency because 2*50ms is becoming comparable to 150ms – it affects the gaming experience. In addition to seeing the dealy, we lose the time – the enemy would shoot us first.

The next level (for non-interactive applications) is limited only by the memory that you could devote to the jitter buffer.

The cinema would be fine even with a 5s jitter buffer. Except for zipping time, but it is a different story.

Eduard

When I wrote my first implementation of telnet ages ago, i was both amused and annoyed about the go-ahead option. Obviously patterned after audio meat-space protocols, but I was never convinced it wasn't a solution in search of a problem. I wonder if CDMA was really an outgrowth of those protocols?

Typically seen with half-duplex implementations, like "Over" in two-way radio. Still used in TTY/TDD as "GA".

But it's my impression that gaming is by far more affected by latency and thus jitter buffers for voice. Don't some ISP's even cater to gamers about latency?

Yep. Dilithium crystal futures are up due to gaming industry demand. :wink:

DId that ever actually occur over the internet such that telnet would need it? Half duplex seems to be pretty clearly an L1/L2 problem. IIRC, it was something of a pain to implement.

Mike

Telnet sessions where often initiated from half duplex terminals. Pushing that flow control across the network helped those users.

I'm still confused. Did it require the telnet users to actually take action? Like they'd manually need to enter the GA option? It's very possible that I didn't fully implement it if so since I didn't realize that.

Mike

The implication would look at the terminal characteristics and enable as required.