Forward Erasure Correction (FEC) and network performance

Hello;

I work with FEC in various ways, mostly to protect video streams against packet loss, including as co-chair
of the IETF FECFRAME WG and in the Video Services Forum. Most FEC is driven by congestion in the edge, RF issues on wireless LANs, etc., but there is always
the chance of loss in transit over the wider network. In many important cases, in fact, (e.g., transfer of video from a content creator to
an IPTV service provider or Enterprise to Enterprise video conferencing) the loss at the edges can be controlled, leaving only network transit to
worry about.

This question has thus come up from time to time, and I was hoping that the assembled NANOG might be able
to either answer it or provide pointers to the literature :

What level of packet loss would trigger response from network operators ? How bad does a sustained packet loss need
to be before it is viewed as a problem to be fixed ? Conversely, what is a typical packet loss fraction during periods
of good network performance ?

If there is some consensus around this, it would effectively set an upper bound for the need for FEC in network transit.

I would be glad to accept replies in confidence off list if people don't want their
networks to be identified.

(To be clear, I am aware that many ISPs offer some sort of MPLS service with a packet loss SLA for video carriage. I am really asking about
Internet transport here, although I would be pleased to learn of MPLS statistics if anyone wants to provide them.)

Regards
Marshall Eubanks

There will be two consensuses (consensai?).

People who _use_ the network will tell you that a network provider will fix a network when they complain, and never before. You have 50% packet loss? Trying to shove 40 Gbps down a GigE? Provider doesn't care, or notice.

People who _run_ the network will tell you: "No packet loss is acceptable!" They'll explain how they constantly monitor their network, have SLAs, give you a tour of their show-NOC, etc. But when you read the SLA, you realize they are measuring packet loss between their core routers in city pairs. And frequently they don't even notice when those hit 2 or 3% packet loss.

If you try to send a packet anywhere other than those cities and those router, say down your own transit link, or to a peer, or another customer, well, that's not monitored. And packet loss on those links are not covered by the SLA in many cases. Even if it is covered, it will only be covered from the time you open a ticket. (See point about provider not caring until you complain.)

There are a few networks who try harder than others, but no network is perfect. And although you did not say it, I gather than you are not planning to use one of the better networks, you need to use them all.

In Summary: How much packet loss is typical? Truthfully, 0% most (i.e. > 50%) of the time. The rest of the time, it varies between a fraction of a percent and double-digit percentages. Good luck on figuring out a global average.

My personal opinion is that 10^-12 BER per-link requirement in ethernet sets an upper bound to what can be required of ISPs. Given that a full sized ethernet packet is ~10kbits, that gives us 10^-7 packet loss upper bound. Let's say your average packet traverses 10 links, that gives you 10^-6 (one in a million) packet loss when the network is behaving as per standard requirements (I'm aware that most networks behave much better than this when they're behaving optimally).

Personally, I'd start investigating bit error rates somewhere around 10^-5 and worse. This is definitely a network problem, whatever is causing it. A well designed non-congesting core network should easily much be better than 10^-5 packet loss.

Now, considering your video case. I think most problems in the core are not caused by actual link BER, but other events, such as re-routes, congested links etc, where you don't have single packet drops here and there in the video stream, instead you'll see very bursty loss.

Now, I've been in a lot of SLA discussions with customers who asked why 10^-3 packet loss SLA wasn't good enough, they wanted to raise it to 10^-4 or 10^-5. The question to ask then is "when is the network so bad so you want us to tear it to pieces (bringing the packet loss to 100% if needed) to fix the problem". That quickly brings the figures back to 10^-3 or even worse, because most applications will still be bearable at those levels. If you're going to design video codec to handle packet loss, I'd say it should behave without serious visual impairment (ie the huge blocky artefacts travelling across the screen for 300ms) even if there are two packets in a row being lost, and if the jitter is hundreds of ms). It should also be able to handle re-ordering (this is not common, but it happens, especially in a re-route case with microloops).

Look at skype, they're able to adapt to all kinds of network impairments and still deliver service, they degrade nicely. They don't have the classic telco "we need 0 packet loss and less than 40ms jitter, because that's how we designed it because we're used to SDH/SONET".

What level of packet loss would trigger response from network operators ? How bad does a sustained packet loss need
to be before it is viewed as a problem to be fixed ? Conversely, what is a typical packet loss fraction during periods
of good network performance ?

It really depend on a lot of parameters and it's why I think this approach is not relevent at all since IP centric solutions.
In past, some peoples said that if you loose less than 0,1% of packet, all is good.
Now, you can loose 1% of packet and acheive something that work for the end user with Flash 10 technology and ... despite all you can loose 0,01% packet and see a lot of defaults because HD / 8 Mbps / H264 encoding. (we've presented that with Cisco last year at IBC Amsterdam)

Fortunatly if you thing about IP centric solution, you can install enough intelligence in the Set Top Box, for exemple or on a PC client side in order to :
  - re-ask paquets
  - and / or repair missing one (fec)
Booth of this solution are in operation today and permit a really not too bad IPTV with DSL long lines in many operators that I know.

(To be clear, I am aware that many ISPs offer some sort of MPLS service with a packet loss SLA for video carriage. I am really asking about
Internet transport here, although I would be pleased to learn of MPLS statistics if anyone wants to provide them.)

You can ask what you want to yours ISP but the magic is : all can happen and loss packet and jitter are not relevent at all !
Then solution is not to ask 100% SLA to ISP (except if you find some crazy man to offer you this with good penalty) but to take care about your service, with a real end to end monitoring. There is no more correlation between backbone artefacts and human artefacts. Best way is a user centric monitoring, top down approach that can understand if all is good at service / application / usage level, in order to control principal real artefacts (blockiness, jerkiness, bluriness and availability of image and sound). That's exist, you can believe me :wink:

Marshall Eubanks wrote:

If there is some consensus around this, it would effectively set an upper bound for the need for FEC in network transit.

The bit error rate of copper is better than 1 error in 10^9 bits. The bit error rate of fiber is better than 1 error in 10^12 bits. So the packet loss rate of the transport media is approximately zero.*

Thus any packet loss you see is congestion. If you see packet loss, you should SLOW DOWN, not just keep sending and using more and more FEC to get recoverable video out the far end. (And by "SLOW DOWN" I mean in a way that is TCP-friendly, as anything less will starve out TCP flows unfairly and anything more will itself find that it is starved out by TCP)

Matthew Kaufman

* The bit error rate of RF-based connections like Wi-Fi is higher *but* because they need to transport TCP, and TCP interprets loss as congestion, they implement link-level ARQ in order to emulate the bit-error-rate performance of wire as best they can.

This sounds pretty good, until you realize that it means you can expect 36
errors in 10 hours on a 100% utilized gigabit fiber link.

Lamar Owen wrote:

Well, it means this is still ok according to standard. In real life, if you engineer your network to be within the optical levels they should be, you get GigE links that at the highest, have single digit CRC errors per month, at the lowest, have 0 CRC errors month after month even with a lot of traffic.

BER starts happening very steeply when you approch the optical limit, it might go from 10^-20 to 10^-14 in just a few dB of optical level change.