Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

Dear NANOG,

Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a subset of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Please help us out to find out by answering our short (<10 minutes) anonymous survey.

Survey URL: https://forms.gle/v99mBNEPrLjcFCEu8

Context:

When we think about network failures, we often think about a link or a network device going down. These failures are “obvious” in that all the traffic crossing the corresponding resource is dropped. But network failures can also be more subtle and only affect a subset of the traffic (e.g. 0.01% of the packets crossing a link/router). These failures are commonly referred to as “gray” failures. Because they don’t drop all the traffic, gray failures are much harder to detect.

Many studies revealed that cloud and datacenter networks routinely suffer from gray failures and, as such, many techniques exist to track them down in these environments (see e.g. this study from Microsoft Azure https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). What is less known though is how much gray failures affect other types of networks such as Internet Service Providers (ISPs), Wide Area Networks (WAN), or Enterprise networks. While the bug reports submitted to popular routing vendors (Cisco, Juniper, etc.) suggest that gray failures are pervasive and hard to catch for all networks, we would love to know more about first-hand experiences.

About the survey:

The questionnaire is intended for network operators. It has a total of 15 questions and should take at most 10 minutes to complete. The survey and the collected data are totally anonymous (so please do not include information that may help to identify you or your organization). All questions are optional, so if you don’t like a question or don’t know the answer, just skip it.

Thank you so much in advance, and we look forward to read your responses!

Laurent Vanbever, ETH Zurich

PS: Of course, we would be extremely grateful if you could forward this email to any operator you might know who may not read NANOG ( assuming those even exist? :slight_smile: )!

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried fully swing over.

Mark.

I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried to fully swing over.

We've had an issue that has lasted over a year but only manifested recently, where someone wrote a static route pointing to an indirect next-hop, mistakenly. The router ended up resolving it and forwarding traffic, but in the process, was spiking CPU in a manner that was not immediately evident from the NMS. Fixing the next-hop resolved the issue, as would improving service provisioning and troubleshooting manuals :-).

Like Saku says, there's always something, and attention to it will be granted depending on how much visible pain it causes.

Mark.

Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected resources?

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

Added to the list... Thanks!

Best,
Laurent

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

:slight_smile:

I'm currently dealing with an issue that will forward a customer's traffic to/from one /24, but not the rest of their IPv4 space, including the larger allocation from which the /24 is born. It was a gray issue while the customer partially activated, and then caused us to care when they tried to fully swing over.

Did you folks manage to understand what was causing the gray issue in the first place?

We've had an issue that has lasted over a year but only manifested recently, where someone wrote a static route pointing to an indirect next-hop, mistakenly. The router ended up resolving it and forwarding traffic, but in the process, was spiking CPU in a manner that was not immediately evident from the NMS. Fixing the next-hop resolved the issue, as would improving service provisioning and troubleshooting manuals :-).

Interesting. I can see how hard this one is to debug as even a relatively small of traffic pointing at the static route would be enough to make the CPU spikes.

Like Saku says, there's always something, and attention to it will be granted depending on how much visible pain it causes.

Yep. Makes absolute sense.

Best,
Laurent

Uucp using tcp does work to overcome packet size problems but limited usage but did work in the past

Col

Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm.

Mark.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected resources?

One method is collecting lookup exceptions. We scrape these:

npu_triton_trapstats.py: command = "start shell sh command \"for
fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
ptx1k_trapstats.py: command = "start shell sh command \"for fpc in
$(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
asr9k_npu_counters.py: command = "show controllers np counters all"
junos_trio_exceptions.py: command = "show pfe statistics exceptions"

No need for ML or AI, as trivial algorithms like 'what counter is
incrementing which isn't incrementing elsewhere' or 'what counter is
not incrementing is incrementing elsewhere' shows a lot of real
problems, and capturing those exceptions and reviewing confirms them.

We do not use these to proactively find problems, as it would yield to
poorer overall availability. But we regularly use them to expedite
time to resolution.
Very recently we had Tomahawk (EZchip) reset the whole linecard and
looking at counters identifying counter which is incrementing but
likely should not yielded the problem. Customer was sending us IP
packets, where ethernet header and IP header until total length was
missing on the wire, this accidentally fuzzed the NPU ucode
periodically triggering NPU bug, which causes total LC reload when it
happens often enough.

> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.

Added to the list... Thanks!

The only way I know how to try to find these memory corruptions is to
look at egress PE device backbone facing interface and see if there
are IP checksum errors.

We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone variances, or errors in measuring ).

We noticed this when we replaced Nagios with some more bursting, trigger-happy monitoring software a few years back. Since then, it's reporting false positives from time to time, and this can become annoying.

Besides spending a lot of time debugging this, we never had a breakthrough in finding the root cause, just looking to replace things in the next year.

If there is a network which does not
experience these, then it’s likely due to lack of visibility rather
than issues not existing.

This. Full stop.

I believe there are very few, if any, production networks in existence in which have a 0% rate of drops or ‘weird shit’.

Monitoring for said drops and weird shit is important, and knowing your traffic profiles is also important so that when there is an intersection of ‘stuff’ and ‘stuff that noticeably impacts traffic’ , you can get to the bottom of it quickly and figure out what to do.

Hi Jörg,

Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :slight_smile: I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps is your probing traffic already high priority?

Best,
Laurent

One method is collecting lookup exceptions. We scrape these:

npu_triton_trapstats.py: command = "start shell sh command \"for
fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
ptx1k_trapstats.py: command = "start shell sh command \"for fpc in
$(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
asr9k_npu_counters.py: command = "show controllers np counters all"
junos_trio_exceptions.py: command = "show pfe statistics exceptions"

No need for ML or AI, as trivial algorithms like 'what counter is
incrementing which isn't incrementing elsewhere' or 'what counter is
not incrementing is incrementing elsewhere' shows a lot of real
problems, and capturing those exceptions and reviewing confirms them.

We do not use these to proactively find problems, as it would yield to
poorer overall availability. But we regularly use them to expedite
time to resolution.

Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons?

Not at all I'm afraid, and not intended for user consumption so
generally not available via SNMP or streaming.

Hello,

there is a large eyeball ASN in Southern Europe, single homed to a Tier1 running under the same corporate umbrella, which for about a decade suffered from periodic blackholing of specific src/dst tuples. The issue occurred every 6 - 18 months, completely breaking specific production traffic for multiple days (think dead, mission-critical IPsec VPNs for example). It was never acknowledged on the record, some say this was about stalled 100G cards. I believe at this point the HW was faced out, but this was one of the rather infuriating experiences …

More generally speaking, single link overloads causing PL or even full blackholing affecting single links (and therefore in a load-balanced environment: specific tuples) is something that is very frustrating to troubleshoot and it happens quite a lot in the DFZ. It doesn’t show on monitoring systems, and it is difficult to get past the first level support in bigger networks because load-balancing decisions and hashing are difficult concepts for the uninitiated and they will generally refuse to escalate issues they are unable to reproduce from their specific system (WORKSFORME). At some point I had a router with an entire /24 configured on a loopback, just to ping destinations from the same device with different source IP’s, to establish whether there is a load-balancing induced issue with packet-loss, latency, or full blackholing towards a particular destination.

Tooling (for troubleshooting), monitoring and education is lacking in this regard unfortunately.

  • lukas

Ask your vendor to implement RFC5837, so that in addition to the
bundle interface having the L3 address, traceroute also returns the
actual physical interface that received the packet. This would
expedite troubleshooting issues where elephant flows congest specific
links.
Juniper and Nokia support adaptive load balancing, dynamically
adjusting hash=>interface mapping table, to deal with elephant flows
without congesting one link.

> Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

I think that some of it depends on the type of failure -- for example,
some devices hash packets across an internal switch fabric, and so the
failure manifests as persistent issues to a specific 5-tuple (or
between a pair of 5-tuples). If this affects one in a thousand flows
it is likely more annoying than one in a thousand random packets being
dropped.

But, yes, all networks drop some set of packets some percentage of the
time (cue the "SEU caused by cosmic rays" response :-))

W

We had a line card that would drop any IPv6 packet with bit #65 in the destination address set to 1. Turns out that only a few hosts have this bit set to 1 in the address, so nobody noticed until some debian mirrors started to become unreachable. Also webbrowser are very good at switching to IPv4 in case of IPv6 timeouts, so nobody would notice web hosts with the problem. And then we had to live with the problem for a while because the device was out of warranty and marked to be replaced, but you do not just replace a router in a hurry unless you absolutely need to.

You do not expect this kind of issue and a lot of time was spent trying to find an alternate explanation for the problem.

Regards,

Baldur

Greetings,

I would suggest that your customer does care, but as there is no
simple test to demonstrate gray failures, your customer rarely makes
it past first tier support to bring the issue to your attention and
gives up trying. Indeed, name the networks with the worst reputations
around here and many of them have those reputations because of a
routine, uncorrected state of gray failure.

To answer Laurent 's question:

Yes, gray failures are a regular problem. Yes, most of us care. And
for the most part we don't have particularly good ways to detect and
isolate the problems, let alone fix them. When it's not a clean
failure we really are driven by: customer says blank is broken, often
followed by grueling manual effort just to duplicate the problem
within our view.

Can network researchers do anything about it? Maybe. Because of the
end to end principle, only the endpoints understand the state of the
connection and they don't know the difference capacity and error. They
mostly process that information locally sharing only limited
information with the other endpoint. Which means there's not much
passing over the wire for the middle to examine and learn that there's
a problem... and when there is it often takes correlating multiple
packets to understand that a problem exists which, in the stateless
middle with asymmetric routing, is not usable. The middle can only
look at its immediate link stats which, when there's a bug, are
misleading.

What would you change to dig us out of this hole?

Regards,
Bill Herrin

Most don't. Somewhat recently we were dropping a non-trivial amount of
packets from a well-known book store due to DMAC failure. This was
unexpected, considering it was an L3 to L3 connection. This was a LACP
bundle with a large number of interfaces and this issue affected just
one interface in the bundle. After we informed the customer about the
problem, while it was still occurring, they could not observe it, they
looked at their stats and whatever it was dropping was being drowned
in the noise, it was not an actionable signal to them. Customer wasn't
willing to remove the broken interface from the bundle, as they could
not observe the problem.

We did migrate that port to a working port and after 3 months we
agreed with the vendor to stop troubleshooting it, vendor can see that
they had misprogrammed their hardware, but they were not able to
figure out why and therefore it is not fixed. Very large amount of
cycles were spent at the vendor and operator, and a small amount of
work (checking TCP resends etc) at customers trying to solve it.

The reason we contacted the customer is because there were quite a
large number of packets we were dropping, I can easily find 100 real
smaller problems we have in the network immediately.

Customer was /not/ wrong, the customer did the exact right thing.
There are a lot of problems, and you can go deep into the rabbit hole
trying to fix problems which are real but don't affect a sufficient
amount of packets to have a meaningful impact on the product quality.