A survey about networking incidents

Hi Nanog,

We all know that networks are at the heart of many of the systems we use today. When these systems break, the underlying networks are often the first suspects. Networks are hard to diagnose and they are most likely to be blamed for problems even if they are completely healthy. As networking engineers, we have all seen cases where another part of the system was causing an issue but the network was held the suspect until the problem was resolved.

We are researchers from Harvard and The University of Pennsylvania who are interested in understanding this problem and its impact better in order to build a solution. Our goal is to be able to quickly rule out the network as a root cause for incidents in order to be able to speed up diagnosis and also to improve operator efficiency. We are interested in learning the answer to a few questions. Specifically, we would like to know: How often do you see problems where the network is blamed but after investigating you find the problem to be caused by some other part of the system? How often have you had incidents where the cause of the incident was outside of the boundary of your organization? How much do you think fixing this problem can help you and your organization more quickly diagnose problems?

We have created a very short survey to be able to get an operator’s perspective on these questions. It should take less than 15 minutes to finish. The findings should help us as well as the research community at large to be able to build a solution that can benefit all types of networks, of different sizes, to improve how they do the diagnosis. We will be presenting the results of this anonymous survey in a scientific article later this year. We will report back our research once it’s finished.

Survey URL: https://docs.google.com/forms/d/e/1FAIpQLScx-U54eQFQi5AdBCOOucMaI6BVmLwcMFiZl2HVZ9bHi1q8bA/viewform

We would greatly appreciate it if you could help us with this research. Please feel free forward this survey to other operators you know. Thank you!

Minlan Yu

It seems that this is even increasingly harder in a MEF/SP-type Layer 2 emulated network of eline, elan, etree type things…

Yeah seems that you have to have synthetic-type traffic generated and inserted into the data path to measure on…

Isn’t CFM/Ethernet OAM supposed to segment up the network into management domains-of-responsibility with mips/meps, etc so that you can real-time-monitor your system and others can monitor theirs… I have not set this up, but I thought that was one way of being able to know on-going the state of the network, link-by-link and endpoint-to-endpoint… I think on-going CMM’s flow to give you an idea of the extent to which links and services are good or not good.

Perhaps that’s the proof you could point at for anyone trying to blame the network

I’m sure there are other ways… like cisco’s ip sla… accedian’s paa, twamp (I just remembered about twamp, and I think that’s perhaps an ip-layer version of what is like Ethernet layer cfm/oam, I could be wrong…but as I think about it, I recall mpls-oam, perhaps others too

Yes, as network engineer’s, I/we continually have to clear-my-name (clear the network) of blame


p.s. I’ll try to look at the survey later