Refusing Pings on Core Routers??? A new trend?

Deepak_Jain · October 19, 2006, 11:18pm

This week, at least a dozen "troublesome" or "problematic" routes our NOC has investigated due to customer complaints all have double or triple digit latencies, jitter and/or packet loss.

Not really that surprising... a normal week when you are looking at the entire Internet.

What is interesting about *this* dozen that I am talking about is that the problematic hops (or just before the problematic hops) have all had various levels of filtering (trace, icmp trace, icmp everything, etc) that (or at least make more difficult) real corroboration of the "spot" samples that a trace route gives.

1 NOC (that will remain nameless even though they should really be shamed) said the following in response to the question -- when we were trying to diagnose +50ms jumps in their latency within a single POP.

Q: "As part of this, can you tell me why your router is prohibiting packets
being sent to our interface?"

security reasons."

When we saw this (below) en route to our interface moving ~500mb/s from outside of our network (you know, that pesky symmetrical problem resolution). We were investigating the hop immediately after this hop.

9 bcs1-so-3-1-0.y.x.z (x.x.x.x) 12.395 ms !N
11.824 ms !N 14.162 ms !N

[Clever folks will know by the interface naming who I am talking about].

What the heck is going on lately? Have we returned to the time where we've started trying to hide lacks of capacity instead of fixing them??

Jeremy_Chadwick · October 20, 2006, 12:20am

1 NOC (that will remain nameless even though they should really be
shamed) said the following in response to the question -- when we were
trying to diagnose +50ms jumps in their latency within a single POP.

Q: "As part of this, can you tell me why your router is prohibiting packets
being sent to our interface?"

A:" The reason you cannot hit your interface is it is blocked for
security reasons."

I've heard this response before, albeit not from the company you're
referring to. The most common response -- which is at this point a
template response -- I hear is "Well, you can't rely on traceroute
because of ICMP prioritisation". When you start to explain how
traceroute actually works (both ICMP-based and UDP-based (which
still relies on ICMP responses, of course!)), and that ICMP prio
should only affect the IP of which the router listens on (and not
hops beyond or at the dest), most NOCs fire back with another
template of their choice ("We're not aware of any issues", "No that's
incorrect", "I'll check with engineers", or the ever-so-amusing
"traceroute and ping aren't reliable, you need to use a different
method of testing") -- but the most common is: "can you send us
traceroutes of what you're seeing?"

"But! You just said..... argh!!"

I happen to work in a NOC, and I have never -- nor will I ever --
spout off that template response. When a client or customer calls
about something, I give them the benefit of the doubt. If it
turns out they're wrong later, they at least (hopefully) learned
something. I just happen to believe in getting things done,
rather than arguing against doing investigative work. When an
issue occurs, look at it quickly, not 24-48 hours later.

I am absolutely fine with ICMP being prioritised last, but those
scenarios induce more questions; "so ICMP is prio'd last, which
would mean the router is busy processing other packets, which could
mean your router is over-utilised either CPU-wise or iface-wise
since we're seeing 250ms at your hop and beyond". 48 hours later,
a network technician looks at the router and either finds absolutely
nothing ("It must've gone away on it's own") or finds something
conclusive (but only when the issue re-occurs, is still occurring,
or if they keep historic data).

Did I miss the conspiracy?? I know my membership dues are all paid up.
If this has been going on a while, I apologize I guess I've just noticed
the trend in our shift reports.

Yes, this has been going on for awhile. Well, not ICMP_UNREACH_NET
(from your example) but general ICMP prioritisation or the explicit
dropping of either ICMP_TIMXCEED (traceroute) or ICMP_ECHOREPLY (ping).

A real-life example, from my own (residential) ISP. Try to imagine
reporting an issue at hop 6 to a technician (who will always insist
the problem is somewhere prior). Here's an example of a working
network (no sarcasm; I'm serious!):

1. 192.168.1.1 0.0% 30 30 0.5 0.5 0.5 0.6
2. ??? 100.0 30 0 0.0 0.0 0.0 0.0
3. 68.87.198.129 0.0% 30 30 8.5 9.9 7.5 21.1
4. 68.87.192.34 30.0% 30 21 9.2 11.9 9.2 20.5
5. 68.87.226.134 66.7% 30 10 10.9 12.9 10.2 25.9
6. 12.116.188.13 0.0% 30 30 10.6 12.6 10.1 25.0
7. 12.123.12.126 0.0% 30 30 12.9 12.3 10.2 15.3

And an example of when things are broken:

1. 192.168.1.1 0.0% 30 30 0.5 0.5 0.4 0.6
2. ??? 100.0 30 0 0.0 0.0 0.0 0.0
3. 68.87.198.129 0.0% 30 30 12.7 12.7 8.1 28.4
4. 68.87.192.34 20.0% 30 24 13.4 11.5 9.5 14.5
5. 68.87.226.134 96.7% 30 1 12.6 12.6 12.6 12.6
6. 12.116.188.13 50.0% 30 15 15.1 11.8 10.3 15.1
7. 12.123.12.122 50.0% 30 15 11.6 17.5 11.1 60.5

Since I'm not a network administrator, I'll ask point blank: why
exactly do your netadmins filter and rate ICMP like this, and what
are you gaining from it? Most kiddies stick with pure TCP or UDP
these days -- the goal is to saturate the pipe, not cause a literal
service DoS (e.g. crashing Apache, etc.)

Additionally, I'll ask another question: exactly what tool are
NOCs (or even network administrators) supposed to use to diagnose
network path problems via layer 3 and 4?

bensons · October 20, 2006, 12:53am

Q: "As part of this, can you tell me why your router is prohibiting
packets being sent to our interface?"

A:" The reason you cannot hit your interface is it is blocked for
security reasons."

[...]

What the heck is going on lately? Have we returned to the time where
we've started trying to hide lacks of capacity instead of fixing

them??

You would be mistaken to think that a router's lack of responsiveness to
your queries is indicative of forwarding capacity issues.

To ask your question from the opposite point of view, are there any
operators of large networks today that don't filter and police traffic
destined for the control/management plane of their routers?

Anticipating the answer to that question: I think it is only reasonable
to limit the impact that random strangers can have on my network's
stability. Your ability to traceroute is valuable, but not more valuable
than my network's uptime.

Cheers,
-Benson

Rubens_Kuhl · October 20, 2006, 2:14am

template response -- I hear is "Well, you can't rely on traceroute
because of ICMP prioritisation". When you start to explain how
traceroute actually works (both ICMP-based and UDP-based (which
still relies on ICMP responses, of course!)), and that ICMP prio
should only affect the IP of which the router listens on (and not
hops beyond or at the dest), most NOCs fire back with another

If I recall well, Cisco GSRs impose low priority and/or limits for all
ICMP traffic flowing thru the box, not just packets to/from router
itself, and there's not a knob to adjust that.

Also of notice is that packets that expire TTL needs some kind of
low-path processing, and will be subject to increased latency or loss
compared to normal ones, and this affects every tool to trace packets
thru the network I've seen.

ianai · October 20, 2006, 2:19am

You don't recall well.

Although there is a knob if you want to tweak it. But there's a knob for just about everything - it's just not tweaked by default.

Eric_Spaeth · October 20, 2006, 4:37am

Rubens Kuhl Jr. wrote:

If I recall well, Cisco GSRs impose low priority and/or limits for all
ICMP traffic flowing thru the box, not just packets to/from router
itself, and there's not a knob to adjust that.

There'd be no reason to limit ICMP globally -- for traffic through a router it's all IP; it doesn't really matter what the sub-protocol it is. The forwarding process on the router is the same for all IP traffic, the simple breakdown being:

1) Take the source and destination IP and hash them to get an index value
2) Look up the destination prefix in the forwarding table (the CEF table on Cisco hardware)
3) Match the hashed index value in the CEF table with an outbound interface
4) Puke the packet out the destination interface.

All of these tasks are easily done in hardware ASICs because they are just doing simple hashing and bit comparisons. If the destination prefix is already populated in the CEF table then there is no CPU/software involved in the process. The hashing is to keep traffic from source to destination on a single interface to reduce out-of-order delivery.

To respond to ICMP, however, the packet needs to be routed up to the CPU to be handled. There the packet must be inspected, and an entirely new packet must be created to be sent back. While individually these responses take a negligible amount of CPU time, if you get enough devices flooding you with ICMP requests it starts to add up. Since processor time is used for other semi-important tasks like maintaining BGP peering, it is often prudent to rate-limit ICMP handling by the router.

Overall this is a bigger issue with IOS devices; Juniper has a whole architecture built into JunOS to protect the CPU so they can often get by without end-user configuration to limit impact.

-Eric

Mikael_Abrahamsson · October 20, 2006, 7:00am

No, that is not neccessarily true. I know of at least one vendor that punts all ICMP (on certain versions of their HW) to CPU, and the CPU is normally not otherwise involved in packet forwarding, so seeing latencies on ICMP (perhaps due to some housekeeping going on) doesn't at all mean other packets are being delayed. It might, of course.

Then we also have the problem of people not understanding how traceroute works, ie sending UDP with low TTL going one way, then getting ICMP back, with the router expiring the TTL and generating the ICMP TTL-exceeded message, perhaps having to punt this to CPU first, or perhaps processing it on a linecard. To understand what you're seeing really means, you have to know all platforms involved in all hops both going there and back (return path perhaps being asymmetrical to the path you're seeing in traceroute).

But to your question regardnig filtering, I'd venture to guess that more and more people are going to filter access attempts to their infrastructure to hinder DoS attacks. If I were to build a brand new network today I'd use loopbacks and link addresses that I'd either filter at the border or not announce on the internet at all. Not announcing it at all of course brings the problem of people using "uRPF loose" and dropping these packets, which will break traceroute and other tools. Better might of course be to rate-limit traffic to the infrastructure addresses to a very low number, let's say 2 megabit/s or so. This will limit DoS attacks and break diagnostics during an attack, but will make traceroute work properly during normal conditions. Guess everybody have to make up their mind regarding these prioritizations when designing their networks. It's important to be aware of all aspects of course, and it's good we have these discussions so more people understand the ramifications.

Anyone know of a document that describes this operationally? This would be a good thing to include in an "ISP essentials" type of document.