best practice for advertising peering fabric routes

I think we're looking at two different aspects of the same issue.

I believe you're coming at it from a 'for all users of the Internet, what's the chance they have connectivity that does not break PMTU-D.' That's an important group to study, particularly for those DSL users still left with < 1500 byte MTU's. And you're right, for those users IXP's are the least of their worries, mostly it's content-side poor networking, like load balancers and firewalls that don't work correctly.

I am approaching it from a different perspective, 'where is PMTU-D broken for people who want to use 1500-9K frames end to end?' I'm the network guy who wants to buy transit in the US, and transit in Germany and run a tunnel of 1500 byte packets end to end, necessitating a ~1540 byte packet. Finding transit providers who will configure jumbo frames is trivial these days, and most backbones are jumbo frame clean (at least to 4470, but many to 9K). There's probably about a 25% chance private peelings are also jumbo clean. Pretty much the only thing broken for this use case is IXP's. Only a few have a second VLAN for 9K peerings, and most participants don't use it for a host of reasons, including PMTU-D problems.

I'm an oddball. I think MPLS VPN's are a terrible idea for the consumer, locking them into a single provider in the vast majority of cases. Consumers would be better served by having a tunnel box (IPSec maybe?) at their edge and running there own tunnel over IP provider-independently, if they could get > 1500B MTU at the edge, and move those packets end to end. While I've always thought that, in the post-Snowden world I think I seem a little less crazy, rather than relying on your provider to keep your "VPN" traffic secret, customers should be encrypting it in a device they own.

But hey, I get why ISP's don't want to offer 9K MTU clean paths end to end. Customers could then buy a VPN appliance and manage their own VPN's with no vendor lock-in. MPLS VPN revenues would tumble, and customers would move more fluidly between providers. That's terrible if you're an ISP.

I understand that perspective, absolutely.

But what I'm saying is that that whether or not they want to use jumbo frames for Internet traffic, it doesn't matter, because PMTU-D is likely to be broken either at the place where the traffic is initiated, the place where the traffic is received, or both - so any nonsense in the middle, especially on IXP networks in particular, isn't really a significant issue in and of itself.

If we could get things optimized and remediated to the point where potential PMTU-D breakage in IXP networks were a significant issue of iteself, the Internet would be much improved. But I don't see any likelihood of that happening anytime soon.

Hi Patrick,

I have to disagree with you. If it appears in a traceroute to
somewhere else, I'd like to be able to ping and traceroute directly to
it. When I can't, that impairs my ability to troubleshoot the all too
common can't-get-there-from-here problems. The more you hide the
infrastructure, the more intractable problems become for your
customers.

The IXP LAN should be reachable from every device on the ASes which
connect to it, not just the immediate router.

Regards,
Bill Herrin

Your assertion does not match my deployment experience.

When I have deployed endpoints that have working PMTU-D, I have 99.999% success with the ISP's in the middle having working PMTU-D. It even works fine for 9K providers connected to 1500B exchange points, because the packet-too-big typically originates from the input side of the router (the backbone link to the IXP router). Indeed, the only place I've seen it broken is where the ISP 9K peers at an exchange, and the "far end" ISP runs a < 9K backbone (like 4470), so the far end IXP-router does the packet-to-big, and originates it from the exchange LAN, which because it's no longer in the table fails to past uRPF.

(Business class) ISP's don't break PMTU-D, end users break it with the equipment they connect. So a smart user connecting equipment that is properly configured should be able to expect it to work properly.

We disagree.

Plus, you really can't type "ping" on the router connected to the IXP?

_If_ you can guarantee your network has zero bots, abusable [DNS|NTP|etc.] servers, all your downstreams are perfectly clean, etc., etc., then maybe I could see you carrying it in your IGP.

As I know 100% of ISPs (to at least one decimal place) cannot make such a guarantee, then doing so puts the IXP and all other members - whether peers of yours or not - at risk. Putting others at risk because you are lazy or because it makes your life easier is .. I believe I called it bad manners before.

But let's take the philosophical out of this. The prefix in question is owned by the IXP. I said in an earlier post that if you carry a prefix I own, did not announce to you, and make it very clear I specifically do not want you to carry, I will ask you to stop or face possible disconnection. You may claim convergence (a bit of BS), troubleshooting (non-issue, IMO), or even "but I waaaaaaaaaaaant to!!1!1!" (whatever). Doesn't matter. That's not your prefix, you were not given it and told not to carry it, so Do Not Carry It.

Ask your IXP if they mind whether you carry the prefix. See what they say.

(Business class) ISP's don't break PMTU-D, end users break it with the equipment they connect.

Concur 100%. That's my point.

So a smart user connecting equipment that is properly configured should be able to expect it to work properly.

In my deployment experience, many (most?) end-user organization break PMTU-D to/through their LANs outside of their IDCs, much less to the Internet, for themselves, and for everyone who wishes to communicate with them across the Internet.

UUnet once advertised the /24 for MAE-East to me (well, Net99), and because I also had it in my IGP, my network was using UUnet's backbone for west-to-east coast traffic for a couple of days until I noticed and fixed it (with next-hop-self).

I agree 100% with Patrick and others on this point. No good can come from propagating IXP address space any further than is absolutely necessary. Best not to propagate it at all.

Dave

I have to disagree with you. If it appears in a traceroute to
somewhere else, I'd like to be able to ping and traceroute directly to
it. When I can't, that impairs my ability to troubleshoot the all too
common can't-get-there-from-here problems. The more you hide the
infrastructure, the more intractable problems become for your
customers.

The IXP LAN should be reachable from every device on the ASes which
connect to it, not just the immediate router.

We disagree.

Plus, you really can't type "ping" on the router connected to the IXP?

Not when I'm the downstream customer, no. It's jolly good that *you*
can test, but before the rest of us can get through the layers of
support which insulate you, we have to be able to convincingly test
too.

As I know 100% of ISPs (to at least one decimal place) cannot
make such a guarantee, then doing so puts the IXP and all other
members - whether peers of yours or not - at risk. Putting others
at risk because you are lazy or because it makes your life easier
is .. I believe I called it bad manners before.

That makes no sense. The IXP is at no more or less risk from your
customers than any other connection you have for Internet carriage.
Risk which you are responsible for managing either way.

I said in an earlier post that if you carry a prefix I own,
did not announce to you, and make it very clear I
specifically do not want you to carry, I will ask you to
stop or face possible disconnection. [...] That's not your prefix,
you were not given it and told not to carry it, so Do Not Carry It.

Well yes, of course. If you participate in an IXP you follow the rules
of the IXP. I respectfully question the wisdom of such a rule and the
IXPs I deal with only ask that you not announce the IXP prefix
externally. But it's not OK to unilaterally break the IXP's rules,
however poorly conceived.

Regards,
Bill Herrin

So ... RFC1918 addresses for the IXP fabric, then?

(Half kidding, but still ....)

Jim Shankland

I repeat: NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An IXP LAN should not be reachable from any device except those directly attached to that LAN. Period.

So ... RFC1918 addresses for the IXP fabric, then?

I've heard apparently non-drunk people suggest IPv6 link-local addresses as BGP endpoints across exchanges, too.

(Half kidding, but still ....)

RFC 6752.

One observation on this thread: some networks have customers who react badly to unusual things seen in traceroute. Sometimes the margin on an individual customer is low enough that one support call displaces any profit you were going to make off them this month.

It's understandable to me that such network operators would choose to carry IXP routes internally in order to avoid that potential support burden.

I don't pretend to have any universal good/bad answer to the original question, though. I don't think the world is that simple.

Joe

* nanog@shankland.org (Jim Shankland) [Wed 15 Jan 2014, 18:04 CET]:

So ... RFC1918 addresses for the IXP fabric, then?

(Half kidding, but still ....)

They need to be globally unique.

  -- Niels.

* patrick@ianai.net (Patrick W. Gilmore) [Wed 15 Jan 2014, 04:36 CET]:
[..]

NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An IXP LAN should not be reachable from any device not directly attached to that LAN. Period.

This is correct, and protects both your (ISP) infrastructure and the IXP's. All major European IXPs revisited their policy after the giant DDoS attack on CloudFlare, and the above was pretty much the outcome.

  -- Niels.

do they? :slight_smile:

also... there is/was an exchange in south america (columbia maybe?
it's been a while since I saw this in configs) that used
192.168.0.0/16 space for their exchange.

Hi Niels,

Actually, they don't. To meet the basic definition of working, they
just have to be able to originate ICMP destination unreachable packets
with a reasonable expectation that the recipient will receive those
packets. Global uniqueness is not required for that. However, RFC1918
addresses don't meet the requirement for a different reason: they're
routinely dropped at AS borders, thus don't have an expectation of
reaching the external destination.

Of course working, monitorable and testable are three different
things. If my NMS can't reach the IXP's addresses, my view of the IXP
is impaired. And "the Internet is broken" is not a trouble report that
leads to a successful outcome with customer support... it helps to be
able to pin things down with some specificity.

Regards,
Bill Herrin

* nanog@shankland.org (Jim Shankland) [Wed 15 Jan 2014, 18:04 CET]:

So ... RFC1918 addresses for the IXP fabric, then?

(Half kidding, but still ....)

They need to be globally unique.

Hi Niels,

Actually, they don't. To meet the basic definition of working, they
just have to be able to originate ICMP destination unreachable packets
with a reasonable expectation that the recipient will receive those
packets. Global uniqueness is not required for that. However, RFC1918
addresses don't meet the requirement for a different reason: they're
routinely dropped at AS borders, thus don't have an expectation of
reaching the external destination.

Of course working, monitorable and testable are three different
things. If my NMS can't reach the IXP's addresses, my view of the IXP
is impaired. And "the Internet is broken" is not a trouble report that
leads to a successful outcome with customer support... it helps to be
able to pin things down with some specificity.

Regards,
Bill Herrin

Using RFC1918 would incur the assumption that one will need to use a
unique router or routing instance for every exchange connected to
since exchanges are likely to have overlapping space at that point
(RFC1918 IXP registry anyone?). I don't think it'd be a good idea to
go down that path..

Also mentioned in a past nanog was the idea of potentially getting
someone like team cymru to setup all exchange prefixes in a special
bogon list and you could null route on your edge all those prefixes..
I inquired to team cymru about this back when originally discussed but
never got anywhere with them.

This approach concerns me for a number of reasons.

First, having your NMS ping your upstream’s IXP peers probably doesn’t scale. If I’m a peer of a reasonably large provider, I’m pretty sure I don’t want all their customers hammering my management plane. Even if you’re the only one doing it, you also don’t know if I’m rate-limiting pings for that or any other reason.

Second, what information do you get that you didn’t already have? If you saw the IP in a traceroute then you know it exists, is alive, is in the path, and a rough estimation of the latency. Pinging it may even give you negative information. Platforms vary and all, but in my experience pinging a router, especially a potentially busy one peering at an IXP, shows notably worse performance than “real” traffic experiences (admittedly somewhat true of TTL Expired responses, but less so in my experience). Now you’re potentially seeing high latency and packet loss which in reality might not even be there at all.

Third, you don’t know that your ping to the peering IP is even taking the same path as the packets addressed to the real destination. MTR for example looks nice, but it would probably be more accurate if it simply ran the traceroute over and over instead of pinging each hop directly. You would also detect path changes for the real destination that pinging intermediate hops wouldn’t show you.

While I appreciate the desire to be able to do as much of your own detective work as possible, I can also see where you’re now shifting workload onto someone else’s support organization when they’re not necessarily the problem either (“Hey, my NMS says your peering router is causing latency and packet loss, fix it!”).

I’m also not saying there isn’t a troubleshooting gap caused by this. I’m just not sure being able to ping the IXP hop solves that problem either.

Semi-related tangent: Working in an IXP setting I have seen weird corner cases cause issues in conjunction with the IXP subnet existing in BGP. Say someone’s got proxy ARP enabled on their router (sadly, more common than it should be, and not just from noobs at startups). Now say your IXP is growing and you expand the subnet. No matter how much you harp on the customers to make the change, they don’t all do it at once. Someone announces the new, larger subnet in BGP. Now when anyone ARPs for IPs in the new part of the range, proxy ARP guy (still on the smaller subnet) says “hey I have a route for that, send it here”. That was fun to troubleshoot. :slight_smile:

-c

* clay@bloomcounty.org (Clay Fiske) [Wed 15 Jan 2014, 20:34 CET]:

Semi-related tangent: Working in an IXP setting I have seen weird corner cases cause issues in conjunction with the IXP subnet existing in BGP. Say someone’s got proxy ARP enabled on their router (sadly, more common than it should be, and not just from noobs at startups). Now say your IXP is growing and you expand the subnet. No matter how much you harp on the customers to make the change, they don’t all do it at once. Someone announces the new, larger subnet in BGP. Now when anyone ARPs for IPs in the new part of the range, proxy ARP guy (still on the smaller subnet) says “hey I have a route for that, send it here”. That was fun to troubleshoot. :slight_smile:

Proper run IXPs pay engineers to hunt down people with Proxy ARP enabled on their peering interfaces.

  -- Niels.

* bill@herrin.us (William Herrin) [Wed 15 Jan 2014, 19:27 CET]:

Yes, yes, I expected a smug reply like this. I just didn’t expect it to take so long.

But how can I detect proxy ARP when detecting proxy ARP was patented in 1996?

http://www.google.com/patents/US5708654

Seriously though, it’s not so simple. You only get replies if the IP you ARP for is in the offender’s route table (or they have a default route). I’ve seen different routers respond depending on which non-local IP was ARPed for. And while using something like 8.8.8.8 might be an obvious choice, I don’t care to hose up everyone’s connectivity to it just to find local proxy ARP offenders on my network.

-c

* clay@bloomcounty.org (Clay Fiske) [Thu 16 Jan 2014, 00:35 CET]:
[...]

Seriously though, it’s not so simple. You only get replies if the IP you ARP for is in the offender’s route table (or they have a default route). I’ve seen different routers respond depending on which non-local IP was ARPed for. And while using something like 8.8.8.8 might be an obvious choice, I don’t care to hose up everyone’s connectivity to it just to find local proxy ARP offenders on my network.

You'll never be entirely sure but obviously you're not limited to sending only one ARP request - this isn't The Hunt For The Red October movie. We're talking a common misconfiguration here in this thread - or at least you were, two mails upthread.

How will checking for Proxy ARP possibly hose up anybody's connectivity? You realise that ARP replies are unicast, right? And that IXPs generally have dedicated servers for monitoring from which they can source packets?

  -- Niels.