best practice for advertising peering fabric routes

I have a connection to a peering fabric and I'm not distributing the peering fabric routes into my network.

I see three options
1. redistribute into my igp (OSPF)

2. configure ibgp and route them within that infrastructure. All the default routes go out through the POPs so iBGP would see packets destined for the peering fabric and route it that-a-way

3. leave it "as is", and let the outbound traffic go out my upstreams and the inbound traffic come back through the peering fabric

Advantages and disadvantages, pros and cons? Recommendations? Experiences, good and bad?

I have 5 POPs, 2 OSPF areas, and have not brought iBGP up between the POPs yet. That's another issue completely from a planning perspective.

thanks
Eric

I have a connection to a peering fabric and I'm not distributing the

peering fabric routes into my network.

I see three options
1. redistribute into my igp (OSPF)

2. configure ibgp and route them within that infrastructure. All the

default routes go out through the POPs so iBGP would see packets destined
for the peering fabric and route it that-a-way

3. leave it "as is", and let the outbound traffic go out my upstreams and

the inbound traffic come back through the peering fabric

Advantages and disadvantages, pros and cons? Recommendations?

Experiences, good and bad?

I have 5 POPs, 2 OSPF areas, and have not brought iBGP up between the

POPs yet. That's another issue completely from a planning perspective.

thanks
Eric

I like no-export

I have a connection to a peering fabric and I'm not distributing the

peering fabric routes into my network.

good plan.

I see three options
1. redistribute into my igp (OSPF)

2. configure ibgp and route them within that infrastructure. All the

default routes go out through the POPs so iBGP would see packets destined
for the peering fabric and route it that-a-way

3. leave it "as is", and let the outbound traffic go out my upstreams and

the inbound traffic come back through the peering fabric

4. all peering-fabric routes get next-hop-self on your peering router
before going into ibgp...
all the rest of your network sees your local loopback as nexthop and
things just work.

Pardon the top post, but I really don't have anything to comment below other than to agree with Chris and say rfc5963 is broken.

NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An IXP LAN should not be reachable from any device not directly attached to that LAN. Period.

Doing so endangers your peers & the IX itself. It is on the order of not implementing BCP38, except no one has the (lame, ridiculous, idiotic, and pure cost-shifting BS) excuse that they "can't" do this.

Pardon the top post, but I really don't have anything to comment below

other than to agree with Chris and say rfc5963 is broken.

NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An

IXP LAN should not be reachable from any device not directly attached to
that LAN. Period.

Doing so endangers your peers & the IX itself. It is on the order of not

implementing BCP38, except no one has the (lame, ridiculous, idiotic, and
pure cost-shifting BS) excuse that they "can't" do this.

+1. Rfc5963 needs to update that guidance. Set next hop self loopback0 and
done

CB

--
TTFN,
patrick

>>>
>>> I have a connection to a peering fabric and I'm not distributing the
>> peering fabric routes into my network.
>>>
>
> good plan.
>
>>> I see three options
>>> 1. redistribute into my igp (OSPF)
>>>
>>> 2. configure ibgp and route them within that infrastructure. All the
>> default routes go out through the POPs so iBGP would see packets

destined

>> for the peering fabric and route it that-a-way
>>>
>>> 3. leave it "as is", and let the outbound traffic go out my upstreams

and

There's a two part problem lurking.

Problem #1 is how you handle your internal routing. Most of the "big boys" will next-hop-self in iBGP all external routes. However depending on the size and configuration of your network there may be advantages to not using next-hop-self, or just putting it in your IGP. Basically, you should be doing the same thing you do for a /30 from a peer or transit provider in your network. There is one thing special about an exchange point though, for security reasons you probably want to add it to your "never accept" routing filter from peers/customers/transit providers. You don't need someone injecting a couple of more specifics to mess with your routing.

Problem #2 is your customers. If you have customers that may operate default free, and they use one of the traceroute tools that not only finds the route, but then continues to probe it (like MTR, or Visual Traceroute) there can be an issue. The initial traceroute probe may return an IP on the exchange of your peer's router, but then when they subsequently source ICMP Ping to that IP there will be no route in their network, and it will simply never respond. Some call this a feature, some call this a problem. There is also an extremely rare problem where the far end of the peering exchange steps down MTU, and thus PMTU discovery is invoked, but your customers use Unicast RPF. Since the exchange LAN isn't in their table, Unicast RPF may drop the PMTU packet-too-big message, causing a timeout.

If your customers have a default to you, all is well. However if they have a default to someone else, and take a table from you to selectively override the same problem can occur for any routes they select through you that also traverse the exchange.

IMHO the best fix for #2 is that the exchange have an ASN, and announce the exchange LAN from that ASN, typically via the route server. You should then peer with the route server to pick up that network. That makes the announcement consistent, and makes it clear who operates that network, and your customers can then access it. Many exchanges do not do this, and then the next best solution might be to originate it from your ASN and announce it to your customers only, with no-export set on the way out.

Various people will no doubt chime in and tell you the last two suggestions are either excellent wonderful and the worst idea ever. Safe to say I know of networks doing both and the world has not ended. YMMV, some assembly required, batteries not included, actual conditions may affect product performance, do not taunt the happy fun ball, and consult a doctor if your network is up for more than four hours.

I've known Leo for .. well, let's just say a long time. And I have great respect for his networking abilities. But I fall into the second camp. As someone who owns & operates an IXP, and is on the board of a couple more, and helped start even more, I'm going to stick to my guns here.

As for knowing networks that do both, blah, blah, blah. I know lots of networks that allow spam, don't configure BCP38, have abusable name or NTP servers, etc. and the world has not come to an end. Doesn't mean you should. Lame excuse, Leo, and beneath you to even go there.

NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An IXP LAN should not be reachable from any device not directly attached to that LAN. Period.

If for no better reason, how about because it is not your prefix, and chances are the IXP does not want you to use the prefix. In fact, I challenge you to find a major IXP route server which is announcing the IXP block.

But because this is a teaching list, let's go through the problems Leo mentions. Anyone who steps down MTU on an IXP is far too broken to worry about your customer having RFP and not getting PMTU. Again, I challenge you to find someone doing this today, their network would be close to unusable. As for traceroute .... Seriously? You want to increase breakage on the Internet because it might cause 3 stars in a traceroute? Puh-LEEEZE. Sorry, neither of those pass the sniff test, IMHO.

So Just Don't Do It. Setting next-hop-self is not just for "big guys", the crappiest, tiniest router that can do peering at an IXP has the same ability. Use it. Stop putting me and every one of your peers in danger because you are lazy.

I'm going to have to disagree here with Patrick, because this is security through obscurity, and that doesn't work well.

For some history about why people like Patrick take the position he did, read: http://blog.cloudflare.com/the-ddos-that-almost-broke-the-internet

Exchange points got attacked, so people yanked them from the routing table hoping to prevent attacks. If you're on this list it should take you all of about 3 seconds to realize the attackers could do a traceroute, and attack the IP one hop on the far side of the exchange for a few dozen providers and still cause all sorts of havoc, or do any of another half dozen things I won't mention to cause problems. The effect would be nearly, if not perfectly identical, since that traffic still has to cross the exchange.

I'll point out the MTU step-down issue is real, and it's part of why we can't have 9K MTU exchanges be the default on the Internet, which would really make things better for a significant number of users. I think Patrick is a bit quick to dismiss some of the potential issues.

Every link on every router is subject to attack. Exchange point LAN's really aren't special in that regard. If anything the only thing that makes them slightly special is that they may in fact be more oversubscribed than most links. Where a backbone might have a router with 20x10GE, so attackers could try and drive 190GE out a 10GE in theory; an exchange point may have 100 people with 20x10GE coming in. An alternate view that mega-exchange points are massively oversubscribed potential single points of failure, and perhaps network operators should consider that. While a DDOS taking an exchange down for half a day is bad, imagine if there was a more sinister attack, taking out the physical infrastructure of an exchange. That can't be "fixed" with a routing advertisement.

So Just Don't Do It. Setting next-hop-self is not just for "big guys", the crappiest, tiniest router that can do peering at an IXP has the same ability. Use it. Stop putting me and every one of your peers in danger because you are lazy.

I'm going to have to disagree here with Patrick, because this is security through obscurity, and that doesn't work well.

Leo, each of your points below is incorrect. I'm happy to discuss off-list if you'd like.

For some history about why people like Patrick take the position he did, read: The DDoS That Almost Broke the Internet

Exchange points got attacked, so people yanked them from the routing table hoping to prevent attacks. If you're on this list it should take you all of about 3 seconds to realize the attackers could do a traceroute, and attack the IP one hop on the far side of the exchange for a few dozen providers and still cause all sorts of havoc, or do any of another half dozen things I won't mention to cause problems. The effect would be nearly, if not perfectly identical, since that traffic still has to cross the exchange.

Let's take just the incident mentioned in the blog post above (which is pretty broken itself, but hey, who said the CEO of CDN had to know anything about networking... ? :).

To where would the attacker traceroute -to-? Somewhere inside Cloudflare? Other LINX members? Remember, most of the attack was sourced from networks which were not attached to the LINX. If the source network or the source network's upstreams are not LINX members, there is probably _no_ path that goes through LINX. Even if they are members, lots of networks have alternative paths (other IXPs, private interconnections, etc.). For instance, sources in Germany may well flow over DE-CIX even if there is a peering session at LINX. Etc. There is no single or set of IP addresses that will guarantee even a majority of packets traverse a specific IXP except the IXP LAN.

Also, the attack was reflected DNS, so the attacker couldn't actually perform the traceroutes you suggest from each source as he did not control the sources. He _might_ be able to find _some_ of the paths with a lot of sleuthing through route & traceroute servers, but that would make things massively more difficult, as well as massively cut the number of servers he can abuse to the same effect. Both of which are huge wins for the good guys.

Pulling the IXP prefix has a enormous benefits and essentially no downside. I know literally hundreds of ISPs large & small who do not carry the IXP prefix, and none have seen any significant issues (most have seen zero, a few get asked about 3 stars, but as I said before, puh-leeeeze).

I'm a bit surprised you even tried to bring this up. I know you well enough to know you would have realized all of the above if you had though about it for a while (or just asked).

I'll point out the MTU step-down issue is real, and it's part of why we can't have 9K MTU exchanges be the default on the Internet, which would really make things better for a significant number of users. I think Patrick is a bit quick to dismiss some of the potential issues.

MTU step-down is a real issue, and it's real enough whether IXP LANs are in the DFZ or not. Let's solve the overarching problem before doing something which has real, proven harm and leaves the root cause in place.

Besides, the two VLAN method already exists in multiple places and it hasn't helped adoption of 9K packets. Unless you are talking about letting some people attach with 1500 MTU and others with 9000 MTU? 'Cause if that's what you meant, then I'm just going to call you loony and ask what you're smoking.

Every link on every router is subject to attack. Exchange point LAN's really aren't special in that regard. If anything the only thing that makes them slightly special is that they may in fact be more oversubscribed than most links. Where a backbone might have a router with 20x10GE, so attackers could try and drive 190GE out a 10GE in theory; an exchange point may have 100 people with 20x10GE coming in. An alternate view that mega-exchange points are massively oversubscribed potential single points of failure, and perhaps network operators should consider that. While a DDOS taking an exchange down for half a day is bad, imagine if there was a more sinister attack, taking out the physical infrastructure of an exchange. That can't be "fixed" with a routing advertisement.

IXPs are more special because they are shared. Other links are between you & one other network not hundreds of other networks, some of whom have no relationship with you.

If you don't like the rules of IXPs, don't join one. But hooking up to one and deciding "I'm going to carry this prefix" even when told not to is .. well, let's call it bad manners.

As for the rest, nothing is a silver bullet. Claiming "this doesn't solve every possible problem so we shouldn't do it" is even more lame than your first excuse that the world hasn't ended. This solves lots of real, provable problems. It is trivial to implement. There is no network which peers at an IXP and cannot implement it. It _has_ been implemented 1000s of times without the harm you mention. In short, it should be done.

I repeat: NEVER EVER EVER put an IX prefix into BGP, IGP, or even static route. An IXP LAN should not be reachable from any device except those directly attached to that LAN. Period.

If you join one of my IXPs and I find you are carrying a prefix I own, did not advertise to you, and specifically told you not to carry, I'm going to ask you to stop immediately or face possible disconnection. The other members of my IXP should not be endangered because you don't like to follow the rules.

What's more, I get a lot more people thanking me for doing that than complaining about it.

+1

Again, folks, this isn't theoretical. When the particular attacks cited in this thread were taking place, I was astonished that the IXP infrastructure routes were even being advertised outside of the IXP network, because of these very issues.

IXPs are not the problem when it comes to breaking PMTU-D. The problem is largely with enterprise networks, and with 'security' vendors who've propagated the myth that simply blocking all ICMP somehow increases 'security'.

Thank you - I will heed the warning. I want to be a good community member and make sure we're maintaining the agreed-upon practices (I'll re-read/review my agreement with the IXP)

So if that is the case, I have to rely on the peering fabric to just return traffic, since the rest of my network (save the directly connected router) will not know about those routes outbound? And what about my customers who are counting on me routing their office traffic through my network into the peering fabric to their properties? (I have one specifically who is eventually looking for that capability) Do I have to provide them some sort of VPN to make that happen across my network to the peering fabric router?

Never mind, I just carefully re-read the point. Right, I'll filter the prefix(es) of the IXP LAN(s) that I'm connected to and not let THAT get out, no reason to advertise it since no traffic ever goes to it. That still has me asking to how best to advertise the rest of the public prefixes coming from the other fabric members.

Thank you - I will heed the warning. I want to be a good community member and make sure we're maintaining the agreed-upon practices (I'll re-read/review my agreement with the IXP)

So if that is the case, I have to rely on the peering fabric to just return traffic, since the rest of my network (save the directly connected router) will not know about those routes outbound? And what about my customers who are counting on me routing their office traffic through my network into the peering fabric to their properties? (I have one specifically who is eventually looking for that capability) Do I have to provide them some sort of VPN to make that happen across my network to the peering fabric router?

perhaps I'm confused, but you have sort of this situation:
  ixp-participants -> ixp -> your-router -> your-network -> your-customer

you get routes for ixp-participants from 'ixp'
you send to the 'ixp' (and on to 'ixp-participants') routes for
'your-customer' and 'your-network'

right?

then so long as you send 'your-customer' the routes you learn from
'ixp' (which you set 'next-hop-self' on in ibgp from 'your-router' to
'your-network' (in the ibgp-mesh that you will setup) ... everything
just works.

All routers behind 'your-router' in 'your-netowrk' see
'ixp-participants' with a next-hop of 'your-router' who still knows
'send to ixp!' for the route(s) in question.

Never mind, I just carefully re-read the point. Right, I'll filter the prefix(es) of the IXP LAN(s) that I'm connected to and not let THAT get out, no reason to advertise it since no traffic ever goes to it. That still has me asking to how best to advertise the rest of the public prefixes coming from the other fabric members.

on your ibgp peers on 'your-router' you'd have something like:
  match community <community-added-for-all-ixp-participant-routes>
  set next-hop-self

<Examine Border Gateway Protocol Frequently Asked Questions - Cisco;

for one vendors view of the situation... and there is a link to:
  <Technologies - Support Documentation - Cisco;

that's worth a read.

Ok, so the right way to do it is in iBGP. That pretty much answers the question - don't redistribute those ixp-participant prefixes into my IGP.

I have a lot of iBGP homework to do, to make it work with the 5 POPs that are all taking full route feeds. I tried once and couldn't get the BGP tables working correctly with a full mesh of the 5 routers, so it looks like time to try it again, this time with a route reflector.

Le 15/01/2014 07:59, Eric A Louie a �crit :

Ok, so the right way to do it is in iBGP. That pretty much answers the question - don't redistribute those ixp-participant prefixes into my IGP.

Yes, using next-hop self (rather than importing IXP routes) as pointed
out earlier in this thread.

I have a lot of iBGP homework to do, to make it work with the 5 POPs that are all taking full route feeds. I tried once and couldn't get the BGP tables working correctly with a full mesh of the 5 routers, so it looks like time to try it again, this time with a route reflector.

I don't think you need route-reflection in a 5 node iBGP. What do you
mean by "couldn't get the BGP
tables working correctly"?

Cheers,

mh

I'm for doing it now and not worrying about it later.

Also, don't originate your routes from your peering router

Mark.

Again, folks, this isn't theoretical. When the particular attacks cited in this thread were taking place, I was astonished that the IXP infrastructure routes were even being advertised outside of the IXP network, because of these very issues.

I know a lot of people push next-hop-self, and if you're a large ISP with thousands of BGP customers is pretty much required to scale.

However, a good engineer would know there are drawbacks to next-hop-self, in particular it slows convergence in a number of situations. There are networks where fast convergence is more important than route scaling, and thus the traditional design of BGP next-hops being edge interfaces, and edge interfaces in the IGP performs better.

By attempting to force IX participants to not put the route in IGP, those IX participants are collectively deciding on a slower converging network for everyone. I don't like a world where connecting to an exchange point forces a particular network design on participants.

IXPs are not the problem when it comes to breaking PMTU-D. The problem is largely with enterprise networks, and with 'security' vendors who've propagated the myth that simply blocking all ICMP somehow increases 'security'.

That's some circular reasoning.

Networks won't 9K peer at exchange points for a number of reasons, including PMTU-D discovery issues.

Since there are virtual no 9K peering at exchange points, PMTU-D is a non-issue.

Maybe if IXP design didn't break PMTU-D it would help attract more 9K peers, or there might even be a future where 9K peering was required?

This whole problem smacks to me of exchange points that are "too big to fail". Since some of these exchanges are so big, everyone else must bend to their needs. I think the world would be a better place if some of these were broken up into smaller exchanges and they imposed less restrictions on their participants.

However, a good engineer would know there are drawbacks to next-hop-self, in particular it slows convergence in a number of situations. There are networks where fast convergence is more important than route scaling, and thus the traditional design of BGP next-hops being edge interfaces, and edge interfaces in the IGP performs better.

A good engineer also knows that there are huge drawbacks to having a peer's network infrastructure DDoSed, routes flapping, core bandwidth consumed by tens and hundreds of gb/sec of attack traffic, et. al., too.

;>

By attempting to force IX participants to not put the route in IGP, those IX participants are collectively deciding on a slower converging network for everyone. I don't like a world where connecting to an exchange point forces a particular network design on participants.

Concur. But that's the world we live in, unfortunately.

It's just another example of the huge, concentric nature of the collateral damage arising from DDoS attacks, both from the attacks themselves, and from the compromises folks have to make in order to increase resilience against such attacks.

That's some circular reasoning.

Not really. What I'm saying is that since PMTU-D is already broken on so many endpoint networks - i.e., where traffic originates and where it terminates - that any issues arising from PMTU-D irregularities in IXP networks are trivial by comparison.

It's actually the polar opposite. If you are small, there are no compelling
reasons to put IXP in IGP.
If you are large, you may wish to have different IGP metric to two egress
points in same peering router. In which case you should at very least have IP
ACL in IXP interface which only allows LAN2LAN.