I would to enquire about the cons/pros of running a full internet routing
table in a vrf and the potential challenges of operating it in a VPN cross
a large network that does peering and provide transit.
I not a fan to support running it in a vrf.
I am looking for a list of operational and technical challenges
specifically around
1) control plane (route reflectors )
2) forward plane (recursive lookup issues)
3) Operational
4) DDOS
5) BCP and RFC that would break eg "BGP-SEC does not support in todays
draft to check prefixs within the VPN"
6) Vendor specifics
We decided against deploying our internet routes via vpnvX. Two major
holdups for us were:
Each route inside a vpnv4 table will consume more cam (96 bits versus
32), which adds up when taking full routes.
Brocade XMR does not support distributing routes via vpnv6, or it did not
when we were designing our MPLS network.
One of the benefits of distributing internet routes inside a VRF is that it
logically separates those routes from your IGP routing tables (your P
routers don't see internet routes). Keeping internet routes inside your
default VRF may lead to unexpectedly leaking IGP routes out to your BGP
sessions, so BGP filters are important, as well as using unique (RIR)
addresses inside your MPLS mesh.
I've done this on multiple vendor platforms, including full routes, and
haven't had any issues. Resource consumption varies on vendor and
implementation, but I've observed that its not as punitive as I thought it
would be due to various optimizations. Granted, in most of my cases, it
was in a VRF, but I was not running MPLS.
1) control plane (route reflectors )
- you can either run a separate control plane infrastructure for inet vrf or
you can use common RRs that depends on your hardware capabilities (or you
can run a separate BGP process for reflecting inet vrf).
- no need to worry about data-plane as VPN routes are not installed into FIB
on RRs.
- as it was mentioned already porting inet prefixes into VPN table increases
control-plane demands.
2) forward plane (recursive lookup issues)
- for inet vrf I'd recommend per CE/next-hop labels instead of per prefix
labels to save up some label space.
- per next-hop label still points directly to outgoing interface so no
recursive lookups.
- recursive lookups are only needed with per VRF label -but I would not
recommend that as it could introduce loops when PIC is used in some
scenarios.
3) Operational
- I find it operationally complex to keep inet table on the P-Core
boxes/vrf-default.
4) DDOS
- as I mentioned you can run a separate infrastructure for inet vrf i.e.
dedicated box or SDR for inet PEs and inet RRs.
- or you can use separate BGP processes so in case some university decides
to test some special attribute on their BGP advertisements it will not
reload your VPN BGP process.
- or you can deploy enhanced BGP error handling on the edges and hope for
the best (actually this is what should be implemented as a first thing).
Internet in a vrf is doable on most platforms and definitely adds a lot of flexibility.
1) control plane (route reflectors )
This is really dependent on your platform and whether you are doing multiple RD's or not. If you divide your transit into regions and filter based upon RT you can tier your route-reflectors to get plenty of scalability.
2) forward plane (recursive lookup issues)
Most platforms program prefix's with associated labels slower so your base convergence will suffer. In addition if you want to run PIC you will likely be left with a bit of custom engineering to make it work. VPN's hide the next hop behind the loopback of the PE so next hop failure awareness of an edge tie will be lost. If you can stomach the double lookup you can run per-vrf labels (per prefix isn't feasible on most platforms) and weight up your edge ties and force a bounce back to another PE, otherwise you will be stuck with bgp control plane based convergence with per-ce labels.
3) Operational
It's definitely harder to train operation people on how to look in a vrf.
4) DDOS
It's actually much easier to design a DDOS filtering system if everything is in VRF's. If you create separate vrf's for transit and subscription your can have extreme flexibility in DDOS filtering. The import export flexibility allows for injection of /32 or /128's into your transit vrf and you can simply hang your DDOS mitigation seems between the transit and subscription VRF's.
5) BCP and RFC that would break eg "BGP-SEC does not support in todays draft to check prefixs within the VPN"
We haven't found any significant functionality we would want to use other than PIC that it would break, and there was a work around with that.
6) Vendor specifics
You are probably ok with most vendors but a few still have issues with table carving, and a few don't support 6VPE.
2) forward plane (recursive lookup issues)
Most platforms program prefix's with associated labels slower so your base convergence will suffer.
Do you have any reference you could share? What level of penalty per prefix
have you observed in each platform tested?
In addition if you want to run PIC you will likely be left with a bit of custom engineering to make it work. VPN's hide the next hop behind the loopback of the PE so next hop failure awareness of an edge tie will be lost. If you can stomach the double lookup you can run per-vrf labels (per prefix isn't feasible on most platforms) and weight up your edge ties and force a bounce back to another PE, otherwise you will be stuck with bgp control plane based convergence with per-ce labels.
PIC is about converging each prefix at the same time. It does not make
statement where next_hop is pointing, is it loop0 (next-hop-self in INET)
or is it edge CE.
If your IGP carries all edge links, and you don't run next-hop-self, far
end PE can converge faster in INET scenario. But current efforts are not to
fix this, current efforts are to make the local PE do hitless repair when
arriving frame is pointing to dead edge interface.
It seems to be very rare to run INET in this way, majority don't carry edge
links in IGP and do run next-hop-self.
If you run PIC and hide the next hop information between a loopback which is what will happen in a vpn environment you will lose awareness of the failure of an edge link on a remote PE. The remote PE will continue to send traffic to the PE with the failed link until it has completely converged both at the control plane, and written to the FIB. If the remote PE has PIC running he can bounce that traffic back to his backup path via another PE. There will be some percentage of your traffic that will then form a transient micro loop though because that remote PE will have his primary path through the failed link due to shortest as path length etc, and he will not have converged yet around the failure on the remote PE and has no awareness of the failure. One possible solution to this is to guarantee that a PE will never use another PE for a primary transit route. This can be accomplished via metrics such as weight etc.. Again one of the downsides of this is you need to run VRF labels so that a local IP lookup can be done on the PE with the failed link and it can execute a local repair when it see's the link drop.
If you run PIC and hide the next hop information between a loopback which is what will happen in a vpn environment
Typical SP network has next-hop-self in INET BGP, and does not carry
edge-links in IGP. You don't want to have lot of prefixes in IGP.
If the remote PE has PIC running he can bounce that traffic back to his backup path via another PE.
PIC merely makes sure that FIB is hierarchical and it guarantees all
prefixes sharing next-hop converge at same time.
Local-repair can be done with or without PIC, as it just means you have
local information how to deliver frame to alternate destination without
expectation of convergence.
There will be some percentage of your traffic that will then form a transient micro loop though because that remote PE will have his primary path through the failed link due to shortest as path length etc
Only if egress PE does IP lookup, which is typically does not do
(per-prefix or per-ce, default config in 7600, JunOS, IOS-XR) as egress PE
label adjacency entry has egress rewrite information.
The faulted edge PE can local-repair and get frame delivered without having
to wait for BGP to converge for the customer. Transient loop can occur if
both of the edges have faulted.
If you run PIC and hide the next hop information between a loopback which is what will happen in a vpn environment
Typical SP network has next-hop-self in INET BGP, and does not carry
edge-links in IGP. You don't want to have lot of prefixes in IGP.
If the remote PE has PIC running he can bounce that traffic back to his backup path via another PE.
PIC merely makes sure that FIB is hierarchical and it guarantees all
prefixes sharing next-hop converge at same time.
Local-repair can be done with or without PIC, as it just means you have
local information how to deliver frame to alternate destination without
expectation of convergence.
Unfortunately Cisco made things confusing by naming their "BGP FRR" feature "BGP PIC Edge."
There's some fundamental misunderstanding here.
By default with vpnv4 and vpnv6 address-familie there's next hop self set by
the PE.
Local-Repair and label-retention was around many years before PIC came
along.
It worked nicely with eibgp multipath and allowed the primary PE to work
around the failed PE-CE link and send traffic to alternate PE that
advertised the same prefix.
The added value with PIC is you don't have to have equal attributes in order
to have an alternate path installed into FIB
There are no micro-loops involved on an alternate PE.
During normal operation packet incoming on Primary PE would be
label-switched based on the per-prefix or per-ce label via PE-CE link as
directed by the L2 overwrite in the FIB.
In case of the local PE-CE link failure.
PIC or Local-Repair will just label switch the incoming label with label
advertised by the alternate PE.
Once the alternate PE receives the labeled packet it will just label-switch
it out the PE-CE link.
During normal operation or during failure there is no recursive lookup done
just label-switching.
As Ytti pointed out already you don't want the PE-CE links to be carried by
the IGP as you can fast reroute over their failure and perform a
"local-repair" until the BGP converges and the ingress PE starts forwarding
traffic to alternate PE/NH.
The only case when you experience an excessive loss of connectivity is when
the egress PE fails -in that case you need to really on the speed of IGP
convergence to inform the ingress PE to switch to a preprogramed backup
path/NH (PIC CORE).
There are already some RFCs that propose P-core to fast reroute to alternate
PE in case the primary PE fails - can't wait :).