Has anyone put in place a method to identify if one their BGP peers suddenly withdraws X% of their prefixes?
e.g I should expect ~420k prefixes in a "complete"[1] routing table from a transit peer today. If suddenly I'm only getting 390k prefixes I'd guess a major network was depeered or similiar.
If so how are people doing this? SNMP MIB, screen scrape?
And there are about 10-20 emails per day, even when looking only rather
'coarse' changes.
But to be honest, I almost never peek at the folder where I get these, I'm
probably moving the output on IRC channel, as I've found it superior way to
keep track of alarms compared to emails for my workflow.
Well, if I had to do it, I think I'd just point munin at the router, yes,
using SNMP, and put the prefix count graph up on the Big Wall, as a filled
curve. That thing jumps around, someone will likely notice.
This assumes a staffed NOC, of course, but it still gives you something to
look at historically if you note a problem in an unstaffed situation as well.
I have cacti graph the amount of prefixes announced and withdrawn from a BGP peer on each BGP router.
+1
Note that not all router OSs support fetching data like that via SNMP.
We use a custom built thing internally that does this two, which we then tack on an alert threshold for. So if a downstream peer sends us less than that, we get an alert. Handy for those times when they call and ask us what we did to their network.
Prior to that, we had a script which whould login, munge the 'show ip bgp summary' table output, figure out the deltas and graph or report as needed on a particularly troublesome peer.
From: ML [mailto:ml@kenweb.org]
Sent: Tuesday, October 02, 2012 11:43 PM
To: North American Networking and Offtopic Gripes List
Subject: Internet routing table "completeness" monitoring?
Has anyone put in place a method to identify if one their BGP peers suddenly withdraws X% of their prefixes?
e.g I should expect ~420k prefixes in a "complete"[1] routing table from a transit peer today. If suddenly I'm only getting 390k prefixes I'd guess a major network was depeered or similiar.
If so how are people doing this? SNMP MIB, screen scrape?
is a threshold helpful here? (well, it's helpful to a point at least)
what if your neighbour starts deaggragating (or sending you their
internal deaggragates) in place of 50k real routes? no alarm, no
'change' from a numbers perspective, but certainly a traffic shift and
reach-ability change
Isn't a speed-of-change threshold also interesting here?
I have cacti graph the amount of prefixes announced and withdrawn
from a BGP peer on each BGP router.
+1
Note that not all router OSs support fetching data like that via SNMP.
We use a custom built thing internally that does this two, which we
then tack on an alert threshold for. So if a downstream peer sends us
less than that, we get an alert. Handy for those times when they call
and ask us what we did to their network.
Prior to that, we had a script which whould login, munge the 'show ip
bgp summary' table output, figure out the deltas and graph or report as
needed on a particularly troublesome peer.
From: ML [mailto:ml@kenweb.org]
Sent: Tuesday, October 02, 2012 11:43 PM
To: North American Networking and Offtopic Gripes List
Subject: Internet routing table "completeness" monitoring?
Has anyone put in place a method to identify if one their BGP peers
suddenly withdraws X% of their prefixes?
e.g I should expect ~420k prefixes in a "complete"[1] routing table
from a transit peer today. If suddenly I'm only getting 390k prefixes
I'd guess a major network was depeered or similiar.
If so how are people doing this? SNMP MIB, screen scrape?
[1] Varying levels of completeless apply.
wfms
So, there is something called the BGP Monitoring Protocol (BMP):
is a threshold helpful here? (well, it's helpful to a point at least)
what if your neighbour starts deaggragating (or sending you their
internal deaggragates) in place of 50k real routes? no alarm, no
'change' from a numbers perspective, but certainly a traffic shift and
reach-ability change
As long as you have some control over the number of polling intervals between the detection of a noteworthy change and sending an alarm. Otherwise, there is a real danger of your NOC having to investigate a lot of noisy alerts. If that persists for too long, the NOC will grow tired of responding to these alerts, and send them all to the bit bucket, or implement their own polling thresholds that meet their needs more effectively.
If a network you have no business relationship with and several AS hops away from you goes away, how much effort do you want to expend investigating that? That probably depends on your customers. If you see a few hundred routes disappear and determine them to be for an ISP on the other side of the planet, that's one thing. If your view of something like Google or Facebook suddenly disappears, that could be another thing entirely
Isn't a speed-of-change threshold also interesting here?