BFD vs network brownouts

David_Zimmerman · January 8, 2025, 10:20pm

Hi, all. BFD is well known for what it brings to the table for improving link failure detection; however, even at a reasonably athletic 300ms Control rate, you’re not going to catch a significant percentage of brownout situations where you have packet loss but not a full outage. I’m trying to:

Jason_Iannone · January 9, 2025, 1:36am

BFD is binary. Service OAM 802.3ag / ITU-T Y.1731 generates time series data that talks to service reliability and SLA. OAM offers interface shut and fault propagation as well, which means it’s both an observability tool and an operational one. BFD is just not the thing for measuring the reliability of network services.

Saku_Ytti1 · January 9, 2025, 7:56am

I agree with what Jason wrote, that this is not what BFD was designed for.

In SONET/SDH even WAN-PHY you could declare interface down if BER
threshold went beyond what you consider acceptable. For more modern
interfaces your best bet is RS-FEC and preFEC error rate as predictor,
possibly multimetric decision including also DDM data and projections.
To my knowledge vendors currently don't have software support to
assert RFI on preFEC counters, infact last time I looked you couldn't
even SNMP GET FEC counters, for which I opened Enhancement Requests to
vendors. So today you'd need to do this with screenscraping and manual
interface down, which is a much bigger hammer than RFI assertion.

Tore_Anderson1 · January 9, 2025, 10:05am

* David Zimmerman

Hi, all. BFD is well known for what it brings to the table for improving link failure detection; however, even at a reasonably athletic 300ms Control rate, you're not going to catch a significant percentage of brownout situations where you have packet loss but not a full outage. I'm trying to:

1. find any formal or semi-formal writing about quantification of
    BFD's effectiveness. For example, my mental picture is a 3D graph
    where, for a given Control rate and corresponding Detection Time,
    the X axis is percentage of packet loss, the Y axis is the
    Control/Detection timer tuple, and the Z axis is the likelihood
    that BFD will fully engage (i.e., missing all three Control
    packets). Beyond what I believe is a visualization complexity
    needing some single malt scotch nearby, letting even a single
    Control packet through resets your Detection timer.
2. ask if folks in the Real World use BFD towards this end, or have
    other mechanisms as a data plane loss instrumentation vehicle.
    For example, in my wanderings, I've found an environment that
    offloads the diagnostic load to adjacent compute nodes, but they
    reach out to orchestration to trigger further router actions in a
    full-circle cycle measured in /minutes/. Short of that, really
    aggressive timers (solving through brute force) on BFD quickly hit
    platform limits for scale unless perhaps you can offboard the BFD
    to something inline (e.g. the Ciena 5170 can be dialed down to a
    3.3ms Control timer).

Any thoughts appreciated. I'm also pursuing ways of having my internal "customer" signal me upon their own packet loss observation (e.g. 1% loss for most folks is a TCP retransmission, but 1% loss for them are crying eyeballs and an escalation).

Hi David.

We're simply monitoring the error counters on our interfaces, as brownout packet loss due to an unhealthy link usually appears as receive errors starting to tick up. If this exceeds some certain percentage of the total pps on the link, we automatically apply the BGP Graceful Shutdown community to the BGP sessions running on that link, so that it is automatically drained of production traffic (assuming that healthier paths remain that the traffic can potentially move to).

This obviously works best if you control and monitor both ends of the link, and ensure that you always have enough bandwidth on the redundant path to handle the full load.

Tore

Tom_Beecher · January 9, 2025, 7:57pm

i, all. BFD is well known for what it brings to the table for improving link failure detection; however, even at a reasonably athletic 300ms Control rate, you’re not going to catch a significant percentage of brownout situations where you have packet loss but not a full outage. I’m trying to:

BFD doesn’t improve link failure detection. It’s the exact opposite ; it’s there to detect reachability failure faster than protocols themselves would do so , in those cases where link failure does NOT occur, which would otherwise do the same thing.

Beyond that, I agree with Jason and Saku that BFD is not the correct tool for what you’re trying to achieve anyways. Aside from monitoring interface counters, there are software options out there to detect loss % on explicit paths that would suit your need much better

Alex_Buie3 · January 9, 2025, 9:45pm

it’s there to detect reachability failure faster than protocols themselves would do so

Exactly this - we have some type 2 fiber transit circuits which are presumably connected to some sort of re-encoder or something, as we have had a few scenarios where the router at the far-remote end died but we maintained light and Ethernet. BFD helps us greatly here when we don’t lose light or link.

I haven’t yet experienced it in a brownout condition, though, say where I’m facing 40% packet loss broadly to all ISPs; generally everything of that nature of ours has been way further in the core and to specific routes or peers (bad LAG member on the peering interconnect or backbone between our upstream and whoever on the DFZ. AS7018 seems notorious for leaving peering links in a shitty state)

David_Zimmerman · January 9, 2025, 10:31pm

Thanks for the feedback, Jason, Saku, Tore, Tom, and Alex. Agreed that trying to effectively brute force (mis)use of BFD as I described is misdirected. To some degree I’m trying to reinforce a “why this doesn’t work” argument internally as part of a larger narrative.

Thanks specifically to Jason for reminding me about 802.1ag and Y.1731 — OAM is where I’ll spend some time digging if there’s any benefit. I’m pretty ignorant of that (pretty big) space.

Towards Saku’s, Tore’s, and Tom’s comments about watching error counters, I’ll keep that in mind, though I expect I’ll want to cover situations where frames are simply lost rather than errored. For example (tapping into Alex’s point) on an L2VPN circuit with carrier underlay congestion where the last-mile circuits are otherwise clean.

-dp

Saku_Ytti1 · January 10, 2025, 6:33am

Explain how this could happen?

Like if we are thinking of a scenario where the far-end didn't even
send it, because the link is full, then we will of course see the link
being full (if we adjust SNMP stats to L1 speed, we will know if it is
full or not).

If we are thinking situation where the far end didn't even send it,
and the link isn't full we are getting into the weeds and we probably
shouldn't try to optimise that, as trying to solve it systematically
may cause more problems than addressing each case individually.

If the far end did send a frame, but the frame didn't arrive, and
we're talking about point-to-point link, there is no reason to
optimize for cases where this isn't visible.
If the far end did send a frame, but the frame didn't arrive, and
we're talking about OEO transport devices between us, it may not be a
sufficiently common scenario to optimise for. There is a specific
exception here, and it is RFI assertion, OEO transport may lose the
RFI assertion instead of tunneling it, making interface-down detection
slow and it is impossible to prove it works now (without bringing
service down), even if it worked during provisioning of the service.
But this should be well covered by BFD/OEM.

Jason_Iannone · January 10, 2025, 11:02am

Saku speaks from the privileged position of an infrastructure owner. We assume that interface connectivity is provided by L1 links, with OEO in transit nodes as a worst case.

But the clever budget conscious among us have deployed router links over provided MPLS based L2 services as critical infrastructure. We have an invisible WAN. In the absence of L1 PM statistics, how do we validate service over other networks? 802.3ag and y.1731 attempt to answer that question.

Of course the real answer is buy a wire.

Saku_Ytti1 · January 10, 2025, 12:28pm

Fair, yes OAM is likely least bad solution here. Or vendor proprietary
solutions like Cisco IP SLA or Juniper RPM.

Tore_Anderson1 · January 12, 2025, 9:49am

* Jason Iannone

But the clever budget conscious among us have deployed router links over provided MPLS based L2 services as critical infrastructure. We have an invisible WAN. In the absence of L1 PM statistics, how do we validate service over other networks? 802.3ag and y.1731 attempt to answer that question.

If you frequently pull your interface counters into a decent time series database, it is also possible to simply compare the pps rates on the transmitting and receiving sides, and alert/mitigate if the discrepancy between the two becomes too large.

There will always be a certain discrepancy, as you won't be able to poll the counters on both sides at the exact same time, but averaging the pps rates over a certain time window should get you fairly close to 1:1 TX:RX pps.

Looking at a few random healthy links in our network, it seems like an appropriate alert threshold that would avoid false alarms would be a receiving pps rate over a 30-minute sliding window being more than 5‰ off from the transmitting pps rate. YMMV.

Obviously such an approach would not catch every single brownout, but I'd wager a guess you'd catch quite a few (and the worse it is, the more likely it is that it will be caught). Much better than waiting around for phone calls from upset customers, at any rate.

Tore