Telia Not Withdrawing v6 Routes

Matt_Corallo · November 16, 2020, 1:58am

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

mrhamel · November 16, 2020, 3:37am

This same issue happened in Los Angeles a number of years ago, but for IPv4 and v6. They need to setup sane BGP timers, and/or advocate the use of BFD for BGP sessions both customer facing and internal.

Ryan

Olivier_Benghozi · November 16, 2020, 3:40am

Probably a ghost route. Such thing happens

https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).

Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...

Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

Matt_Corallo · November 16, 2020, 3:44am

Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.

Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.

Must just be something with my particular prefixes, oh well.

Matt

Olivier_Benghozi · November 16, 2020, 4:07am

One of the routing gears on the path don't like the large community inside those routes maybe ?
By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...

Matt_Corallo · November 16, 2020, 4:12am

Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar.

Matt

Matt_Corallo · November 16, 2020, 2:17pm

For those curious, Johan indicated on Twitter this was a JunOS bug.

https://twitter.com/gustawsson/status/1328298914785730561

Matt

Sabri_Berisha · November 16, 2020, 7:13pm

I have seen issues like this in a network that I operated. In that particular
case, it was an internal ipv4 10/8 route which was withdrawn, along with a
few hundred other routes. The withdrawl was configured on a DC exit router,
in a Clos network with leaf, spine, and superspine. On the spine layer, I
observed that BGP withdrawls, although being received, were not processed
by the control plane.

Further investigation and working with the TAC of the vendor, revealed that
on that particular platform, the BGP process would stop process withdrawls
in a very nasty race condition that was very difficult to reproduce.

This was the first (and so far only) time in my 20+ years of working with
BGP that I've observed such a weird bug. Since I operated the entire
network, it was fairly easy to find the culprit. The why, took some more
time.

If I were in your shoes, I'd ping Telia's NOC to see what's going on. I
would not be surprised if they'd be hitting a similar issue.

Thanks,

Sabri

Matt_Corallo · November 16, 2020, 7:45pm

See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this was due to a (now-worked-around) bug in JunOS.

https://twitter.com/gustawsson/status/1328298914785730561

Matt

Sabri_Berisha · November 17, 2020, 1:36am

Hi,

See my latest response from this morning. Telia's "Head of Network Engineering &
Architecture" confirmed on Twitter this
was due to a (now-worked-around) bug in JunOS.

https://twitter.com/gustawsson/status/1328298914785730561

Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer,
policy was that once a PR was hit in the field, it would be marked public.

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
like that get introduced. One would expect that after 20+ years of writing BGP code,
handling a withdrawl would be easy-peasy.

Thanks,

Sabri

Valdis_Kletnieks · November 17, 2020, 2:52am

Handling a withdrawal is easy.

Handling one correctly without race conditions when you're seeing withdrawals
and additions from multiple bgp sessions concurrently, while also maintaining
RIB and FIB consistency and keep forwarding customer packets is a little bit harder.

Neil_Hanlon · November 17, 2020, 3:23am

Surely they can just put them in an array.

Markus_Weber_FvD · November 17, 2020, 6:01am

New code, new features, new problems. E.g. public PR1323306 describes a BGP stuck situation. (And the fixed code should address as well a - hidden - PR, which causes down/stale sessions, leading to stuck routes even without a both-side GRES event). All very, very special cases ... but some of us will find / get hit by them (unfortunately).

Markus

Saku_Ytti1 · November 17, 2020, 6:54am

Hey Sabri,

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
like that get introduced. One would expect that after 20+ years of writing BGP code,
handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard
programming problem that DE couldn't solve. These are honest mistakes.
I've not experienced in my tenure the frequency of these bugs change
at all, NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial
router market so that poor quality NOS is good for business and good
quality NOS is bad for business, I don't think this is in anyone's
formal business plan or that companies even realise they are not even
trying to make good NOS. I think it's emergent behaviour due to the
market and people follow that market demand unknowingly.
If we suddenly had one commercial NOS which is 100% bug free, many of
their customers would stop buying support, would rely on spare HW and
Internet forums for configuration help. Lot of us only need contracts
to deal with novel bugs all of us find on a regular basis, so good NOS
would immediately reduce revenue. For some reason Windows, macOS or
Linux almost never have novel bugs that the end user finds and when
those are found, it's big news. While we don't go a month without
hitting a novel bug in one of our NOS, and no one cares about it, it's
business as usual.

I also put a lot of blame on C, it was a terrific language when
compiling had to be fast. Basically macro assembler. Now the utility
of being 'close to HW' is gone, as the CPU does so much C compiler has
no control over, it's not really even executing the same code
as-written anymore. MSFT estimated >70% of their bugs are related to
memory safety. We could accomplish significant improvements in
software quality if we'd ditch C and allow the computer to do more
formal correctness checks at compile time and design languages which
lend towards this.

We constantly misattribute problems (like in this post) to config or
HW, while most common reasons for outages are pilot error and SW
defect, and very little engineering time is spent on those. And often
the time spent improving the two first increases the risk of the two
latter, reducing mean availability over time.

Mark_Tinka · November 17, 2020, 4:32pm

Not to mention that many of us would not need to be around to babysit all this dodgy software.

Definitely bad for business :-).

Mark.

adamv0025 · November 18, 2020, 12:45pm

On Behalf Of Mark Tinka
Sent: Tuesday, November 17, 2020 4:32 PM

> I put most of the blame on the market, we've modelled commercial
> router market so that poor quality NOS is good for business and good
> quality NOS is bad for business, I don't think this is in anyone's
> formal business plan or that companies even realise they are not even
> trying to make good NOS. I think it's emergent behaviour due to the
> market and people follow that market demand unknowingly.
> If we suddenly had one commercial NOS which is 100% bug free, many of
> their customers would stop buying support, would rely on spare HW and
> Internet forums for configuration help.

Not to mention that many of us would not need to be around to babysit all this
dodgy software.

Definitely bad for business :-).

Being obsoleted already by "self-driving networks", there's no limit to what one can automate...
But then one needs someone to babysit all the automation systems.

adam

adamv0025 · November 18, 2020, 12:58pm

Saku Ytti
Sent: Tuesday, November 17, 2020 6:55 AM

Hey Sabri,

> Also, in the case that I described it wasn't a Junos device. Makes me
> wonder how bugs like that get introduced. One would expect that after
> 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard programming
problem that DE couldn't solve. These are honest mistakes.
I've not experienced in my tenure the frequency of these bugs change at all,
NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial router
market so that poor quality NOS is good for business and good quality NOS is
bad for business, I don't think this is in anyone's formal business plan or that
companies even realise they are not even trying to make good NOS. I think it's
emergent behaviour due to the market and people follow that market demand
unknowingly.
If we suddenly had one commercial NOS which is 100% bug free, many of their
customers would stop buying support, would rely on spare HW and Internet
forums for configuration help. Lot of us only need contracts to deal with novel
bugs all of us find on a regular basis, so good NOS would immediately reduce
revenue. For some reason Windows, macOS or Linux almost never have novel
bugs that the end user finds and when those are found, it's big news. While we
don't go a month without hitting a novel bug in one of our NOS, and no one
cares about it, it's business as usual.

I also put a lot of blame on C, it was a terrific language when compiling had to
be fast. Basically macro assembler. Now the utility of being 'close to HW' is
gone, as the CPU does so much C compiler has no control over, it's not really
even executing the same code as-written anymore. MSFT estimated >70% of
their bugs are related to memory safety. We could accomplish significant
improvements in software quality if we'd ditch C and allow the computer to do
more formal correctness checks at compile time and design languages which
lend towards this.

We constantly misattribute problems (like in this post) to config or HW, while
most common reasons for outages are pilot error and SW defect, and very little
engineering time is spent on those. And often the time spent improving the two
first increases the risk of the two latter, reducing mean availability over time.

I agree with everything but the last statement.

From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended, that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

adam

Tom_Beecher · November 18, 2020, 3:18pm

I also put a lot of blame on C, it was a terrific language when
compiling had to be fast. Basically macro assembler. Now the utility
of being ‘close to HW’ is gone, as the CPU does so much C compiler has
no control over, it’s not really even executing the same code
as-written anymore. MSFT estimated >70% of their bugs are related to
memory safety. We could accomplish significant improvements in
software quality if we’d ditch C and allow the computer to do more
formal correctness checks at compile time and design languages which
lend towards this.

Agree 1000%. I think this is greatly compounded by current generations of programmers who come out of school without having had much experience with low level memory management, having mostly worked in more modern languages that handle such things in a much better way. Moving from college Python to mature C code with a hellscape of pointers must be a pretty jarring transition.

Mark_Tinka · November 19, 2020, 3:58am

From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended,

I'm not so sure about that, actually.

I'd say there are some ISP's that spend some (or a considerable) amount of time testing for software defects.

My anecdotal experience is that most ISP's have neither the time, tools nor resources to do significant testing of software. More like, "is the version anything after R1, has it been around long enough, has it been recommended by TAC, are the -nsp lists raving on about it, is it a maintenance release, is the caveat list too long, does my vendor SE approve", type-thing.

that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

Which the majority of ISP's likely will never test for.

Mark.