CloudFlare issues?

Dmitry_Sherman · June 24, 2019, 10:55am

Hello are there any issues with CloudFlare services now?

Dmitry Sherman
dmitry@interhost.net
Interhost Networks Ltd
Web: http://www.interhost.co.il
fb: https://www.facebook.com/InterhostIL
Office: (+972)-(0)74-7029881 Fax: (+972)-(0)53-7976157

Antonios_Chariton · June 24, 2019, 11:03am

Yes, traffic from Greek networks is routed through NYC (alter.net), and previously it had a 60% packet loss. Now it’s still via NYC, but no packet loss. This happens in GR-IX Athens, not GR-IX Thessaloniki, but the problem definitely exists.

Antonis

Jaden_Roberts · June 24, 2019, 11:39am

From https://www.cloudflarestatus.com/:

Identified - We have identified a possible route leak impacting some Cloudflare IP ranges and are working with the network involved to resolve this.
Jun 24, 11:36 UTC

Seeing issues in Australia too for some sites that are routing through Cloudflare.

Jaden Roberts
Senior Network Engineer
4 Amy Close, Wyong, NSW 2259
Need assistance? We are here 24/7 +61 2 8115 8888 |

|

James_Jun · June 24, 2019, 11:43am

It seems Verizon has stopped filtering a downstream customer, or filtering broke.

Time to implement peer locking path filters for those using VZ as paid peer..

Network Next Hop * 2.18.64.0/24 137.39.3.55 * 2.19.251.0/24 137.39.3.55 * 2.22.24.0/23 137.39.3.55 * 2.22.26.0/23 137.39.3.55 * 2.22.28.0/24 137.39.3.55 * 2.24.0.0/16 137.39.3.55 * 202.232.0.2 * 2.24.0.0/13 202.232.0.2 * 2.25.0.0/16 137.39.3.55 * 202.232.0.2 * 2.26.0.0/16 137.39.3.55 * 202.232.0.2 * 2.27.0.0/16 137.39.3.55 * 202.232.0.2 * 2.28.0.0/16 137.39.3.55 * 202.232.0.2 * 2.29.0.0/16 137.39.3.55 * 202.232.0.2 * 2.30.0.0/16 137.39.3.55 * 202.232.0.2 * 2.31.0.0/16 137.39.3.55 * 202.232.0.2 * 2.56.16.0/22 137.39.3.55 * 2.56.150.0/24 137.39.3.55 * 2.57.48.0/22 137.39.3.55 * 2.58.47.0/24 137.39.3.55 * 2.59.0.0/23 137.39.3.55 * 2.59.244.0/22 137.39.3.55 * 2.148.0.0/14 137.39.3.55 * 3.5.128.0/24 137.39.3.55 * 3.5.128.0/22 137.39.3.55 Metric LocPrf Weight Path
0 701 396531 33154 174 6057 i
0 701 396531 33154 174 6057 i
0 701 396531 33154 174 6057 i
0 701 396531 33154 174 6057 i
0 701 396531 33154 174 6057 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 3356 12576 i
0 2497 701 396531 33154 3356 12576 i
0 701 396531 33154 1239 9009 i
0 701 396531 33154 1239 9009 i
0 701 396531 33154 174 50782 i
0 701 396531 33154 1239 9009 i
0 701 396531 33154 1239 9009 i
0 701 396531 33154 3356 29119 i
0 701 396531 33154 3356 2119 i
0 701 396531 33154 3356 16509 i
0 701 396531 33154 3356 16509 i

Robbie_Trencheny · June 24, 2019, 11:44am

From John Graham-Cumming, CTO of Cloudflare, on Hacker News right now:

This appears to be a routing problem with Level3. All our systems are running normally but traffic isn’t getting to us for a portion of our domains.

1128 UTC update Looks like we’re dealing with a route leak and we’re talking directly with the leaker and Level3 at the moment.

1131 UTC update Just to be clear this isn’t affecting all our traffic or all our domains or all countries. A portion of traffic isn’t hitting Cloudflare. Looks to be about an aggregate 10% drop in traffic to us.

1134 UTC update We are now certain we are dealing with a route leak.

Dovid_Bender · June 24, 2019, 11:51am

We are seeing issues as well getting to HE. The traffic is going via Alter.

Robbie_Trencheny · June 24, 2019, 11:52am

1147 UTC update Staring at internal graphs looks like global traffic is now at 97% of expected so impact lessening.

Tom_Paseka · June 24, 2019, 12:18pm

a Verizon downstream BGP customer is leaking the full table, and some more specific from us and many other providers.

Robbie_Trencheny · June 24, 2019, 12:20pm

1204 UTC update This leak is wider spread that just Cloudflare.

1208 UTC update Amazon Web Services now reporting external networking problem

Robbie_Trencheny · June 24, 2019, 12:39pm

This is my final update, I’m going back to bed, wake me up when the internet is working again.

https://news.ycombinator.com/item?id=20262316

Job_Snijders3 · June 24, 2019, 2:11pm

a Verizon downstream BGP customer is leaking the full table, and some more
specific from us and many other providers.

It appears that one of the implicated ASNs, AS 33154 "DQE Communications
LLC" is listed as customer on Noction's website:

I suspect AS 33154's customer AS 396531 turned up a new circuit with
Verizon, but didn't have routing policies to prevent sending routes from
33154 to 701 and vice versa, or their router didn't have support for RFC
8212.

Andree_Toonk · June 24, 2019, 2:25pm

This is what looked happened:

There was a large scale BGP ‘leak’ incident causing about 20k prefixes for 2400 network (ASNs) to be rerouted through AS396531 (a steel plant) and then on to its transit provider: Verizon (AS701) Start time: 10:34:21 (UTC) End time: 12:37 (UTC)
All ASpaths had the following in common:
701 396531 33154

33154 (DQECOM ) is an ISP providing transit to 396531.
396531 is by the looks of it a steel plant. dual homed to 701 and 33154.
701 is verizon and accepted by the looks of it all BGP announcements from 396531

What appears to have happened is that 33154 those routes were propagated to 396531, which then send them to Verizon and voila… there is the full leak at work.
(DQECOM runs a BGP optimizer ( , thanks Job for pointing that out, more below)As a result traffic for 20k prefixes or so was now rerouted through verizon and 396531 (the steel plant)We’ve seen numerous incidents like this in the pastlessons learned:1) if you do use a BGP optimizer, please FILTER!2) Verizon… filter your customers, please!Since the BGP optimizer introduces new more specific routes, a lot of traffic for high traffic destinations would have been rerouted through that path, which would have been congested, causing the outages.There were many cloudflare prefixes affected, but also folks like Amazon, Akamai, Facebook, Apple, Linode etc.here’s one example for Amazon - CloudFront : 52.84.32.0/22. Normally announced as a 52.84.32.0/21 but during the incident as a /22 (remember more specifics always win)RPKI would have worked here (assuming you’re strict with the max length)!Cheers Andree
My secret spy satellite informs me that Dmitry Sherman wrote On 2019-06-24, 3:55 AM:

Max_Tulyev · June 24, 2019, 2:28pm

Hi All,

here in Ukraine we got an impact as well!

Have two questions:

1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places.

2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?

24.06.19 13:55, Dmitry Sherman пише:

FHR · June 24, 2019, 2:40pm

Verizon is the one who should've noticed something was amiss and dropped their customer's BGP session.
They also should have had filters and prefix count limits in place, which would have prevented this whole disaster.

As to why any of that didn't happen, who actually knows.

Regards,
Filip

Jared_Mauch · June 24, 2019, 2:44pm

(Updating subject line to be accurate)

Hi All,

here in Ukraine we got an impact as well!

Have two questions:

1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places.

They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess.

2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?

There’s several major issues here

- Verizon accepted garbage from their customer
- Other networks accepted the garbage from Verizon (eg: Cogent)
- known best practices from over a decade ago are not applied

I’m sure reporters will be reaching out to Verizon about this and their response time should be noted.

It was impacting to many networks. You should filter your transits to prevent impact from these more specifics.

- Jared

https://twitter.com/jaredmauch/status/1143163212822720513
https://twitter.com/JobSnijders/status/1143163271693963266
https://puck.nether.net/~jared/blog/?p=208

ML11 · June 24, 2019, 3:00pm

$MAJORNET filters between peers make sense but what can a transit customer do to prevent being affected by leaks like this one?

Jared_Mauch · June 24, 2019, 3:03pm

Block routes from 3356 (for example) that don’t go 701_3356_ 701_2914_ 701_1239_

etc (if 701 is your transit and you are multi homed)

Then you won’t accept the more specifics.

If you point default it may not be any help.

- Jared

Max_Tulyev · June 24, 2019, 3:12pm

24.06.19 17:44, Jared Mauch пише:

1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places.

They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess.

yes, it is. But it is a working, quick and temporary fix of the problem.

2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?

There’s several major issues here

- Verizon accepted garbage from their customer
- Other networks accepted the garbage from Verizon (eg: Cogent)
- known best practices from over a decade ago are not applied

That's it.

We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream distributed leaked more specifics.

I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. Why?

Christopher_Morrow · June 24, 2019, 3:13pm

oddly VZ used to be quite good about filtering customer seesions
there ARE cases where: "customer says they may announce X" and that
doesn't happen along a path expected For instance they end up
announcing a path through their other transit to a prefix in the
permitted list on the VZ side it doesn't seem plausible that that
is what was happening here though, I don't expect the duquesne folk to
have customer paths to (for instance) savi moebel in germany...

there are some pretty fun as-paths in the set of ~25k prefixes leaked
(that routeviews saw).

Jared_Mauch · June 24, 2019, 3:15pm

24.06.19 17:44, Jared Mauch пише:

1. Why Cloudflare did not immediately announced all their address space by /24s? This can put the service up instantly for almost all places.

They may not want to pollute the global routing table with these entries. It has a cost for everyone. If we all did this, the table would be a mess.

yes, it is. But it is a working, quick and temporary fix of the problem.

Like many things (eg; ATT had similar issues with 12.0.0.0/8) now there’s a bunch of /9’s in the table that will likely never go away.

2. Why almost all carriers did not filter the leak on their side, but waited for "a better weather on Mars" for several hours?

There’s several major issues here
- Verizon accepted garbage from their customer
- Other networks accepted the garbage from Verizon (eg: Cogent)
- known best practices from over a decade ago are not applied

That's it.

We have several IXes connected, all of them had a correct aggregated route to CF. And there was one upstream distributed leaked more specifics.

I think 30min maximum is enough to find out a problem and filter out it's source on their side. Almost nobody did it. Why?

I have heard people say “we don’t look for problems”. This is often the case, there is a lack of monitoring/awareness. I had several systems detect the problem, plus things like bgpmon also saw it.

My guess is people that passed this on weren’t monitoring either. It’s often manual procedures vs automated scripts watching things. Instrumentation of your network elements tends to be a small set of people who invest in it. You tend to need some scale for it to make sense, and it also requires people who understand the underlying data for what is “odd”.

This is why I’ve had my monitoring system up for the past 12+ years. It’s super simple (dumb) and catches a lot of issues. I implemented it again for the RIPE RIS Live service, but haven’t cut it over to be the primary (realtime) monitoring method vs watching route-views.

I think it’s time to do that.

- Jared