AS4788 Telecom Malaysia major route leak?

This is the official feedback:

Level 3's network, alongside some other ISP's, experienced service disruptions affecting customers in Europe, Asia and multiple other markets. IP, Voice and Content Delivery Network (CDN) services were affected for Level 3. The root cause of the issue was isolated to a third party Internet Service Provider in Asia that leaked internet routes resulting in traffic being sent to a destination that could not route them, which affected IP, Voice and CDN services in multiple markets. The issue has been resolved, but the provider continues working to determine the specific root cause of the incident. At this time, customer services are restored with the exception of any that pose any possible risk to the Level 3 network. Maintaining a reliable, high-performing network for our customers is our top priority. Level 3 will continue to work with the provider to prevent a recurrence.

Jürgen Jaritsch
Head of Network & Infrastructure

ANEXIA Internetdienstleistungs GmbH

Telefon: +43-5-0556-300
Telefax: +43-5-0556-500

E-Mail: jj@anexia.at<mailto:jj@anexia.at>
Web: http://www.anexia.at/>

Anschrift Hauptsitz Klagenfurt: Feldkirchnerstraße 140, 9020 Klagenfurt
Geschäftsführer: Alexander Windbichler
Firmenbuch: FN 289918a | Gerichtsstand: Klagenfurt | UID-Nummer: AT U63216601

While I agree that TM needs to look into its operational procedures, I
think Level(3) needs to shoulder more of the blame, and not simply pass
the buck to TM.

TM has several more upstreams other than Level(3). Assuming their issue
affected all their border routers, we did not see an issue via their
other upstreams other than Level(3) - although this is conjecture on my
part.

Level(3) should have filtered at the time they were turning up TM.
Simple as that.

We all know we should never trust customers. So...

Mark.

This is the official feedback:

Level 3's network, alongside some other ISP's, experienced service disruptions affecting customers in Europe, Asia and multiple other markets. IP, Voice and Content Delivery Network (CDN) services were affected for Level 3. The root cause of the issue was isolated to a third party Internet Service Provider in Asia that leaked internet routes resulting in traffic being sent to a destination that could not route them, which affected IP, Voice and CDN services in multiple markets. The issue has been resolved, but the provider continues working to determine the specific root cause of the incident. At this time, customer services are restored with the exception of any that pose any possible risk to the Level 3 network. Maintaining a reliable, high-performing network for our customers is our top priority. Level 3 will continue to work with the provider to prevent a recurrence.

While I agree that TM needs to look into its operational procedures, I
think Level(3) needs to shoulder more of the blame, and not simply pass
the buck to TM.

if you localpref your customer up, you should probably not be willing to
accept the whole internet from them.

TM has several more upstreams other than Level(3). Assuming their issue
affected all their border routers, we did not see an issue via their
other upstreams other than Level(3) - although this is conjecture on my
part.

they also have ~ 180 ASNs in their downstream cone who presumably get a
full table have the export policy that did the business in this case
applied all the time.

For completeness sake: here is what Telekom Malaysia published about the
issue:

    Telekom Malaysia Berhad (TM) wishes to update on the service related
    issue detected yesterday, 12 June 2015 affecting a number of our
    Internet services customers that caused a deterioration in
    connection performance.
     
    We identified the root cause and our network team immediately took
    steps to optimise traffic flows, while we worked to restore
    connectivity to its expected level of performance. The services were
    restored at 6.30pm on the same day.
     
    We would like to clarify that during a network reconfiguration
    exercise, we had unintentionally updated traffic routing information
    which caused congestion and packet loss to our international
    connectivity. This had affected the internet traffic flow for some
    of our customers and some international traffic routes.
     
    We apologise for any inconvenience caused by the service disruption
    and would like to assure customers that we are undertaking all the
    necessary measures to ensure customers continue to experience
    uninterrupted services.
     
    Meanwhile, customers who have any enquiry or require further
    assistance can email us at help@tm.com.my or tweet to us via
    @tmconnects on Twitter.

    source: https://www.tm.com.my/OnlineHelp/Announcement/Pages/INTERNET-SERVICES-DISRUPTION-12-June-2015.aspx

Kind regards,

Job

Hai!

Wouw! This is what they came up with?!

Hopefully Level3 will take appropriate measures. Its amazing. Really.

'Some internationally routes'

Have they any idea what they did at all?

Its amazing that with parties like that the internet still works as is <tm> ...

Thanks,
Raymond Dijkxhoorn

I wouldn't be as hard. Stuff happens - and as they said, during a
maintenance activity, they boo-boo'ed.

Are Level(3) going to own up and say they should have had filters in
place? I certainly hope they do.

But more importantly, are Level(3) going to implement the filters
against TM's circuit? Are they going to run around the network looking
for any additional customer circuits that need plugging? That's my
concern...

Mark.

Hai!

Mark, mistakes and oopses happen. No problem at all. I understand that completely. There is human faillure and this happenes.

A simple 'sorry' would have done. Yet their whole message tells 'they did ok' In my very limited view they did NOT ok. Did i misread?

I am also very much looking how level3 is going to prevent things like this. But out of own experience they will not. We have seen before that they implemented filtering based on customer lists. But not a per customer filter. They did this globally. So any l3 customer can announce routes of another l3 customer. While this can be changed this outage tells there is certainly room for improvements.

I hope people will learn from what happened and implement proper filtering. Thats even more important then a message from a operator that didnt even understand fully what they caused to the internet globally.

Thanks,
Raymond Dijkxhoorn

Raymond,

They provided a "simple sorry":

    "We apologise for any inconvenience caused by the service disruption."

It doesn't get much more simple than that.

-mel beckman

Hello Mel,

Must just be me then.

I was most likely expecting a more in depth report. Strange things happened. Perhaps they could post a 'what exactly happened' since this wasnt a average route leak.

Thanks,
Raymond Dijkxhoorn

Raymond,

But you said "A simple 'sorry' would have done." Now you're asking for lots more detail. Why the change?

-mel beckman

Does anyone know if there's an official "ruling" as to who gets to pay for
the SLA breaches?

SLAs are part of a contract, and thus only apply to the parties of the contract. There are no payments due to other parties. The Internet is a "best effort" network, with zero guarantees.

-mel beckman

Does anyone know if there's an official "ruling" as to who gets to pay for the SLA breaches?

Raymond,

But you said "A simple 'sorry' would have done." Now you're asking for lots more detail. Why the change?

-mel beckman

Ok, I'll bite: my $dayjob is a Level 3 client that was directly affected by
lack of availability due to recovery attempt Level 3 tried in our region.
Where $dayjob can collect $ for this incident ?

Rubens

keep in mind their target audience with that message is probably local
malaysian customers, not the world.

I get that much, just wondering if Level3 would have to pay an SLA breach
to its customers given the mess started with TM (even though it could have
been avoided). And I am guessing if they do, they wouldn't be able to
recover anything from TM.

In addition to that, losing face in SE Asia is "not done".

what i have yet to understand (probably my fault) is how L(3) propagated
the disease or, more correctly, what has happened over there that they
did not stop the propagation? the crew that went there from mci ran a
very tight ship and L(3) has always had pretty rigid filters. what
happened? and i mean that in the sense of how can i not make a similar
mistake?

randy

Hi Rafael,

I get that much, just wondering if Level3 would have to pay an SLA breach

to its customers given the mess started with TM (even though it could have
been avoided). And I am guessing if they do, they wouldn't be able to
recover anything from TM.

I doubt if L3 has to pay anything to its customers in terms of SLA breach,
its best effort. Are you aware of any such agreement which suggest
otherwise? that would be interesting.

Well, I was wondering the same. I am guessing it depends on the SLA
contract since they are all very unique and specific. I assume they would
have to, granted the issue lasted for a couple hours. Now, it depends on
how they define the outage. A fiber cut that yields a customer's service
unusable would be an easy SLA breach. Their legal team most likely removed
any liability due to someone else's negligence, although you could argue
they were negligent as well. So in this case they can claim the whole "best
effort" thing and get away with it. I am not a L3 customer, so was just
wondering out of curiosity.

I'm going to bet that aside from a few one-off cases the SLA in
question talks about maintaining reachability inside L3's network, or
maybe even 'is your link up and can you ping the L3 gateway router you
connect to?'

SLA's aren't meant to actually get paid out...