TATA problems?

Todd_S · November 7, 2011, 3:00pm

We seem to be having some problems with our tata links - first seen in EU
about 45 minutes ago, now we're seeing problems in NA. I'm focused on DNS,
so I'm seeing a lot of timeouts/servfails, but our networking folks are
talking about links dropping.

Anyone else seeing oddness on the NA Internet right now?

http://downrightnow.com/ confirms - something is up.

cheers,

t.

_Stephane_Bortzmeyer · November 7, 2011, 3:05pm

a message of 12 lines which said:

We seem to be having some problems with our tata links

They probably use Juniper routers

Tim_Vollebregt · November 7, 2011, 3:06pm

Hi,

This issue seems to be much bigger, we lost about 20 Level3 and some TATA sessions.
Also we lost about 15% of our total traffic.

On #IX there are rumours about Junos version 10.3R2.11 being core dumped and rebooted, which makes sense.

Currently traffic is restored.

Tim

Tom_Hill · November 7, 2011, 3:08pm

There are widespread issues across the Internet; certain versions of
Juniper firmware have core dumped after seeing a particular BGP 'UPDATE'
message.

(That's the running theory at least).

It's affected multiple service providers, globally, not just those
connected to TATA.

Tom

Jared_Mauch · November 7, 2011, 3:31pm

Pretty much any major BGP event will impact multiple providers.

A threshold you should use to view the general instability (which I find valuable, you may as well) is route views data.

If you look at the BGP UPDATES archive sizes, you can see when something happens, e.g.:

http://archive.routeviews.org/bgpdata/2011.11/UPDATES/

Take a look at the size of the updates.20111107.1400.bz2 file and the 1415 file. They are abnormally large compared to a normal period of time. This shows there were a lot of updates out there being processed and a reference to levels of instability.

If you are not feeding route views or similar community projects, please consider doing so. It helps paint the view for those doing analysis.

- Jared

Pierre-Yves_Maunier · November 7, 2011, 3:33pm

On our side all our 10.3R2.11 core dumped which made all our interfaces
flapped.
I've been told 10.4R1.9 is affected too.

Leigh_Porter · November 7, 2011, 3:45pm

My 10.4r1.9 boxes died also but I saw interfaces go down whilst bgpd seemed stable.

Kelly_Kane · November 7, 2011, 3:55pm

Perhaps related to Juniper PSN-2011-08-327? Did the whole router
reboot, or just the service module?

We saw one TATA session, and one Abovenet session flap.

Kelly

Dan · November 7, 2011, 4:08pm

We got a panic message about the PFE that core'd and looks like it restarted our FPC's.

JUNOS 10.2R2.11

-Dan

Todd_Snyder · November 7, 2011, 4:09pm

Can anyone point to any authoritative updates about this?

Hammer · November 7, 2011, 4:14pm

I'm struggling to do the same. All the various "Internet Health" sites show(ed) some upticks in negative performance but I don't have any specifics. We are a Gomez customer and Gomez is showing issues In St. Louis (SAVVIS) and Philly (L3) that specifically impacts the availability of our applications but it's not clear on the underlying reason. I'm giving cautious updates to management because even though it's obvious something is going on I don't have anything official except random email threads. Looking for more insight before misinforming management.

-Hammer-

"I was a normal American nerd"
-Jack Herer

Richard1 · November 7, 2011, 4:27pm

I think Jared's suggestion was about as close as your going to get for
right now. Look at the size of the files he mentioned as compared to the
average size of the others.
Hopefully someone will come forth with an authoritative answer later
today.
Richard Golodner

Jared_Mauch · November 7, 2011, 4:37pm

One can do some analysis of the files to determine what prefixes and autonomous
system neighbors were impacted.

I can do some of this as I have some other tools that quickly process this data
if people are interested. Please send those replies/votes off list to me directly.

- Jared

Todd_Snyder · November 7, 2011, 4:40pm

Management don't understand or care about BGP updates, they just want to
know if the problem is ours, and if it's not, who to blame

thank goodness for NANOG - updates here have been helpful explaining things
to management.

t.

Hammer · November 7, 2011, 4:41pm

So the file size was 30% higher implies that the number of updates is larger and therefore there is instability? I see the logic but if you scroll thru that page (the whole month of November) there are tons of >1M files. Trying to see what is different about today....

-Hammer-

"I was a normal American nerd"
-Jack Herer

Pierre-Yves_Maunier · November 7, 2011, 4:43pm

On our side we did not have any reboot, just a core dump generated and all
interfaces flapped.

Leigh_Porter · November 7, 2011, 4:43pm

Just blame Shub Internet..

Oh no, I've said it now!

Jared_Mauch · November 7, 2011, 4:45pm

This is an easy benchmark to gauge overall stability. Large files mean something was unstable. Then you need to actually look at them to see *why*. Also since the files are compressed you lose some visibility into what is really in them.

- Jared

Hammer · November 7, 2011, 4:50pm

Thank you. This is somewhat of a learning opportunity for me. I hit all the generic Internet health sites and I understand that there IS an issue. Now I'm getting to learn how you guys attempt to understand WHY we had an issue.

But my point is the same. If this is the case than the entire month of November reflects "instability" where I see transitions from 600k to 1M between updates. Yet we didn't experience the same negative customer experience for those. So how do you see the difference with todays events? Digging into files now.

-Hammer-

"I was a normal American nerd"
-Jack Herer

Joel_Jaeggli · November 7, 2011, 4:52pm

Can anyone point to any authoritative updates about this?

  I think Jared's suggestion was about as close as your going to get for
right now. Look at the size of the files he mentioned as compared to the
average size of the others.
  Hopefully someone will come forth with an authoritative answer later
today.
  Richard Golodner

One can do some analysis of the files to determine what prefixes and autonomous
system neighbors were impacted.

I can do some of this as I have some other tools that quickly process this data
if people are interested. Please send those replies/votes off list to me directly.

according to my peakflow the level-3 update spike was from ~1408 utc to
~1424 utc.