Routing issues to AWS environment.

Curt_Rice · May 8, 2019, 2:55pm

Hi are there any AWS engineers out there? We are seeing routing problems between NTT and AWS in Ashburn, Va and would like to find out which side is having the problem.

Thanks,
Curt

John_Von_Essen · May 8, 2019, 8:54pm

I was just about to email the group for a related issue.

We are also seeing some funky routing/peering within the AWS network.

We primarily communicate with Verizon Media/Oath - AS10310. Verizon Media has a presence in Singapore, and its peered locally with AWS AS38895 - we normally see 8ms latency. Verizon Media also peers with AWS AS16509 in Japan, but for Singapore traffic, Verizon Media sends a lower MED so AWS Singapore should prefer that route/peer, but its not working properly on the AWS side, all of our traffic is going to Japan, this started early AM today.

I had Verizon Media investigate, and we gave them our AWS Singapore IP addresses, they confirmed that they are not receiving those prefixes/announcements from AWS Singapore (AS38895).

So something is broke…. hopefully if someone from AWS is reading they can escalate.

In my case, the AWS Singapore IP ranges in question are : 46.51.216.0/21 and 52.74.0.0/16

-John

Chuck_Church · May 9, 2019, 10:34am

Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit and had a boatload of problems with random packet loss affecting initially just SIP and RTP (both UDP). Spirit was blaming NTT. Problems went away when Spirit stopped peering with NTT yesterday. Path is through Telia now to their main SIP trunk provider.

chuck

Dovid_Bender · May 9, 2019, 11:25am

Interesting at my 9-5 we use NTT exclusively for SIP traffic and it has been flawless. If there are any tests that you want me to run over NTT via their pop at 111 8th let me know.

_Job_Snijders · May 9, 2019, 2:23pm

Hi Chuck,

Are you sure the problem isn’t NTT? My buddy’s WISP peers with Spirit
and had a boatload of problems with random packet loss affecting
initially just SIP and RTP (both UDP). Spirit was blaming NTT.
Problems went away when Spirit stopped peering with NTT yesterday.
Path is through Telia now to their main SIP trunk provider.

I don't know the specifics of what you reference, but in a large
geographically dispersed network like NTT's backbone, I can assure you
there will always be something down somewhere. Issues can take on many
forms: sometimes it is a customer specific issue related to a single
interface, sometimes something larger is going on.

It is quite rare that the whole network is on fire, so in the general
case is good to investigate and consider each and every report about
potential issues separately.

The excellent people at the NTT NOC are always available at noc@ntt.net
or the phone numbers listed in PeeringDB.

Kind regards,

Job

Nick_Ellermann · May 9, 2019, 3:05pm

Job,
We have had a lot of dialog with the excellent people at NTT NOC this week, easily over a couple of hours in total. We were told to talk to AWS directly and have our customers talk to AWS. Basically, "it's not us" response. So we reached out to our buddies in NANOG. We have no way to get AWS to communicate to us, we don't directly peer with them like many other cloud providers out of the Equinix IX.

We have a work around in the fact that we broke up some of our Ashburn /21 advertisements into /23 and /24 advertisements of the ones that included our customer IP assignments. The result was pushing a more specific route out our Ashburn peers versus our out of the area peers such as in Chicago is helping. That has helped resolve our direct customer issues, but leads us to believe where we have BGP peering in other regions outside of Ashburn, VA AWS isn't honoring our AS prepending.

The original issue is that our local customers in the DC region get routed from our AS over NTT into AWS in Ashburn for AWS-East region environments, but AWS is sending the return traffic over to Chicago to one of our other upstream peers. For a few select customers this is breaking their applications completely with not being able to connect or severely disrupting performance and bringing the applications to a crawl. Yet, we can push iperf traffic in our own AWS instances with zero packet loss or perceivable issue other than the asymmetrical routing that is adding around 30ms to the return latency versus the typical 2ms to 3ms latency. We do have Layer2 between our POPs.

Is ignoring AS prepending common? Given my example issue, what direction would you normally take?

Sincerely,
Nick Ellermann

_Job_Snijders · May 9, 2019, 3:40pm

Dear Nick,

I sympathize with you plight, network debugging can be quite a test of
character at times.

I am snipping some text as I can't comment on on specific details in
this case, but you do raise two excellent questions which I can maybe
help with.

Is ignoring AS prepending common?

It is not common, but yes it does happen. Some cloudproviders and CDNs
have broken away from the traditional BGP best path selection and use
SDN controllers to steer traffic. I don't know if in play here or not.

Given my example issue, what direction would you normally take?

Your issue reminds me of an issue I encountered some years ago. A member
of the Dutch community reported that seemingly random pairs of IP
addresses could not reach each other across an Internet Exchange fabric.
It drove this person crazy because none of the involved parties could
find anything wrong within their domain. The debugging process was hard
because the person had to ask for pingsweeps, traceroutes, would get
information back without timestamps, didn't have the ability to alter
source and destination ports on packets sent for debugging.
It turned out to be a faulty linecard, that under specific circumstances
would hash traffic into a blackhole. It took WEEKS to find this.

So, I identified a need for a more advanced debugging platform - one
that wouldn't require human-to-human interaction to help operators debug
things, in other words it seemed to make sense to stand up linux shell
servers in lots of networks and share access with each other. This
project is the NLNOG RING and I'd recommend you to participate.

An introduction can be found here
https://www.youtube.com/watch?v=TlElSBBVFLw and a nice use case video is
available here https://www.youtube.com/watch?v=mDIq8xc2QcQ

NTT, Amazon, and many others are part of it, and I assume that you have
SSH access to the problematic destination so I hope you can use tcpdump
there to verify if you can or can't receive packets coming from NLNOG
RING nodes.

You mentioned that altering your announcements (deaggregating,
prepending) resolves the issue, this strongly suggests that something
somewhere is broken and it is a matter of triangulating until you've
find the shortest path that exhibits the problem. Perhaps you can find
something like "Between these two nodes, when I use source port X,
protocol Y, destport Z, traffic doesn't arrive".

Website: https://ring.nlnog.net/

There also is an IRC channel where people perhaps can help you make the
best use of this tool.

Kind regards,

Job