Anyone seen this kind of problem? SIP traffic not getting to destination but traceroute does

Jay_Nakamura · November 9, 2011, 6:47pm

We ran into a strange situation yesterday that I am still trying to
figure out. We have many VoIP customers but yesterday suddenly select
few of them couldn't reach the SIP provider's network from our
network.

I could traceroute to the SIP providers server from the affected
clients' IP just fine. I confirmed that the SIP traffic was leaving
our network out the interface to the upstream provider and the SIP
provider says they couldn't see the SIP traffic come into their border
router.

SIP traffic coming from SIP provider to the affected customer came
through fine. It's just Us -> SIP server was a problem.

I thought there may be some strange BGP issue going on but we had
other customers within the same /24 as the affected customers and they
were connecting fine.

The traffic at the time traversed

Our network -> Qwest/century link -> Level 3 -> SIP provider

I changed the routing around so it would go through our other
upstream, AT&T, and it started working. With AT&T, the route was

Our network -> AT&T -> Level 3 -> SIP provider

So my questions is, is it possible there is some kind of filter at
Qwest or Level 3 that is dropping traffic only for udp 5060 for select
few IPs? That's the only explanation I can come up with other than
the whole Juniper BGP issue 2 days ago left something in between in a
strange state? I read the post about XO doing filtering on transit
traffic, I haven't seen anyone say Level 3 or Qwest is doing the same.

Sean_Harlow · November 9, 2011, 6:59pm

I can't say I have a specific answer to your question, but yesterday I was seeing major packet loss on outbound audio from all my VoIP customers using Qwest and going in to servers on L3. It's entirely possible that SIP was also being lost, just the audio was the more notable and pressing issue. It seems to be resolved at this point, but we have not yet heard from Qwest what the actual problem was.

This was with sites in Northeast Ohio and the Chicago area connecting to servers in New York and LA for what it's worth.

Preston_Parcell · November 9, 2011, 7:04pm

What was the timeframe for your issues? Just curious since we saw some strangeness last night.

Preston

Jay_Nakamura · November 9, 2011, 7:08pm

It started sometime Tuesday morning. I have yet to set the route back
to Qwest. I am going to do that tonight and test it.

Sean_Harlow · November 9, 2011, 7:13pm

I saw the problems starting around 09:30 Eastern and continuing past 17:00. Looking through ticket notes I had missed when writing my previous reply it seems that a fix was confirmed around 22:30 which involved a faulty piece of equipment being replaced. I do not have specifics on what went wrong and when it was actually fixed though.

Owen_Roth · November 9, 2011, 7:21pm

Yes!

Yesterday, from 9AM-10AM PST, I had a Qwest client transiting Level3 where traceroutes were working, but sip registrations were not. They were leaving fine, but not being received on the destination side.

Then at 10AM-2PM PST, same client, registrations and invites were now working, but "180 RINGING" was being eaten. Things worked fully at 2PM. We only contacted Level3, and they didn't see any issues at around 1:45PM PST.

Regards,
Owen

Jeff_S_Wheeler1 · November 9, 2011, 7:45pm

I ran into exactly this problem last week with Rogers. All traffic
from the client except udp/5060 could be received by us, and udp/5060
was blocked. We tested other IP addresses on our (provider) side and
did not find any blocking there, so we assigned a new IP to the SIP
gateway. I hardly think this can be an ordinary malfunction, but good
luck getting a phone company to troubleshoot a problem with their
subscribers using mobile data to connect to a third-party voice
gateway...

Jay_Ashworth · November 9, 2011, 8:45pm

Well, just a couple of days ago, we discussed that XO does this kind of
rifle-bullet filtering in certain circumstances; is any party getting their
connectivity from them?

Cheers,
-- jra

Jared_Mauch · November 9, 2011, 9:39pm

I've seen UDP/5060 be intercepted or blocked by various providers. This
is common in international markets. If you are doing VoIP over the public
internet, it may be worthwhile to invest in software or hardware that
can VPN either 'back' or 'out' to the internet. I have a PPTP VPN
solution I use to escape various hotel networks. You can even do an
install on a Linux box with the poptop/pptpd solution. (Having a
ssh server on tcp/80 and tcp/443 also can help, and is part of 'being
prepared').

- Jared

Blake_Hudson · November 9, 2011, 10:45pm

Jay Nakamura wrote the following on 11/9/2011 12:47 PM:

We ran into a strange situation yesterday that I am still trying to
figure out. We have many VoIP customers but yesterday suddenly select
few of them couldn't reach the SIP provider's network from our
network.

I could traceroute to the SIP providers server from the affected
clients' IP just fine. I confirmed that the SIP traffic was leaving
our network out the interface to the upstream provider and the SIP
provider says they couldn't see the SIP traffic come into their border
router.

...
So my questions is, is it possible there is some kind of filter at
Qwest or Level 3 that is dropping traffic only for udp 5060 for select
few IPs? That's the only explanation I can come up with other than
the whole Juniper BGP issue 2 days ago left something in between in a
strange state? I read the post about XO doing filtering on transit
traffic, I haven't seen anyone say Level 3 or Qwest is doing the same.

I've found tools like tcptraceroute (the name is deceiving, UDP is the default) and hping to be invaluable in tracking down issues like these that are obviously above the routing and into the transport layer.

I'm not sure how an IP transit provider (who should be providing routing/switching) screws up transport layer connections - looks like they are arbitrarily "managing" client data. Just my $0.02.

--Blake

Michael_Ulitskiy · November 9, 2011, 11:20pm

It may also be related to QoS policy inside the carriers.
Some time ago I've seen exactly the same symptoms with Verizon when sip signaling
was sent marked as EF. Remarking it down to CS1 or CS3 (don't remember exactly) solved the problem.

Michael

Jay_Nakamura · November 10, 2011, 3:33am

I just removed the route to our other provider and traffic is going
out Qwest again. The problem seems to be gone now. As others had
similar problems during the same period using Qwest, it must have been
some strange issue with Qwest.

Jack_Bates · November 10, 2011, 5:41am

With today's routers, all sorts of weird things can go wrong, especially if it's a hardware failure.

I had an IO/FE go out on a 7200 (which is as software as you get) which attributed to a lot of weirdness. It started when the IGP updated state information on the IO card's FE, which shut down mpls switching on the router, but the LSP itself was still considered up. It then showed by freaking out the neighbor 7206 when we reboot the failing one (could no longer ping the loopback of the neighbor router with and without using the LSP, but all IGP was up and you could ping/telnet/ssh to any other IP ). Finally the reboot itself showed the true issue (required multiple power cycles and a reset of the ata card to even load IOS in an unstable state).

I don't even want to think what happens when a high end router's linecard starts to fail.

Jack