Over the last 48 hours we have been getting a lot of alerts of customers phones losing registrations to us. All the complaints are coming from customers that are on VZ Fios in the NYC area. Anyone else see anything strange going on?
While you are diagnosing, might check to make sure that the SIP ALG is disabled on all of their routers too.
Thought of that. Customers have their own CPE’s. So far the only thing mutual here is that it’s NTT -> VZ. Here is what I found so far looking at two Polycom phones using non standard ports (e.g. not 5060)
- PhoneA tries to register multiple extensions and for each request we send a 401. We expect to get back a REGISTER request with a no-once but we don’t. This happens for a while and then magically it starts working.
- PhoneB tries to register the time time as PhoneA and has no issues.
At first I thought it was something possibly with the SIP call-ID but I ruled that out since in the same SIP DIALOG it was not working then it started. Also the seems to be per phone each phone is behind NAT and the traffic is coming from a different NAT’d port. Seems like there is some device in the middle that is randomly dropping traffic on specific sessions.
Are you using TLS encrypted SIP or just plain ol' cleartext?
If its encrypted, I'd look at possibly there being a MTU/MSS issue somewhere along the path possibly?
I’m seeing the same sort of thing. Polycom phones. Multiple customers getting to me from Verizon in NYC area. I’m seeing phones register for a while, then drop off, then I see them trying to re-reg resulting in your 401 below.
Call me. 212 497 8015. Let’s look at this.
FYI: More than one person reached out to me off list. The issue is clearly with VZ. Traces by the others were done and NTT was not in the mix. The only common denominator was 401 SIP packets hitting VZ Fios IP’s in the NY area.
This matches my experience with running SIP on networks. Slowly over the years it became more unreliable as “helper” ALGs were in the path.
Eventually we moved some devices off 5060 to alleviate the problem.
In our case we are not using 5060. The issue seems exclusive to VZ.
Can someone try to recreate the problem with TCP/5060. Or do iperf
test on equivalent ports with UDP+TCP, to determine if the problem is
related specifically to UDP.
Most networks have some form of limits to even transit traffic, UDP is
most typical L4 to have policers.
It’s not strictly UDP. I spoke with someone yesterday that was re-producing it with curl.