Network Connectivity... Dealing with Providers

Kuechel, Mark wrote:

Sounds like you are trouble shooting a VoIP issue several networks
removed from the actual user. First step is to get into their network
via telnet and start from there. Is this a jitter issue on some or all
calls? Has the customer done a traffic study on their own LAN to see if
there is not some sort of congestion there? Pings from afar are not used
to trouble shoot issues in depth: Lots of posting on this. Has the
clients Bandwidth utilization been looked at to their provider? Give us
more.

Pings and traceroutes weren't the only tests I've done. Here is my capacity when dealing with this client:

When something happens and I need to do some VoIP related stuff (extension changes, etc), I mainly log in via SSH from one of four points, a DSL connection CTTEL, Level3, GBLX, and Verio. When my lab's CTTel DSL connection fails I jump on a DS3 (GBLX), when that fails, I jump on to a machine in Texas and most of the times one of them is going to let me in. Now, I have had failures from two points to all points at sporadic times. So I do the obvious traceroutes, pings, etc.. Now a provider can be quick to tell me "check your line" but come on now... 4 different lines are failing to connect here. (This doesn't include the fact that if I can't get in... What makes you think voice data is getting in?)

So, for my testing, I'm doing a functional (its fugly) test from all four locations to my client, and from my client to all four. My data is going to be a collection of ping tests, traceroute test (tcptraceroute), bing test, etc.... I was hoping to get feedback on other tools... I have Radarping as well but don't feel like using it. I want to be able to leave something running 24x7 until Friday. I'd like for it to be opensource so the provider doesn't cry "your network voodoo tools don't count!". I want to be able to go back and say "Listen these tools are industry standard tools from CAIDA (or elsewhere), and they're used by engineers all across the board. I've done a fair test and its obviously coming from your network.."

So to answer your bandwidth question, bandwidth (according to the provider) is under 50% capacity with "sporadic spikes" as their engineers have seen while on the phone with them. Sporadic means nothing to me. I have a 63% packet loss which means even if I was equipped with an OC768, the bandwidth means nothing if the packets aren't going through. "Here's your Lamborghini Murcielago Sir. It does 200mph. Although from time to time you'll only do 126mph..." Traffic internally, I've put on QoS maps, but with or without them same errors occur. It's not an issue of echoes, its more of calls to specific DID's dropping, not going through, caller can hear - receiver can't. All the while some lines work, others don't. Couple this with my Nagios test going bonkers - I configured Nagios to monitor from my client to Google, Yahoo, MSN and I can see loss from here to the outside world so it's twofold. Short of my client running me over with his FX45, I'm even running out of patience with my client's provider.

If you have Cisco routers on either end, use the built in SLA capability.
It will give you ongoing abilty to trace latency, loss, jitter. It won't
tell you bandwidth, but will give you a set of metrics for traffic quality.
Do a full mesh between all your edge devices and it might help track where
in the middle your issues reside. The SLA tools are pretty standard to
Cisco devices and so should give you an edge in getting people to listen to
you.

Ray,

   Do you have an example of accessing the SLA data via SNMP? I've just got interested in those things, I've found the OIDs required, but its all a bit of a maze ... I could really use some jitter information in a couple of places right about now ...

                                                Neal

Ray Burkholder wrote:

I've been using Cricket along with GenDevConfig_2_0 from
http://www.acktomic.com/cricket/cricket-genRtrConfig.htm to collect and plot
cisco SAL status. I have had to make some changes to their scripts to
accept some of Cisco's recent changes. I can get the changes posted in the
next day or two.

nealr wrote:

Ray,

  Do you have an example of accessing the SLA data via SNMP? I've just
got interested in those things, I've found the OIDs required, but its
all a bit of a maze ... I could really use some jitter information in a
couple of places right about now ...

I seem to remember the thread
http://forums.cacti.net/about4136-0-asc-30.html
as being useful if you use cacti.

Ray Burkholder wrote:

If you have Cisco routers on either end, use the built in SLA capability.
It will give you ongoing abilty to trace latency, loss, jitter. It won't
tell you bandwidth, but will give you a set of metrics for traffic quality.
Do a full mesh between all your edge devices and it might help track where
in the middle your issues reside. The SLA tools are pretty standard to
Cisco devices and so should give you an edge in getting people to listen to
you.
  

Thanks for all the responses. I wish I had Cisco on both ends I would have configured auto-qos but
I'm stuck on Adtran (client) and I believe Juniper (provider). Anyhow for those who enquired, this is
what I am currently doing for my connectivity testing: (M = my connections, C = client)

M(GBLX) --> tcptraceroute && iperf && ping --> Client
M(LVLT) --> same as above --> Client
M(DSL) --> same as above --> Client
M(Verio) -- > same as above --> Client

C --> bing (Google, MSN, *PROVIDER*) && tcptraceroute --> M(GBLX)
C --> tcptraceroute --> M(LVLT)
C --> tcptraceroute --> M(DSL)
C --> tcptraceroute --> M(Verio)

So far I have come across the following oddity I can't put my finger on:

# bing -P -D -c 25 -e 3 xxx.xxx.1.177 xxx.xxx.1.182

bing: packet (72 bytes) from unexpected host xxx.xxx.24.36
bing: packet (136 bytes) from unexpected host xxx.xxx.24.36
bing: packet (72 bytes) from unexpected host xxx.xxx.24.36

See a problem? xxx.xxx.24.36 is the provider's router two hops before the
CPE. I'm thinking, filtering? Maybe, I have no idea why xxx.xxx.24.36 is
getting in the mix of my packets. I have this scenario running every 15
minutes from all locations.

Ray,

   Do you have an example of accessing the SLA data via SNMP?
I've just got interested in those things, I've found the OIDs
required, but its all a bit of a maze ... I could really use
some jitter information in a couple of places right about now ...

A number of people have asked for how I did the Cricket/SLA thing. I have a
description of the configuration at:

http://www.oneunified.net/blog/OpenSource/Debian/Monitoring/Cricket/installa
ndconfig.article

On one of the systems I'm getting a cricket error of:
"illegal attempt to update +using time 1163791808 when last update time is
1163791808 (minimum one second step) "

I'm not sure if it affects other systems. I have to check. Anyway, once I
get this thing fixed, I think everything should be good to go.

Let me know if you have similar problems.

Ray
http://www.oneunified.net/blog/