Phone networks struggle in Hurricane Katrina's wake

Fergie · August 30, 2005, 11:23pm

Me? I personally never trade my POTS for VoIP...

- ferg

"In this age of cheap commoditized consumer electronics and
advanced mobile technology, why can't all the people of a city make
contact during an emergency?

Simple: it's too expensive.

Keep this in mind when trading in your POTS service for VoIP service
over the internet. Discounting the local loop which is often the same
in both cases, POTS is extremely reliable while VoIP over the public
internet, well, isn't. But apparently people that switch to VoIP
don't mind the reduced likelihood of being able to make calls during
the next large scale emergency.

Mark_Foster · August 31, 2005, 6:33am

Telecom New Zealand announced the other day their intention to do precisely this.

"In relatively short order we will replace the entire PSTN and be delivering all our services for customers over the IP network. That has the potential to reduce costs for customers and put a lot more control and flexibility in customers. hands, wherever they are . at home, at work or on the move.."

From http://www.telecom-media.co.nz/releases_detail.asp?id=3223&page=index

I have to say I would usually agree with you - but it looks like I may not have a choice, going forward... The whole country to be migrated by 2012.

The whole idea of not having POTS to fall back on doesn't sit well with me - As part of AREC we prepare for a situation where all other means have failed. Suddenly it seems so much more likely... ?

Mark.

Mark_Foster · August 31, 2005, 8:04am

At the risk of replying to myself,

The below article is about the core, not the edge....
Theres another article on Telecom's site relating to trials for edge IP
equipment. So my take on the NZ situation was a bit warped.

I do see a risk in the move toward IP systems at the edge. At the core is a different story to at least some degree.
Twas also pointed out that British Telecom are heading down the same track as Telecom NZ, and their rollout should be completed earlier. I trust therefore that it has all been thought out in terms of robustness and the like.

As was pointed out to me offlist, when the PSTN falls over,
alternate-network based IP systems do have their merits - but I've always favoured the simple over the complex from a view of resilience. IP stuff has that many more layers to break?

Operationally, natural disasters and the like do reveal our
reliance on increasingly complex systems, with x number of additional
dependencies that can take the service down.

Of course, events like Katrina are fairly extreme, but in general, people should have some sort of fallback position. Its not a bad general rule.

Mark.

Iljitsch_van_Beijnum · August 31, 2005, 10:27am

There are two types of VoIP: voice over a private, tightly controlled IP network, and voice over the public internet. Now obviously the latter is a risky proposition, as it imports all the limitations of the internet into the voice service. Apart from the fact that many parts of the internet aren't all that robust (but some are), this is a problem because voice and IP react differently to congestion collapse, which invariably happens to some degree in big emergencies. With IP, delays and packet loss build up, slowing everything down, but allowing many protocols to continue to work at a reduced rate. With PSTN, initiating calls starts failing more and more, but when you get through, you generally get to talk because you get a reserved piece of the scarce bandwidth. With VoIP, packet loss and delay eventually make the service useless. So VoIP fails harder than either traditional IP apps and PSTN.

However, voice over a private network isn't entirely trouble-free, even though the private network can be designed such that congestion is a less fatal problem. And it does have the advantage that it allows IP routing protocols to route ongoing calls around failed parts of the network. On the other hand, in a circuit switched network you can do all kinds of interesting stuff (such as restarting all your control software) without breaking your sessions. We're only now seeing this in IP, and I think it's not really possible to reach the same levels with IP routing even in the long run. And then there is all this SIP stuff, which I'm (thankfully) only superficially familiar with, but never seemed particular robust to me.

And voice over any kind of packet infrastructure introduces significant additional delays.

I think in 10 years or so we'll realize that TDM isn't so bad after all.

Chris_Gilbert · August 31, 2005, 11:51am

[snip]
The Telecommunications Industry Association (TIA) -- the people who
brought you the CAT standards for unshielded twisted pair cabling --
recently undertook a vast challenge to publish a definitive document
encompassing best practices and design considerations for every single
aspect of the modern data center.

The standard, entitled Telecommunications Infrastructure Standard for
Data Centers, TIA-942, weighs in at 148 pages, and covers everything
from site selection to rack mounting methods.
[/snip]

Link:
http://searchdatacenter.techtarget.com/originalContent/0,289142,sid80_gci1120625,00.html

Also:
http://www.tiaonline.org/media/press_releases/index.cfm?parelease=05-46

I seem to remember some folks asking questions about such a thing here
in the past... so I hope this isn't a duplicate of an old thread.

In any case, has anyone here looked over the documents and/or have any
comments on them?

It seems to me (however I have not yet read it) that something such as
this could be quite useful to IT students and others who don't have the
field experience.

michael.dillon1 · August 31, 2005, 1:03pm

With VoIP, packet loss and delay
eventually make the service useless. So VoIP fails harder than either
traditional IP apps and PSTN.

That is only in theory. In practice, during times of
impending congestion collapse, IP network operators
reconfigure the network to cope. For instance when
DDoS is detected, people set up ACLs and trigger black
hole routes. I think that it is possible for network
operators to define an analogous action plan to stave
off congestion collapse in an emergency situation.

I'm not sure exactly what that action plan would look
like, but I'm sure other list members will have plenty
of good ideas. If you'll recall, just a few days ago
people were talking about how they informally identified
IP connectivity to emergency response sites so that those
sites could be given priority in restoring service.

We just need to sit down and talk these things over with
our local emergency response organizations and learn
where network operators can become part of the solution.

On the other hand, in a circuit switched
network you can do all kinds of interesting stuff (such as restarting
all your control software) without breaking your sessions. We're only
now seeing this in IP, and I think it's not really possible to reach
the same levels with IP routing even in the long run.

MPLS may have the edge here because you can have backup paths
and fast reroute to keep traffic flowing if you have an
orderly plan for rebooting routers.

And voice over any kind of packet infrastructure introduces
significant additional delays.

Experience with the Inter-NOC phone system
http://www.pch.net/inoc-dba/
seems to suggest otherwise. Some kinds of packet
infrastructure only introduce insignificant delays.
It would be interesting to know if any of the academics
among us have studied the behavior of a SIP-based
VoIP network during various types of failure and
congestion scenarios. I suspect that problems will
be mostly found under certain specific sets of conditions
and if we know what those conditions are and how they
impact voice services, then we can plan actions to
mitigate the problems. One thing that IP network operators
can do is throw bandwidth at a problem by "shedding load",
i.e. killing traffic that is deemed non-essential. This
would free bandwidth for traffic that is deemed "important".
This has nothing to do with QoS per se becaus it can be
implemented in many ways up to and including unplugging
sites that generate non-essential traffic.

All indications are that the next few decades will see
an increased number of emergency situation like the
tsunami, terror attacks in major cities, hurricanes,
earthquakes. We have gotten very good at running the
network through normal times, maybe we should now focus
on how to keep it running through times of extreme stress.

--Michael Dillon

Andy_Davidson · August 31, 2005, 7:19pm

Iljitsch van Beijnum wrote:

There are two types of VoIP: voice over a private, tightly controlled IP network, and voice over the public internet. Now obviously the latter is a risky proposition, as it imports all the limitations of the internet into the voice service.

I'm not so sure; someone cuts an ISDN-30 into our building and the sky falls down. Someone cuts some fibre carrying IP and life (and communications) carry on ..

Perhaps you've made a fair and good comment on the marurity of most off-the-shelf voip products or implementations. But the key, in my mind, is that VoIP across the internet, when done well, imports all of the opportunities of internet routing into voice service.

-a

Michael_Loftis · August 31, 2005, 7:42pm

<...>

On the other hand, in a circuit switched
network you can do all kinds of interesting stuff (such as restarting
all your control software) without breaking your sessions. We're only
now seeing this in IP, and I think it's not really possible to reach
the same levels with IP routing even in the long run.

MPLS may have the edge here because you can have backup paths
and fast reroute to keep traffic flowing if you have an
orderly plan for rebooting routers.

Which does us no good in the case that we're "close" to the edge device and need to reboot the control plane of a nearby router. To me it seems Juniper and Cisco are both making huge steps in understanding this is necessary technology they can 'borrow' from telco's. You've a highly intelligent, but fairly decoupled control plane, with a fairly dumb, but largely automatic 'forwarding' or 'circuit fabric' plane being directed by the control plane. If the control plane takes a nap, the bottom end continues what it was doing until something (control plane coming back online, backup control plane doing takeover) tells it otherwise. No this isn't easily possible in most instances, even with just bare IP and with NAT it becomes really difficult because of the large amount of intelligence (relatively speaking) required to handle NAT. I should clarify that when I say NAT I mean PNAT and application/protocol specific NAT that requires more than just simple packet mangling.

I think though, that eventually this will be commonplace, certainly in the core, and even really close to the edges. the M10i's approach this sort of resiliency. the T series and the larger M series also work like this....I think that the ONS' also are pushing on this (though admittedly aren't exactly IP...)

Anyway, point is, that if you're right up close to the edge, MPLS may not matter, towards the core sure, where you're away from actual end connections and there's redundancy around you when you need to do a control plane restart.

There will always be upgrades. Further there will always be other issues, however, in my mind atleast, today's networks are far more resilient and faster to heal than they've been in the past, atleast in IP.... PSTN...well...They're reliability king, until something unexpected happens. There were reports on here I believe it was even about call routing issues during this outage, not capacity type issues, simple lack of the systems ability to reconfigure and cope with loss of connectivity.

There are places for both PSTN and IP though.

Valdis_Kletnieks · August 31, 2005, 7:47pm

The crucial point being that "when done well" is something that you usually
can't evaluate until it's too late. And there's maturity level for more than
just products and implementations.

It's clearly possible to find telco engineers with 5/10/15 years experience in
running PSTN (might even find somebody with 40-50 years? :). It's possible to
find network engineers with lots of BGP experience. Where do you find a senior
engineer with 5+ years experience in enterprise-scale VoIP deployment?

Iljitsch_van_Beijnum · August 31, 2005, 9:07pm

There are two types of VoIP: voice over a private, tightly controlled IP network, and voice over the public internet. Now obviously the latter is a risky proposition, as it imports all the limitations of the internet into the voice service.

I'm not so sure; someone cuts an ISDN-30 into our building and the sky falls down.

Yes, single homing sucks.

Someone cuts some fibre carrying IP and life (and communications) carry on ..

You can get your ISDN 30 over redundant fibers too, that's not the problem.

Perhaps you've made a fair and good comment on the marurity of most off-the-shelf voip products or implementations. But the key, in my mind, is that VoIP across the internet, when done well, imports all of the opportunities of internet routing into voice service.

You say that as if it's a good thing.

I think in the long run, it makes sense to have end-to-end IP calls over the internet. However, this is not going to be as reliable as the PSTN for many years to come, because there are is no inter-AS QoS deployment, routing protocols take their sweet time (180 seconds BGP timeout anyone?) and the internet is becoming fairly non-transparent because of all the goo people keep pouring into the machinery in the name of security and the like.

However, using the public internet as a local loop is bad. Here in the Netherlands, the incumbent telco isn't allowed to lower its prices, but everyone (including the incumbent telco) can sell voice minutes to PSTN destinations over an IP "local" loop for any price they want. So basically they're forced to kill off the local leg of the PSTN to be able to compete on medium/long distance. This is not good. Not so long ago, when there was a failure in the long distance infrastructure, you could still make local calls. With the current "intelligent" networks that's not always the case anymore, but if the emergency number stuff is done properly, you can still call 911/112 when the long distance stuff is down. With inet local loop that will no longer be the case in most cities.

But then, people don't really care about this, as cell is in the exact same boat and huge numbers of people rely on just their cell phone and no longer have a fixed line (in Europe at least).

Deepak_Jain · September 1, 2005, 2:20am

Eesh... I grabbed a copy of this thing. In a cursory over-read... I am afraid if people (people defined by lim(clue) -> 0) start implementing datacenters by this guide. This would be a BRILLIANT document as the reading material for a college-level course. However, I'd be concerned if a CxO reads this and assumes they are great if the document has no conflicts with their implementation and they think they are in good shae.

Before I comment publicly on the issues I think I have with it, I want to verify that the points I raise aren't covered in some sort of disclaimer about being "out of scope" etc. Essentially 90% of the conversations folks have on nanog about datacenter designs are outside of what this advocates building (in a very cursory overread).

DJ

Chris Gilbert wrote:

Robert_Boyle1 · September 1, 2005, 2:35am

We have already been asked about where our datacenters fit in with the TIA942 spec in several RFPs! It does cover some good topics, but it also leaves out the design and structure of many things which are far more likely to cause an outage than the copper and fiber physical plants.

-R

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com | 888-TELLURIAN | 973-300-9211
"Well done is better than well said." - Benjamin Franklin

Deepak_Jain · September 1, 2005, 3:07am

We have already been asked about where our datacenters fit in with the TIA942 spec in several RFPs! It does cover some good topics, but it also leaves out the design and structure of many things which are far more likely to cause an outage than the copper and fiber physical plants.

Yeah... and it introduces/codifies the concept of "tiers" of datacenters... Yet, its possible to be have "tier 4" access to telecommunications while being a "tier 1" datacenter to operate those telecommunications, or vice versa.

What bothers me as significantly as this tier stuff is that redundancies, procedures, staffing, testing, policies are only mentioned, but not actually discussed (such as the why's, or how to test for the condition). They refer to specific technologies... like "RAID" as an application for a "tier 4" facility. They mention colocation and internet data centers, but don't discuss or even address how your facilities survivability is not fundamentally affected by non-carrier grade equipment being installed by customers -- yet, not surprisingly, the "tier 4" definition specifically talks about all the equipment installed in the datacenter.

There is lots of hand waving... like "beware the EPO".

And yet, it doesn't discuss how facilities like Exodus's NJ facility that had all the power outages or Equinix/Ashburn and Equinix/Chicago which presumably meet at least, the Tier-3 specifications by design... still fail when they are implemented poorly. That 99.99% and above availability have more to do with maintenance and procedures than the equipment you installed initially.

Its more of a document I'd expect to spend a ridiculous some of money to have a consultant produce, not someone who should know better. Great college guide book to discuss "issues" though.

Deepak Jain
AiNET

Petri_Helenius · September 1, 2005, 5:48am

Deployable enterprise VoIP products existed in 1998. So it would be somebody who was there doing it back then? Goes 5+ with a margin.

Pete

michael.dillon1 · September 1, 2005, 9:58am

But then, people don't really care about this, as cell is in the
exact same boat and huge numbers of people rely on just their cell
phone and no longer have a fixed line (in Europe at least).

I have read accounts that suggest that cellphone subscribers
from New Orleans only have one way service. In other words,
if you left New Orleans with your cellphone then you can
make outgoing calls but no-one can call you. I don't know
how widespread this is, but knowing that there has to
be an SS7 switch in New Orleans directing those incoming
calls to your new location, I can imaging that loss of
such a switch would create problems.

A similar problem would be created if a web server relied
on DNS that was only hosted on servers in New Orleans.

--Michael Dillon

NANOG_Mail_List_Comm · September 1, 2005, 10:54am

Are there other documents/books that people would instead recomend? I
found this recently but I've not started reading it (it's on my safari
bookshelf):

Build the Best Data Center Facility for Your Business
By Douglas Alger
ISBN: 1-58705-182-6

also "The Practice of System and Network Administration" by Limoncelli and
Hogan has a few pointers as well but on a smaller scale and it feels a
little old at times.

JC_Dill2 · September 1, 2005, 4:41pm

It is sometimes the case in disasters that people from inside can call out but that people from outside can't call in because the circuits into the disaster area become overloaded. This would hold true especially in the case where many people in the disaster area have no access to working phones, so those with working phones can easily get a free outbound circuit - meanwhile frantic friends and family clog up the incoming circuits trying to reach phones that are out of service or people who simply aren't near the phone and who can't answer but those calls still tie up circuits each time they are attempted.

I've had several reports that cell phone users who can't make *or* receive calls are successfully sending *and* receiving SMS. It could be that the problem is one of not enough cell channels and working phone circuits for all the phone calls people want to make, but that the SMS channel is not overloaded and thus SMS traffic can zip on thru (when the cell has power and can reach a working cell tower).

jc

Valdis_Kletnieks · September 1, 2005, 5:37pm

Yes, but I hear that both of the guys who actually *DEPLOYED* anything
enterprise-wide in 1998 are happily employed and not available.

Jay_Ashworth · September 2, 2005, 2:43am

Tom tells me he's prepping a seconfd edition for late 96; anyone who
has comments on the first edition should codify them and ship them to
him now.

Cheers,
-- jra

michael.dillon1 · September 2, 2005, 10:57am

I've had several reports that cell phone users who can't make *or*
receive calls are successfully sending *and* receiving SMS. It could be

that the problem is one of not enough cell channels and working phone
circuits for all the phone calls people want to make, but that the SMS
channel is not overloaded and thus SMS traffic can zip on thru (when the

cell has power and can reach a working cell tower).

This was my personal experience during the July 7th terrorist
attacks in London. I couldn't make or receive voice calls
but SMS did get through both incoming and outgoing. However
the delivery of SMS messages was sometimes delayed by as
much as an hour.

SMS takes far less network bandwidth than voice calls.
Originally it was implemented as part of the control
network of GSM (rather like SS7) but I believe that most
carriers now simply use IP networks to carry their SMS
traffic.

By now it has become clear that the response to the
New Orleans disaster has been completely screwed up
because of lack of reliable communications in and out
of the city. There are tons of food, water, medical
supplies and personnel hung up on edge of the city
because no-one seems to know what is needed, where
it is needed, how to get it there, etc.

--Michael Dillon