Bottlenecks and link upgrades

At what point do commercial ISPs upgrade links in their backbone as well as peering and transit links that are congested? At 80% capacity? 90%? 95%?

Thanks,
Hank

Caveat: The views expressed above are solely my own and do not express the views or opinions of my employer

I've worked for employees where policy has been anywhere from 50% or
80%. And I know this isn't complete range. Most do not subscribe to
any single simple rule but act more tactically.

Personally if the link is in a growth market, you should upgrade
really early, 50% seems late, cost is negligible if you anticipate
growth to continue. If it's not a growth market cost may become less
than negligible.

Sometimes networks congest particularly their edge interfaces
strategically due to poor incentives, where irrelevant revenue
wholesale arm might see some benefit from strategic congestion while
also significantly hurting their money printing mobile arm reducing
company wide bottom line while improving wholesale arm bottom line.

We start the process at 50% utilization, and work toward completing the upgrade by 70% utilization. The period between 50% - 70% is just internal paperwork. Mark.

Personally if the link is in a growth market, you should upgrade
really early, 50% seems late, cost is negligible if you anticipate
growth to continue. If it's not a growth market cost may become less
than negligible.

The problem you have is "what is a growth market", especially over time
as it stabilizes and see new entrants, but growth is now in a phase
where you need massive scale to keep playing.

You then shift from a "sales are guaranteed Day 1" to a "build it and
hope for the best". Many Commercial get fearful at that point, because
of the temptation to link capacity to guaranteed sales.

Sometimes networks congest particularly their edge interfaces
strategically due to poor incentives, where irrelevant revenue
wholesale arm might see some benefit from strategic congestion while
also significantly hurting their money printing mobile arm reducing
company wide bottom line while improving wholesale arm bottom line.

I know a few :-).

Mark.

Just my curiosity. May I ask how we can measure the link capacity loading? What does it mean by a 50%, 70%, or 90% capacity loading? Load sampled and measured instantaneously, or averaging over a certain period of time (granularity)?

These are questions have bothered me for long. Don’t know if I can ask about these by the way. I take care of the radio access network performance at work. Found many things unknown in transport network.

Thanks and best regards,
Taichi

When I worked for an ISP, it was about 70%, not sure if that is the case with the other ones.

For this, we look at simpel 5-minute based SNMP data over the period.
Nothing too fancy. It's stable

Mark.

For this, we look at simple 5-minute based SNMP data over the period.
Nothing too fancy. It's stable

Mark.

Why upgrade when you can legislate the problem instead.

Charter tries to convince FCC that broadband customers want data caps.

https://arstechnica.com/tech-policy/2020/08/charter-tries-to-convince-fcc-that-broadband-customers-want-data-caps/

Ted

m Taichi writes:

Just my curiosity. May I ask how we can measure the link capacity
loading? What does it mean by a 50%, 70%, or 90% capacity loading?
Load sampled and measured instantaneously, or averaging over a certain
period of time (granularity)?

Very good question!

With tongue in cheek, one could say that measured instantaneously, the
load on a link is always either zero or 100% link rate...

ISPs typically sample link load in 5-minute intervals and look at graphs
that show load (at this 5-minute sampling resolution) over ~24 hours, or
longer-term graphs where the resolution has been "downsampled", where
downsampling usually smoothes out short-term peaks.

From my own experience, upgrade decisions are made by looking at those

graphs and checking whether peak traffic (possibly ignoring "spikes" :slight_smile:
crosses the threshold repeatedly.

At some places this might be codified in terms of percentiles, e.g. "the
Nth percentile of the M-minute utilization samples exceeds X% of link
capacity over a Y-day period". I doubt that anyone uses such rules to
automatically issue upgrade orders, but maybe to generate alerts like
"please check this link, we might want to upgrade it".

I'd be curious whether other operators have such alert rules, and what
N/M/X/Y they use - might well be different values for different kinds of
links.

We use alerts to tell us about links that hit a threshold, in our NMS.
But yes, this is based on 5-minute samples, not percentile data.

The alerts are somewhat redundant for any long-term planning. They are
more useful when problems happen out of the blue.

Mark.

With tongue in cheek, one could say that measured instantaneously, the
load on a link is always either zero or 100% link rate…

Actually, that’s a first-class observation !

At what point do commercial ISPs upgrade links in their backbone as
well as peering and transit links that are congested? At 80%
capacity? 90%? 95%?

Hi,

Wouldn't it be better to measure the basic performance like packet drop
rates and queue sizes ?

These days live video is needed and these parameters are essential to
the quality.

Queues are building up in milliseconds and people are averaging over
minutes to estimate quality.

If you are measuring queue delay with high frequent one-way-delay
measurements

you would then be able to advice better on what the consequences of a
highly loaded link are.

We are running a research project on end-to-end quality and the enclosed
image is yesterdays report on

queuesize(h_ddelay) in ms. It shows stats on delays between some peers.

I would have looked at the trends on the involved links to see if
upgrade is necessary -

421 ms might be too much ig it happens often.

Best regards

Olav Kvittem

I'm confident everyone (even the cheapest CFO) knows the consequences of
congesting a link and choosing not to upgrade it.

Optical issues, dirty patch cords, faulty line cards, wrong
configurations, will almost likely lead to packet loss. Link congestion
due to insufficient bandwidth will most certainly lead to packet loss.

It's great to monitor packet loss, latency, pps, e.t.c. But packet loss
at 10% link utilization is not a foreign occurrence. No amount of
bandwidth upgrades will fix that.

Mark.

you could easily have 10% utilization and see packet loss due to insufficient bandwidth if you have egress << ingress and proportionally low buffering, e.g. UDP or iSCSI from a 40G/100 port with egress to a low-buffer 1G port.

This sort of thing is less likely in the imix world, but it can easily happen with high capacity CDN nodes injecting content where the receiving port is small and subject to bursty traffic.

Nick

Indeed.

The smaller the capacity gets toward egress, the closer you are getting
to an end-user, in most cases.

End-user link upgrades will always be the weakest link in the chain, as
the incentive is more on their side than you, their provider. Your final
egress port buffer sizing notwithstanding, of course.

Mark.

Hi Mark,

Just comments on your points below.

Wouldn't it be better to measure the basic performance like packet
drop rates and queue sizes ?

These days live video is needed and these parameters are essential to
the quality.

Queues are building up in milliseconds and people are averaging over
minutes to estimate quality.

If you are measuring queue delay with high frequent one-way-delay
measurements

you would then be able to advice better on what the consequences of a
highly loaded link are.

We are running a research project on end-to-end quality and the
enclosed image is yesterdays report on

queuesize(h_ddelay) in ms. It shows stats on delays between some peers.

I would have looked at the trends on the involved links to see if
upgrade is necessary -

421 ms might be too much ig it happens often.

I'm confident everyone (even the cheapest CFO) knows the consequences of
congesting a link and choosing not to upgrade it.

Optical issues, dirty patch cords, faulty line cards, wrong
configurations, will almost likely lead to packet loss.
Link congestion
due to insufficient bandwidth will most certainly lead to packet loss.

sure, but I guess the loss rate depends of the nature of the traffic.

It's great to monitor packet loss, latency, pps, e.t.c. But packet loss
at 10% link utilization is not a foreign occurrence. No amount of
bandwidth upgrades will fix that.

I guess that having more reports would support the judgements better.

A basic question is : what is the effect on the perceived quality of the
customers ?

And the relation between that and /5min load is not known to me.

Actually one good indicator of the congestion loss rate are of course
the SNMP OutputDiscards.

Curves for queueing delay, link load and discard rate are surprisingly
different.

regards

Olav

sure, but I guess the loss rate depends of the nature of the traffic.

Packet loss is packet loss.

Some applications are more sensitive to it (live video, live voice, for
example), while others are less so. However, packet loss always
manifests badly if left unchecked.

I guess that having more reports would support the judgements better.

For sure, yes. Any decent NMS can provide a number of data points so you
aren't shooting in the dark.

A basic question is : what is the effect on the perceived quality of the
customers ?

Depends on the application.

Gamers tend to complain the most, so that's a great indicator.

Some customers that think bandwidth solves all problems will perceive
their inability to attain their advertised contract as a problem, if
packet loss is in the way.

Generally, other bad things, including unruly human beings :-).

And the relation between that and /5min load is not known to me.

For troubleshooting, being able to have a tighter resolution is more
important. 5-minute averages are for day-to-day operations, and
long-term planning.

Actually one good indicator of the congestion loss rate are of course
the SNMP OutputDiscards.

Curves for queueing delay, link load and discard rate are surprisingly
different.

Yes, that then gets into the guts of the router hardware, and it's design.

In such cases, that's when your 100Gbps link is peaking and causing
packet loss, not understanding that the forwarding chip on it is only
good for 60Gbps, for example.

Mark.

Is it possible to do and is anyone monitoring metrics such as max queue length in 5 minutes intervals? Might be a better metric than average load in 5 minutes intervals.

Regards

Baldur

I suppose it would depend on if your hardware has an OID for what you want to monitor.