Intradomain Traffic Engineering

Hi All,
I’m a PhD student currently studying intra-domain traffic engineering, and I have two questions that I really wish to hear some opinions from you network operators.
I’m experimenting with a prediction-based intra-domain traffic engineering technique. The technique uses traffic demand matrices observed in the history to predict future traffic demands, and computes a routing that minimizes maximum link utilization (MLU) for those future demands.
I evaluate the performance of the technique using Abilene traffic traces collected at every 5 minutes interval. The results show that when the model is able to predict the real traffic matrix, the technique can achieve close to optimal MLU. However, when the model makes wrong prediction, the technique suffers very high MLU (as high as 140%).
Basically, I have the following two questions:

  1. In the traces I have, there exist several intervals with a huge, sudden increase of traffic on some links. The prediction model I use cannot predict those ‘big spikes’. Do these ‘big spikes’ really happen in operational networks? Or are they merely measurement errors? If they really happen, is there a gradual ramp up of traffic in smaller time scale, say, on the order of tens of seconds? Or do these ‘big spikes’ really occur very quickly, say, in a few seconds?
  2. I have the option to make a tradeoff between average case performance and worst case performance guarantee, but I don’t know which one is deemed more important by you. Are ISP networks currently optimized for worst case or average case performance? Is the trade-off between these two an appealing idea, or may the ISP networks are already doing it?
    I really appreciate any feedback from you about the above two questions, and your help will be acknowledged in any publication about this work.

Thanks,
Edgar

(snip)
wrong prediction, the technique suffers very high MLU (as high as 140%).
    Basically, I have the following two questions:
    1. In the traces I have, there exist several intervals with a huge, sudden increase of traffic on some links. The prediction model I use cannot predict those 'big spikes'. Do these 'big spikes' really happen in operational networks? Or are they merely measurement errors? If they really happen, is there a gradual ramp up of traffic in smaller time scale, say, on the order of tens of seconds? Or do these 'big spikes' really occur very quickly, say, in a few seconds?

Nobody can predict them so you build your network with excess capacity from an overhead standpoint as well as a link standpoint. Here are several reasons for variation and unpredictability. This is not a comprehensive list and I'm sure others will add to it.

CNN or other major network coverage including major advertising events - super bowl, victoria's secret show, etc. (10s of seconds)
SQL Slammer / Code Red / Nimda / or other major fast moving outbreaks (10s of seconds - maybe. We saw the spread of SQL slammer within 2 seconds to many unmanaged colo customer machines)
depeering of any two or more large networks or routing mistakes or flapping thus dampening (a few seconds to 10s of seconds to hours)
major provider outage which moves flows to other paths (a few seconds to 10s of seconds)
fiber cuts / regional power outages (a few seconds to 10s of seconds)
significant events such as 9/11 & Katrina (a few seconds to many hours)

    2. I have the option to make a tradeoff between average case performance and worst case performance guarantee, but I don't know which one is deemed more important by you. Are ISP networks currently optimized for worst case or average case performance? Is the trade-off between these two an appealing idea, or may the ISP networks are already doing it?

Each ISP makes their own decisions based on their business needs, budgets, and promised SLAs to customers

-Robert

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com | 888-TELLURIAN | 973-300-9211
"Well done is better than well said." - Benjamin Franklin

    1. In the traces I have, there exist several intervals with a huge, sudden increase of traffic on some links. The prediction model I use cannot predict those 'big spikes'. Do these 'big spikes' really happen in operational networks? Or are they merely measurement errors? If they really happen, is there a gradual ramp up of traffic in smaller time scale, say, on the order of tens of seconds? Or do these 'big spikes' really occur very quickly, say, in a few seconds?
    2. I have the option to make a tradeoff between average case performance and worst case performance guarantee, but I don't know which one is deemed more important by you. Are ISP networks currently optimized for worst case or average case performance? Is the trade-off between these two an appealing idea, or may the ISP networks are already doing it?

This email covers a lot of issues, perhaps it'll start a discussion.

I think the question depends on how big a core you are talking about. Excluding local effects (the operator of the network bounces a link or loses a router, etc), I doubt if you have a significantly large network you have many effects that shift traffic faster than 10s of seconds (upperbound on this statement is ~30 seconds).

For example, if you "lose" a BGP session, it may take more than 30 seconds for the router to notice it. Once it realizes that its gone, it may re-route traffic very rapidly. But it would still take a while (at least a few seconds for a local link, more for a backbone link) before that traffic really renormalizes). This has more to do with TCP noticing packet loss, backing off [only for the traffic that has been effected] and starting back up. It takes up to half a second to *establish* a single TCP session on an average latency link.

So, the trick would be to discover the traffic has gone or gone wonky before the BGP session is dropped. This would allow your algorithm to back off until a new /normal/ has been established.

However, the talk of traffic engineering and maximum utilization always come into vogue when folks want to squeeze more utilization out of their networks without really spending more money. IMO, the best time to use TE is when customer-links to your network approach your maximum core speed [relative here... there is /core/ in your datacenter/pop and there is /core/ that is your network to the point the packets get handed off on average]. Often this limit on the operator's core is technology imposed (though budgetary concerns get in there too).

I think the technology doesn't really exist at a scalable level to operate for the worst-case scenario, despite what some people may say. Our traffic measurement/link measurement tools are almost all average... and "spot" checks are of only marginal value. I would suggest that this is because of the nature of TCP. If the Internet were UDP based, there would be a *lot* more flash traffic problems. So, for those who have a high amount of UDP traffic (media streamers, DNS hosts, etc) would have a very different experience.

I'm not the first person to say it, and I can't remember the first place I heard it... but I'd suggest that the core is not where TE has the best benefits. Cores by their nature need to be overengineered. You have very little flexibility because the demands on them are wide [they need to handle UDP and TCP, low latency and high latency acceptable applications with aplomb].

TE belongs to the Customer or non-backbone operating ISP. If one were to start an ISP where all residential customer connections were 1Gb/s I could conceivably have thousands of customers operating without needing 200Gb/s of uplink [assuming that were really feasible for a network with very little traffic terminating on the network]. By using TE I could shape my peak traffic needs (MLU) to approach my average. This would make me a much more desirable customer to sell transit to.

TE, MLU, and other concerns while most well understood by core-operators, aren't by customers. Core operators may eventually need to push these concerns to customers if backbone link speeds do not stay far above end-user connection speeds. [on an ICB basis, they are -- whenever you want to buy a few OC-48s in a single POP or an OC-192 customer connection, someone is always going to ask you what your traffic looks like and when]. This would be easiest to push over by providing differential pricing. Enforcement and Analysis of *what* is a desirable traffic pattern and what financial value that provides is where we are largely lacking today. Since a customer knows their traffic and their needs better than a core operator, they would be much better at enforcing traffic flows/engineering. This is better than a core that optimizes for its own link utilization instead one that just tries to stay as empty as possible for as long as possible.

This is way early in the day for me, so this may not make any sense.

YMMV,

Deepak Jain
AiNET

Hi Deepak,
Thanks a lot for your opinions!
Especially, your idea that TE may be more appropriate to stub networks is very interesting. Is it a common practice of large transit ISPs to determine the pricing based on ‘what the customer’s traffic looks like and when’? And do the transit ISPs actively use traffic regulation (rate control and/or traffic shaping) to control the incoming traffic, or they just simply use some pricing scheme based on traffic pattern, such as 95-percentile charging?

Thanks,
Hao