mSQL Attack/Peering/OBGP/Optical exchange

My posted comment was concerning if this technology of layer3 to
layer1 integration/communication would have exacerbated the mSQL worm
as it might have had more ability to grab larger peering pipes.

Were that to have been the case, it would probably would also have
been responsible for some op-ex budgets being blown over the weekend,
both as a result of capacity that would otherwise have been
constrained automagically reprovisioning itself upward (ratcheting up
the capacity comes at a price, right?), and as a result of accounting
departments arguing over "you used it" versus "an attack caused an
automatic system to provision bandwidth that I didn't really want so I
don't want to pay for it."

It's not hard to imagine a lot of edge customers infected with the
latest flavor-of-the-week worm having conversations with their
upstream providers about 95th percentile billing real soon
now. Picture this aspect of the 1434/udp worm:

    It hits late on a Friday 1/24 (PST), in theory after lots of
    end-user IT shops have gone home for the weekend.

    January is a 31-day month - the 5% of samples tossed in a 95th
    percentile calculation represent a little over 37 hours of usage.

    Those IT shops have 37 hours to patch their systems, until Sunday
    (1/26) afternoon, and prevent their bill for January from being
    defined by 1434/udp worm usage. Oops, Sunday (1/26) was the Super
    Bowl. Missed the window. Systems get patched Monday (1/27).

    On Monday (2/3), lots of bills for January usage are going to be
    calculated. How many surprises will there be, and how much time in
    February will be devoted to Customer X disputing their January
    bill with Vendor Y?

Auto-provisioning technology is quite exciting, being able to
implement sweeping changes in many powered devices simultaneously with
one point'n'click. In the interior of a network, the spending
decisions that back the execution of that point'n'click are at least
all within one organization. In a customer/vendor relationship, I can
easily imagine the vendor wanting the customer to be able to run the
dollar meter faster with the greatest of ease (and possibly associate
some minimums with those increases, so that the click-down can't
follow the click-up too closely), and any billing disputes mercifully
only involve two parties.

At an exchange point scenario, though, where two networks presumably
have independently agreed to pay money to a third to connect via this
optical switch, we now have the case where one can affect the monthly
bill of the other by a point'n'click (again, I am making the
assumption that the additional value represented by increased capacity
will cause additional charges to be incurred - to two parties, now
that we're in an exchange point scenario). The kind of policies that
the control system now needs to implement undergo a dramatic shift in
order to implement the business rules of an exchange point - from
network R's perspective:

    network S may be have a specific cap on now much additional
    capacity it can cause network R to buy from exchange point E

    network T may have priority over network S when contending for
    limited headroom (without E revealing to S that T has priority)

    a total cap for monthly spending of $N with exchange point E may
    be set, after which all requests for additional capacity will be
    denied

    (and just for humor value) auto-provisioned capacity can only be
    added in response to legitimate traffic increases

Billing disputes in the exchange point now involve three parties, and
become more complex as a result - this, in theory, results in the
technology not reducing op-ex but shifting it from the operations
department to the accounting and legal departments.

I get the picture that the control software can organize views
hierarchically. Exchange points aren't organized hierarchically,
though (well, the non-bell-shaped ones aren't), they're organized as a
mesh. The nice thing about Ethernet-based exchanges is that:

    they allow the structure of the network to mirror the structure of
    the organization (as networks have a habit of doing) easily

    the use of VLAN tags allows backplane slot capacity to be divided
    between peers without the hard boundaries per-peer that slicing
    and dicing SONET imposes but still within an overall cap that a)
    sets a boundary on the traffic engineering problem space on the
    interior side of the connected router and b) can be periodically
    reviewed

    the business rules that the technology has to implement are
    relatively clean, easy to understand, free of dependencies between
    customers of the exchange (beyond their initial agreement to
    exchange traffic with each other).

Optical switch technology, and the control systems that cause the
technology to implement the business rules of an exchange point, have
a ways to go before they're ready for prime-time.

Stephen
VP, Eng.
PAIX

Stephen Stuart <stuart@tech.org> writes:

Optical switch technology, and the control systems that cause the
technology to implement the business rules of an exchange point, have
a ways to go before they're ready for prime-time.

We don't know anything we could do with 50ms provisioning without
making a disaster (c) smd 2001.

/vijay

Billing disputes in the exchange point now involve three parties, and
become more complex as a result - this, in theory, results in the
technology not reducing op-ex but shifting it from the operations
department to the accounting and legal departments.

If a proper rulebased system were implemented, wouldn't this account for the
issues? For example, implementation of an increase is only allowed by peer E
if the traffic has been a gradual increase and X throughput has been met for
T amount of time. Peer E would also have specific caps allotted for peer S
and T along with priority in granting the increases. In the case of the
worm, it is important to have a good traffic analyzer to recognize that the
increase in bandwith has been too drastic to constitute a valid need. Of
course, traffic patterns to vary abit in short periods of time, but the
average sustained throughput and the average peak do not increase rapidly.
What was seen with Saphire should never be confused with normal traffic and
requests for bandwidth increments should be ignored by any automated system.
Of course, I realize that to implement the necessary rules would add a
complexity that could cost largs sums of money due to mistakes.

-Jack

We don't know anything we could do with 50ms provisioning without
making a disaster (c) smd 2001.

indeed. but i sure would like one or two day provisioning, as
opposed to 18 months.

randy

If a proper rulebased system were implemented, wouldn't this account for the
issues? For example, implementation of an increase is only allowed by peer E
if the traffic has been a gradual increase and X throughput has been met for
T amount of time. Peer E would also have specific caps allotted for peer S
and T along with priority in granting the increases. In the case of the
worm, it is important to have a good traffic analyzer to recognize that the
increase in bandwith has been too drastic to constitute a valid need.

If my regular saturday morning traffic is 50 Mbps and a worm generates
another 100, then 150 Mbps is a valid need as being limited to my usual
50 Mbps would mean 67% packet loss, TCP sessions go into hibernation and
I end up with 49.9% Mbps of worm traffic.

Of course, traffic patterns to vary abit in short periods of time, but the
average sustained throughput and the average peak do not increase rapidly.

Sometimes they do: star report, mars probe, that kind of thing...

What was seen with Saphire should never be confused with normal traffic and
requests for bandwidth increments should be ignored by any automated system.

So you're proposing the traffic is inspected very closely, and then
either its rate limited/priority queued or more bandwidth is provisioned
automatically? That sure adds a lot of complexity but I guess this is
the only way to do it right.

Of course, I realize that to implement the necessary rules would add a
complexity that could cost largs sums of money due to mistakes.

Right.

If my regular saturday morning traffic is 50 Mbps and a worm generates
another 100, then 150 Mbps is a valid need as being limited to my usual
50 Mbps would mean 67% packet loss, TCP sessions go into hibernation and
I end up with 49.9% Mbps of worm traffic.

But a ruleset should be allowed for you as a business to make that decision.
Do you allow the worm's traffic increase to increase your circuit and cost
you money, or do you limit it based on suspected illegitimate traffic?

> Of course, traffic patterns to vary abit in short periods of time, but

the

> average sustained throughput and the average peak do not increase

rapidly.

Sometimes they do: star report, mars probe, that kind of thing...

And what do you do to handle traffic bursts now? Do you immediately jump up
and scream, "I need a bigger pipe now! Step on it!" You plan for what you
maximum capacity needs to be. The proposed system would still allow for
maximum caps, but there are times when that amount of bandwidth is
unnecessary for your particular network while another may need it at that
time. For planned bursts in throughput, you can increase the amount
manually. The automated system, however, should be configurable per peer to
allow for what the business wants to spend. If a business doesn't want 100mb
surprises, then they should be able to avoid them. On the flip side, if the
business does, then they can allot for it.

> What was seen with Saphire should never be confused with normal traffic

and

> requests for bandwidth increments should be ignored by any automated

system.

So you're proposing the traffic is inspected very closely, and then
either its rate limited/priority queued or more bandwidth is provisioned
automatically? That sure adds a lot of complexity but I guess this is
the only way to do it right.

Traffic doesn't have to be inspected more closely than it is. It just needs
to keep historical records and averages. The system knows what the current
utilization is and can quickly calculate the rate of increase. As stated
above, it should be the right of each peer to decide what they consider to
be an acceptable rate of increase before allowing an automatic upgrade which
will cost them money.

> Of course, I realize that to implement the necessary rules would add a
> complexity that could cost largs sums of money due to mistakes.

Right.

Automation is rarely a simplistic process when the automation includes
increasing expenditures. The factors involved in the automation process
would also have to be worked into peering agreements, as both sides of a
peering session would have to agree on what they find to be acceptable
between them.

-Jack