Your opinion on network analysis in the presence of uncertain events

Hi NANOG,

Networks evolve in uncertain environments. Links and devices randomly fail; external BGP announcements unpredictably appear/disappear leading to unforeseen traffic shifts; traffic demands vary, etc. Reasoning about network behaviors under such uncertainties is hard and yet essential to ensure Service Level Agreements.

We’re reaching out to the NANOG community as we (researchers) are trying to better understand the practical requirements behind “probabilistic” network reasoning. Some of our questions include: Are uncertain behaviors problematic? Do you care about such things at all? Are you already using tools to ensure the compliance of your network design under uncertainty? Are there any good?

We designed a short anonymous survey to collect operators answers. It is composed of 14 optional questions, most of which (13/14) are closed-ended. It should take less than 10 minutes to complete. We expect the findings to help the research community in designing more powerful network analysis tools. Among others, we intend to present the aggregate results in a scientific article later this year.

It would be terrific if you could help us out!

Survey URL: https://goo.gl/forms/HdYNp3DkKkeEcexs2

Thanks much!

Laurent Vanbever, ETH Zürich

PS: It goes without saying that we would also be extremely grateful if you could forward this email to any operator you know and who may not read NANOG.

I took the survey. It’s short and sweet — well done!

I do have a question. You ask "Are there any good?” Any good what?

-mel

I took the survey. It’s short and sweet — well done!

Thanks a lot, Mel! Highly appreciated!

I do have a question. You ask "Are there any good?” Any good what?

I just meant whether existing network analysis tools were any good (or good enough) at reasoning about probabilistic behaviors that people care about (if any).

All the best,
Laurent

I know of none that take probabilities as inputs. Traditional network simulators, such as GNS3, let you model various failure modes, but probability seems squishy enough that I don’t see how it can be accurate, and thus helpful. It’s like that Dilbert cartoon where the pointy haired boss asks for a schedule of all future unplanned outages :slight_smile:

https://dilbert.com/strip/1997-01-29

-mel

My understanding was that the tool will combine historic data with the MTBF datapoints form all components involved in a given link in order to try and estimate a likelihood of a link failure.

Heck I imagine if one would stream a heap load of data at a ML algorithm it might draw some very interesting conclusions indeed -i.e. draw unforeseen patterns across huge datasets while trying to understand the overall system (network) behaviour. Such a tool might teach us something new about our networks.

Next level would be recommendations on how to best address some of the potential pitfalls it found.

Maybe in closed systems like IP networks, with use of streaming telemetry from SFPs/NPUs/LC-CPUs/Protocols/etc…, we’ll be able to feed the analytics tool with enough data to allow it to make fairly accurate predictions (i.e. unlike in weather or markets prediction tools where the datasets (or search space -as not all attributes are equally relevant) is virtually endless).

adam

MTBF can’t be used alone to predict failure probability, because product mortality follows the infamous “bathtub curve”. Products are as likely to fail early in their lives as later in their lives. MTBF as a scalar value is just an average.

-mel via cell

Hi Adam/Mel,

Thanks for chiming in!

My understanding was that the tool will combine historic data with the MTBF datapoints form all components involved in a given link in order to try and estimate a likelihood of a link failure.

Yep. This could be one way indeed. This likelihood could also be taking the form of intervals in which you expect the true value to lies (again, based on historical data). This could be done both for link/devices failures but also for external inputs such as BGP announcements (to consider the likelihood that you receive a route for X in, say, NEWY). The tool would then to run the deterministic routing protocols (not accounting for ‘features’ such as prefer-oldest-route for a sec.) on these probabilistic inputs so as to infer the different possible forwarding outcomes and their relative probabilities. For now we had something like this in mind.

One can of course make the model more and more complex by e.g. also taking into account data plane status (to model gray failures). Intuitively though, the more complex the model, the more complex the inference process is.

Heck I imagine if one would stream a heap load of data at a ML algorithm it might draw some very interesting conclusions indeed -i.e. draw unforeseen patterns across huge datasets while trying to understand the overall system (network) behaviour. Such a tool might teach us something new about our networks.
Next level would be recommendations on how to best address some of the potential pitfalls it found.

Yes. I believe some variants of this exist already. I’m not sure how much they are used in practice though. AFAICT, false positives/negatives is still a big problem. Non-trivial recommendation system will require a model of the network behavior that can somehow be inverted easily which is probably something academics should spend some time on :slight_smile:

Maybe in closed systems like IP networks, with use of streaming telemetry from SFPs/NPUs/LC-CPUs/Protocols/etc.., we’ll be able to feed the analytics tool with enough data to allow it to make fairly accurate predictions (i.e. unlike in weather or markets prediction tools where the datasets (or search space -as not all attributes are equally relevant) is virtually endless).

I’m with you. I also believe that better (even programmable) telemetry will unlock powerful analysis tools.

Best,
Laurent

PS: Thanks a lot to those who have already answered our survey! For those who haven’t yet: https://goo.gl/forms/HdYNp3DkKkeEcexs2 (it only takes a couple of minutes).

Hi Laurent,

I have filled out the survey however, I would just like to request
that in the future you don't use a URL shortner like goo.gl; many
people don't like those because we can't see were you're sending us
until we click that link. Some people also block them because they are
a security issue (our corporate proxy does, I have to drop off the VPN
or use a URL expander to retrieve the original URL).

Also have you seen Batfish? I looks like you guys want to write a tool
that has some overlap with Batfish. Batfish can ingest the configs
host B?" or "will prefix advertisement P from host A will be
filtered/accepted by host B?", "if I ping from this source IP who has
a return route and can respond?" etc.

Kind regards,
James.

From: Mel Beckman <mel@beckman.org>
Sent: Wednesday, January 16, 2019 9:21 PM

MTBF can’t be used alone to predict failure probability, because product
mortality follows the infamous “bathtub curve”. Products are as likely to fail
early in their lives as later in their lives. MTBF as a scalar value is just an
average.

Yes very good point -however that's where the historical data should come to rescue to help bend the MTBF line into this expected "bathtub curve".

adam