F.ROOT-SERVERS.NET moved to Beijing?

Tony_Finch · October 3, 2011, 11:29am

If you are running BIND 9.8 there is really no reason not to turn on
DNSSEC validation, then you won't have to worry about anycast routes
leaking from behind the great firewall.

dnssec-validation auto;
dnssec-lookaside auto;

Tony.

Danny_McPherson4 · October 3, 2011, 1:27pm

User Exercise: What happens when you enable integrity checking in an
application (e.g., 'dnssec-validation auto') and datapath manipulation
persists? Bonus points for analysis of implementation and deployment
behaviors and resulting systemic effects.

Network layer integrity techniques and secure routing infrastructure are
all that's going to fix this. In the interim, the ability to detect such
incidents at some rate faster than the speed of mailing lists would be
ideal.

-danny

Todd_Underwood3 · October 3, 2011, 2:30pm

User Exercise: What happens when you enable integrity checking in an
application (e.g., 'dnssec-validation auto') and datapath manipulation
persists? Bonus points for analysis of implementation and deployment
behaviors and resulting systemic effects.

i agree with danny here.

ignoring randy (and others) off-topic comments about hypocrisy, this
situation is fundamentally a situation of bad (or different) network
policy being applied outside of its scope. i would prefer that china
not censor the internet, sure. but i really require that china not
censor *my* internet when i'm not in china.

t

bill3 · October 3, 2011, 2:42pm

well, not to disagree - BUT.... the sole reason we have
  BGP and use ASNs the way we do is to ensure/enforce local
  policy. It is, after all, an AUTONOMOUS SYSTEM number.
  One sets policy at its boundaries on what/how to accept/reject/modify
  traffic crossing the boundary.

If you dont -like- the ASN policy - then don't use/traverse that
ASN.

and rPKI has the same problems as DNSSEC. lack of uniform use/implementation
is going to be a huge party - full of fun & games.

/bill

Emile_Aben · October 3, 2011, 2:45pm

We used DNSMON data to analyse this event, and found an earlier leak on
29 and 30 September:
https://labs.ripe.net/Members/emileaben/f-root-route-leak-the-dnsmon-view

best regards,
Emile Aben
RIPE NCC

Bandy_Rush1 · October 3, 2011, 2:46pm

ignoring randy (and others) off-topic comments about hypocrisy

actually, if you had followed the thread in its sad detail, at that
point of jingoism they were on.

this situation is fundamentally a situation of bad (or different)
network policy being applied outside of its scope.

kink is gonna leak. rfc1918 is gonna leak. ula-foo is gonna leak.
pakistani kink is gonna leak. anycast 'local' cones are gonna leak.
chinese kink is gonna leak. american kink is gonna leak.

s/are gonna/has already/g

are people gonna stop doing kink? sadly, not likely. so all we are
left is

Danny McPherson wrote:

Network layer integrity techniques and secure routing infrastructure
are all that's going to fix this.

and

Danny McPherson wrote:

In the interim, the ability to detect such incidents at some rate
faster than the speed of mailing lists would be ideal.

is not a lot of good unless you insert "and fix." watching train wrecks
is about as fun as reading pontification on nanog. qed

randy

Leo_Bicknell1 · October 3, 2011, 3:20pm

In a message written on Mon, Oct 03, 2011 at 09:27:46AM -0400, Danny McPherson wrote:

User Exercise: What happens when you enable integrity checking in an
application (e.g., 'dnssec-validation auto') and datapath manipulation
persists? Bonus points for analysis of implementation and deployment
behaviors and resulting systemic effects.

I think this is a (to some on the list) cryptic way of asking "If
all your routes to the server go to someone masquerading, what
happens when you try to validate that data?" The question being
if you configure your nameserver to validate the root, but don't
get signed answers back will your nameserver refuse to serve up any
data, effectively taking you and your users offline?

The answer should be no. This is part of why there are 13 root
servers. If a nameserver is told the root is signed and it gets
unsigned answers from one of the 13, it should ignore them and move
on. I do not off the top of my head know all the timeouts and
implementation dependant behaviors, but also remember that a up
caching resolver will make approximately 1 query to the root per
day for _valid_ names, but many queries per day for invalid names.
Thus the impact to valid names should be minimal, even in the face
of longer timeouts.

Is there enough operational experience with DNSSEC? No. Can we
fix that by saying it's not good enough yet? No. Run it. The
people who write nameserver software are comitted to fixing any
issues as quickly as possible, because it is our best way to secure
DNS.

Network layer integrity techniques and secure routing infrastructure are
all that's going to fix this. In the interim, the ability to detect such
incidents at some rate faster than the speed of mailing lists would be
ideal.

Network layer integrity and secure routing don't help the majority of
end users. At my house I can choose Comcast or AT&T service. They will
not run BGP with me, I could not apply RPKI, secure BGP, or any other
method to the connections. They may well do NXDOMAIN remapping on their
resolvers, or even try and transparently rewrite DNS answers. Indeed
some ISP's have even experimented with injecting data into port 80
traffic transparently!

Secure networks only help if the users have a choice, and choose to not
use "bad" networks. If you want to be able to connect at Starbucks, or
the airport, or even the conference room Wifi on a clients site you need
to assume it's a rogue network in the middle.

The only way for a user to know what they are getting is end to end
crypto. Period.

As for the speed of detection, its either instantenous (DNSSEC
validation fails), or it doesn't matter how long it is (minutes,
hours, days). The real problem is the time to resolve. It doesn't
matter if we can detect in seconds or minutes when it may take hours
to get the right people on the phone and resolve it. Consider this
weekend's activity; it happened on a weekend for both an operator
based in the US and a provider based in China, so you're dealing
with weekend staff and a 12 hour time difference.

If you want to insure accuracy of data, you need DNSSEC, period.
If you want to insure low latency access to the root, you need
multiple Anycasted instances because at any one point in time a
particular one may be "bad" (node near you down for maintenance,
routing issue, who knows) which is part of why there are 13 root
servers. Those two things together can make for resilliance,
security and high performance.

Danny_McPherson4 · October 3, 2011, 4:38pm

Thus the impact to valid names should be minimal, even in the face
of longer timeouts.

If you're performing validation on a recursive name server (or
similar resolution process) expecting a signed response yet the
response you receive is either unsigned or doesn't validate
(i.e., bogus) you have to:

1) ask other authorities? how many? how frequently? impact?
2) consider implications on _entire chain of trust?
3) tell the client something?
4) cache what (e.g., zone cut from who you asked)? how long?
5) other?

"minimal" is not what I was thinking..

Network layer integrity and secure routing don't help the majority of
end users. At my house I can choose Comcast or AT&T service. They will
not run BGP with me, I could not apply RPKI, secure BGP, or any other
method to the connections. They may well do NXDOMAIN remapping on their
resolvers, or even try and transparently rewrite DNS answers. Indeed
some ISP's have even experimented with injecting data into port 80
traffic transparently!

Secure networks only help if the users have a choice, and choose to not
use "bad" networks. If you want to be able to connect at Starbucks, or
the airport, or even the conference room Wifi on a clients site you need
to assume it's a rogue network in the middle.

The only way for a user to know what they are getting is end to end
crypto. Period.

I'm not sure how "end to end" crypto helps end users in the advent
of connectivity and *availability* issues resulting from routing
brokenness in an upstream network which they do not control.
"crypto", OTOH, depending on what it is and where in the stack it's
applied, might well align with my "network layer integrity"
assertion.

As for the speed of detection, its either instantenous (DNSSEC
validation fails), or it doesn't matter how long it is (minutes,
hours, days). The real problem is the time to resolve. It doesn't
matter if we can detect in seconds or minutes when it may take hours
to get the right people on the phone and resolve it. Consider this
weekend's activity; it happened on a weekend for both an operator
based in the US and a provider based in China, so you're dealing
with weekend staff and a 12 hour time difference.

If you want to insure accuracy of data, you need DNSSEC, period.
If you want to insure low latency access to the root, you need
multiple Anycasted instances because at any one point in time a
particular one may be "bad" (node near you down for maintenance,
routing issue, who knows) which is part of why there are 13 root
servers. Those two things together can make for resilliance,
security and high performance.

You miss the point here Leo. If the operator of a network service
can't detect issues *when they occur* in the current system in some
automated manner, whether unintentional or malicious, they won't be
alerted, they certainly can't "fix" the problem, and the potential
exposure window can be significant.

Ideally, the trigger for the alert and detection function is more
mechanized than "notification by services consumer", and the network
service operators or other network operators aware of the issue have
some ability to institute reactive controls to surgically deal with
that particular issue, rather than being captive to the [s]lowest
common denominator of all involved parties, and dealing with
additional non-determinsitic failures or exposure in the interim.

Back to my earlier point, for *resilience* network layer integrity
techniques and secure routing infrastructure are the only preventative
controls here, and necessarily to augment DNSSEC's authentication and
integrity functions at the application layer. Absent these, rapid
detection enabling reactive controls that mitigate the issue are
necessary.

-danny

Christopher_Morrow · October 3, 2011, 5:09pm

Does ISC (or any other anycast root/*tld provider) have external
polling methods that can reliably tell when, as was in this case,
local-anycast-instances are made global? (or when the cone of silence
widens?)

Given that in the ISC case the hostname.bind query can tell you at
least the region + instance#, it seems plausible that some system of
systems could track current/changes in the mappings, no? and either
auto-action some 'fix' (SHUT DOWN THE IAD INSTANCE IT's ROGUE!) or at
least log and notify a hi-priority operations fixer.

Given something like the unique-as work Verisign has been behind you'd
think monitoring route origins and logging 'interesting' changes could
accomplish this as well?

(I suppose i'm not prescribing solutions above, just wondering if
something like these is/could-be done feasibly)

-chris

Leo_Bicknell1 · October 3, 2011, 5:34pm

In a message written on Mon, Oct 03, 2011 at 12:38:25PM -0400, Danny McPherson wrote:

1) ask other authorities? how many? how frequently? impact?
2) consider implications on _entire chain of trust?
3) tell the client something?
4) cache what (e.g., zone cut from who you asked)? how long?
5) other?

"minimal" is not what I was thinking..

I'm asking the BIND team for a better answer, however my best
understanding is this will query a second root server (typically
next best by RTT) when it gets a non-validating answer, and assuming
the second best one validates just fine there are no further follow
on effects. So you're talking one extra query when a caching
resolver hits the root. We can argue if that is minimal or not,
but I suspect most end users behind that resolver would never notice.

You miss the point here Leo. If the operator of a network service
can't detect issues *when they occur* in the current system in some
automated manner, whether unintentional or malicious, they won't be
alerted, they certainly can't "fix" the problem, and the potential
exposure window can be significant.

In a message written on Mon, Oct 03, 2011 at 01:09:17PM -0400, Christopher Morrow wrote:

Does ISC (or any other anycast root/*tld provider) have external
polling methods that can reliably tell when, as was in this case,
local-anycast-instances are made global? (or when the cone of silence
widens?)

Could ISC (or any other root operator) do more monitoring? I'm sure,
but let's scope the problem first. We're dealing here with a relatively
wide spread leak, but that is in fact the rare case.

There are 39,000 ASN's active in the routing system. Each one of those
ASN's can affect it's path to the root server by:

1) Bringing up an internal instance of a root server, injecting it into
its IGP, and "hijacking" the route.
2) Turning up or down a peer that hosts a root server.
3) Turning up or down a transit provider.
4) Adding or removing links internal to their network that change their
internal selection to use a different external route.

The only way to make sure a route was correct, everywhere, would
be to have 39,000+ probes, one on every ASN, and check the path to
the root server. Even if you had that, how do you define when any
of the changes in 1-4 are legitimate? You could DNSSEC verify to
rule out #1, but #2-4 are local decisions made by the ASN (or one
of its upstreams).

I suppose, if someone had all 39,000+ probes, we could attempt to
write algorythms that determined if too much "change" was happening
at once; but I'm reminded of events like the earthquake that took
out many asian cables a few years back. There's a very real danger
in such a system shutting down a large number of nodes during such
an event due to the magnitude of changes which I'd suggest is the
exact opposite of what the Internet needs to have happen in that
event.

(I suppose i'm not prescribing solutions above, just wondering if
something like these is/could-be done feasibly)

Not really. Look, I chase down several dozen F-Root leaks a year.
You never hear about them on NANOG. Why? Well, it's some small
ISP in the middle of nowhere leaking to a peer who believes them,
and thus they get a 40ms response time when they should have a 20ms
response time by believing the wrong route. Basically, almost no
one cares, generally it takes some uber-DNS nerd at a remote site
to figure this out and contact us for help.

This has tought me that viewpoints are key. You have to be on the
right network to detect it has hijacked all 13 root servers, you
can't probe that from the outside. You also have to be on the right
network to see you're getting the F-Root 1000 miles away rather
than the one 500. Those 39,000 ASN's are providing a moving playing
field, with relationships changing quite literally every day, and
every one of them may be a "leak".

This one caught attention not because it was a bad leak. It was
IPv6 only. Our monitoring suggests this entire leak siphoned away
40 queries per second, at it's peak, across all of F-Root. In terms
of a percentage of queries it doesn't even show visually on any of
our graphs. No, it drew attention for totally non-technical reasons,
US users panicing that the Chinese goverment was hijacking the
Internet which is just laughable in this context.

There really is nothing to see here. DNSSEC fixes any security
implications from these events. My fat fingers have dropped more
than 40qps on the floor more than once this year, and you didn't
notice. Bad events (like earthquakes and fiber cuts) have taken
any number of servers from any number of operators multiple times
this year. Were it not for the fact that someone posted to NANOG, I bet
most of the people here would have never noticed their 99.999% working
system kept working just fine.

I think all the root ops can do better, use more monitoring services,
detect more route hijacks faster, but none of us will ever get 100%.
None will ever be instantaneous. Don't make that the goal, make the
system robust in the face of that reality.

My own resolution is better IPv6 monitoring for F-root.

Danny_McPherson4 · October 3, 2011, 5:39pm

Given that in the ISC case the hostname.bind query can tell you at
least the region + instance#, it seems plausible that some system of
systems could track current/changes in the mappings, no? and either
auto-action some 'fix' (SHUT DOWN THE IAD INSTANCE IT's ROGUE!) or at
least log and notify a hi-priority operations fixer.

That sort of capability at the application layer certainly seems
prudent to me, noting that it does assume you have a measurement
node within the catchment in question and are measuring at a high
enough frequency to detect objective incidents.

Given something like the unique-as work Verisign has been behind you'd
think monitoring route origins and logging 'interesting' changes could
accomplish this as well?

I'm a fan of both routing system && consumer-esque monitoring, and
do believe that a discriminator in the routing system associated with
globally anycasted prefixes makes this simpler - for both detection,
and possibly even reactive or preventative controls IF necessary. A
unique origin AS is not the only place you can do this in the routing
system, as I'm sure some will observe, but it seems an ideal location
to me.

-danny

Danny_McPherson4 · October 3, 2011, 5:49pm

I'm not talking "one extra query", and it's not simply about
subsequent transaction attempts either - so conjecture aiming
to marginalize the impact isn't particularly helpful.

I.e., have that look, get back to us...

-danny

Martin_Millnert · October 3, 2011, 6:44pm

Leo,

The only way to make sure a route was correct, everywhere, would
be to have 39,000+ probes, one on every ASN, and check the path to
the root server. Even if you had that, how do you define when any
of the changes in 1-4 are legitimate? You could DNSSEC verify to
rule out #1, but #2-4 are local decisions made by the ASN (or one
of its upstreams).

I suppose, if someone had all 39,000+ probes, we could attempt to
write algorythms that determined if too much "change" was happening
at once; but I'm reminded of events like the earthquake that took
out many asian cables a few years back. There's a very real danger
in such a system shutting down a large number of nodes during such
an event due to the magnitude of changes which I'd suggest is the
exact opposite of what the Internet needs to have happen in that
event.

This sounds an awfully lot like the notary concept:
- http://perspectives-project.org/
- http://convergence.io/

Furthermore, changing network paths used to reach information probably
should not be reason to shut down a service, in general. More
interesting than which path is used, I suppose, is whether or not the
data being returned has been changed in some unexpected/undesired way.

Regards,
Martin

Joe_Abley2 · October 3, 2011, 9:10pm

Given that in the ISC case the hostname.bind query can tell you at
least the region + instance#, it seems plausible that some system of
systems could track current/changes in the mappings, no? and either
auto-action some 'fix' (SHUT DOWN THE IAD INSTANCE IT's ROGUE!) or at
least log and notify a hi-priority operations fixer.

That sort of capability at the application layer certainly seems
prudent to me, noting that it does assume you have a measurement
node within the catchment in question and are measuring at a high
enough frequency to detect objective incidents.

In principle there seems like no reason that a DNS client sending queries to authority-only servers couldn't decide to include the NSID option and log changes in declared server identity between subsequent queries (or take some other configured action).

We support 5001 on L-Root (which runs NSD), for what that's worth, as well as HOSTNAME.BIND/CH/TXT, VERSION.BIND/CH/TXT, ID.SERVER/CH/TXT and VERSION.SERVER/CH/TXT, but those require separate queries. I appreciate NSID support is not universal, but perhaps that's ok in the sense of "better than nothing".

I'm a fan of both routing system && consumer-esque monitoring, and
do believe that a discriminator in the routing system associated with
globally anycasted prefixes makes this simpler - for both detection,
and possibly even reactive or preventative controls IF necessary. A
unique origin AS is not the only place you can do this in the routing
system, as I'm sure some will observe, but it seems an ideal location
to me.

Whether it's the right-most entry in the AS_PATH or a bigger substring, you still need more measurement points than you have if you want to catch every leak.

Joe

Bandy_Rush1 · October 3, 2011, 10:00pm

Furthermore, changing network paths used to reach information probably
should not be reason to shut down a service, in general.

cool. then we can get rid of dynamic routing. it always has been a
pain in the ass.

randy

Leo_Bicknell1 · October 3, 2011, 10:07pm

In a message written on Tue, Oct 04, 2011 at 07:00:52AM +0900, Randy Bush wrote:

cool. then we can get rid of dynamic routing. it always has been a
pain in the ass.

If we went back to hosts.txt this pesky DNS infrastructure would
be totally unnecessary.

Valdis_Kletnieks · October 3, 2011, 10:21pm

You're just saying that because you're hoping your employer will
get to sell bandwidth to SRI-NIC.ARPA

Kurt_Erik_Lindqvist · October 4, 2011, 5:15pm

Most if not all European operators today force rewrite or blocking of DNS lookups. Belgium added a fairly large site today. There is virtually no way that this can be contained just inside a country. This problem is waaaay beyond root-servers, China etc. Filtering on the net is becoming common, and was pushed quite hard for at Interent Governance Forum last week. By Interpol and MPAA.

Best regards,

- kurtis -

Eric_Osterweil1 · October 5, 2011, 8:08pm

Leo,

<snip>

This sounds an awfully lot like the notary concept:
- http://perspectives-project.org/
- http://convergence.io/

Furthermore, changing network paths used to reach information probably
should not be reason to shut down a service, in general. More
interesting than which path is used, I suppose, is whether or not the
data being returned has been changed in some unexpected/undesired way.

Actually, some other related work that's been around for 3-6 years includes:
- http://vantage-points.org/
- http://secspider.cs.ucla.edu/

The former has a tech report (listed on its page, http://techreports.verisignlabs.com/tr-lookup.cgi?trid=1110001 ) that presents candidate closed form analysis of how much faith you can gain using network path diversity.

Eric