DNS attacks evolve

It's usually interesting to be proven wrong, but perhaps not in this case.

I was among the first to point out that the 11-second DNS poisioning claim
made by Vixie only worked out to about a week of concentrated attack after
the patch. This was a number I extrapolated purely from Paul's 11-second
number and the factor-of-65000x introduced by port randomization.

I am very, very, very disheartened to be shown to be wrong. As if 8 days
wasn't bad enough, a concentrated attack has been shown to be effective in
10 hours. See http://www.nytimes.com/2008/08/09/technology/09flaw.html

With modern data rates being what they are, I believe that this is still a
severe operational hazard, and would like to suggest a discussion of further
mitigation strategies.

On my list of concepts:

1) Use of multiple IP addresses for queries (reduce success rate somewhat)

2) Rate-limiting of query traffic, since I really doubt many sites actually
   have recursers that need to be able to spike to many times their normal
   traffic,

3) Forwarding of failed queries (which I believe BIND doesn't currently
   allow) to a "backup" server (which would seem to be interesting in
   combination with 2)

4) I wonder if it wouldn't make sense to change the advice for large-scale
   recursers to run multiple instances of BIND, internally distribute the
   requests (random pf/ipfw load balancing) to present a version of 1) that
   would render smaller segments of the user base vulnerable in the event of
   success. It would represent more memory, more CPU, and more requests,
   but a smaller victory for attackers.

5) Modify BIND to report mismatch QID's. Not a log report per hit, but some
   reasonable strategy. Make the default installation instructions include
   a script to scan for these - often - and mail hostmaster.

6) Have someone explain to me the reasoning behind allowing the corruption
   of in-cache data, even if the data would otherwise be in-baliwick. I'm
   not sure I quite get why this has to be. It would seem to me to be safer
   to discard the data. (Does not eliminate the problem, but would seem to
   me to reduce it)

7) Have someone explain to me the repeated claims I've seen that djbdns and
   Nominum's server are not vulnerable to this, and why that is.

It would seem that the floor is wide open to a large number of possibilities
for mitigating this beyond the patch.

... JG

jgreco@ns.sol.net (Joe Greco) writes:

I am very, very, very disheartened to be shown to be wrong. As if 8 days
wasn't bad enough, a concentrated attack has been shown to be effective in
10 hours. See http://www.nytimes.com/2008/08/09/technology/09flaw.html

that's what theory predicted. guessing a 30-or-so-bit number isn't "hard."

With modern data rates being what they are, I believe that this is still a
severe operational hazard, and would like to suggest a discussion of further
mitigation strategies.
...

i have two gripes here. first, can we please NOT use the nanog@ mailing
list as a workshop for discussing possible DNS spoofing mitigation
strategies? namedroppers@ops.ietf.org already has a running gun battle
on that topic, and dns-operations@lists.oarci.net would be appropriate.

but unless we're going to talk about deploying BCP38, which would be the
mother of all mitigations for DNS spoofing attacks, it's offtopic on nanog@.

second, please think carefully about the word "severe". any time someone
can cheerfully hammer you at full-GigE speed for 10 hours, you've got some
trouble, and you'll need to monitor for those troubles. 11 seconds of
10MBit/sec fit my definition of "severe". 10 hours at 1000MBit/sec doesn't.

I think what we're seeing here is the realization that DNS hosting, like web hosting, is no longer something that can simply be done by tossing a machine on the internet and leaving it there; it needs professional management, monitoring and updates. That's always a hard transition for some people to make, but it's one that has to be made; that's the world we live in.

Kee Hinckley
CEO/CTO Somewhere Inc.
Somewhere: http://www.somewhere.com/
TechnoSocial: http://xrl.us/bh35i
I'm not sure which upsets me more; that people are so unwilling to accept responsibility for their own actions, or that they are so eager to regulate those of everybody else.

* Joe Greco:

I am very, very, very disheartened to be shown to be wrong. As if 8 days
wasn't bad enough, a concentrated attack has been shown to be effective in
10 hours. See http://www.nytimes.com/2008/08/09/technology/09flaw.html

Note that the actual bandwidth utilization on that GE link should be
somewhere between 10% and 20% if you send minimally sized replies during
spoofing. In fact, the theoretically predicted time for 50% success
probability for 100mbps attacks is below one day.

This also matches the numbers posted here:

<Домен припаркован в Timeweb;

1) Use of multiple IP addresses for queries (reduce success rate somewhat)

You must implement this carefully. Just using a load-balanced DNS setup
doesn't work, for instance. The attacker could trigger the cache misses
through a CNAME he controls, so he'd know which instance to attack in
each round.

2) Rate-limiting of query traffic, since I really doubt many sites actually
   have recursers that need to be able to spike to many times their normal
   traffic,

The problem with that is that 130,000 queries over a 10 hour period (as
in Evgeniy's experiment) are often lost in the noise. Only if the
authoritative servers are RTT-wise close to your recursor, the attacker
benefits from high query rates.

3) Forwarding of failed queries (which I believe BIND doesn't currently
   allow) to a "backup" server (which would seem to be interesting in
   combination with 2)

I don't think any queries fail in this scenario.

4) I wonder if it wouldn't make sense to change the advice for large-scale
   recursers to run multiple instances of BIND, internally distribute the
   requests (random pf/ipfw load balancing) to present a version of 1) that
   would render smaller segments of the user base vulnerable in the event of
   success. It would represent more memory, more CPU, and more requests,
   but a smaller victory for attackers.

User-specific DNS caches are interesting from a privacy perspective,
too. But I don't think they'll work, except when the cache is in the
CPE.

5) Modify BIND to report mismatch QID's. Not a log report per hit, but some
   reasonable strategy. Make the default installation instructions include
   a script to scan for these - often - and mail hostmaster.

Yes, better monitoring is crucial. Recent BIND 9.5 has a counter for
mismatched replies, which should provide at least one indicator. Due to
the diversity of potential attacks, it's very difficult to set up
generic monitoring.

6) Have someone explain to me the reasoning behind allowing the corruption
   of in-cache data, even if the data would otherwise be in-baliwick. I'm
   not sure I quite get why this has to be. It would seem to me to be safer
   to discard the data. (Does not eliminate the problem, but would seem to
   me to reduce it)

The idea is that the delegated zone can introduce additional servers not
listed in the delegated zone. (It's one thing that gets you a bit of
IPv6 traffic.) Unfortunately, it's likely that performance would suffer
for some sites if resolver

7) Have someone explain to me the repeated claims I've seen that djbdns and
   Nominum's server are not vulnerable to this, and why that is.

For DJBDNS, see: <http://article.gmane.org/gmane.network.djbdns/13371&gt;

Nominum has published a few bits about their secret sauce:

  <http://nominum.com/news_events/security_vulnerability_update.php&gt;

TCP fallback on detected attack attempts is expected to be sufficiently
effective so that you can get away with a smaller source port pool.
Even if it's not, on some platforms, a smallish pool is the only way to
cope with the existing load until you can bring in more servers, so it's
better than nothing.

The TCP fallback idea was posted to namedroppers in 2006, in response to
one of Bert's early drafts which evolved into the forgery resilience
document, so it should not be encumbered. The heuristics when to
trigger the attack could be, though.

Joe Greco wrote:

6) Have someone explain to me the reasoning behind allowing the corruption
   of in-cache data, even if the data would otherwise be in-baliwick. I'm not sure I quite get why this has to be. It would seem to me to be safer
   to discard the data. (Does not eliminate the problem, but would seem to
   me to reduce it)

I had this question in my post weeks ago. No one bothered to reply. Older poisoning is why the auth data must be within the same zone to be cached, but apparently no one bothered to question the wisdom of altering existing cache data.

Wish they'd just fix the fault in the logic and move on. Talking til everyone is blue in the face about protocol changes and encryption doesn't serve operations. There are recursive resolvers that work just fine without the issues some standard resolvers have. The protocol seems to work, some vendors just need to change how they use it and tighten up on cache integrity.

7) Have someone explain to me the repeated claims I've seen that djbdns and
   Nominum's server are not vulnerable to this, and why that is.

PowerDNS has this to say about their non-vulnerability status:

http://mailman.powerdns.com/pipermail/pdns-users/2008-July/005536.html

I know some very happy providers that haven't had to patch. I hope to be one of them on the next round.

Jack

In a message written on Mon, Aug 11, 2008 at 09:41:54AM -0500, Jack Bates wrote:

>7) Have someone explain to me the repeated claims I've seen that djbdns and
> Nominum's server are not vulnerable to this, and why that is.

PowerDNS has this to say about their non-vulnerability status:

[Pdns-users] Statement on the recent DNS vulnerability & impact on PowerDNS (none)

I know some very happy providers that haven't had to patch. I hope to be
one of them on the next round.

It's not that they are immune to the attack, and I think a few
people deserve to be smacked around for the language they use.....

Let's be perfectly clear, without DNSSEC or an alteration to the
DNS Protocol THERE IS NO WAY TO PREVENT THIS ATTACK. There are
only ways to make the attack harder.

So what PowerDNS, DJB and others are telling you is not that you
are immune, it is that you're not the low hanging fruit. A more
direct way of stating their press releases would be:

  Everyone else figured out it took 3 minutes to hack their servers
  and implemented patches to make it take 2 hours. Our server always
  had the logic to make it take 2 hours, so we were ahead of the game.

Great.

If your vendor told you that you are not at risk they are wrong,
and need to go re-read the Kaminski paper. EVERYONE is vunerable,
the only question is if the attack takes 1 second, 1 minute, 1 hour
or 1 day. While possibly interesting for short term problem
management none of those are long term fixes. I'm not sure your
customers care when .COM is poisoned if it took the attacker 1
second or 1 day.

Leo Bicknell wrote:

If your vendor told you that you are not at risk they are wrong,
and need to go re-read the Kaminski paper. EVERYONE is vunerable,
the only question is if the attack takes 1 second, 1 minute, 1 hour
or 1 day. While possibly interesting for short term problem
management none of those are long term fixes. I'm not sure your
customers care when .COM is poisoned if it took the attacker 1
second or 1 day.

EVERYONE with a CACHE MIGHT be vulnerable. Have studies been done to determine if existing cached records will be overwritten on ALL caching resolvers?

Poisoning has always and will always be possible until DNSSEC, but the question isn't if you can poison a few off the wall records, but if you can poison the resolver in any meaningful way. If the cache isn't passively overwritten, then the only records you could poison would be records that aren't cached.

The operational impact would be a much smaller scope. .COM will be cached constantly and to poison it, the attacker would have to forge the packet in the small window of cache expiry to renewal.

This can be mitigated even more if sites give out auth on negative responses, which means for that specific domain, the attacker gets 1 shot to spoof and then the auth info is cached. Obviously there is a downside to sending larger packets, but that is a decision for the domain holder.

I'll be happy to add DNSSEC to my operational list as soon as it's actually useful (other people can argue over who signs what).

Jack