NTP Issues Today

Van_Wolfe · November 19, 2012, 11:21pm

Hello,

Did anyone else experience issues with NTP today? We had our server
times update to the year 2000 at around 3:30 MT, then revert back to 2012.

Thanks,
Van

Mark_Andrews2 · November 20, 2012, 1:41am

NTP should be immune from this sort of behaviour unless you did a
ntpdate at the wrong moment. The clocks should have been marked
as insane.

Mark

Wallace_Keith · November 20, 2012, 2:08am

Just got paged with a pbx alarm that had 1970 as the year. By the time I logged in , it was showing 2012. Using GPS for time and date.

George_Herbert · November 20, 2012, 3:28am

crossreplying to outages list.

Is anyone ELSE seeing GPS issues? This could well have been an
unrelated issue on that particular PBX.

If this was real, then the mother of all infrastructure attacks might
be underway...

One glitch on tick and tock and one malfunctioning PBX is not
sufficient evidence of pattern - much less hostile activity - to
induce panic, but it would perhaps be a wise time to check
time-related logs?

-george

Sid_Rao · November 20, 2012, 3:58am

We had multiple servers synchronized with Windows/MS time change their clock to the year 2000 today. It broke many things, including AD authentication.

These servers had been properly synchronized for years.

They were synchronized with Microsoft and NIST NTP servers.

This may not be isolated.

Sid Rao | CTI Group | +1 (317) 262-4677

Mike_Lyon · November 20, 2012, 4:17am

Anyone check out the NIST GPS Archive?

http://www.nist.gov/pml/div688/grp40/gpsarchive.cfm

-Mike

Leo_Bicknell1 · November 20, 2012, 4:38pm

In a message written on Mon, Nov 19, 2012 at 04:21:55PM -0700, Van Wolfe wrote:

Did anyone else experience issues with NTP today? We had our server
times update to the year 2000 at around 3:30 MT, then revert back to 2012.

I'm surprised the various time geeks aren't all posting their logs, so
I'll kick off:

/tmp/parse-peerstats.pl peerstats.20121119
56250 76367.354 192.5.41.41 91b4 -378691200.312258363 0.088274002 0.014835425 0.263515353
56250 77391.354 192.5.41.41 91b4 -378691200.312258363 0.088274002 0.018668790 0.263749719
56250 78204.354 192.5.41.40 90b4 -378691200.785377324 0.088179350 0.014812585 0.263668835
56250 78416.355 192.5.41.41 91b4 -378691200.785974681 0.088312507 0.014832943 0.209966600
56250 79229.355 192.5.41.40 90b4 -378691200.785377324 0.088179350 0.018668723 378691200.785523713
56250 79442.355 192.5.41.41 91b4 -378691200.785974681 0.088312507 0.018689918 378691200.786114931

Or in more human readable form:
/tmp/parse-peerstats.pl peerstats.20121119
192.5.41.41 off by -378691200.312258363
192.5.41.41 off by -378691200.312258363
192.5.41.40 off by -378691200.785377324
192.5.41.41 off by -378691200.785974681
192.5.41.40 off by -378691200.785377324
192.5.41.41 off by -378691200.785974681

The script, if you want to run against your own stats:

#!/usr/bin/perl

while (<>) {
  chomp;
  ($day, $second, $addr, $status, $offset, $delay, $disp, $skew) = split;
  if (($offset > 10) || ($offset < -10)) {
# print "$addr off by $offset\n"; # More human friendly
    print "$_\n"; # Full details
  }
}

It just looks for servers off by more than 10 econds and then prints
the line. 378691200 seconds is ~12 years, which lines up with the
year 2000 dates some are reporting.

The IP's are tick.usno.navy.mil and tock.usno.navy.mil.

I can confirm from my vantage point that tick and tock both went about
12 years wrong on Nov 19th for a bit, I can also report that my NTP
server with sufficient sources correctly determined they were haywire
and ignored them.

If your machines switched dates yesterday it probably means you're
NTP infrastructure is insufficiently peered and diversified.

Colin_Johnston1 · November 20, 2012, 5:02pm

from firewall ntp logs
Nov 19 09:58:06 [192.168.0.1.128.176] 2012:11:19-09:58:06 ntpd[21385]: ntpd exiting on signal 15
Nov 19 09:58:19 [192.168.0.1.128.176] 2012:11:19-09:58:19 selfmonng[3503]: W check Failed increment ntpd_running counter 3 - 3
Nov 19 09:58:22 [192.168.0.1.128.176] 2012:11:19-09:58:22 selfmonng[3503]: W NOTIFYEVENT Name=ntpd_running Level=INFO Id=147 sent
Nov 19 09:58:22 [192.168.0.1.128.176] 2012:11:19-09:58:22 selfmonng[3503]: W triggerAction: 'cmd'
Nov 19 09:58:22 [192.168.0.1.128.176] 2012:11:19-09:58:22 selfmonng[3503]: W actionCmd(+): '/var/mdw/scripts/ntp restart'
Nov 19 09:58:25 [192.168.0.1.128.176] 2012:11:19-09:58:25 ntpd[24120]: ntpd 4.2.4p8@1.1612-o Tue Feb 2 21:46:54 UTC 2010 (1)
Nov 19 09:58:25 [192.168.0.1.128.176] 2012:11:19-09:58:25 selfmonng[3503]: W child returned status: exit='0' signal='0'
Nov 19 09:58:35 [192.168.0.1.128.176] 2012:11:19-09:58:35 ntpd[24121]: kernel time sync status change 0001

was sync'd to 84.25.175.98, stratum 2 at the time I believe

Colin

Steve_Meuse3 · November 20, 2012, 5:02pm

If you take anything away from this thread, this is it....

-Steve

Colin_Johnston1 · November 20, 2012, 5:10pm

no idea, re sigterm cause
checked firewall system logs and could not see cause from that either
times are GMT

Colin

Seth_Mattinen · November 20, 2012, 5:58pm

I use GPS for my NTP server and didn't notice anything, but it's PPS
disciplined after initial sync so it doesn't matter as long as the pulse
keeps going.

ntp0# ntpq -pn
remote refid st t when poll reach delay offset
jitter

Leo_Bicknell1 · November 20, 2012, 7:00pm

After some private replies, I'm going to reply to my own post with
some information here.

It appears many people don't understand how the NTP protocol works.
I suspect many people have configured a "primary" and a "backup"
NTP server on many of their devices. It turns out this is the
_WORST_ possible configuration if you want accurate time:

http://support.ntp.org/bin/view/Support/SelectingOffsiteNTPServers#Section_5.3.3.

To protect against two falseticking servers (tick and tock, as we saw on
the 19th) you need _FIVE_ servers minimum configured if they are both in
the list. More importantly, if you want to protect against a source
(GPS, CDMA, IRIG, WWIV, ACTS, etc) false ticking, you need a minimum of
_FOUR_ different source technologies in the list as well.

It's not hard, my box that I posted the logs from peers with 18 servers
using 8 source technologies, all freely available on the Internet...

Jay_Ashworth · November 20, 2012, 7:28pm

I'm curious, Leo, what your internal setup looks like. Do you have an
internal pair of masters, all slaved to those externals and one another,
with your machines homed to them? Full mesh? Or something else?

In my last big gig, it was recommended to me that I have all the machines
which had to speak to my DBMS NTP *to it*, and have only it connect to the
rest of my NTP infrastructure. It coming unstuck was of less operational
impact than *pieces of it* going out of sync with one another...

Cheers,
-- jra

Jared_Mauch · November 20, 2012, 7:39pm

here's a sample ntp config from one of my systems.

-- snip --
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.fedora.pool.ntp.org
server 1.fedora.pool.ntp.org
server 2.fedora.pool.ntp.org
server 3.fedora.pool.ntp.org

Leo_Bicknell1 · November 20, 2012, 8:15pm

In a message written on Tue, Nov 20, 2012 at 02:28:19PM -0500, Jay Ashworth wrote:

I'm curious, Leo, what your internal setup looks like. Do you have an
internal pair of masters, all slaved to those externals and one another,
with your machines homed to them? Full mesh? Or something else?

My particular internal setup is a tad weird, and so rather than
answer your question, I'm going to answer with some generalities.
The right answer of course depends a lot on how important it is
that boxes have the right time.

If you have 4 or more physical sites, I believe the right answer
is to have on the order of 8 NTP servers. 2 each in 4 sites reaches
the minimum nicely with redundancy. These boxes can have GPS, CDMA
or other technologies if you want, but MUST peer with at least 10
stratum-1 sources outside of your network. Of course if you have
more sites, one server in each of 8 sites is peachy. Those on a
budget could probably get by with 4 servers total, but never less!

All "critical" devices should then be synced to the full set of
internal servers. 4 boxes minimum, 8-10 preferred. NTP will only
use the 10 best servers in it's calculations, so there is a steep
dropoff of diminishing returns beyond 10. For most ISP's I would
include all routers in this list.

For the "non-critical" devices? Well, there it gets more complex.
For most I would only configure one server, their default gateway
router. Of course, pushing out a set of 4+ to themm if that is
easy is a great thing to do.

The interesting thing here is that no devices except for your NTP
servers should ever peer with anything outside of your network.
Why? Let's say your NTP servers all go crazy together. The outside
world is cut off, GPS is spoofed, the world is ending. All that
you have left is that all of your devices are in time to each
other....so at least your logs still coorelate and such. So having
every device under your master set of NTP servers is important.
One guy with an external peer may choose to use that, and leave the
hive mind, so to speak.

For small players, less than 4 sites, typically just use the NTP
pool servers, configuring 4 per box minimum. If you want the same
protection I just outlined in the paragraph before, make 4 of your
servers talk to the outside world, and make everything else talk
to those. Want to give back to the community? Get a GPS/CDMA/Whatever
box and make it part of the NTP pool. Want to step up your game (which
is what I do), reach out to various Stratum-1's on the net (or find
free, open ones) and peer up 8-20 of them.

In my last big gig, it was recommended to me that I have all the machines
which had to speak to my DBMS NTP *to it*, and have only it connect to the
rest of my NTP infrastructure. It coming unstuck was of less operational
impact than *pieces of it* going out of sync with one another...

Yep, a prime example of the scenario I described above. Depending on
your level of network redundancy, number of NTP servers, and so on, this
is a fine solution. With one NTP server (the DBMS) the downstream will
always use it, and stay in sync. It's a valid and good config in many
situations.

George_Herbert · November 20, 2012, 8:52pm

.

I've also been looking at an item like this:

http://www.netburnerstore.com/ProductDetails.asp?ProductCode=PK70EX-NTP

which is about $300 + misc parts.

Should be well worth it to avoid a 'major outage' that some folks had with needing to reboot their servers, etc.

- Jared

Caution - that Netburner decice is just GPS synced, so if GPS ever does go insane you're out of luck. It doesn't list a precision internal clock part.

I am not sure what all is in the dev kit version, but I know the company owner and can ask if anyone cares.

George William Herbert

Darius_Jahandarie · November 20, 2012, 9:00pm

Choosing the first four servers is usually pretty straightforward:
*.CC.pool.ntp.org

But beyond that, I'm honestly rather curious what server selections
are a good idea. A first thought would be an adjacent country, but
maybe there is a benefit to picking things outside of the pool.ntp.org
selection entirely?

I see that Jared used *.fedora.pool.ntp.org -- I wonder if there was a
specific reason for that or if my questions are even worth thinking
about at all :-).

Happy to hear thoughts.

Mike_Lyon · November 20, 2012, 9:04pm

I usually use time.nist.gov.

R_Benjamin_Kessler · November 20, 2012, 9:07pm

Logs from a Juniper router in a customer network - we had hundreds of these affected. They all synchronize to internal hosts (172.20.167.251 and .252) which are configured to get time from NIST and USNO

CORP-NTP-01#sh ntp as

address ref clock st when poll reach delay offset disp
*~192.5.41.41 .IRIG. 1 354 512 377 34.2 0.36 1.4
+~132.163.4.101 .ACTS. 1 336 512 377 35.0 -2.54 18.7
~127.127.7.1 127.127.7.1 10 59 64 377 0.0 0.00 0.0
* master (synced), # master (unsynced), + selected, - candidate, ~ configured

CORP-NTP-02#sh ntp as

address ref clock st when poll reach delay offset disp
*~192.5.41.41 .IRIG. 1 65 512 377 36.5 0.91 0.6
+~132.163.4.101 .ACTS. 1 95 512 377 34.3 -1.31 22.8
~127.127.7.1 127.127.7.1 10 44 64 377 0.0 0.00 0.0
* master (synced), # master (unsynced), + selected, - candidate, ~ configured

Here are the logs from one of the Junipers:

Nov 19 14:24:48 XXXX xntpd[912]: kernel time sync enabled 2001
Nov 19 15:50:11 XXXX xntpd[912]: synchronized to 172.20.167.252, stratum=2
Nov 19 16:41:23 XXXX xntpd[912]: no servers reachable
Nov 19 16:44:24 XXXX xntpd[912]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:44:24 XXXX xntpd[912]: time correction of -378691200 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.
Nov 19 16:44:24 XXXX init: ntp (PID 912) exited with status=255
Nov 19 16:44:24 XXXX init: ntp (PID 70200) started
Nov 19 16:44:24 XXXX xntpd[70200]: ntpd 4.2.0-a Sat Apr 10 00:32:46 UTC 2010 (1)
Nov 19 16:44:24 XXXX xntpd[70200]: mlockall(): Resource temporarily unavailable
Nov 19 16:44:24 XXXX xntpd[70200]: precision = 0.582 usec
Nov 19 16:44:24 XXXX xntpd[70200]: Listening on interface ggsn_vpn, 128.0.0.1#123
Nov 19 16:44:24 XXXX xntpd[70200]: kernel time sync status 2040
Nov 19 16:44:24 XXXX xntpd[70200]: frequency initialized -64.931 PPM from /var/db/ntp.drift
Nov 19 16:44:24 XXXX xntpd[70200]: Configuring iburst flag for server
Nov 19 16:44:24 XXXX xntpd[70200]: Configuring iburst flag for server
Nov 19 16:44:33 XXXX xntpd[70200]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:44:32 XXXX xntpd[70200]: time reset -378691200.411331 s
Nov 19 16:44:32 XXXX xntpd[70200]: kernel time sync disabled 2041
Nov 19 16:45:44 XXXX xntpd[70200]: synchronized to 172.20.167.251, stratum=2
Nov 19 16:45:51 XXXX xntpd[70200]: kernel time sync enabled 2001
Nov 19 16:45:56 XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 16:53:25 XXXX xntpd[70200]: no servers reachable
Nov 19 17:03:09 XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 17:13:00 XXXX xntpd[70200]: NTP Server Unreachable
Nov 19 17:20:27 XXXX xntpd[70200]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:27 XXXX xntpd[70200]: time correction of 378691200 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.
Nov 19 17:20:27 XXXX init: ntp (PID 70200) exited with status=255
Nov 19 17:20:27 XXXX init: ntp (PID 70766) started
Nov 19 17:20:27 XXXX xntpd[70766]: ntpd 4.2.0-a Sat Apr 10 00:32:46 UTC 2010 (1)
Nov 19 17:20:27 XXXX xntpd[70766]: mlockall(): Resource temporarily unavailable
Nov 19 17:20:27 XXXX xntpd[70766]: precision = 0.570 usec
Nov 19 17:20:27 XXXX xntpd[70766]: Listening on interface ggsn_vpn, 128.0.0.1#123
Nov 19 17:20:27 XXXX xntpd[70766]: kernel time sync status 2040
Nov 19 17:20:27 XXXX xntpd[70766]: frequency initialized -64.931 PPM from /var/db/ntp.drift
Nov 19 17:20:27 XXXX xntpd[70766]: Configuring iburst flag for server
Nov 19 17:20:27 XXXX xntpd[70766]: Configuring iburst flag for server
Nov 19 17:20:35 XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:20:36 XXXX xntpd[70766]: time reset +378691200.387434 s
Nov 19 17:20:36 XXXX xntpd[70766]: kernel time sync disabled 6041
Nov 19 17:21:48 XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 19 17:21:48 XXXX xntpd[70766]: kernel time sync disabled 2041
Nov 19 17:21:52 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 00:02:29 XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2
Nov 20 01:44:56 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 02:19:03 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 02:53:12 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 03:44:26 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 05:26:58 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 05:44:02 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 07:43:35 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 08:00:39 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 08:34:48 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 08:51:54 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 10:34:22 XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 20 11:25:16 XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2
Nov 20 12:33:56 XXXX xntpd[70766]: synchronized to 172.20.167.252, stratum=2
Nov 20 14:16:05 XXXX xntpd[70766]: kernel time sync enabled 6001
Nov 20 14:33:10 XXXX xntpd[70766]: kernel time sync enabled 2001
Nov 20 15:07:19 XXXX xntpd[70766]: synchronized to 172.20.167.251, stratum=2

Jared_Mauch · November 20, 2012, 9:21pm

I'm by no means a time geek, but …. i have some ideas about what you want and can tell you why I picked the settings I did…

1) The 129.250 ones are my employer run clocks. It is a good idea to know how accurate they are.

2) The pool ones, some were default (e.g.: fedora) from my OS distro on the machine I took the example from. You will see freebsd, centOS and others based on your settings. You may even see time.apple.com if you are MacOS.

3) CC ntp pool were selected to provide additional clock diversity.

4) You want low jitter to your clocks. This will allow you to have an accurate timing source. This means don't congest that path. If you want something very reliable, don't run it on a server with the other "misc" functions you need (e.g.: DNS, etc). If it's important, dedicate some hardware to it. if it is of passing importance, use a fair number of peers.

I was playing with the OWAMP software. Having consistent clocks is important for that, (even if they are all off by a few ms). It can be fun to play with and measure things… http://www.internet2.edu/performance/owamp/index.html

5) Monitor your NTP setup periodically. You may see clocks be rejected or outliers. Depending on how close your clocks are, you may see a fair number be unusable. Take this output:

nat:~$ ntpq -n -p -c ass
remote refid st t when poll reach delay offset jitter