Need trusted NTP Sources

TGLASSEY · February 7, 2014, 11:38pm

Raspberry Pi

Bryan_Seitz · February 8, 2014, 4:35am

The SureGPS is decent fun but i've had this device lose sync / crap out randomly as well.
I am using the Garmin 18X-LVC + a low power server with pretty good success.

(Requires PPS soldering + USB pigtail for power, pretty easy mod)

[seitz@ntp-gps ~]$ ntpq -p
remote refid st t when poll reach delay offset jitter

Majdi_Abbas2 · February 8, 2014, 8:53pm

The Netburner NTP sample app works well enough for basic home
use, although I get better timing performance out of a fleet of hand
modified Soekrii.

I've been modifying NET4801s to include internal Motorola Oncore
timing receivers (this is a tight fit, but doable, in the factory
cases), or to break out their second serial port for connections to
external reference clocks. (I have one connected to a TrueTime TL-3 to
use WWV as a backup to GPS, but it can also be a travelling GPS NTP
server with, say, a Garmin GPS18lvc connected.)

You can make your own sub-$150 NTP server -- I'll spare the
list the details, but those that are interested should see:

http://puck.nether.net/~majdi/ntp/

Feedback is appreciated -- I've only spent about an hour on
this doc, and it assumes a lot of familiarity with FreeBSD. I will
try to flesh it out more as I have time.

Cheers,

--msa

Jay_Ashworth · February 9, 2014, 12:43am

Fair point.

In practice, it never bit me because nearly everything that wanted NTP
would only accept one server name (being windows) and the things that
*did* take more than one, I generally pointed to both internals, and
something outside the firewall as well.

In the architecture I described, though, is it really true that the odds
of the common types of failure are higher than with only one?

Cheers,
-- jra

Jay_Ashworth · February 9, 2014, 12:46am

My two internal servers were my two uplink firewalls, and were pretty
thoroughly monitored. Had NTP gone insane, I've had heard about it.

Remember that 3 of the 8 peers on each machine were pool.ntp.org machines,
so the cluster, as a cluster, actually had *nine* external peers, each
machine having 3 in common, and three which were not (each machine was
a DNS resolver, so they didn't share a name cache on "*.us.pool.ntp.org"

Cheers,
-- jra

Jay_Ashworth · February 9, 2014, 12:48am

As I've noted, I had *nine* external peers; 3 shared by both machines
(commercial and NIST strat-1's), and 3 each from us.pool, which were
generally different servers; I did keep an eye on that.

And the NTP servers were monitored.

I'm stupid, but I'm not crazy.

Cheers,
-- jra

Saku_Ytti1 · February 9, 2014, 8:03am

I think so, lets assume arbitrarily that probability of NTP server not
starting to give incorrect time is 99% over 1 year time.
Then either of two servers not giving incorrect time is 0.99**2 i.e. 98%, so
two NTP servers would be 1% point more likely to give incorrect time than one
over 1 year time.

Obviously the chance of working is more than 99% maybe it's something like
99.999%? And is that really typical failure-mode or is typical failure-mode
complete loss of connectivity? Two NTP servers would protect from this, single
not.
However loss-of-connectivity minor impact on clients, wrong time has major
impact of client.
Maybe if loss-of-connectivity is fixed in somewhat short period of time,
single NTP always win, if loss-of-connectivity is fixed typically in very long
period of time, single NTP loses.

I don't really have exact data, but best practice is >2. Matthew said 4, which
gives the advantage that in single failure you are still operating redundantly
and do not have urgency to fix, with 3 in single failure another failure must
not occur before it is fixed.
I think 3 is enough, networks are typically designed to handle 1 arbitrary
failure at the same time and 2 arbitrary failures in most networks, when
chosen correctly, will cause SLA breaking faults (Cheaper to pay SLA
compensations than to recover from any 2 failures).
But NTP servers are cheap, so if you want to be robust and recover from n
false tickers, have 3+n.

Andriy_Bilous · February 9, 2014, 8:08pm

Best practice is five. =) I don't remember if it's in FAQ on ntp.org or in
David Mills' book. Your local clock is kind of gullible "push-over" which
will "vote" for the "party" providing most reasonable data. The algorithm
would filter out insane sources which run too far from the rest and then
group sane sources into 2 "parties" - your clock will follow the one where
runners are closer to each other. That is why uneven number of trustworthy
sources at least at start is required. With 2 sources you will blindly
follow the one which is closer to your own clock. You're also having the
the risk to degrade into this situation when you lose 1 out of 3 sources.
Four is again 2:2 and only with five you have a good chance to start
disciplining your clock into the right direction at the right pace, so when
1 source is lost you (most probably) won't run into insanity.

Jay_Ashworth · February 9, 2014, 8:16pm

That's only true if the two devices have common failure modes, though,
is it not?

Saku_Ytti1 · February 9, 2014, 8:30pm

No, we can assume arbitrary fault which causes NTP to output bad time. With
two NTP servers it's more likely that any one of them will start doing that
than with one alone. And if any of the two start doing it, you don't know
which one.

Jay_Ashworth · February 9, 2014, 8:45pm

Hey, waitaminnit! I saw you palm that card.

If I'm locked to 2 coherent upstreams and one goes insane, I'm going to
know which one it is, because the other one will still match what I already
have running, no?

Or do I understand NTP less well than I think?

Cheres,
-- jra

Saku_Ytti1 · February 9, 2014, 8:56pm

I don't think you can reasonably tell which of the two is the false ticker.
Andriy says your PC would blindly follow one who is in more agreement with
your local lock, and PC's have terrible oscillators (I don't know why, 5EUR
would buy LOT better oscillator).

James_Hess · February 9, 2014, 9:00pm

[snip]

If I'm locked to 2 coherent upstreams and one goes insane, I'm going to
know which one it is, because the other one will still match what I already
have running, no?

The question should be how assured is the reliability of the clocks of the
2 upstream servers. I think I am pretty happy with the concept of
having two local centralized NTP servers, used by various servers in an
environment ---- some SNTP some NTP, each of the local centralized NTP
servers using 5 external time sources.

These external time sources need to be periodically checked, to ensure the
central NTP servers continue to synchronize with them, and that they
continue to be accurate.

So the pair of NTP servers is not redundant in the sense that the time is
allowed to be wrong, but they are resilient in the sense of being
configured, so their own clock should always be correct, unless there
is a once in 100 years failure scenario.

Each of the local servers, then has two NTP peers as time source, and the
local clock discipline, except for virtual machines: which should use
just the two NTP servers.

A local pair of NTP servers are not "redundant" in the sense of being able
to survive a catastrophic software bug in NTP; the local time sources
should be redundant to survive the more highly frequent condition of
temporary total failure of a local NTP server.

Saku_Ytti1 · February 9, 2014, 9:03pm

I'm having bit difficulties understanding the issue with 4.

Is the implication that you have two groups which all agree with each other
reasonably well, but do not agree between the groups. Which would mean that 4
cannot handle situation where 2 develop problem where they agree with each
other but are wrong.
But even in that case, you'd still recover from 1 of them being wrong. So

3 = correct time, no redundancy
4 = correct time, 1 can fail
5 = correct time, 2 can fail
and so forth?

But not sure here, just stabbing in the dark. For the fun of it, threw email
to Mills, if he replies, I'll patch it back here.

Lyle · February 9, 2014, 9:19pm

Look back in the archives and see the problems that erupted when one of the big guys rebooted and came on line with bad time(tock.usno.navy.mil in Nov of 2012). It was talked about in Outages and other lists at the time it happened.

Brett_Frankenberger1 · February 9, 2014, 9:36pm

If it suddenly goes insane as a step function? Sure. But if the one
you've selected for synchronization starts drifting off true time very
slowly, it will take your clock with it, and then ultimately the other
one (that is actually the good clock) will appear to be insane clock.

-- Brett

Andriy_Bilous · February 9, 2014, 9:41pm

Unfortunately I don't have the book handy. May be I am wrong too. Just
checked and 4 looks to be a valid solution for 1 falseticker according to
Byzantine Generals' Problem.