Debugging NTP again (part 1)

Xen servers are well known for having time synchronization problems. I have a few here, too. Two datacenters, with three multicast NTP servers and two Xen servers in each. And in each datacenter, one of the Xen servers is working like a charm, while the other's offset keeps oscillating more and more, until ntpd resets, and the cycle starts again. Using known workarounds just made things worse.

Having a Xen servers's clock suddenly going backwards from time to time isn't any good. This really calls for a deep debug. …Debugging needs data. To collect data that shows the problem, it's enough to activate a few statistics in ntp.conf. Namely you need to add at least the following lines:

# Enable this if you want statistics to be logged.
statsdir /var/log/ntpstats/

statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable

Once you restart ntpd, it starts collecting information in /var/log/ntpstats/. You may want to check ntpd's documentation to understand what the collected information is and how these files are named and formatted.

To plot the graphs I use gnuplot:

plot "./server1/loopstats" using 2:3 with linespoints, "./server2/loopstats" using 2:3 with linespoints

This will show a window with graphs that show how the offset is going over time with server1 and server2.

One of the problems seemed to be that one of the servers was propagating NTP packets with different source addresses, from time to time, confusing the "client". To verify this, a simple pipe of commands is enough:

zcat peerstats.20100808.gz | awk '{ print $3 }' | sort | uniq -c | sort -rn

If something is wrong in one server's reachability, you'll see it towards the bottom of the list, with a count significantly different from the rest of the lot.

Restricting the spurious address in ntp.conf just produced no result:

restrict N.N.N.N ignore

while adding an explicit route for multicast packets fixed the problem at the source:

route add -net netmask dev eth0

but unfortunately with no benefit on the Xen server clock side 😦

I still couldn't get the rebel machine right, and attempting a different solution. Stay tuned for part 2 😉


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s