The leap second is finally behind us, and for the first time it has been transformed in an event. That had the unfortunate consequence that many channels where useful information had flown in the previous events were now flooded with bullshit. But it’s over. A giant army of idiots has finally stopped asking “what will you do with your extra second?”, a smaller but still noticeable army of inaccurate writers and journalists won’t write for a while that the atomic clocks need to be stopped for a second to realign with the Earth (?!?!?!?!?!?). We can now sit, look back and save some take-aways for the next edition of the event.
- Don’t let the clients step back time, but allow the servers to do so if you can afford it. Our initial plan was to avoid the step back in time on both our servers and clients; it worked well in the simulations and we usually had convergence in 17-48h, depending on how good were the upstream servers. However, it turned out that if one could afford their servers to step back while still not allowing the clients to step, one could have convergence much faster. Our final configuration had stepping server, non-stepping clients with small poll intervals (64 or 128 seconds) intervals. That allowed us to have convergence in 6-8 hours, something to keep in mind for the next time we’ll have a leap second.
- Disable the kernel discipline. It wasn’t necessary this time because there were no kernel bugs (by the way, congratulations to the Linux kernel developers, good job!!!). It may happen anyway, so it’s better to switch it off for good measure.
- You can’t assume that your upstreams are reliable, not even when they are public stratum 1 servers. A few hours before the leap second insertion we noticed that in two locations our servers didn’t receive a proper leap second warning from their upstreams. To make our servers handle the leap second properly we had to quickly add a leap second file to the configuration of ntpd. This is something to do by default next time.
- Use configuration management. The simulations go a long way to provide data, select the most effective strategies and discard the ineffective ones. However, more information is discovered once the final strategy is implemented and configuration changes on the fly will be likely required. Using a configuration management tool is key to be able to deploy configuration changes quickly and reliably, as well as to restore a standard configuration when the leap second is handled. In our case, CFEngine helped a lot, for example: a few days before July 1st we decided to switch from non-stepping servers to stepping servers: a simple search & replace with Perl to modify the classification of the NTP servers, a command to deploy the new information and all the servers reconfigured themselves in 5 minutes. Another example is the configuration change to add the leap second file to the NTP servers: with only a few hours left, it was important to deploy the change reliably.
- Have some monitoring in place to observe the evolution of the offset of the servers and, possibly, on some key clients. In our case, munin and a custom “NTP dashboard” helped.
- Run the simulations in good time and on an appropriate number of machines. To make proper simulations for our standard configuration we need at least 3 months and 9 machines: 4 machines to impersonate the upstream servers, 4 machines to impersonate our servers, and one machine to impersonate a client.
- Test leap smear/leap slew strategies. I learned from Miroslav Lichvar’s blog post that chrony implements both a leap second slew and leap second smear strategy and I read that ntpd has implemented a leap smear strategy in version 4.2.8p3 final. Definitely something to test next time.
And now let’s keep our eyes open for the next Bulletin C that will tell us if we’ll have a leap second for new year’s day. Until then… enjoy!