Detecting runaway processes

A simple one liner I used to take a stroll on a few systems, and see where I had a runaway backend process. This uses ps's command extended options (-o); I almost never use those, so I thought it was better to make a note to myself πŸ™‚
That sort command could be crafted better, but it did its job in this case.

$ for SERVER in x{1..6} ; do echo $SERVER ; ssh -n $SERVER 'ps -C backend -o bsdtime,pid,comm | grep -v TIME | sort -rn | head -n 3' ; done 

I'd like to investigate how to manage such a situation automatically using CFEngine. Will try that sooner or later πŸ˜‰

cfengine vs vpn: 1-0

I had a quite annoying problem on my laptop, that I solved using cfengine.

When the VPN software runs, it creates a virtual tun0 interface and changes a few things in the network configuration (e.g.: routes, /etc/resolv.conf,…). A problem arises when the DHCP lease is renewed on the physical interface, eth0 or wlan0: in fact, resolv.conf gets rewritten, and I can't resolve internal network addresses any more until I put a valid resolv.conf back in place.

A few days ago, while on vacation, I finally adapted my existing policies to run on my laptop. One of the policies keeps an eye on resolv.conf while I am on VPN, and rewrites it if dhclient does the smartass. I am testing it today for the first time, and I am really pleased to find this message in my mailbox:

Subject: community [cooper/192.168.0.5]
Date: Thu, 19 Jul 2012 20:46:34 +0200
From: cfengine@localhost
To: bronto@localhost

R: Repaired resolver configuration in /etc/resolv.conf

So I'm pretty safe: if dhclient messes with my resolver, cfengine will set it back in <5 minutes time. Isn't that nice? πŸ˜‰

Oh, and of course it does more than that. Depending on the location I am in, and whether I am in VPN or not, it reconfigures ntpd and restarts it, so that I always use the best configuration. But I don't want to bother you with the gory details, so I'll stop here πŸ˜‰

The hectic week of the leap second

Last week has been an hectic period:

  • we’ve been hit by the leap second announcement bug
  • others around the internet have been, as well
  • …not to mention those hit by the leap second itself
  • Wired put my name on an article twice, which started a “citation spree” around the globe
  • I’ve finally realized my proposition to start on Twitter (@brontolinux)
  • I’ve been appointed as one of the CFEngine champions 2012

I’ll try to sum up, and conclude with a take-away lesson for next leap second.

Continue reading

Who’s my cfengine policy hub?

This could be either a trivial question, or an arcane mistery. For me it was the second one, until I did some research and I found out that the answer was quite easy πŸ™‚

So, how does cfengine know what's your policy hub, once you have bootstrapped a client?

The surprisingly simple answer is: by reading the contents of $(sys.workdir)/policy_server.dat, usually /var/cfengine/policy_server.dat

Good to know, when you are planning to point a client to a different hub for any reason πŸ™‚

Why I gave up puppet and chose cfengine 3

But before even beginning to tell the story, I'd like to say that I have nothing against the puppet community, which is in fact a great one: be it on IRC or in a mailing list, they were always helpful, never rude; yes, I hated when they said "well, puppet is not designed to do that", but at least they were honest πŸ˜‰

The fact is: I never fell in love with puppet. I've always been frustrated by its unpredictability, badly implemented functionalities (fileserver anyone?), bad design decisions (e.g.: why use the hostname as a key in the hosts file, where all systems use the IP as the key?), and no scalability out-of-the-box (e.g.: no provisions for hierarchical puppetmaster architectures, need for external resources like an nginx reverse proxy to make it scale…) to mention the first ones that come to mind.

With all this bad, how did it happen that I got trapped into puppet?

It's a long story, so take your time and relax. … Continue reading

An humble attempt to work around the leap second

Note: this articleΒ is now obsolete, please have a look at A humble attempt to work around the leap second, 2015 edition. Thanks.


Some background
Back in March, I talked about the experiments I was conducting to manage the leap second coming at the end of June 30th, 2012. Despite the fact that the leap second was first introduced in the early 70s, and that we never had a negative leap second up to date, a number of applications and systems still rely on some wrong assumptions, namely:

  • every minute always lasts 60 seconds
  • time read from the system clock is monotonic
  • two consecutive reads of a UNIX timestamp, happening at least one second after the other, will result in the second timestamp being bigger than the first one (rephrase of the previous point in the UNIX/POSIX world)

So bad that, after exactly fourty years from the first leap second, systems and applications still rely on these assumptions and can crash badly when, during a leap second insertion, they find themselves in a situation they didn’t expect.

David Mills, the inventor of NTP, in his document “The NTP Timescale and Leap Seconds” suggests how it should implemented on all systems that always assume 60-second minutes. If that was correctly implemented in, e.g., the Linux kernel, we’d have no need to work around any issue, as time would still be monotonic during the leap second transition. Unfortunately, that is not the case, and Linux will suddenly step back one second when the clock reaches July 1st, 2012, 00:00:00.

The procedure described below will help you to avoid the step, and to recover for the excess second the clock will find itself to have compared to its time sources. However, this procedure is far from ideal in a number of situations, and if you decide to apply it on your systems you do it at your own risk. My advice is: go for this procedure only where the risk of having a system crash due to a leap second is higher than the risk of a misbehavior due to two systems having an offset of some tenths of a second; and do that only after some testing. … Continue reading