Detecting runaway processes

July 27th, 2012 - 12:07May 20th, 2013 - 15:33 / bronto / 1 Comment

A simple one liner I used to take a stroll on a few systems, and see where I had a runaway backend process. This uses ps's command extended options (-o); I almost never use those, so I thought it was better to make a note to myself 🙂
That sort command could be crafted better, but it did its job in this case.

$ for SERVER in x{1..6} ; do echo $SERVER ; ssh -n $SERVER 'ps -C backend -o bsdtime,pid,comm | grep -v TIME | sort -rn | head -n 3' ; done

I'd like to investigate how to manage such a situation automatically using CFEngine. Will try that sooner or later 😉

Happy Sysadmin day!

July 27th, 2012 - 12:07May 20th, 2013 - 15:33 / bronto / Leave a comment

From http://t.co/JkbyVYiQ

cfengine vs vpn: 1-0

July 19th, 2012 - 22:07May 20th, 2013 - 15:33 / bronto / 3 Comments

I had a quite annoying problem on my laptop, that I solved using cfengine.

When the VPN software runs, it creates a virtual tun0 interface and changes a few things in the network configuration (e.g.: routes, /etc/resolv.conf,…). A problem arises when the DHCP lease is renewed on the physical interface, eth0 or wlan0: in fact, resolv.conf gets rewritten, and I can't resolve internal network addresses any more until I put a valid resolv.conf back in place.

A few days ago, while on vacation, I finally adapted my existing policies to run on my laptop. One of the policies keeps an eye on resolv.conf while I am on VPN, and rewrites it if dhclient does the smartass. I am testing it today for the first time, and I am really pleased to find this message in my mailbox:

Subject: community [cooper/192.168.0.5]
Date: Thu, 19 Jul 2012 20:46:34 +0200
From: cfengine@localhost
To: bronto@localhost

R: Repaired resolver configuration in /etc/resolv.conf

So I'm pretty safe: if dhclient messes with my resolver, cfengine will set it back in <5 minutes time. Isn't that nice? 😉

Oh, and of course it does more than that. Depending on the location I am in, and whether I am in VPN or not, it reconfigures ntpd and restarts it, so that I always use the best configuration. But I don't want to bother you with the gory details, so I'll stop here 😉

The hectic week of the leap second

July 5th, 2012 - 16:07April 16th, 2015 - 19:23 / bronto / Leave a comment

Last week has been an hectic period:

we’ve been hit by the leap second announcement bug
others around the internet have been, as well
…not to mention those hit by the leap second itself
Wired put my name on an article twice, which started a “citation spree” around the globe
I’ve finally realized my proposition to start on Twitter (@brontolinux)
I’ve been appointed as one of the CFEngine champions 2012

I’ll try to sum up, and conclude with a take-away lesson for next leap second.

Continue reading →

Who’s my cfengine policy hub?

June 19th, 2012 - 17:06May 20th, 2013 - 15:33 / bronto / Leave a comment

This could be either a trivial question, or an arcane mistery. For me it was the second one, until I did some research and I found out that the answer was quite easy 🙂

So, how does cfengine know what's your policy hub, once you have bootstrapped a client?

The surprisingly simple answer is: by reading the contents of $(sys.workdir)/policy_server.dat, usually /var/cfengine/policy_server.dat

Good to know, when you are planning to point a client to a different hub for any reason 🙂

Why I gave up puppet and chose cfengine 3

June 17th, 2012 - 12:06May 20th, 2013 - 16:11 / bronto / 9 Comments

But before even beginning to tell the story, I'd like to say that I have nothing against the puppet community, which is in fact a great one: be it on IRC or in a mailing list, they were always helpful, never rude; yes, I hated when they said "well, puppet is not designed to do that", but at least they were honest 😉

The fact is: I never fell in love with puppet. I've always been frustrated by its unpredictability, badly implemented functionalities (fileserver anyone?), bad design decisions (e.g.: why use the hostname as a key in the hosts file, where all systems use the IP as the key?), and no scalability out-of-the-box (e.g.: no provisions for hierarchical puppetmaster architectures, need for external resources like an nginx reverse proxy to make it scale…) to mention the first ones that come to mind.

With all this bad, how did it happen that I got trapped into puppet?

It's a long story, so take your time and relax. … Continue reading →

Scapegoats

June 6th, 2012 - 10:06May 20th, 2013 - 15:33 / bronto / Leave a comment

We tried to make a wrapper around cpan2rpm at one point. It kinda sorta worked once in a while when the moon was full and we sacrificed a goat first.
— Jesse Becker

😆

For those who don’t know what to expect during the leap second…

June 5th, 2012 - 10:06May 20th, 2013 - 15:33 / bronto / Leave a comment

…my friend and ex-colleague Giovanni pointed me to an interesting post regarding an interesting Oracle crash in the occasion of the leap second we had in 2008. Enjoy the post.

An humble attempt to work around the leap second

June 1st, 2012 - 14:06June 4th, 2015 - 17:20 / bronto / 6 Comments

Note: this article is now obsolete, please have a look at A humble attempt to work around the leap second, 2015 edition. Thanks.

Some background
Back in March, I talked about the experiments I was conducting to manage the leap second coming at the end of June 30th, 2012. Despite the fact that the leap second was first introduced in the early 70s, and that we never had a negative leap second up to date, a number of applications and systems still rely on some wrong assumptions, namely:

every minute always lasts 60 seconds
time read from the system clock is monotonic
two consecutive reads of a UNIX timestamp, happening at least one second after the other, will result in the second timestamp being bigger than the first one (rephrase of the previous point in the UNIX/POSIX world)

So bad that, after exactly fourty years from the first leap second, systems and applications still rely on these assumptions and can crash badly when, during a leap second insertion, they find themselves in a situation they didn’t expect.

David Mills, the inventor of NTP, in his document “The NTP Timescale and Leap Seconds” suggests how it should implemented on all systems that always assume 60-second minutes. If that was correctly implemented in, e.g., the Linux kernel, we’d have no need to work around any issue, as time would still be monotonic during the leap second transition. Unfortunately, that is not the case, and Linux will suddenly step back one second when the clock reaches July 1st, 2012, 00:00:00.

The procedure described below will help you to avoid the step, and to recover for the excess second the clock will find itself to have compared to its time sources. However, this procedure is far from ideal in a number of situations, and if you decide to apply it on your systems you do it at your own risk. My advice is: go for this procedure only where the risk of having a system crash due to a leap second is higher than the risk of a misbehavior due to two systems having an offset of some tenths of a second; and do that only after some testing. … Continue reading →

The cafe may well be closed at 9PM…

April 30th, 2012 - 23:04May 20th, 2013 - 15:33 / bronto / Leave a comment

…but cafe’s wireless is open 24/7 :p

(Sorry for the poor quality of the photo, but it was taken with a phone, at evening, and from a long distance. Anyway, I think the subject is quite clear 🙂

A sysadmin's logbook

in every challenge there is an opportunity

Syslog