Following the inheritance mess I talked about earlier, I had to work hard to make the structure sane. Of course, it is quite difficult to implement such a change in steps; rather, you have to change everything in a single step. Nonetheless, it is possible to go the cautious way. …
I already had a branch in my puppet repository, and it was logical to keep working on that branch. There, I prototyped semi-visually the class changes using graphviz's dot.
Once I was happy enough with the result, I started changing the class manifests, revising the structure each time I stumbled on an unexpected hurdle. At every step, I updated the dot file of the new structure.
At the end of this process, a simple diff between the dot file of the old structure and the new dot file revealed the final version of the changes. It was time to modify the node manifests to reflect the change in the class structure. This, in turn, revealed that a few more classes were not needed any more, and I purged them, getting a neater class graph 🙂
Finished this, it was time to test the new stuff in the test environment. A few little adjustments more, and we were ready for release. Uhm… were we?
OK, let's be overcautious. With a first cycle over all the impacted nodes, I disabled puppet on all of them (
puppetd --disable, as you probably know). Then I propagated the changes on the first "distribution server", and everything went well. OK.
I kept changing a few nodes by hand, each time running
puppetd --enable && puppetd --test ; puppetd --disable
After testing the changes on a few nodes per class (and adding a few more minor changes) I was happy enough, and I re-enabled puppet on all nodes so that they could sync. That was yesterday, late afternoon. How to check this morning if all of them were actually happy with the change? Let's see!
When something goes wrong with the catalog (which is the problem I was monitoring), you'll get a message like the following one in the log:
Feb 10 14:20:59 mynode puppetd: Could not retrieve catalog; skipping run
So, did we have any yesterday? Let's see:
$ cat * | while read NODE ; do echo -n "$NODE: " ; ssh -n root@$NODE "awk '$5~/^puppetd/ && /skipping run/' /var/log/syslog.1 | wc -l" ; done 2> /dev/null
(I am not going to cover why that 2> was needed, sorry)
As you can easily guess, I had the node names in some files in the current directory, and I am counting the occurrences of "skipping run" strings issued by puppetd. Luckily, a few nodes showed a number different from 0. I visited the nodes individually to confirm that they were the nodes that showed some minor problems yesterday during the manual tests. A few similar checks for critical conditions showed that no problems happened after applying the changes.
Done! The new class hierarchy is now in production!
Oh, by the way: is any of the nodes in disabled state?
$ cat * | while read NODE ; do echo -n "$NODE: " ; ssh -n root@$NODE ls /var/lib/puppet/state/puppetdlock || echo ; done 2> /dev/null
No, it's not 😉