How we upgraded a CFEngine cluster

blog-cfengine-logoThis post describes how we performed, on a cluster of ours, an upgrade of both the CFEngine policies, and of the CFEngine software itself from 3.3.5 to 3.4.4. The tools we used for the job were an editor and meld to edit the policies and merge changes from different branches, and cfengine itself (in particular cf-runagent), and clusterssh to apply the changes. I don’t claim this is a perfect procedure, and it wouldn’t scale to very big clusters; but in our case, the cluster was small enough to allow for some manual operations, and we took the opportunity to do things step by step, ensuring that nothing bad could happen to the production nodes at any time. On bigger clusters, with a better test environment, and with more “instrumentation” to monitor the changes, handwork could be reduced further, and the whole update would take less time. In our case, the bigger part was done between the workday start and the lunch break (about 3 hours).

The cluster

The cluster where we started using CFEngine (let’s label it with “A”) was running 3.3.5. The cluster is subdivided in two segments, each one with its own policy hub; we’ll label these two segments with A1 and A2.

When we later used CFEngine on a new cluster (let’s call it “B”), we used the then latest CFEngine version, 3.4.4. We liked it a lot; in particular, we liked that, in 3.4, the policy files have moved from a flat directory structure to a hierarchical one, where libraries and user policies live in their own directories, and where promises.cf has been split, so that each service (cf-serverd, cf-agent and so forth…) have each its own configuration/policy file. The experience with implementing 3.4.4 was good, so we decided it was the right time to both restructure A’s policies’ hierarchy (as was had previously planned) and to upgrade Cfengine to 3.4.4.

The preparations

The first step was to branch the git repository, split promises.cf and reorganize the files. That was done quickly and easily enough.

The second step was to integrate some of the developments from cluster B into A. That required some work and more testing, and we did it. I must say that using meld to selectively merge the changes made our job significantly easier and faster.

Finally, we needed to ensure that the policies worked with both CFEngine 3.3.5 and 3.4.4. They generally did, but there were some “glitches” that needed to be adjusted, so that an agent run didn’t trigger a ton of warnings. That took some more time with debugging and testing, but it finally worked.

Throughout the three steps described above, we needed to integrate in this new branch the changes that were applied to the original branch. That was a rather tedious process, that boosted our will to make the transition happen as soon as possible.

When everything was proved to be working in our testing environments on both 3.3.5 and 3.4.4, we did one more sanity check to ensure that all directives that were in promises.cf before were ported correctly to the split files. We found a few more glitches that were promptly corrected, and everything was tested again, successfully.

The plan, executed

It was finally time to initiate the change. We discussed a few options, and we finally set for the following checklist (an explanation for the reasoning behind every step is given below the checklist):

  1. change cf-runagent’s configuration, so that it would run the agent with the no-locks option (-K);
  2. use cf-runagent itself to make all CFEngine’s services restart on all nodes, to ensure that the configuration change was received;
  3. disable the agent on all nodes in segment A1, again by using cf-runagent (what does “disable CFEngine” mean? read on, I’ll explain it later);
  4. deploy the new policies on the policy hub, have it run the policies, check for errors and fix if any;
  5. upgrade CFEngine to 3.4.4, run the policies again, check for errors and fix if any; re-enable CFEngine on the node when done.
  6. when the policy hub is OK, apply the same cycle on one non-production node: first update and run the new policies, check for errors and fix with any, upgrade, run again, check and fix, re-enable;
  7. when done with all non-production nodes, apply the same cycle on one role-specific node. If everything works, update all nodes with the same role. Re-enable the agent on every node that was updated successfully;
  8. iterate the previous step for all roles until exahustion;
  9. at this point, it’s reasonable that the policies are OK to be applied on the A2 segment in big chunks;
  10. when finished, change cf-runagent’s configuration again to run the agent without the -K option.

Aim of step 1 was to have the agent unconditionally run when invoked with cf-runagent; without it, the agent would honour all possible locks, like those set by the ifelapsed clause, and may refuse to run when requested if the previous run happened less than one minute before. It is a rather dangerous option to set, as it leaves room for remote denial of service attacks, and breaks any action that relies on a persistent lock. That’s why we disable it again at step 10, when finished.

At step 2 we use a feature in our cf-runagent configuration: when invoked with the restart_cfdaemons class set, it will force a restart of all CFEngine daemons (you bet!). In version 3.4, this doesn’t seem to be necessary, but 3.3 was not smart enough to detect a configuration change and reload by itself.

At step 3 we use yet another feature in our policies: at a very early stage of an agent run, we check if a flag file is present. If it is, the agent sets a class that will make the agent run abort (see the abortclasses directive in the reference for more details). However, if we set the force_run class, the flag file is ignored and the agent runs as normal. With CFEngine disabled, a node won’t update its policies, unless an agent run is forced, or a failsafe run is executed.

At steps 4, 5, we start the procedure by upgrading the policy hub; that is an obvious choice, as the new policies will be “irradiated” from that node.

At step 6, we start upgrading non-production nodes. The reason why we do this at this stage is as obvious as the previous step: we want to detect and fix any flaw that has survived the testing phase, before it hits the production nodes.

At step 7, we start upgrading production nodes by role. Each production node is assigned one or more roles, depending on the tasks it has been assigned with. For a given role, we run the upgrade on one node only first, to be sure that it won’t break the functionality of any node with the same role. Once we get green light on this role-specific node, we are reasonably safe to upgrade all the nodes sharing the same role. We use cf-runagent to run the policy update all such nodes in parallel, and clusterssh to upgrade the cfengine package in batches. We do this for each role until exhaustion (step 8).

When everything’s done on A1, we are reasonably safe to apply the same changes in the segment A2 (step 9), first on the policy hub, then on the other nodes. This time we’ll do that in bigger chunks, as we have already tested everything on A1. Again, we use cf-runagent to force the policy updates, and clusterssh to upgrade the nodes in batches.

Disabling an agent

I’ve mentioned that we have a policy that we run at a very early stage, that we use to disable an agent when we need so. Let’s see it:

bundle agent cfe_disabled
{
  classes:
    !force_run::
      "skip_run"    expression => fileexists("$(cfe_runcontrol.flag)") ;
}

This works together with this directive in controls/cf_agent.cf:

  # The following works together with cfengine runcontrol (services/cfengine.cf)
  abortclasses => { "skip_run" } ;

The bundle is quite simple. The path of the flag file is defined in a vars promise in the cf_runcontrol bundle. If force_run is not defined, we check for the existence of the flag file. If it exists, the skip_run class is defined, which will cause the agent to abort as soon as this bundle has completed.

Conclusion

In this post I’ve summarized how we used a few tools, including cfengine, to upgrade cfengine itself and its policies. I’ve also showed how we use cfengine and a flag file to temporarily disable the agent, and how we can override the this lock to force an isolated run. The upgrade was planned carefully and executed slowly, to ensure that no bad surprises could happen. This procedure is suitable to be used on clusters of reasonable size but not for bigger clusters, where different approaches are needed in order execute the operation in a sensible amount of time.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s