The classification problem: challenges and solutions

Update March 1st, 2015: the latest version of the code for hENC is now on github

It’s been about a month since I came back from FOSDEM and cfgmgmtcamp, a month where I gradually recovered from the the backlog both in the office and at home. It’s been a wonderful experience, especially at cfgmgmtcamp, and I really want to thank all those that helped make it special — more details at the end of this article.

But promise is debt (no pun intended with promise theory here), and I promised to write a long blog post with some (or all) the details from my talks. It’s time to keep that promise. So, without any further ado…

Part 1: the problem

Configuration management was the theme of the devroom where I presented at FOSDEM, and the theme of the whole cfgmgmtcamp. One would expect that every single attendant knew the definition for it. But maybe talking about “the” definition is a bit too much.

In fact, there are so many tools today that claim to do configuration management and built around so diverse philosophies (CFEngine, Puppet, Chef, Ansible, SaltStack, BCFG2, LCFG…), that if someone tries to infer the definition of configuration management from what their tool of choice does or does not, you’d get very different definitions. What is configuration management then?

I looked for a definition by Mark Burgess (who else?!), only to find that him, in turn, refers to the IEEE Glossary of Software Engineering Terminology (Standard 729-1983):

“Configuration Management is the process of identifying and defining the items in the system, controlling the change of these items throughout their life-cycle, recording and reporting the status of items and change requests, and verifying the completeness and correctness of items”.

Each part of this definition could be “exploded” into details, and we could probably elaborate for hours about how each of the tools mentioned before addresses (or doesn’t address) each part of the process. I’ll rather concentrate on the first part and namely:  the process of identifying and defining the items in the system.

It’s no accident that that part comes first. In fact, doing any of the things that come after that sentence would be from difficult to impossible if the systems were not identified beforehand. The process of identifying the systems by capturing their characteristics (e.g.: location; role in an infrastructure; networks they belong to; processes they run, must run or must not run; filesystems they have access to and how much free space those filesystems must have as a minimum…) is often referred to as classification.

When we classify systems, we define their common traits (the classes). Classes collect nodes together in broadly defined sets: in each class most of the nodes share exactly the same traits, and different settings are applied to different classes of nodes. Only after correctly classifying a node it is possible to decide which sets of configurations should be applied to the node or, if you prefer, which state a system should converge to. In fact, in configuration management you don’t apply a configuration to a system, you rather apply it to classes of systems! A node can be effectively managed only if it can be tied to the classes that describe its features, thus making classification a critical process. As all critical processes, classification either scales with the infrastructure, or it’s a hindrance to its growth.

Part 2: the challenges

Settings are supposed to be homogeneous across all systems in a certain class. However, there are nodes that are part of a class, and yet need some special settings that are different from all (or most) other nodes in the same class. One can well say that exceptions are actually the rule. Let me show what I mean with a concrete example.


In your infrastructure, you have probably a very few settings that apply to all machines, whatever they do and wherever they are. Then, you probably have some settings that are location specific, and are supposed to be the same for all machines in the same location. But you have exceptions:

  • all machines in a given location use the same two DNS servers, but you may prefer that each one of those two DNS servers points to itself on the loopback interface as a first choice;
  • all machines share the same NTP configuration, all but the four servers that will point to other servers upstream and will have different restrictions and options;
  • all machines will share the same rsyslog configuration, all but the server/relay in that location
  • SSH settings may differ for some subtle details: some may need to have root logins enabled (for as bad as it is), some may need password authentication to be enabled; some may want to use a different location for the authorized_keys file, for some reason…
  • …and there may be more and more…;
  • …and other locations may have similar exceptions cases…
  • …and other locations may have even more complicated cases, for example: both public and private addresses in use, with machines using either of them or both, and each case needing a different set of exceptions to be applied

If classification is critical, to be able to cope with exceptions efficiently in the classification process is crucial.

It’s easy to understand that a simple, “internal” node classification done in the tools’ configuration files doesn’t scale; let’s see a few reasons why:

  • coarse-grained classes can be defined with simple expressions and they are OK to stay in configuration files, but exceptions may well start a “definitions explosion”, where you’ll want to apply some rules to class1&!exception1, and to exception1 separately in the best case. You’ll begin to throw bigger and bigger rules and expressions into your policies, which will grow fast in an inextricable mess, will be slower to parse for the interpreter, and difficult to understand for humans;
  • it may be problematic to report on which settings are applied to a node when all the class information lies inside manifests/policies/recipes/…
  • «The business should define what systems belong in which classes.  The Cfengine administrator should build policy.  Once the Cfengine administrator is left to manually defining classes within policy, you become a bottleneck» (M.Svoboda, more on this later)

Let’s see these three main points, one by one.

Definitions explosion


Every time you add an exception to a class, you are creating at least two cases: a system is either part of the exception, or it is not. If you add a new separate exception (that is: an exception that doesn’t intersect with the previous one) things don’t get much more complicated: you are just adding another case.

When you have intersecting exceptions, you either need to treat each resulting subset separately, or you still need to sort out priorities,which exception is applied in the intersection and when: when a node is part of both sets of exceptions, which exception has the highest priority? (Which rule will have the last word?) Will the same priority apply across the whole set of policies, or different priorities will be applied depending on the managed resource? E.g.: will exception 2 have the highest priority when deciding which configuration files should be selected and copied on the system, but the lowest priority when deciding which processes must run on this node?

You see how adding just a few exceptions results in a mess. With just three exceptions we have 8 different subsets: we have to anticipate seven different subcases for nodes that would be otherwise “normal” class1 members. No matter how you try to sort this thing out, it’s a mess. And it’s only three exceptions!

Do you really think that managing all of these in policies (both the classification and the rules resulting from it) is what you want to do?

Difficult reporting


When you grow over a certain size it may be very difficult to understand at a glance what’s the role of a certain node in the infrastructure by just reading the policies. It means that when you deploy the change, you don’t know exactly where changes are going to happen. If you change the configurations applied to a class, and you want to know which nodes are impacted, you need some form of reporting.

You may think that reporting could be easily done going through each machine with SSH in a loop. But if the size of your infrastructure is big enough, that is definitely not an option. Say you’re managing some 30000 machines in different datacenters, and say that collecting the information you need from each node in that way takes three seconds on average. If you do that serially and there are no problems during data collection, then it takes 90000 seconds (more than 24 hours) just to collect the data. If you use, say, 10 parallel process to collect the data, it still takes 2.5 hours. And you haven’t even started to analyse it or make it available to management in a readable form, for example. Besides, whatever the reason why you collected the data in the first place, you need to have an up-to-date inventory of your machines to produce accurate reports if you have to poll them one by one.

But this was an ideal case. In real life, problems will always happen during data collection, and you’ll need to manage them… you need to spend a lot of time just to know what’s happening! What if you could query a database for which nodes are in a certain class?

The human bottleneck


In short, what Mike Svoboda means is: your task is building policies, not defining classes. That may sound a bit extreme, so let’s try to clarify.

The more an infrastructure grows, the more it’s unlikely that a single person has full knowledge and understanding of it. Even if you are the person, or one of the people, who writes the policies that manage your infrastructure, your policies need some external input to match the infrastructure and its configuration. You could be an implementer of others’ knowledge, but that may quickly make you a bottleneck; or rather you can enable others to leverage your work by themselves.

If they are technical people, you may hope that they learn configuration management and start writing policies themselves. That’s the ideal world, and it happens sometimes, but it’s difficult to reach good numbers, and having many people working on the same set of policies creates other problems in terms of collaboration and in opportunities to break something.

More often, your “customers” are other technical people that are too busy with their stuff to take yet another challenge (learning your configuration management tool of choice), or they are non-technical decision makers.


If you can enable these people to instil their knowledge into the infrastructure without you being in the middle, you stop to be a bottleneck: you’re ensuring that your policies help the infrastructure to scale, rather than be in the way.

And again, this calls for a “classification database” that is outside of your policies and manifests, but it can be used by them, and in a reliable way. That’s what Mike and his team did at LinkedIn, and we’ll see how in a few minutes.

Some solutions

OK, so you need a configuration database. You may ask yourself: how sophisticated should it be? What would be the right interface to enable collaboration from your customers? How feature rich should it be? Simplifying down to earth, it all depends on two key factors: the complexity of your infrastructure and the technical skills of your customers.


Depending on who your customers are and how complex your infrastructure is, your ideal solution fits in one of these quadrants. To help yourself getting an idea of the complexity of your infrastructure you may ask yourself the question: “how many special cases should I take into account if I had to write all the details of my infrastructure in policies/manifests?”; for the technical level of your customers, ask yourself: “are these people comfortable with defining infrastructure information in plain text files or using database queries?”

And now you have your configuration database. How do you plug it to your configuration management tool of choice? Let’s start with Puppet. It deserves a place of honour here, since it explicitly supports external node classification as a first-class citizen since ages (from my research, the first implementation of ENC in Puppet dates back to 2009, please correct me if I am wrong).

Puppet and ENC

An external node classifier is an arbitrary script or application which can tell Puppet which classes a node should have […] that can be called by puppet master; it doesn’t have to be written in Ruby. Its only argument is the name of the node to be classified, and it returns a YAML document describing the node. […] To tell puppet master to use an ENC, you need to set two configuration options: node_terminus has to be set to “exec”, and external_nodes should have the path to the executable.

One can choose to make their own ENC, or use some popular classifiers out there, like hiera (a hierarchical database that is able to merge different settings into a single set, specific for the node that’s requesting it) or The Foreman, that offers a web UI. Depending on what problem you’re trying to solve, how much time you have to solve it, and for how long the solution should last, you may go for any of these ways, or explore other ones. What’s the counterpart in CFEngine land?

ENC in CFEngine means…

In general, and quite differently from Puppet, CFEngine abstracts much less from the underlying system, and provides no ENC mechanism out of the box, nor a de-facto standard emerged for the classification engine. In a way, it seems that anyone who needs ENC ends up implementing one on their own. However, one thing seems to be a constant there: the use of modules to raise/cancel classes and set variables to represent the configuration of the system. CFEngine modules are actually a handy and powerful instrument to implement ENC. We’ll take a few minutes soon to give a few more details about them. Now let’s see a couple of ENC solutions for CFEngine: the top notch one from LinkedIn and, in detail, the plain simple one from Opera.

LinkedIn’s approach to ENC

How did they solve this problem at LinkedIn? They created a system based on Range, an Open Source tool by “Yahoo!”. Range aggregates information coming from various sources, some managed by System Operations people, some by Engineering, and others self-standing ones like an inventory system and the load balancers. Every node query a Range cache, asking for any information regarding the node itself. The result is then stored locally in JSON format, and the local cache is used every time the remote cache is not available. Scripts in python and bash glue everything together raising the appropriate classes on each node.

The LinkedIn approach is damn cool, scalable and sophisticated, and something I would give a try one day. But it’s was also way too much for my needs.

Our implementation: hENC

ENC, in my case, had two purposes: the first: allow us to scale better (and that’s a common trait in all ENC mechanisms), by making it easier to handle exceptions in general; the second: take as much configuration information as possible out of the policies and in plain text files to lower the access barrier for the tech-savvy, but non-CFEngine savvy, people in the Sysadmin group. This approach is actually closer to Neil Watson’s and EvolveThinking’s CFEngine library (based on CSV files) than to the otherwise wonderful LinkedIn approach.

The final result is, in my opinion, the best combination of power and simplicity. We use plain text files, the CFEngine’s own module protocol, and a Perl script for hierarchical node classification. We’ll conclude this post by looking into the implementation of this solution in detail. You’ll see that there is no rocket science here and yet this tool allowed us to cope with a sudden growth in managed nodes in Opera during the last months of 2013. We initially used CFEngine just to apply some security hardening measures, and we took it from there to implement new functionalities, like for example a centralised syslog architecture across many datacenters around the world, based on rsyslog, with log relays in each datacenter. Other projects and features are in the pipeline or in active development.

CFEngine’s module protocol


A module is a program written in any language. They exist since at least CFEngine v2, which means: well before 2009. However, they aren’t strictly an ENC system, so Puppet still wins the race here.

A module can do whatever it likes to collect information, as long as it hands it back to CFEngine through the standard output with a plain simple formatting:

  • printing out +activated_class will activate a global class activated_class;
  • printing out -cancelled_class will cancel a global class cancelled_class;
  • printing out =my_var=my_value will set a variable my_var with value my_value in a context (think of a namespace in a programming language) named after the program that generated the variable (e.g.: if the module executable is called my_module, the fully qualified name of the variable will be my_module.my_var);
  • printing out @my_list= {'list","of","4","values"} will set a list called my_list
  • of course, you can also set array values: =my_array[element]=value is a perfectly valid syntax

In CFEngine 3.5.x and later an extension mechanism has been added to the protocol; it’s not covered here because we are using 3.4.x, but it would be really trivial (and very convenient, too!!!) to improve our ENC to include extensions. See the documentation if you want to know more.

As you can see, setting classes and variables is trivial with modules, which makes them the perfect candidate to implement external node classifiers. That’s what I did.

Start with plain text files…

Think for a second if you had a text file that was formatted according to the module protocol, something like in the slide image above. If we had such a file, we could define classes and variables in our agent by just using the cat command from inside the agent. That would be something already, but comes with the limitation of not allowing any additional human-readable information (like, for example, comments).

But that’s very easy to fix! If we write a shell wrapper around grep as a module, we can put both CFEngine information and human-readable information in the file, and ensure that any extra information will be skipped:

/bin/egrep -h ^[=@+-] $* 2> /dev/null

Such a wrapper will filter all non-module information from all the files it gets on the command line. That’s something more, but it has another limitation: it’s not hierarchical, it doesn’t merge the information from the different files: whatever definition is in the files is just thrown out and into the agent. What happens if a class is activated and cancelled in different files? What if the same name is used to set both a variable and a list? What would be the outcome, what would the agent do?

That’s where hierarchical/merged node classification is needed: if we have a list of files, where a class is both set and cancelled, the final status will be the one found in the last file that said something about that class; if a variable is set many times, the last definition wins.

…and get hierarchical

How difficult is it to implement? Judging by the size of the following Perl script, called henc (for hierarchical ENC), it’s not difficult at all:


use strict ;
use warnings ;

my %class ;    # classes container
my %variable ; # variables container

# Silence errors (e.g.: missing files)
close STDERR ;

while (my $line = <>) {
    chomp $line ;
    my ($setting,$id) = ( $line =~ m{^\s*([=\@/+-_])(.+)\s*$} ) ;
    next if not defined $setting ; # line didn't match the module protocol

    # add a class
    if ($setting eq '+') {
    # $id is a class name, or should be.
    $class{$id} = 1 ;

    # undefine a class
    if ($setting eq '-') {
    # $id is a class name, or should be.
    $class{$id} = -1 ;

    # reset the status of a class
    if ($setting eq '_') {
    # $id is a class name, or should be.
    delete $class{$id} if exists $class{$id} ;

    # define a variable/list
    if ($setting eq '=' or $setting eq '@') {
    # $id is "variable = something", or should be
    my ($varname)      = ( $id =~ m{^(.+?)=} ) ;
    $variable{$varname} = $line ;

    # reset a variable/list
    if ($setting eq '/') {
    # $id is "variable = something", or should be
    delete $variable{$id} if exists $variable{$id} ;

    # discard the rest

# print out classes
foreach my $classname (keys %class) {
    print "+$classname\n" if $class{$classname} > 0 ;
    print "-$classname\n" if $class{$classname} < 0 ;

# print variable/list assignments, the last one wins
foreach my $assignment (values %variable) {
    print "$assignment\n" ;

Let me explain briefly how this script works: it reads the lines of the files passed on the command line, one by one, looking for module-like lines (but hey, what are those “_” and “/” symbols that are not part of the module protocol? Hold on!). If the line appears to be setting/cancelling a class, it will set an hash key/value for that class; if it looks like a variable or array, it will extract the variable name, and save the latest definition for it in another hash. If a class is set two times, the latest definition overwrites the earlier; if a variable and a list with the same name are defined, the latest definition is kept and the earlier are all discarded. When there are no more lines to be read, the module prints all this merged information with no duplicates.

But then you hit another limitation: once a class is defined/cancelled in a file, once a variable is set, you can change the setting, but you can’t make it disappear (like if it never happened). That’s where we added two more symbols that are not part of the module protocol: with the “_” symbol we ask hENC to forget about a class and start afresh with it. In the same way, the symbol “/” tells hENC to forget (slash) a variable from its cache and start afresh with it.

Using hENC

We have seen the engine, we now need to feed it with fuel (a list of files to parse) and wheels (a policy to leverage the whole lot). How do we do that? What decides which files should be checked? Once again, there are endless possibilities, from the simplest one (like: hardwiring a file list into a policy) to the most sophisticated ones (get the list itself from some external source). We decided for something in the middle but, once again, simple:

  • first, we want to read some defaults, generally valid for all possible locations;
  • then, we want to read defaults for the location the node is in, overriding general defaults if needed;
  • then, we want to read defaults that depend on other information, like the environment the node is in (e.g.: it has global connectivity, like a public IPv4 or a global IPv6 address, or it has only private addresses);
  • finally, we want to read special settings that apply to this node only.

We define this hierarchy as a list of files, where the elements of the list depend on certain classes being set; the last file to say anything about a class or variable wins. I’ll show how it works using a simplified excerpt from our real policies.

In our policies we set some global classes that tell us the location of the node (e.g.: a node running in Oslo will have the oslo class set); plus, we set classes like on_oslo_private_net_only or oslo_public, depending on the connectivity of the node itself. And we have the following vars promises:

bundle common site
. . .
      "enc_basedir" string => "ENC" ;

          string => "$(sys.domain)" ;

          string => "$(matched_domain[1])" ;

          string => "$(def.domain)" ;

          policy => "overridable",
          string => "$(enc_basedir)/pub" ;

          policy => "overridable",
          string => "$(enc_basedir)/priv" ;

          policy => "overridable",
          slist => {
          } ;
. . .

As you can see, the general defaults for all locations are listed first, then we read the defaults for Oslo, then the defaults for Oslo for public or private nodes, and finally the special settings for the node. The latest ones are stored in a subdirectory named after the domain name of the node (that we try to guess in all conceivable ways!), so that we don’t clutter a single directory by throwing all the node files there.

This mechanism is very flexible: one can add many levels before reaching the node-specific parts, and all of those levels depend on coarse-grained classes – the fine-grained definitions belong in the files. Note also that any of these files could be missing (e.g.: the node-specific file), and the mechanism still works: if a file doesn’t exist, henc will just ignore it.

Finally, note that we could extend the list with a local file that is not pulled from the policy hub (e.g.: /etc/cfengine/local-node.conf), and that would also work and would allow settings defined centrally to be overridden by local settings; whether or not this is a good idea is left to the reader as an exercise 🙂

henclist is then passed to a bundle via a method call in

bundle agent main {
. . .
          comment   => "External node classification",
          usebundle => henc("site.henclist") ;
. . .

Once the bundle henc is processed, the classes will be set/cancelled globally, and the variables will be usable in other parts of the policy like, for example:

          usebundle => motd("$(henc.motd_file)",
                            "$(henc.motd_tmpl_dst)") ;

Yes, it’s as simple as that! With such a system in place, how hard is it to set exceptions to the sshd configuration for a handful of nodes? How hard is it to set the resolve.conf file differently on two DNS servers? How hard is it to configure all machines as rsyslog clients, a small subset of them as relays, and one as the central collector of all logs?

The bundle that puts everything together

The last piece missing is the CFEngine bundle that reads the files from the given list and invokes the module. We see the code first, and then we explain how it works (we use some additional bodies from a library of ours that are not shown here, but their names are informative enough and it’s easy to understand what they do).

bundle agent henc(enclist_name) {
      "enclist"      slist  => { "@($(enclist_name))" } ;
      "enc_fullpath" slist  => maplist("$(site.inputs)/$(this)","enclist") ;
      "encargs"      string => join(" ","enc_fullpath") ;

      "henc_has_list" expression => isvariable("enclist_name") ;
      "henc_has_args" expression => isvariable("encargs") ;
      "henc_can_classify"    and => { "henc_has_list","henc_has_args" } ;

          comment   => "Copy/update hierarchical merger",
          copy_from => digest_cp("$(site.modules)/henc"),
          perms     => mog("0755","root","root") ;

          comment   => "Cache henc files locally",
          copy_from => digest_cp("$(site.masterfiles)/$(enclist)") ;

          comment    => "Hierarchical classification for $(sys.fqhost)",
          args       => "$(encargs)",
          classes    => always("henc_classes_activated"),
          module     => "true" ;

Just recall that the agent goes three times through each bundle to ensure convergence, and we are ready to go.

At the first pass, the vars promises are not evaluated, because the henc_has_list class has not been set yet. Then the classes promises are evaluated: if the bundle was passed a valid argument, the henc_has_list class gets defined: let’s suppose that we are in that case. henc_has_args won’t be defined, as the variable encargs hasn’t been defined yet; this implies that henc_can_classify will also be false. All of the files promises are evaluated, thus making a local copy of the files used in the classification process, included the henc script. If any of the files used by henc doesn’t exist on the policy hub, that’s not a problem: the worst that can happen is that CFEngine will tell us that it couldn’t find them. Finally, the commands promises are skipped as the class condition evaluates to false.

At the second pass, the vars promises will be evaluated, and all the variables will be defined. enclist will be a copy of the list whose name was passed to the bundle as parameter; enc_fullpath will be the same list with all paths prefixed with the name of the inputs directory (we define it in a common bundle named site), and encargs will be a string that contains all the files in enc_fullpath, joined with spaces.

When we get to the classes promises, all of the classes will be now defined. As a consequence, the commands promise will now run, the node will be classified, and the class henc_class_activated will be set.

At the third and last pass, all the promises have been already evaluated, so CFEngine just skips them all. Job done!

Final take-aways


I tried to show why classification is crucial in configuration management, and must be supported by an adequate tool, depending on your needs and the complexity of your infrastructure. Sometimes, as I have just described, even a simple approach with nothing fancier than text files may be good enough!

Other shops have implemented ENC in CFEngine differently, as you can read from the help-cfengine forum: LinkedIn, MailOnline, Normation all have their own implementation. Our ENC is probably the least sophisticated of the pool, but it is already helping us to scale faster, and we are confident that it will help us to scale to much larger numbers than we have now!


I want to truly thank all those people that made my experience at FOSDEM and cfgmgmtcamp memorable one.

Thanks to Khushil Dep for forcing me to present the proposal to both conferences, and to all the other ones who further encouraged me to do so. Without your pressing I would have never submitted my proposals.

Special thanks go to Bas van der Vlies and to Ed Daniel. With Bas, we shared the same hotels (hey, not the same room, you evil people!), walked kilometers to and from the conference venues, and talked about many things, both technical and non-technical. Without him, I would have spent more time alone and had not enjoyed this experience as much as I did. With Ed, we had some interesting chats during cfgmgmtcamp, and a chaotic last-day where we hardly managed to meet for dinner. But we made it, and spent good hours talking, walking, and almost get lost in Gent.

I am grateful to all the people I met that granted me their friendship and esteem; thanks in particular to all the people from CFEngine (the company) and  Normation — sorry but I am not going to make myself miserable by misspelling your names and accents, you know who you are! 🙂

Thanks to all the people from the CFEngine community that I had the pleasure to meet in person: it was great to spend some time with you and to attach a real person to that email address on the mailing list!

I am grateful to all the people who attended my speeches, each one of you. In particular, thanks to all those people that were not part of the CFEngine community, and yet decided to attend my seminar, and they even liked it. Even more thanks to those who used some time to let me know they liked it.

Special thanks to David Lutterkort: it was an honour to have you among the public, and to receive your comments and criticism; I am humbled.


2 thoughts on “The classification problem: challenges and solutions

  1. I like the richness of tools and framework functionality in CFEngine where everything is almost doable (if you have time, and think about it carefully). on the other hand i have started using Ansible and trying to map what you have said to Ansible, I can see the inventory concepts and groups in ansible is the main classifier, that can be more granular if you add to it parameterized roles, and granularity by dynamic grouping or host_groups. however Ansible does a good job, as well as hiera in puppet in separating configuration data from the mechanics of applying the policy (playbook, or manifest). i do agree that there should also be a complementary system to capture and able to query which nodes is which, and what kind of configuration it could have, should have, did it have.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.