In this post I’ll describe how I put together a number of pieces of information about AWS features to experiment with an idea. It’s nothing advanced, rather: it’s what happens when you are studying on something and you start seeing the possibilities. Don’t expect rocket science then, it’s more like a handful of notes I made in the hope they may be useful to more people than just myself.
Being an experiment where I was supposed to learn how to do things, it’s a manual set-up. Automation will follow, and in my case , it will be Terraform, but not in this post.
AWS features in this article
The first feature that will be part of this article is spot instances. Spot instances are virtual machines that utilise unused capacity in AWS’ Elastic Compute Cloud (EC2). They cost a fraction of the price for a normal on-demand instance, up to a 90% discount, but there is a “but”, of course: AWS can terminate those instances on short notice to reclaim that capacity when they need it.
If you think that no “durable” application can run on spot instances, you should think twice. In fact, there are distributed applications designed to be so resilient that they can actually run on spot instances, which saves a lot of money to the owners of those applications. The possibility to run a resilient distributed application on spot instances was an interesting feature that I wanted to explore.
And then I was studying autoscaling. Autoscaling is a functionality in AWS (and, I guess, in other cloud providers’ offer, too) that can be used to ensure adequate capacity for a service. This often means to create additional instances during request peaks (scale-out), and reducing the number of instances when the capacity exceeds the demand by a certain threshold (scale-in). A “degenerate” case of autoscaling is to set the minimum and maximum number of instances to the same value, to ensure that the capacity is kept constant in the event of a failure of the service on one or more instances: in that case: autoscaling will replace dead instances with new ones to ensure that the capacity is kept at the desired level.
To test all these things together I decided to build a redundant CFEngine policy server, running on spot instances for economy. cf-serverd is the CFEngine process delegated to providing CFEngine clients with the policies they should apply locally.
Connections between clients and cf-serverd are secured with public key encryption. When a client “registers” to a CFEngine policy server, the policy server will save the client’s public key locally and, conversely, the client will save the server’s public key. Because of this key exchange, the policy server keys must be the same across the different instances in a distributed set-up, and they must not change if a policy server is replaced: if the key changes, the client will not trust the new policy server and won’t accept the server’s policies. At the same time, it’s important that the public keys collected by the server are also preserved if an instance dies, otherwise a new server won’t be able to recognise a client and will likely refuse to serve the policies to it.
To share this information across running policy server instances and future instances that will replace them, a shared filesystem for the keys is necessary. Besides, a shared filesystem for the policies, although not required, is very convenient. AWS comes to the rescue with EFS, that is: AWS’ managed NFS service.
Finally, we want to present a single interface for the distributed service we are building. Here, the obvious choice is an AWS ELB’s Network Load Balancer (which we’ll abbreviate as NLB). The NLB will take incoming connections to port 5308 and distribute them out to the instances. This requires that the policy server instances are registered as targets for the load balancer (they will form a target group). However, spot instances will be recycled every now and then and the target group must be kept up to date. That’s not a problem: autoscaling will keep the target group up to date to ensure that the load balancer always directs traffic to the live instances.
Finally, we’ll use an additional, on demand (non-spot) instance, which we’ll call the “deployer node”, to manage the policy files distributed by the policy servers. In theory, we could do this from one of the policy server, but we’d run the risk that the instance is terminated while we are working on policy files, leaving the policies in an inconsistent state. Since this node is stable, we could place on this node anything that the policy server may want to refer to, e.g. a syslog server to collect CFEngine logs, but that’s just an example (we could use CloudWatch logs for that). We are not going to talk about the deployer node here, but there is one consideration to make: this node will be more exposed than the rest of the system (it’s more or less a bastion host for the service), care must be taken to keep it as secure as possible.
Another thing that we are not going to deal with is DNS. I haven’t registered a domain in Route 53 (AWS’ DNS service) and machines will be changing IPs and hostnames every time they are recycled. The name of the NLB itself will be horrible, generic blob like e.g.
cfengine-hub-nlb-12ab34c56d7890ef.elb.eu-west-1.amazonaws.com. To have more manageable names you should have a domain registered in Route 53 and do a bit more magic to register manageable names. That is also not covered here.
I’ll assume you already have at least a VPC ready. In the VPC you want to deploy this configuration, you must set both the “DNS resolution” and “DNS hostnames” to “yes”. You will also need at least one public subnet and at least two-three private subnets in distinct availability zones.
Create a NAT gateway (if you haven’t one already) in the public subnet, and ensure that the routing tables used in your private subnets have a default route through it.
Set up a security group to be used with the policy hubs: you can call it
cfengine-policy-hub. In the security group, allow access to port 5308/TCP from any address (you will place the policy hubs in the private subnets anyway and they won’t be reachable on that port unless they are contacted through the NLB). For debugging purposes, it may turn useful to have rules to allow SSH access from some trusted IP.
Set up another security group that will allow access to the deployer node only from trusted IP addresses and name it appropriately. E.g., let’s say that you’ll allow SSH connections to the deployer node from your home’s IP address: then you can call the security group
Set up a security group to be used on the EFS mount targets, so that they accept only NFS requests from the members of the cfengine-policy-hub and access-from-home security groups. You can call it
As said in the preface, you’ll need two shared filesystems: one for the masterfiles, that will be mounted under
/var/cfengine/masterfiles, and one for the keys, that will be mounted under
/var/cfengine/ppkeys. For each of these filesystems, create a mount target in each of the private subnets where you will deploy policy hubs and guard those mount targets via the
cfengine-policy-hub-efs security group.
Policy hub AMI set-up
While AWS prepares your mount targets, create an instance in the public subnet that you’ll use as a template for the hub AMI. Allow access to this node only through the
access-from-home security group you created earlier. That will also give this instance access to the EFS filesystems through the
I’ll use Debian as the basis for this AMI. A list of the AMI identifiers for the official Debian Linux 10 “buster” images in each region are available here.
You will have to run a few commands on the instance once it’s ready. Log in the machine, get superuser privileges through “
sudo -i” and run the following commands:
apt-get update apt-get upgrade -y apt-get install -y gnupg2 nfs-common mkdir -m 700 -p /var/cfengine/masterfiles /var/cfengine/ppkeys
By now the EFS mount targets should be ready. From the AWS console, you will be able to access the command line to mount each of the two filesystems. You can use it as a guideline to build the lines in
/etc/fstab used to mount these filesystems automatically at system’s boot. The suggested command line will look something like:
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-a012bc3d.efs.eu-west-1.amazonaws.com:/ efs
The one for the other EFS filesystem will have a different mount target (the fs-xxxxxxxx part in the hostname above).
The information in the command line above translates to the following line in
/etc/fstab (here we suppose that the NFS filesystem mentioned in the command line is the one for the masterfiles):
a012bc3d.efs.eu-west-1.amazonaws.com:/ /var/cfengine/masterfiles nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev 0 2
Notice how in the line for fstab we add the “
_netdev” option: this is to instruct the system that it will be dealing with a network filesystem and should hold the mount until the network stack is up and ready.
If the lines in fstab are correct, you will be able to mount the (still empty) EFS filesystems by running:
Now you can install CFEngine. At the moment of this writing, the latest LTS (long-term support) version is the 3.12.2 and the corresponding debian package version is 3.12.2-2. To install that version and not the absolute latest (a 3.14.x as of today) you will run the following commands:
wget -qO- https://cfengine-package-repos.s3.amazonaws.com/pub/gpg.key | apt-key add - echo "deb https://cfengine-package-repos.s3.amazonaws.com/pub/apt/packages stable main" > /etc/apt/sources.list.d/cfengine-community.list apt-get update apt-get install -y cfengine-community=3.12.2-2 echo cfengine-community hold | sudo dpkg --set-selections
The installation will have copied the masterfiles in
/var/cfegine/masterfiles and created a key pair in
/var/cfengine/ppkeys. That’s all you needed.
Now, use this instance to create an AMI from the AWS console. When the AMI is ready, you can destroy this instance as you’ll have no further use for it (the hub instances will be created in the private network).
Create a target group
In the EC2 service, go to load balancing -> target groups, create a target group and call it
cfengine-hub-tg. As protocol, you’ll select TCP and port 5308, the port used by cf-serverd to serve policies to clients. Ensure to select the VPC where you’ll be deploying your policy servers.
Create a NLB
In the EC2 service, go to load balancing -> load balancer and create a network load balancer; call it
cfengine-hub-nlb. The protocol will be again TCP, port 5308. Check that the VPC you select is the right one. Select the
cfengine-hub-tg target group and don’t register any target: autoscaling will take care of it for you.
Create the autoscaling group
This is where the real fun starts. In the EC2 service, go to Auto Scaling -> Launch Configurations and start creating a new one. You will have the choice to create a launch configuration or a launch template: the latter seems a relatively new feature and didn’t want to use time learning about them now: the most important thing for the moment was to see if my experiment worked, so I created a new launch configuration and not a template.
In the launch configuration, I set:
- the AMI to use when launching a new instance: it will be the AMI I’ve created a few steps back;
- instance type: I selected a t2.micro, use what you see fit;
- you can call this configuration CFEngine hub v3.12.2 – LC v1
- request spot instances; you will be requested what’s the maximum price you are willing to pay to run a spot instance; at the moment I was working with this experiment, the current price was $0.0038; I decide to set the limit to $0.005 to see what happens ;
- in the advanced details, set this user data: you want the policy servers to bootstrap themselves against themselves, and this small shell snippet will do:
cf-agent --bootstrap $( curl -s http://169.254.169.254/latest/meta-data/local-ipv4 )
- IP address type: I’ve decided to assign private addresses only, since I’ll want these machines to run in a private network, access the Internet only through the NAT gateway, and be accessed only through the NLB;
- leave the storage setting as default
- security group: use
cfengine-policy-hub, because that’s what we created it for.
Proceed with the creation of the launch configuration. Then, switch to the Auto Scaling Groups item in the menu and start creating a new one:
- select “launch configuration”, and then the one you just created;
- name the group
- I was going to deploy on 3 separate availability zones, so I set group size to 3;
- select the right VPC and the private subnets for each AZ you want these instances to be deployed to;
- choose to keep this group at its initial size: this way, the ASG will keep the configuration stable (remember that the goal is not to scale up or down, just to have an instance replaced when AWS reclaims the capacity back);
- you can choose to be notified via email when the autoscaling kicks in; I decided to go for it because I want to be notified when autoscaling reacts to the deletion of one or more spot instances;
- add tags to the instances, if you like.
Once the autoscaling group is created, edit it in the console:
- for target groups, select
cfengine-hub-tg; autoscaling will keep this target group up to date, adding new instances into it and removing the ones who die;
- for health checks, select ELB: that means, when the load balancer will mark the instance as unhealthy (e.g.: cf-serverd died for any reason), the ASG will replace the instance with a new one; the alternative is EC2, which would work only if the instance itself dies, but won’t replace it if the service is non-functional on a functional instance;
- health check grace period, I selected 60 seconds; may be too short in a production environment, but it’s OK for an experiment;
- termination policies: the default is fine;
- cooldown period: 300 seconds is OK
Let the show begin
Once the ASG is created, it will start to evaluate the situation, see that the service has no active instance and, eventually, start to create them. You can follow up on what’s happening by selecting your newly made autoscaling group and checking the Activity History and Instances tabs.
When all of the instances are ready, try terminating one instance and see what happens. Then, try to go in one instance and stop the cf-serverd service with
systemctl stop cfengine3 and then wait a while to see what happens.
With these checks you will build some confidence that, the moment AWS reclaims the capacity you are using, your service will be restored in a reasonable time. Now, if you have a service of yours that can run distributed and can withstand the failure of one or more instances, you can try and balance the number of instances and their cost, and that will go long way to make your service highly available and not spend a fortune.
Update: in the article How cheap is cheap? I go through the actual cost of this solution in its first 20 days of operations. If you’re after numbers check that article, too!
If you are not into CFEngine you can stop reading here, Enjoy! Otherwise, read on.
Additional info for CFEngineers
The experiment was not really about cf-serverd, cf-serverd was just a nice candidate for this test. However, I discovered in the process that placing a NLB in front of cf-serverd has no downsides. Normally, it’s not recommended to place in between the clients and the policy server a proxy that alters the source address of the clients (e.g. a NAT). If you do so, all clients will be seen from the server like a single client presenting itself with different keys all the time, and it will become very difficult from the server’s perspective to infer any statistic on the connecting clients. However, that’s not the case with the NLB: source addresses are preserved, which makes this configuration suitable to ensure the high availability of the cf-serverd process! It guarantees scalability as well, but that has never really been a problem for cf-serverd 😉