Building a simple, resilient, cheap network service with AWS

In this post I’ll describe how I put together a number of pieces of information about AWS features to experiment with an idea. It’s nothing advanced, rather: it’s what happens when you are studying on something and you start seeing the possibilities. Don’t expect rocket science then, it’s more like a handful of notes I made in the hope they may be useful to more people than just myself.

Being an experiment where I was supposed to learn how to do things, it’s a manual set-up. Automation will follow, and in my case , it will be Terraform, but not in this post.

AWS features in this article

The first feature that will be part of this article is spot instances. Spot instances are virtual machines that utilise unused capacity in AWS’ Elastic Compute Cloud (EC2). They cost a fraction of the price for a normal on-demand instance, up to a 90% discount, but there is a “but”, of course: AWS can terminate those instances on short notice to reclaim that capacity when they need it.

If you think that no “durable” application can run on spot instances, you should think twice. In fact, there are distributed applications designed to be so resilient that they can actually run on spot instances, which saves a lot of money to the owners of those applications. The possibility to run a resilient distributed application on spot instances was an interesting feature that I wanted to explore.

And then I was studying autoscaling. Autoscaling is a functionality in AWS (and, I guess, in other cloud providers’ offer, too) that can be used to ensure adequate capacity for a service. This often means to create additional instances during request peaks (scale-out), and reducing the number of instances when the capacity exceeds the demand by a certain threshold (scale-in). A “degenerate” case of autoscaling is to set the minimum and maximum number of instances to the same value, to ensure that the capacity is kept constant in the event of a failure of the service on one or more instances: in that case: autoscaling will replace dead instances with new ones to ensure that the capacity is kept at the desired level.

Diagram of the set-up

In this diagram we illustrate only the components discussed in this article: the large grey box is the VPC, the smaller three boxes are the availability zones; in each AZ, the box at the top is a public subnet and the lower box is a private subnet; there is one spot instance in in each of the private subnets, and they are all part of an autoscaling group (green box) and access the Internet through a NAT gateway in one of the public subnets; connections to the load balancer from the Internet are dispatched to the three instances; from Internet it is also possible to access the deployer node; both the spot instances and the deployer node have access to the EFS filesystems.

To test all these things together I decided to build a redundant CFEngine policy server, running on spot instances for economy. cf-serverd is the CFEngine process delegated to providing CFEngine clients with the policies they should apply locally.

Connections between clients and cf-serverd are secured with public key encryption. When a client “registers” to a CFEngine policy server, the policy server will save the client’s public key locally and, conversely, the client will save the server’s public key. Because of this key exchange, the policy server keys must be the same across the different instances in a distributed set-up, and they must not change if a policy server is replaced: if the key changes, the client will not trust the new policy server and won’t accept the server’s policies. At the same time, it’s important that the public keys collected by the server are also preserved if an instance dies, otherwise a new server won’t be able to recognise a client and will likely refuse to serve the policies to it.

To share this information across running policy server instances and future instances that will replace them, a shared filesystem for the keys is necessary. Besides, a shared filesystem for the policies, although not required, is very convenient. AWS comes to the rescue with EFS, that is: AWS’ managed NFS service.

Finally, we want to present a single interface for the distributed service we are building. Here, the obvious choice is an AWS ELB’s Network Load Balancer (which we’ll abbreviate as NLB). The NLB will take incoming connections to port 5308 and distribute them out to the instances. This requires that the policy server instances are registered as targets for the load balancer (they will form a target group). However, spot instances will be recycled every now and then and the target group must be kept up to date. That’s not a problem: autoscaling will keep the target group up to date to ensure that the load balancer always directs traffic to the live instances.

Finally, we’ll use an additional, on demand (non-spot) instance, which we’ll call the “deployer node”, to manage the policy files distributed by the policy servers.  In theory, we could do this from one of the policy server, but we’d run the risk that the instance is terminated while we are working on policy files, leaving the policies in an inconsistent state. Since this node is stable, we could place on this node anything that the policy server may want to refer to, e.g. a syslog server to collect CFEngine logs, but that’s just an example (we could use CloudWatch logs for that). We are not going to talk about the deployer node here, but there is one consideration to make: this node will be more exposed than the rest of the system (it’s more or less a bastion host for the service), care must be taken to keep it as secure as possible.

Another thing that we are not going to deal with is DNS. I haven’t registered a domain in Route 53 (AWS’ DNS service) and machines will be changing IPs and hostnames every time they are recycled. The name of the NLB itself will be horrible, generic blob like e.g. To have more manageable names you should have a domain  registered in Route 53  and do a bit more magic to register manageable names. That is also not covered here.

VPC set-up

I’ll assume you already have at least a VPC ready. In the VPC you want to deploy this configuration, you must set both the “DNS resolution” and “DNS hostnames” to “yes”. You will also need at least one public subnet and at least two-three private subnets in distinct availability zones.

Create a NAT gateway (if you haven’t one already) in the public subnet, and ensure that the routing tables used in your private subnets have a default route through it.

Set up a security group to be used with the policy hubs: you can call it cfengine-policy-hub. In the security group, allow access to port 5308/TCP from any address (you will place the policy hubs in the private subnets anyway and they won’t be reachable on that port unless they are contacted through the NLB). For debugging purposes, it may turn useful to have rules to allow SSH access from some trusted IP.

Set up another security group that will allow access to the deployer node  only from trusted IP addresses and name it appropriately. E.g., let’s say that you’ll allow SSH connections to the deployer node from your home’s IP address: then you can call the security group access-from-home.

Set up a security group to be used on the EFS mount targets, so that they accept only NFS requests from the members of the cfengine-policy-hub and access-from-home security groups. You can call it cfengine-policy-hub-efs.

EFS set-up

As said in the preface, you’ll need two shared filesystems: one for the masterfiles, that will be mounted under /var/cfengine/masterfiles, and one for the keys, that will be mounted under /var/cfengine/ppkeys. For each of these filesystems, create a mount target in each of the private subnets where you will deploy policy hubs and guard those mount targets via the cfengine-policy-hub-efs security group.

Policy hub AMI set-up

While AWS prepares your mount targets, create an instance in the public subnet that you’ll use as a template for the hub AMI. Allow access to this node only through the access-from-home security group you created earlier. That will also give this instance access to the EFS filesystems through the cfengine-policy-hub-efs group.

I’ll use Debian as the basis for this AMI. A list of the AMI identifiers for the official Debian Linux 10 “buster” images in each region are available here.

You will have to run a few commands on the instance once it’s ready. Log in the machine, get superuser privileges through “sudo -i” and run the following commands:

apt-get update
apt-get upgrade -y
apt-get install -y gnupg2 nfs-common
mkdir -m 700 -p /var/cfengine/masterfiles /var/cfengine/ppkeys

By now the EFS mount targets should be ready. From the AWS console, you will be able to access the command line to mount each of the two filesystems. You can use it as a guideline to build the lines in /etc/fstab used to mount these filesystems automatically at system’s boot. The suggested command line will look something like:

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport efs

The one for the other EFS filesystem will have a different mount target (the fs-xxxxxxxx part in the hostname above).

The information in the command line above translates to the following line in /etc/fstab (here we suppose that the NFS filesystem mentioned in the command line is the one for the masterfiles):    /var/cfengine/masterfiles    nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev    0 2

Notice how in the line for fstab we add the “_netdev” option: this is to instruct the system that it will be dealing with a network filesystem and should hold the mount until the network stack is up and ready.

If the lines in fstab are correct, you will be able to mount the (still empty) EFS filesystems by running:

mount /var/cfengine/masterfiles
mount /var/cfengine/ppkeys

Now you can install CFEngine. At the moment of this writing, the latest LTS (long-term support) version is the 3.12.2 and the corresponding debian package version is 3.12.2-2. To install that version and not the absolute latest (a 3.14.x as of today) you will run the following commands:

wget -qO- | apt-key add -
echo "deb stable main" > /etc/apt/sources.list.d/cfengine-community.list
apt-get update
apt-get install -y cfengine-community=3.12.2-2
echo cfengine-community hold | sudo dpkg --set-selections

The installation will have copied the masterfiles in /var/cfegine/masterfiles and created a key pair in /var/cfengine/ppkeys. That’s all you needed.

Now, use this instance to create an AMI from the AWS console. When the AMI is ready, you can destroy this instance as you’ll have no further use for it (the hub instances will be created in the private network).

Create a target group

In the EC2 service, go to load balancing -> target groups, create a target group and call it cfengine-hub-tg. As protocol, you’ll select TCP and port 5308, the port used by cf-serverd to serve policies to clients. Ensure to select the VPC where you’ll be deploying your policy servers.

Create a NLB

In the EC2 service, go to load balancing -> load balancer and create a network load balancer; call it cfengine-hub-nlb. The protocol will be again TCP, port 5308. Check that the VPC you select is the right one. Select the cfengine-hub-tg target group and don’t register any target: autoscaling will take care of it for you.

Create the autoscaling group

This is where the real fun starts. In the EC2 service, go to Auto Scaling -> Launch Configurations and start creating a new one. You will have the choice to create a launch configuration or a launch template: the latter seems a relatively new feature and didn’t want to use time learning about them now: the most important thing for the moment was to see if my experiment worked, so I created a new launch configuration and not a template.

In the launch configuration, I set:

  • the AMI to use when launching a new instance: it will be the AMI I’ve created a few steps back;
  • instance type: I selected a t2.micro, use what you see fit;
  • you can call this configuration CFEngine hub v3.12.2 – LC v1
  • request spot instances; you will be requested what’s the maximum price  you are willing to pay to run a spot instance; at the moment I was working with this experiment, the current price was $0.0038; I decide to set the limit to $0.005 to see what happens ;
  • in the advanced details, set this user data: you want the policy servers to bootstrap themselves against themselves, and this small shell snippet will do:
    cf-agent --bootstrap $( curl -s )
  • IP address type: I’ve decided to assign private addresses only, since I’ll want these machines to run in a private network, access the Internet only through the NAT gateway, and be accessed only through the NLB;
  • leave the storage setting as default
  • security group: use cfengine-policy-hub, because that’s what we created it for.

Proceed with the creation of the launch configuration. Then, switch to the Auto Scaling Groups item in the menu and start creating a new one:

  • select “launch configuration”, and then the one you just created;
  • name the group cfengine-hub-v3.12.2-asg;
  • I was going to deploy on 3 separate availability zones, so I set group size to 3;
  • select the right VPC and the private subnets for each AZ you want these instances to be deployed to;
  • choose to keep this group at its initial size: this way, the ASG will keep the configuration stable (remember that the goal is not to scale up or down, just to have an instance replaced when AWS reclaims the capacity back);
  • you can choose to be notified via email when the autoscaling kicks in; I decided to go for it because I want to be notified when autoscaling reacts to the deletion of one or more spot instances;
  • add tags to the instances, if you like.

Once the autoscaling group is created, edit it in the console:

  • for target groups, select cfengine-hub-tg; autoscaling will keep this target group up to date, adding new instances into it and removing the ones who die;
  • for health checks, select ELB: that means, when the load balancer will mark the instance as unhealthy (e.g.: cf-serverd died for any reason), the ASG will replace the instance with a new one; the alternative is EC2, which would work only if the instance itself dies, but won’t replace it if the service is non-functional on a functional instance;
  • health check grace period, I selected 60 seconds; may be too short in a production environment, but it’s OK for an experiment;
  • termination policies: the default is fine;
  • cooldown period: 300 seconds is OK

Let the show begin

Once the ASG is created, it will start to evaluate the situation, see that the service has no active instance and, eventually, start to create them. You can follow up on what’s happening by selecting your newly made autoscaling group and checking the Activity History and Instances tabs.

When all of the instances are ready, try terminating one instance and see what happens. Then, try to go in one instance and stop the cf-serverd service with systemctl stop cfengine3 and then wait a while to see what happens.

With these checks you will build some confidence that, the moment AWS reclaims the capacity you are using, your service will be restored in a reasonable time. Now, if you have a service of yours that can run distributed and can withstand the failure of one or more instances, you can try and balance the number of instances and their cost, and that will go long way to make your service highly available and not spend a fortune.

Update: in the article How cheap is cheap? I go through the actual cost of this solution in its first 20 days of operations. If you’re after numbers check that article, too!

If you are not into CFEngine you can stop reading here, Enjoy! Otherwise, read on.

Additional info for CFEngineers

The experiment was not really about cf-serverd, cf-serverd was just a nice candidate for this test. However, I discovered in the process that placing a NLB in front of cf-serverd has no downsides. Normally, it’s not recommended to place in between the clients and the policy server a proxy that alters the source  address of the clients  (e.g. a NAT). If you do so, all clients will be seen from the server like a single client presenting itself with different keys all the time, and it will become very difficult from the server’s perspective to infer any statistic on the connecting clients. However, that’s not the case with the NLB: source addresses are preserved, which makes this configuration suitable to ensure the high availability of the cf-serverd process! It guarantees scalability as well, but that has never really been a problem for cf-serverd 😉


One thought on “Building a simple, resilient, cheap network service with AWS

  1. Pingback: How cheap is cheap? | A sysadmin's logbook

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.