What's wrong with this image?

What's wrong with this image?

Is this a familiar nightmare? “The server you built yesterday is great,” your boss enthuses. “Now we need a hundred more just like it. And when you’ve done that, could you take a look at my laptop before lunch? I can’t seem to get into my email.”

It’s common nowadays for sysadmins to manage very large numbers of servers - often virtual servers created on-demand in cloud platforms like Amazon EC2. How do you build so many servers and keep them updated - automatically, reliably, and quickly?

The problem

One common strategy is the golden image pattern: set up the server the way you want it, and then copy the disk image and use that to boot as many clone machines as you need. Another is the automation pattern: start with a generic server, and then apply automatic configuration tools such as Puppet to bring the machine into a specified state where it is ready for production.

The golden image

The golden image is attractive to many because it’s easy and cheap. It’s quick to create images, and you just need a big disk to store them.

Imaged machines also have low latency: they’re ready for production in just a few minutes from startup. Finally, it’s platform agnostic: you can image virtually any kind of machine, even Windows (which most config automation tools can’t handle very well).


So what’s the problem with the golden image? Luke Kanies, creator of Puppet, prefers to call it the “foil ball” approach: your images can quickly become an unmanageable, crumpled mess. Firstly, how many images do you need? Well, as many as you have different types of machines: web servers, databases, load balancers, DNS servers, and so on.

The number of images multiplies again when you take history into account. What happens when you need to recreate the state of the database server last month to troubleshoot a problem? You need to keep old images around, which means a new image every time you make a change.

And here’s a thing: if you do need to make a change (for example a security upgrade for a critical package), you’ll have to rebuild the golden image and then re-image and reboot all affected machines - which means lots of bandwidth and possibly downtime. And, unfortunately, even imaged machines usually need some final tweaking (eg IP addresses) before they’re ready for use.

Automation tools

By comparison, building your machines with an automatic configuration tool such as Puppet, Chef or even cfengine gives you many advantages. The configuration is self-documenting; it’s easy to see what should be on each machine, and more importantly, why it’s there. It will also stay in sync with reality.

Although your server types will be different, they will also likely have a lot in common. With images, these supposedly common settings could easily drift apart over time. With automation tools, you can put these settings into modules or classes, just as a software developer re-uses a module or library in another project.

In fact, all the advantages of software modules are available: you can share code with others, download public recipes and manifests, and maintain a complete version history of your network.

Automated configuration is intrinsically multi-platform. Puppet and Chef can abstract away some of the differences between operating systems, and you can add minor tweaks of your own. For example, you might need a slightly different MySQL config setting for MySQL 4.x as opposed to 5.x. A one-line switch in the configuration can address this problem, instead of creating two disk images.

Of course, everything comes with a downside. It is more difficult and usually more time-consuming to build an automated configuration, especially from scratch and when you’re still learning to use the tools. And some things are hard to automate. Some software just wasn’t written with automated installation and management in mind.

Also, not all software will be packaged for your platform (or you need special compilation options). In these cases you will need to package the software yourself.

Finally, configuration management doesn’t always mix well with human intervention. In particular, someone may make a config change on the machine which is then automatically reversed by Puppet. At best this can be a nasty surprise, at worst it can result in a site outage. Managing desktops automatically can create similar problems. This needs to be addressed at a business level with a proper change control process and making it clear to everyone where the boundaries of automatic management lie.

Combined strategies

So both imaging and automation have their place; neither is a magic bullet. The best strategy for scaling your deployments is often a judicious combination of these two approaches. You can, for example, start with a ‘stem cell’ image containing an approved base build, security fixes and settings which don’t often change, which is then configured by Puppet or Chef. Or you could use provisioning tools such as Kickstart and Cobbler and then hand over to Puppet. Or even build the machine with Puppet and then create a golden image from this.

One thing is certain: ignoring the problem is not an option. Moving to an automated network might seem like a big project, but the key is to start small. Set aside a block of time every week to spend on improving your infrastructure, and automate one small thing at a time as part of a gradual continuous process. The more you invest in this, the more benefits you’ll see - and you can bet your competitors will be doing the same.

Scaling Puppet with Git

Scaling Puppet with Git

Deciding which Docker tag to use

Deciding which Docker tag to use