May 14

May 14 Scaling Puppet with Git

This post describes a simple approach to 'serverless Puppet' using Git instead of a centralised Puppet server. You can read more about this idea in John Arundel's bestselling Puppet Beginner's Guide book, which describes a complete, production-ready Puppet infrastructure you can use for your own projects, complete with sample code on GitHub. If you'd like help with Puppet, or infrastructure in general, see About Bitfield Consulting for details.

More and more people are turning to systems automation tools like Puppet and Chef to get the most out of their environments, and to create time to focus on delivering business benefits. Scaling Puppet is most commonly done using client/server mode, in which every client is issued with an SSL certificate, and conversations take place between clients and a server, over HTTP, and manifests and assets are served over the network, and applied by a locally running Puppet daemon. However, is there a better way? We present an alternative to the traditional Puppetmaster solution which we like to call ‘Git Puppet’.

Guest article by Stephen Nelson-Smith

Drawbacks of puppetmaster / puppetd

Although this is the most popular deployment strategy, there are a few drawbacks to this method.

Fiddly to set up

Firstly, it’s a bit fiddly to set up. A quick glance at the Puppet mailing list, or the irc channel shows a slight tendency for new users to get a bit muddled when setting up the puppetmaster, and sorting out the SSL certificates. It isn’t especially difficult, but it’s not brain-deadeningly, works-automatically simple.

WEBrick doesn’t scale

Puppet, out of the box, uses the WEBrick HTTP server. This is a simple Ruby HTTP library that is used as the internal Ruby-on-Rails webserver, designed primarily for local development work. It doesn’t scale very well - my experience has been that once you hit about thirty clients, it really begins to struggle. It’s also not ever so fast. None of this is a surprise - it’s a development webserver - but it does mean that in order to scale we need to set up a grown-up webserver.

Mongrel / Passenger not ideal (esp. for auto setup)

The most popular approach to scaling Puppet is to use either Mongrel, with a proxy server such as Nginx, or to use Passenger (mod_rails) plugged into Apache or Nginx. Now again, this isn’t especially difficult, but it certainly falls into the category of non-trivial. Again, a quick survey of the mailing list and the IRC channel will indicate that one of the most popular support requests is getting Puppet to play nicely with Passenger. Also, from the perspective of an engineer who wants to automate and simplify their infrastructure, Passenger isn’t an ideal choice - installation requires manual steps, and, last time I looked, on CentOS the preferred installation method was to use either gems or a tarball.

Puppetd memory issue

So much for the server side - what about the client? Well, it’s not all good news here either. Many users have experienced problems with running the native Puppet daemon on their Puppet clients. Things may have improved in the most recent releases, but my experience has been that it exhibits behaviour consistent with that of an application with a memory leak. I ended up running puppetd out of cron, and with a random delay to prevent overloading the puppetmaster by having too many simultaneous connections. Several Puppet experts I know have done similarly.

Multi-site setup

A final drawback is that frequently an organisation’s systems infrastructure is dispersed over a number of physical (and network) locations. It’s not uncommon to find a company with a production network with one hosting provider, a disaster recovery setup with another, a staging environment in their own server room, and a number of local developer networks of VMs, all requiring their configurations to be managed. Quite aside from the volume of machines and connections that this can generate, we also find that some of these networks have fairly conservative firewall policies, and in cases where the distances between sites can be significant, even with the improved performance of the latest Puppet release, administrators start to feel that it makes more sense to run local instances of puppetmasterd, and then have to engineer some kind of way to keep them in sync.

Note that none of these drawbacks are show-stoppers - this is by far the most popular way to deploy Puppet, and all of these problems can be solved. I’m just pointing out that these drawbacks are present, and if we can find a simpler and more elegant solution, we should give it serious consideration.

Our vision/requirements

So - what’s our vision for an improved solution? I think it needs to encompass the following requirements:

There should be no need to set up complicated server environments involving third party modules, and proxy servers
There should be no fussing about with SSL certificates
It should comfortably scale to hundreds of servers without needing to resort to setting up an enterprise web application stack
It should be fast

The plan

A useful feature of Puppet is that it ships with a standalone execution script - that is we can write a Puppet manifest, and ‘interpret’ it on the command line, and its intent will be made explicit. Here’s an example:

package { "screen":
  ensure => installed,
}

Simply save this as something.pp and run it with Puppet:

[root@gruyere ~]# puppet -v something.pp 
info: Applying configuration version '1264090058'
notice: //Package[screen]/ensure: created

All of Puppet’s power can be harnessed in this way. A common way to organise Puppet config is into modules - for example, you might have a module called ‘sshd’ which has Puppet install sshd. Modules are a feature, available since 0.22.2, which makes it easy to gather together files, resources, Puppet classes, templates and definitions into a related bundle with a shared namespace. For example we could write a Puppet mysql module that contains the packages, the my.cnf, and defines a service, together with its relationships to the config and package. These modules can be packaged and reused - and, to make this possible, Puppet includes the config directive modulepath. Any config directive can be passed to Puppet on the command line, so we can include our modules like this:

# puppet -v --modulepath=/srv/puppet/modules

We can then put any modules in /srv/puppet/modules, and Puppet will be able to use them.

Best practice is to keep Puppet manifests (code) in a version control system. Given that it is possible to execute Puppet manifests locally with the standalone script, can we combine these two? Well, yes - we can. We keep the Puppet config in a version control system, check it out on the local machine, and run Puppet against it locally.

You have two major choices for version control - do you go for a traditional, central repository such as Subversion, or even CVS? Or do you go for a distributed version control such as Git, Mercurial or Bazaar?

Well, let’s look at our requirements list. Both Subversion and Git are very simple to set up - Git slightly simpler. There’s no need for SSL certificates - we can run everything over SSH, if we decide we want encryption. Git is definitely faster than Subversion in a lot of ways (by design, all commits are local), but I’ve not done a test to quantify it. So at this stage, either would do the trick.

However, let’s think a bit more about the environment in which we’re going to work. We’re going to have a bunch of servers whose config we’re going to manage via Puppet. Wouldn’t it be really handy to be able to manage this via push rather than pull? That leaves us feeling more in control, and we can rate-limit clients hitting any shared resource. From a central hub we could say: ok - push this config out to all staging servers. Or all webservers. Or just one server. It makes sense from a security perspective too - the central Puppet server is a valuable and vulnerable machine. If we push out to servers, we need only allow SSH out - we don’t need to open port 8140 to a DMZ or public hosts.

A further consideration is the possibiity that we might have a number of sysadmins or developers (or devops) working on Puppet manifests. It would be awesome to make it very very easy for them to branch, play, test, pull in stuff from the main production repo, and share independently with each other. This is exactly the sort of situation for which Git, and other distributed version control systems were built.

So - let’s set up Git repos in such a way as to permit us to push to each machine or groups of machines (which I will call ‘spokes’) from a central machine (which I will henceforth refer to as the ‘hub’).

Bootstrap the Git-Puppet infrastructure

Let’s start on your workstation, where we will start writing Puppet manifests. You’re going to need to install Git. Packages are available for most OS and distros - I won’t attempt to second-guess your environment. My workstation is a MacBook Pro, with a CentOS VM providing userland tools. Pick your poison. So, install Git, create a simple directory structure for the Puppet manifests, and then create a Git repo:

# mkdir puppet/manifests puppet/modules
# cd puppet
# git init

We also need an SSH key for Git to use:

# ssh-keygen -f git.rsa

Install Puppet and Git

OK, we’re going to need to install Puppet and Git on all our machines. Most of my clients run CentOS or Redhat, so I provide a small tarball which contains the latest Puppet RPMs, and its dependencies. I then do a yum local install of these, which provides me with the local Puppet interpreter. I then write (or simply download) a bootstrap Puppet manifest to provide a custom repo, a user, key based authentication and a bare Git repo.

Here’s an example:

user { "git":
  ensure => "present",
  home => "/var/git",
}

file {
  "/var/git": ensure => directory, owner => git, 
  require => User["git"];
  "/var/git/puppet": ensure => directory, owner => git, 
  require => [User["git"], File["/var/git"]],
}

ssh_authorized_key { "git":
  ensure => present,
  key => "INSERT PUBLIC KEY HERE",
  name => "git@atalanta-systems.com",
  target => "/var/git/.ssh/authorized_keys",
  type => rsa,
  require => File["/var/git"],
}

yumrepo { "example":
  baseurl => "http://packages.example.com",
  descr => "Example Package Repository",
  enabled => 1,
  gpgcheck => 0,
  name => "example",
}

package { "git":
  ensure => installed,
  require => Yumrepo["example"],
}

exec { "Create puppet Git repo":
  cwd => "/var/git/puppet",
  user => "git",
  command => "/usr/bin/git init --bare",
  creates => "/var/git/puppet/HEAD",
  require => [File["/var/git/puppet"], Package["git"], User["git"]],
}

Copy this to every machine, install Puppet and run puppet -v this_file.pp. Puppet’s ‘file’ type will set up a directory to allow SSH copy of the manifests via the git user. (Note that this example manifest doesn’t show how to manage a Unix group using Puppet - we’ll assume here that the user provider automatically creates the required group.)

I suggest installing the private key on your workstation and on your hub server - this allows you to connect over SSH to all your spoke machines, as the Git user.

Note that if you get an error “Puppet ssh_authorized_key no such file or directory /home .ssh” you may need to make sure that the .ssh directory is created within git’s home directory.

So, now we need to push a copy of our Puppet code from our workstation to the hub:

# git remote add hub ssh://git@hubserver/var/git/puppet

Now let’s create some content - I recommend creating a module such as sudo:

# cd puppet/modules
# mkdir -p sudo/manifests sudo/files

In the modules/sudo/manifests directory, create an init.pp which simply contains the line:

import "*"

This will import any other files in the directory.

Now let’s have an install.pp:

# cat <<EOF > sudo/manifests/install.pp
class sudo::install {
  package{ "sudo":
    ensure => installed,
}
EOF

and also a sudoers.pp:

class sudo::sudoers {

  file { "/tmp/sudoers":
    mode => 440,
    source => "puppet:///modules/sudo/sudoers",
    notify => Exec["check-sudoers"],
  }

  exec { "check-sudoers":
    command => "/usr/sbin/visudo -cf /tmp/sudoers && cp /tmp/sudoers /etc/sudoers",
    refreshonly => true,
  }

}

Now copy your /etc/sudoers file (with or without modification - this is for illustrative purposes only) to files/sudoers.

Great - now we have the elements of managing sudoers. We just need to create a node definition to include it:

# cd puppet/manifests
# mkdir nodes

# cat <<EOF > nodes/yarg.pp
yarg {
  include sudo::install
  include sudo::sudoers
}
EOF

Finally, we need a site.pp:

import "nodes/*.pp"

Add it:

# cd puppet
# git add . 
# git commit -m "Initial import"

Now, use ssh-agent to stash the ssh key:

# ssh-agent bash
# ssh-add /path/to/git.rsa
# git push hub master

So now we have a central repo on our hub server. We can clone this to any other workstations as required, and push updates from our local repos to the Puppet hub.

We now want to be able to push out Puppet config from the hub.

The principle is very simple - we simply add remotes to the repo on the hub. For example:

# git remote add web1 ssh://git@web1.mydomain.com/var/git/puppet

Where it gets clever is that we can specify multiple URLs for each remote. In the Git config (/var/git/puppet/config in our example), we just add a stanza such as the following:

[remote "webs"]
        url = ssh://git@matisse/var/git/puppet
        url = ssh://git@picasso/var/git/puppet
        url = ssh://git@klee/var/git/puppet

Now, provided we stash the ssh key using ssh-agent, or craft an SSH config file which specifies it, we can push from the hub to any machine, or any group of machines.

We now need to arrange for the Puppet code to appear on the spoke machines. To achieve this we use a Git post-receive hook:

#!/bin/sh
git archive --format=tar HEAD | (cd /etc/puppet && tar xf -)

Replace the default post-receive hook in /var/git/puppet/hooks/post-receive with the above, and chmod +x it.

We could have the post-receive hook actually run Puppet if we wanted, or we could run Puppet out of cron, or we could run Puppet manually - I leave the choice up to the reader. Either way, you’ll need to point Puppet to the path where your modules have been pushed, for example:

puppet --modulepath=/etc/puppet/modules

Ability to pull

Although our push model seems effective and, in many ways, ideal, it would be good to open the possibility of being able to pull from one of the spoke servers. The kind of situation I’m imagining here would be a developer who has a handful of test VMs on their workstation, which they want to be built with Puppet, but which are out of the reach of the Puppet server.

There are a couple of ways to achieve this. From this firewalled-off machine, we can clone the repo, as long as we have access to any of the machines to which we push Puppet, including the hub. We could set up an SSH key, and allow SSH from the IP to which the VM is NAT’d. We could fire up the git-daemon on any machine with a repo, and clone it that way. Or we could run a webserver on any of the machines, with a virtual host pointing to the Git repo. The only question to consider is how to tighten up security enough to allow the NAT’d IP access to the Puppet source code. Perhaps the simplest approach, if one of the devops folk has a laptop or workstation on the same network, and has their own copy of the repo, is to git —bare clone their repo, and serve it via git-daemon temporarily. There are many ways to skin this cat!

Further benefits

A full discussion of the merits of distributed version control is clearly outside the remit of this article, but having a Puppet infrastructure powered by Git opens up a whole world of collaborative development opportunities. One particularly attractive possibility is the easy ability to move to a branch to carry out test-driven Puppet work. We can easily push a branch (or a tag) to a test machine, verify that it works, and merge it back into the main production branch, without any headache at all. We could build a simple system around, for example, Rake and cucumber-puppet, to move towards a genuine test-first approach. cucumber-puppet is a tool for behavioral testing of Puppet manifests.

A further idea would be to configure the post-receive hook to behave differently based upon the branch or tag to which the code was pushed. The post-receive hook has access to the ref-name, which will have the form ref/heads/branchname (or tagname). This would make it possible to have tags or branches correlating to push-only, run in dry-run mode, or make-live.

One final idea that is worthy of merit, would be to have the whole infrastructure managed by Hudson, giving us a simple GUI to drive our Puppet deployment.

Git Puppet: conclusion

Moving away, conceptually, from running Puppet in client server mode frees us up to deliver a Puppet environment which is faster, more scaleable, and opens up interesting creative opportunities. Linking this into a distributed version control system seems to be a comfortable fit. I’m running a number of environments in this fashion, and recommend you give it a go.

About the author

Stephen Nelson-Smith is an e-business consultant and Technical Manager, with a strong background in Linux, UNIX, Python and Ruby. He is currently one of the technical leads and manages the operations team for the UK Government’s National Strategies website - one of the largest Drupal deployments in Europe. He specialises in system monitoring and automation, and delivering value by mentoring and training in lean and agile principles.