General sysadmin topics.
The ever-amusing Scott Adams with his take on how to boost one’s professional credibility.
“We’re having a problem sending email out of the department.”
“What’s the problem?” I asked.
“We can’t send mail more than 500 miles,” the chairman explained.
I choked on my latte. “Come again?”
“We can’t send mail farther than 500 miles from here,” he repeated. “A little bit more, actually. Call it 520 miles. But no farther.”
“Um… Email really doesn’t work that way, generally,” I said, trying to keep panic out of my voice.
More from Agile Testing’s Grig Gheorghiu on setting up Opscode Chef, creating your own cookbook, modifying an existing cookbook, creating a role and adding a client machine to that role.
Excellent introduction to Puppet for cloud management by Jeff Wallace, featuring EC2 and Rackspace Cloud integration, Puppet classes, using Ruby logic in ERB templates, and inheriting node definitions to create identical configurations.
cucumber-puppet is the glue between cucumber and Puppet, allowing you to write behavioural tests, or features as cucumber calls it, for your Puppet manifest.
It’s been said that deploying Java apps is hard for Linux packages, but Puppet makes it very easy. This is only the tip of the iceberg—you can use the same tool to deploy mailservers and databases as well as appservers. It fits in well whether you have 20 machines or thousands. It’s agnostic to cloud vs physical hardware, and plays nicely in all places. The example that follows below is designed to be executed locally, though in a typical deployment, you’ll host it on a central server, called a puppetmaster, and then roll your configuration out to the nodes using puppet.
Video of Stephen Nelson-Smith speaking about Git for sysadmins - how it works, how to use it, how to support it, and how it compares to other version control systems.
Gizzard is a Scala framework that makes it easy to create custom fault-tolerant, distributed databases. At a high level, Gizzard is a middleware networking service that manages partitioning data across arbitrary backend datastores (e.g., SQL databases, Lucene, etc.). Nick Kallen from Twitter Engineering outlines what Gizzard is and how it works.
pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. John Allspaw notes that “When you’ve got massive files, this can be a pretty big advantage, especially when you’ve got lots of cores sitting around.”.
Twitter recently switched from an Apache/Mongrel web infrastructure to one based on Unicorn, a high-performance load balancing application server, and Stormcloud, a Unicorn monitoring and management system. Ben Sandofsky and John Adams from Twitter Engineering explain the benefits.
Agile Testing’s Grig Gheorghiu provides a simple, step-by-step tutorial for installing Opscode Chef and getting a Chef client to talk to a Chef server.
Vagrant is a tool for building and distributing virtualized development environments, providing easy to configure, lightweight, reproducible, and portable virtual machines targeted at development environments. Vagrant includes automated virtual machine creation using Oracle’s VirtualBox, automated provisioning of virtual environments using Opscode Chef, port forwarding to the host machine, full SSH access to created environments, shared folders, packaging environments into distributable boxes, and easy teardown and rebuild of your environments.
Poul-Henning Kamp is the author of Varnish, an open-source HTTP accelerator used by Facebook, Slashdot, Wikia and many other sites to speed up slow Web servers. In this article for ACM Queue he describes how he applied his experience as a lead kernel developer on FreeBSD to developing a new, ultra-high-performance algorithm for cache expiry, and discusses how to optimise traditional algorithms for virtual-memory execution environments.
We have been doing various experiments in our ec2 web serving cluster to serve maximum traffic at the minimum costs. I thought our experience will be useful to many other people using ec2.
Applying agile and lean methodologies to system administration.
For me, what is really exciting about DevOps is the notion that software development, infrastructure engineering and operational automation can and should done simultaneously and collaboratively. DevOps doesn’t invalidate ITIL, nor does it mean unbridled application deployment.
The way I see it we already have some great examples of DevOps in action — its just that the term hasn’t been applied to them yet.
Can you remember the last time when you had to apply patches or config file changes to a system. And did you have that fingers crossed feeling? Wouldn’t it be great that you could install a patch and run a series of tests to see if everything behaved the way it should?
I’ve been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.
Here is a detailed example of a fairly typical 2-tier Kanban board, for teams that know the basics of Kanban and are taking their first steps towards implementing it in practice.
Agile training expert Robert Dempsey examines the application of Kanban techniques to system administration work, with a devops flavour.
Technologies and architectures for data storage and retrieval.
Cassandra is a highly scalable NoSQL database solution. Jonathan Ellis demolishes some misconceptions about Cassandra with a look at its replication capabilities, reliability, and interoperability with Hadoop and similar tools.
A review of the various HA solutions available for MySQL.
Handy notes and tips on squeezing more performance out of your MySQL server.
Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files.
David Mytton of BoxedIce explains how his MongoDB infrastructure works and gives some tips and tricks based on experience running MongoDB in production for several months.
SQL databases are fundamentally non-scalable, and there is no magical pixie dust that we, or anyone, can sprinkle on them to suddenly make them scale.
In this FutureRuby talk, Ilya Grigorik explores Tokyo Cabinet’s features such as the key-value store, ordered traversal, attribute search, schemaless data structures,indexing, and scripting with Lua.
Deployment of applications, version control, and continuous integration and nightly build resources.
The two most important part of this talk is the observation that Dev, Qa and Operations teams have to slightly blend into each other to achieve deployments at such a velocity, and the fact that they are not afraid to break the website by deploying code from trunk.
How to set up continuous integration testing for Puppet manifests.
It’s important to realize that the tools you use are largely independent of the integration strategy you use. Although many people associate DVCSs with feature branching, they can be used with CI. All you need to do is mark one branch on one repository as the mainline. If everyone pulls and pushes to that every day, then you have a CI mainline. Indeed with a disciplined team, I would usually prefer to use a DVCS on a CI project than a centralized one.
Collaboration and mutual respect between developers and sysadmins.
Deployment is definitely one of the places where the rubber meets the road. In some organizations, deployment of new code can be the single-most stressful and dividing parts of their work.
Devopsdays was a small conference about a couple of emerging themes combining Development and Operations.
Hilarious Monty-Python-mashup video explaining the devops movement.
Ten characteristics of devops-flavoured sysadmin practices, as outlined by Dmitriy Samovskiy, including coding skills, protection of business revenues, a focus on stability and uptime, and distributed or hyper-distributed applications.
If your dev team is fixing servers, they are not programming. If they are not programming, they may not be making you money. Many developers needlessly use versions of software that are not part of the standard OS. Web developers often lack the experience to know the accepted standards. As a result, you may end up with “odd” deployment scenarios which can complicate management.
General resources on infrastructure and network architecture.
Using these tools should help you build more reliable and less noisy cron jobs, which makes your systems more reliable and your pager more quiet.
A “design pattern” is a little bit of knowledge that is worth sharing, and useful for repeating. For example, the design pattern of “a network” or “active-active redundancy” or “RAID-0″ or “helpdesk”. Ever visit an IT shop where there was no way for people to get help? I have. So I explained the design pattern of a “helpdesk” and without having to re-invent the wheel, they were able to implement one.
When planning the storage for a system one thing you need to know above all other information is how that data is going to be accessed. I’m talking about 95% reads 5% writes, 1.2Gb/minute average transfer, highly latency sensitive. You can make some assumptions based on the applications that’ll be accessing the storage, but if you really need to know, the only way to find out is measuring.
Once you know how that data is going to be accessed, you can build or provision its storage accordingly. Knowing how likely the dataset is to grow is also something you need to know, but that’s a luxury we often don’t get. And for the love of performance metrics, don’t forget peak loading and behavior under fault conditions.
Cloud computing and cloud-based infrastructures,
Scalr is an auto-scaling solution for EC2, a bit like RightScale, but cheaper. They give you some nice infrastructure like a MySQL master/slave replication setup, and distributed cron jobs.
We’ve switched to use Amazon EBS for a few months now. Therefore, we have been able to have a good night’s sleep without worrying our database disappearing. Here’s a quick guide.
Automation is the underlying foundation of the revolution (some say disruption) in Information Technology that is currently happening and has been branded as cloud computing.
After you rebundle a running instance to create a new image, you can then run new EC2 instances of that image.
At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.
Heroku is the instant ruby platform. Deploy any ruby app instantly with a simple and familiar git push. Take advantage of advanced features like HTTP caching, memcached, rack middleware, and instant scaling built into every app. Never think about hosting or servers again.
Many people know that Amazon Web Services are one of the big players in the cloud computing business, and especially their Infrastructure as a Service offering EC2 is becoming increasingly popular. Few people know that EC2 is probably one of the biggest Xen installations deployed. But how many know how EC2 actually works and how the underlying architecture is constructed?
Resilience, uptime and high-demand strategies.
Migrating Drupal sites without downtime using an ingenious combination of DNS switchover and proxying.
The idea is to allow only that much traffic through to your system which your system can handle successfully.
Everyone knows that problems always seem to happen when you are asleep, on holiday or away from your computer.
That awesome load-balanced, redundant, no-single-point-of-failure stack you’ve built? Yeah, doesn’t do you much good when the lights go out. In my experience, the worst, most sustained downtime has always been caused by power issues.
Why Github uses ldirectord for high-demand load-balancing.
ldirectord
Preparing your infrastructure for unexpected traffic spikes and ‘Slashdotting’, with caching, proxying and denormalisation of data.
Tools and techniques for monitoring your systems.
Flapjack is a scalable and distributed monitoring system. It natively talks the Nagios plugin format, and can easily be scaled from 1 server to 1000. Flapjack aims to be simple to set up, configure, and maintain, and easily scales from a single host to multiple.
We’ve created a website monitor, in plain English, which genuinely checks the behaviour of a website, and which returns data in the form of a standard Nagios plugin. Suddenly the barrier to doing truly intelligent website monitoring has been reduced, very significantly.
This is a detailed howto article, but you can skip the parts that you don’t need easily, and it will get you up and running with an enviable Nagios Drupal Monitoring station.
Monitoring “is it down?” is reactionary. It is better than no monitoring at all, but all it tells you is that there is already a problem. Monitoring is better when it predicts the future and prevents problems.
Non-functional requirements are essentially the often unspoken desires of a product owner or user about how a piece of software works, rather than what it does. We don’t really mean how the software itself behaves, as this can still be captured in behaviour-driven design and acceptance tests; rather we mean aspects of its behaviour over the lifecycle of the application.
When monitoring a Web site, you need to look at it both from a ‘micro’ perspective (i.e. are the individual devices and servers in your infrastructure running smoothly?) and from a ‘macro’ perspective (i.e. do your customers have a pleasant experience when accessing and using your site?; can they use your site’s functionality to conduct their business online?).
Performance tuning and high-volume techniques.
Masterzen explains how to offload the job of Puppet static file serving onto Nginx, a small and performant web server, and also how to have Nginx cache the files. He also gives a recipe for configuring Nginx to cache the compiled catalogs for each machine, reducing some of the compute load on the Puppet server.
Autobench is a simple Perl script for automating the process of benchmarking a web server (or for conducting a comparative test of two different web servers).
There are many ways to scale modern web applications. What I will be describing here is the method that we chose. This should by no means be considered the only way to scale an application. Consider it a case study of what worked for us given our unique requirements.
EXT3, EXT4, XFS, ReiserFS, and Btrfs benchmarked.
How to use vmstat, iostat, and top to understand what part of your system is the bottleneck.
Insight into the process of building scalable internet based services.
iperf is a great little tool for quickly measuring the network bandwidth between two machines.
Provisioning and building new systems.
When looking at bootstrapping an infrastructure, people opt either for taking, or creating full images of an existing environment , or for “scripting” automated installations. Both have their advantages and disadvantages.
The core definition of “fully automated provisioning”: the ability to deploy, update, and repair your application infrastructure using only pre-defined automated procedures.
A checklist for security and performance tweaks you’ll want to apply when building a new Ubuntu box to host Drupal sites. Most of these will apply to CentOS as well, but the package names may be slightly different.
Virtual machines and VM hosting.
While a lot of effort is spent on automating the installation of the machine OS and its application, I see that the provisioning of a virtual machine is often still done by the GUI. So why not automate that step too.
Linux tuning information is scattered among many hundreds of sites, each with a little bit of knowledge. Virtual machine tuning information is equally scattered about. This is my attempt at indexing all of it.
While new technologies and delivery models have made it much simpler to manage the infrastructure, this is not where our core inefficiencies lie. Virtualization principles must be extended to higher levels of the application stack, to make it easier for all of us to manage, tune and integrate applications.
How to become a better sysadmin.
Version control and other development best practices which sysadmins should use.
For years, IT pros have heard that they must do more with less, as staffing is cut and outsourced, while demands to better serve the business and adopt new technologies continually increase. This is how it’s always been, but it doesn’t have to be how it always will be.
Let’s face it, System Administrators get no respect 364 days a year. This is the day that all fellow System Administrators across the globe, will be showered with expensive sports cars and large piles of cash in appreciation of their diligent work.
Many of my former Go students headed to Berkeley and Stanford, chose computer science major, and became software engineer. At the minimal, Go will calm a person down and teaches him/her to concentrate for a long period of time.
Lock-picking will teach you to be humble and not focus on what others can do better, but to take an honest inventory of your skills and your progress.
Hardening your systems, and deterring and detecting intruders.
Here are a few things you need to tweak in order to improve OpenSSH server security.
Tools, software and applications for sysadmins.
DNS Knife is a good tool to check if your DNS setup is ok, it checks the parent servers, it checks for if your nameservers are listed on the parent server, checks if all your nameservers are reachable and are authoritative. And so on and so on…
A simple command-line tool to be able to grab real-time stats from memcache.
A large crowdsourced list of sysadmin diagnostic tools available on Linux.