General resources on infrastructure and network architecture.
More from Agile Testing’s Grig Gheorghiu on setting up Opscode Chef, creating your own cookbook, modifying an existing cookbook, creating a role and adding a client machine to that role.
Excellent introduction to Puppet for cloud management by Jeff Wallace, featuring EC2 and Rackspace Cloud integration, Puppet classes, using Ruby logic in ERB templates, and inheriting node definitions to create identical configurations.
Using these tools should help you build more reliable and less noisy cron jobs, which makes your systems more reliable and your pager more quiet.
A “design pattern” is a little bit of knowledge that is worth sharing, and useful for repeating. For example, the design pattern of “a network” or “active-active redundancy” or “RAID-0″ or “helpdesk”. Ever visit an IT shop where there was no way for people to get help? I have. So I explained the design pattern of a “helpdesk” and without having to re-invent the wheel, they were able to implement one.
When planning the storage for a system one thing you need to know above all other information is how that data is going to be accessed. I’m talking about 95% reads 5% writes, 1.2Gb/minute average transfer, highly latency sensitive. You can make some assumptions based on the applications that’ll be accessing the storage, but if you really need to know, the only way to find out is measuring.
Once you know how that data is going to be accessed, you can build or provision its storage accordingly. Knowing how likely the dataset is to grow is also something you need to know, but that’s a luxury we often don’t get. And for the love of performance metrics, don’t forget peak loading and behavior under fault conditions.
Twitter recently switched from an Apache/Mongrel web infrastructure to one based on Unicorn, a high-performance load balancing application server, and Stormcloud, a Unicorn monitoring and management system. Ben Sandofsky and John Adams from Twitter Engineering explain the benefits.
Julian Simpson (aka @builddoctor) is a well-known authority on software build and deployment issues, including configuration management, continuous integration, continuous deployment, and testing. His blog covers all of these topics, featuring talks, interviews, links, and patterns and practices.
Poul-Henning Kamp is the author of Varnish, an open-source HTTP accelerator used by Facebook, Slashdot, Wikia and many other sites to speed up slow Web servers. In this article for ACM Queue he describes how he applied his experience as a lead kernel developer on FreeBSD to developing a new, ultra-high-performance algorithm for cache expiry, and discusses how to optimise traditional algorithms for virtual-memory execution environments.
“Wouldn’t it be nice if we could abstract some of the low-level details of different socket types, connection handling, framing, or even routing? This is exactly where the ZeroMQ (ØMQ/ZMQ) networking library comes in: “it gives you sockets that carry whole messages across various transports like inproc, IPC, TCP, and multicast; you can connect sockets N-to-N with patterns like fanout, pubsub, task distribution, and request-reply”.” Ilya Grigorik explains the architecture of ZeroMQ with some code examples and diagrams.
Cloud computing and cloud-based infrastructures,
Scalr is an auto-scaling solution for EC2, a bit like RightScale, but cheaper. They give you some nice infrastructure like a MySQL master/slave replication setup, and distributed cron jobs.
We’ve switched to use Amazon EBS for a few months now. Therefore, we have been able to have a good night’s sleep without worrying our database disappearing. Here’s a quick guide.
Automation is the underlying foundation of the revolution (some say disruption) in Information Technology that is currently happening and has been branded as cloud computing.
After you rebundle a running instance to create a new image, you can then run new EC2 instances of that image.
At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.
Heroku is the instant ruby platform. Deploy any ruby app instantly with a simple and familiar git push. Take advantage of advanced features like HTTP caching, memcached, rack middleware, and instant scaling built into every app. Never think about hosting or servers again.
Many people know that Amazon Web Services are one of the big players in the cloud computing business, and especially their Infrastructure as a Service offering EC2 is becoming increasingly popular. Few people know that EC2 is probably one of the biggest Xen installations deployed. But how many know how EC2 actually works and how the underlying architecture is constructed?
We have been doing various experiments in our ec2 web serving cluster to serve maximum traffic at the minimum costs. I thought our experience will be useful to many other people using ec2.
Resilience, uptime and high-demand strategies.
Migrating Drupal sites without downtime using an ingenious combination of DNS switchover and proxying.
A review of the various HA solutions available for MySQL.
The idea is to allow only that much traffic through to your system which your system can handle successfully.
Everyone knows that problems always seem to happen when you are asleep, on holiday or away from your computer.
That awesome load-balanced, redundant, no-single-point-of-failure stack you’ve built? Yeah, doesn’t do you much good when the lights go out. In my experience, the worst, most sustained downtime has always been caused by power issues.
Why Github uses ldirectord for high-demand load-balancing.
ldirectord
Preparing your infrastructure for unexpected traffic spikes and ‘Slashdotting’, with caching, proxying and denormalisation of data.
Tools and techniques for monitoring your systems.
Flapjack is a scalable and distributed monitoring system. It natively talks the Nagios plugin format, and can easily be scaled from 1 server to 1000. Flapjack aims to be simple to set up, configure, and maintain, and easily scales from a single host to multiple.
We’ve created a website monitor, in plain English, which genuinely checks the behaviour of a website, and which returns data in the form of a standard Nagios plugin. Suddenly the barrier to doing truly intelligent website monitoring has been reduced, very significantly.
This is a detailed howto article, but you can skip the parts that you don’t need easily, and it will get you up and running with an enviable Nagios Drupal Monitoring station.
Monitoring “is it down?” is reactionary. It is better than no monitoring at all, but all it tells you is that there is already a problem. Monitoring is better when it predicts the future and prevents problems.
Non-functional requirements are essentially the often unspoken desires of a product owner or user about how a piece of software works, rather than what it does. We don’t really mean how the software itself behaves, as this can still be captured in behaviour-driven design and acceptance tests; rather we mean aspects of its behaviour over the lifecycle of the application.
When monitoring a Web site, you need to look at it both from a ‘micro’ perspective (i.e. are the individual devices and servers in your infrastructure running smoothly?) and from a ‘macro’ perspective (i.e. do your customers have a pleasant experience when accessing and using your site?; can they use your site’s functionality to conduct their business online?).
Performance tuning and high-volume techniques.
Masterzen explains how to offload the job of Puppet static file serving onto Nginx, a small and performant web server, and also how to have Nginx cache the files. He also gives a recipe for configuring Nginx to cache the compiled catalogs for each machine, reducing some of the compute load on the Puppet server.
Autobench is a simple Perl script for automating the process of benchmarking a web server (or for conducting a comparative test of two different web servers).
There are many ways to scale modern web applications. What I will be describing here is the method that we chose. This should by no means be considered the only way to scale an application. Consider it a case study of what worked for us given our unique requirements.
EXT3, EXT4, XFS, ReiserFS, and Btrfs benchmarked.
How to use vmstat, iostat, and top to understand what part of your system is the bottleneck.
pigz, which stands for parallel implementation of gzip, is a fully functional replacement for gzip that exploits multiple processors and multiple cores to the hilt when compressing data. pigz was written by Mark Adler, and uses the zlib and pthread libraries. John Allspaw notes that “When you’ve got massive files, this can be a pretty big advantage, especially when you’ve got lots of cores sitting around.”.
Insight into the process of building scalable internet based services.
iperf is a great little tool for quickly measuring the network bandwidth between two machines.
Provisioning and building new systems.
When looking at bootstrapping an infrastructure, people opt either for taking, or creating full images of an existing environment , or for “scripting” automated installations. Both have their advantages and disadvantages.
The core definition of “fully automated provisioning”: the ability to deploy, update, and repair your application infrastructure using only pre-defined automated procedures.
A checklist for security and performance tweaks you’ll want to apply when building a new Ubuntu box to host Drupal sites. Most of these will apply to CentOS as well, but the package names may be slightly different.
Agile Testing’s Grig Gheorghiu provides a simple, step-by-step tutorial for installing Opscode Chef and getting a Chef client to talk to a Chef server.
Virtual machines and VM hosting.
While a lot of effort is spent on automating the installation of the machine OS and its application, I see that the provisioning of a virtual machine is often still done by the GUI. So why not automate that step too.
Linux tuning information is scattered among many hundreds of sites, each with a little bit of knowledge. Virtual machine tuning information is equally scattered about. This is my attempt at indexing all of it.
While new technologies and delivery models have made it much simpler to manage the infrastructure, this is not where our core inefficiencies lie. Virtualization principles must be extended to higher levels of the application stack, to make it easier for all of us to manage, tune and integrate applications.
Vagrant is a tool for building and distributing virtualized development environments, providing easy to configure, lightweight, reproducible, and portable virtual machines targeted at development environments. Vagrant includes automated virtual machine creation using Oracle’s VirtualBox, automated provisioning of virtual environments using Opscode Chef, port forwarding to the host machine, full SSH access to created environments, shared folders, packaging environments into distributable boxes, and easy teardown and rebuild of your environments.