Bitfield Consulting

View Original

What “uptime” even means anymore

"What is uptime?" sounds like a silly question. Isn't it obvious? Either your website or service is up, or it's not.

The reality isn't so clear-cut, according to Uptime.com's resident SRE expert, John Arundel of Bitfield Consulting. In his book Cloud Native DevOps with Kubernetes, he observes:

We’re used to measuring the resilience and availability of our applications in uptime, usually measured as a percentage. For example, an application with 99% uptime was unavailable for no more than 1% of the relevant time period. 99.9% uptime, referred to as 'three nines', translates to about nine hours downtime a year. The more nines, the better, you might think. But this misses an important point: Nines don't matter if users aren't happy.

If your site is flat-out unavailable (it never loads, or you get a browser error), that's clearly a downtime situation, Arundel points out. But there are many ways a service can be making users unhappy, even if it's nominally 'up'.

Your site could be slow to load, driving impatient users to your competitors. Or the homepage might be speedy, but it takes users 20 seconds to log in (Amazon, we're looking at you). Searches might not be returning any results. Items might appear and disappear from users' shopping carts. Payments might not go through properly, or, worse, customers might be charged twice.

You get the point. 'Up' means happy users, not numbers in a spreadsheet. But what gets measured gets maximized, so you better be careful what you measure. If you're focused on traditional 'site uptime' numbers, you could be missing serious problems that are making customers angry. Best case, they'll call you out on social media. Worst case, they'll silently leave and never come back.

The answer, according to Arundel, is to 'uptime' your monitoring game. "More sophisticated ways of instrumenting your site, such as synthetic monitoring and RUM (Real User Monitoring), get you closer to measuring customer happiness. Synthetic monitoring checks simulate what customers really do on your site: logging in, searching for products, reading reviews, adding items to the cart, even making payments. By contrast, RUM metrics show you what real customers are doing on your site right now, and what kind of experience they're having."

Distributed systems such as cloud native applications are never one hundred percent 'up'; they always exist in a state of partially degraded service. The cloud is dark and full of terrors; to navigate it successfully, Arundel recommends you find out what makes your customers happy, and use modern, user-centric monitoring tools to make sure you keep them that way.