Jan 30

Jan 30 Slow, flaky, and failing

If you find yourself working on a project with quite a few broken windows, it’s all too easy to slip into the mindset of “All the rest of this code is crap, I’ll just follow suit.”
—David Thomas & Andrew Hunt, “The Pragmatic Programmer: Your Journey to Mastery”

It’s one minute to ship time, and you hit “push” on the very last commit. There it goes: the build is running. Every second counts now, and you watch the test output with increasing impatience. Why do the tests take so darned long?

And then, to your horror, the first red lights start to appear. “But these were passing before, and I haven’t touched that code!” you wail. It’s no good: your co-workers are already giving you the stink eye for breaking the build and holding them up.

You’re not going to ship today, for one simple reason: your tests are slow, flaky, and failing. So what the hell?

Flaky tests

Flaky tests sometimes fail, sometimes pass, regardless of whether the system is correct. There are many reasons for flaky tests, so let’s look at a couple of them, with some possible solutions.

Timing issues can be a source of flakiness, as you probably know from experience. In particular, fixed sleeps in tests are a bad idea (see the next section for more about these). Eliminate these wherever possible and replace them with code that only waits as long as strictly necessary.

When you need to test timing itself, use the shortest possible interval. For example, don’t test a timer function with a one-second duration when one millisecond would work just as well.

In some tests, as crazy as it sounds, the time of day can affect the test. One way to eliminate this cause of flakiness is by turning time into data, and if necessary injecting a fake Now() function to return a canned time of day.

Flakiness can also sometimes arise from ordering issues. Some data structures in Go are inherently unordered: maps, for example. Comparing these needs special care.

For example, iterating over a map comparing its elements is not good enough: the iteration order of maps is unspecified in Go. Instead, we can use the maps.Equal function to compare maps regardless of iteration order:

func TestMapsHaveSameContents(t *testing.T) {
    t.Parallel()
    want := map[int]bool{1: true, 2: false}
    got := map[int]bool{2: false, 1: true}
    if !maps.Equal(want, got) {
        t.Errorf("want %v, got %v", want, got)
    }
}
// pass

(Listing order/1)

On the other hand, slices are inherently ordered, and so slices.Equal requires this:

func TestSlicesHaveSameElementsInSameOrder(t *testing.T) {
    t.Parallel()
    want := []int{1, 2, 3}
    got := []int{3, 2, 1}
    if !slices.Equal(want, got) {
        t.Errorf("want %v, got %v", want, got)
    }
}
// fail:
// want [1 2 3], got [3 2 1]

(Listing order/1)

But sometimes we don’t actually care about the order. Maybe we get these results from some concurrent computations, and we don’t know what order they will show up in. We just want to know that we have the right results.

To compare two slices for equal elements, then, regardless of order, we can use slices.Sort to sort them before the comparison:

func TestSlicesHaveSameElementsInAnyOrder(t *testing.T) {
    t.Parallel()
    want := []int{1, 2, 3}
    got := []int{3, 2, 1}
    slices.Sort(want)
    slices.Sort(got)
    if !slices.Equal(want, got) {
        t.Errorf("want %v, got %v", want, got)
    }
}
// pass

(Listing order/1)

Whatever the cause of a flaky test suite, it’s a serious problem. Left untreated, it will continuously erode value from the tests, until eventually they become useless and ignored by all. It should be a red flag to hear something like “Oh yeah, that test just fails sometimes.”

As soon as you hear that, you know that the test has become useless. Delete it, if the flakiness really can’t be fixed. Thou shalt not suffer a flaky test to live. As soon as it starts flaking, it stops being a useful source of feedback, and bad tests are worse than no tests.

A brittle test is not the same thing as a flaky test: a brittle test fails when you change something unrelated, whereas a flaky test fails when it feels like it. Fixing brittle tests is usually a matter of decoupling entangled components, or simply reducing the scope (and thus sharpening the focus) of the test.

On the other hand, flaky tests can require some time and effort to find the underlying cause and address it. Only do this if the test is really worth it; if not, just delete it.

Failing tests

What if some tests aren’t just flaky, but fail all the time, because bugs aren’t being fixed? This is a very dangerous situation, and without prompt action the tests will rapidly become completely useless.

Why? Because if tests are allowed to fail for a while without being fixed, people soon stop trusting them, or indeed paying any attention to them: “Oh yeah, that test always fails.”

We can never have any failing tests, just as we can never have any bugs:

Don’t fix bugs later; fix them now.
—Steve Maguire, “Writing Solid Code: Development Philosophies for Writing Bug-Free Programs”

As soon as any test starts failing, fixing it should be everyone’s top priority. No one is allowed to deploy any code change that’s not about fixing this bug. Once you let one failing test slip through the net, all the other tests become worthless.

This so-called zero-defects methodology sounds radical, but it really isn’t. After all, what’s the alternative?

The very first version of Microsoft Word for Windows was considered a “death march” project. Managers were so insistent on keeping to the schedule that programmers simply rushed through the coding process, writing extremely bad code, because bug-fixing was not a part of the formal schedule.
Indeed, the schedule became merely a checklist of features waiting to be turned into bugs. In the post-mortem, this was referred to as “infinite defects methodology”.
—Joel Spolsky, “The Joel Test: 12 Steps to Better Code”

Fixing bugs now is cheaper, quicker, and makes more business sense than fixing them later. The product should be ready to ship at all times, without bugs.

If you already have a large backlog of bugs, or failing tests, but the company’s still in business, then maybe those bugs aren’t really that critical after all. The best way out may be to declare voluntary bug bankruptcy: just close all old bugs, or delete all failing tests. Bugs that people do care about will pretty soon be re-opened.

Slow tests

My book The Power of Go: Tests is all about how to write meaningful tests: not just box-ticking exercises to satisfy some bureaucratic manager, but tests that really add value to the code, and make your work easier and more enjoyable.

Even the world’s greatest test suite does us no good, though, if it takes too long to run. How long is too long? Well, if we’re running tests every few minutes, clearly even a few minutes is too long. We simply won’t run the tests often enough to get the fast feedback we need from them.

By running the test suite frequently, at least several times a day, you’re able to detect bugs soon after they are introduced, so you can just look in the recent changes, which makes it much easier to find them.
—Martin Fowler, “Self-Testing Code”

One way or the other, then, we don’t want to be more than about five minutes away from passing tests. So, again, how long is too long for a test suite to run?

Kent Beck suggests that ten minutes is a psychologically significant length of time:

The equivalent of 9.8 m/s² is the ten-minute test suite. Suites that take longer than ten minutes inevitably get trimmed, or the application tuned up, so the suite takes ten minutes again.
—Kent Beck, “Test-Driven Development by Example”

We may perhaps call this psychological limit the Beck time. Beyond the ten-minute mark, the problem is so obvious to everybody that people are willing to put effort into speeding up the test suite. Below that time, people will probably grumble but put up with it.

That certainly doesn’t mean that a ten-minute test suite is okay: it’s not, for the reasons we’ve discussed. Let’s look at a few simple ways to reduce the overall run-time of the test suite to something more manageable.

Parallel tests. The inability to run certain tests in parallel is usually a design smell. Refactor so that each test has its own world, touches no global state, and can thus run in parallel. Adding parallelism to a suite that doesn’t have it should speed it up by about an order of magnitude.
Eliminate unnecessary I/O. Once you go off the chip, things get slow. Do everything on the chip as far as possible, avoiding I/O operations such as network calls or accessing disk files. For example, you could use an fstest.MapFS as an in-memory filesystem, and memory-backed io.Readers and io.Writers instead of real files.
No remote network calls. Instead of calling some remote API, call a local fake instead. Local networking happens right in the kernel, and while it’s still not fast, it’s a lot faster than actually going out onto the wire.
Share fixtures between tests. Any time you have some expensive fixture setup to do, such as loading data into a database, try to share its cost between as many tests as possible, so that they can all use it. If necessary, do the setup in a single test and then run a bunch of subtests against it.

However, we need to be careful that the tests don’t then become flaky as a result of too much fixture sharing. A flaky test is worse than a slow test.
No fixed sleeps. A test that can’t proceed until some concurrent operation has completed should use the “wait for success” pattern (loop and retry, with a tiny delay, until the operation has completed). This minimises wasted time, whereas a long fixed sleep maximises it (or causes flaky tests, which is also bad).
Throw hardware at the problem. When you’ve made the test suite as fast as it can go and it’s still slow, just run it on a faster computer. If the tests are mostly CPU-bound, rent a 256-core cloud machine and have it pull and run the tests on demand. CPU time costs a lot less than programmer time, especially since hiring cheap programmers is a false economy.
Run slow tests nightly. This is a last resort, but it might come to that. If you have a few tests that simply can’t be speeded up any more, and they’re dragging down the rest of the suite, extract them to a separate “slow test” suite, and run it on a schedule. Every night, perhaps; certainly no less frequently than that. Even nightly isn’t great, but it’s better than not running tests at all.

Jan 30 Slow, flaky, and failing

Flaky tests

Failing tests

Slow tests

Feb 27 Writing terrible code

Jan 13 How to know when it's time to go

Related Posts