Undertesting and overtesting
It is tempting to test that the code does what it should do and leave it at that, but it’s arguably even more important to test that it doesn’t do what it shouldn’t do.
—Mike Bland, “Goto Fail, Heartbleed, and Unit Testing Culture”
What is well-tested code? We tend to assume it’s something to do with test coverage: that is, if a test exists for each of your functions, they must be well-tested, right?
Right?
Not necessarily. Even when a certain piece of code is nominally covered by some test, that doesn’t mean it’s really tested In other words, just because a test causes some function to be executed, that doesn’t necessarily prove that the function does the right thing.
In Testing legacy code, we talked about what to do if your codebase has no tests at all, or not enough tests. Today, let’s look at the tests you do have, and what still might be wrong with them.
Optimistic tests
An optimistic test is designed only to confirm that the system works, not to prove that it doesn’t.
Surprisingly, one problem that optimistic tests can miss is when the system does nothing at all.
For example, consider a test for some function
user.Create
, like this:
func TestCreateCreatesGivenUser(t *testing.T) {
.Parallel()
t.Create("Alice")
userif !user.Exists("Alice") {
.Error("Alice not created")
t}
}
At first glance, this is plausible. We create a test user, Alice, and check if she subsequently exists. If not, the test fails. That part is fine. So, can you see what’s missing? If not, you might like to take a minute to think about it before reading on.
It’s always worth asking of any test whether it rigorously checks its preconditions as well as its postconditions. The developer’s focus, naturally enough, tends to be on the state of the world after the operation under test, but that can result in some risky assumptions about its prior state.
Suppose Create
actually does nothing. How could this
test pass? Only when Alice already exists. And would we know if that was
the case?
Our test checks that Alice exists after calling
Create
, but what’s missing is a check that she doesn’t
exist beforehand.
In other words, suppose we don’t clean up the database after each
test run, so if Alice was ever created, she’ll still be there.
And suppose someone later introduces a bug into Create
that
prevents it from actually creating users.
To put it another way, here’s an obviously incorrect implementation that nevertheless passes this test:
type User struct {
string
Name }
var (
= new(sync.Mutex)
m = map[string]*User{
users "Alice": {
: "Alice",
Name},
}
)
func Create(name string) {}
func Exists(name string) bool {
.Lock()
mdefer m.Unlock()
, ok := users[name]
_return ok
}
We thought we were testing Create
, but we really aren’t,
because Create
does nothing at all, yet the test doesn’t
detect that. Alice always exists, so the test always passes.
This kind of mirage test is especially dangerous, because it
looks like you have a test, but you don’t.
It turns out that Create
needs not just to leave the
world in a state where Alice exists. What’s important about
Create
is that it changes the world from a state
where Alice doesn’t exist to one where she does.
You might think that goes without saying, but we’ve just proved that it doesn’t. We need to pay attention to preconditions as well as postconditions, according to the contract that the system under test is supposed to fulfil.
Let’s write a test that would catch this bug, then:
func TestCreateCreatesGivenUser(t *testing.T) {
.Parallel()
tif user.Exists("Alice") {
.Fatal("Alice unexpectedly exists")
t}
.Create("Alice")
userif !user.Exists("Alice") {
.Error("Alice not created")
t}
}
The difference is very simple, but important: we check our preconditions.
What about this test, then?
func TestDeleteDeletesGivenUser(t *testing.T) {
.Parallel()
t.Create("Alice")
user.Delete("Alice")
userif user.Exists("Alice") {
.Error("Alice still exists after delete")
t}
}
Again, this looks reasonable on a cursory inspection. It creates
Alice, deletes her (sorry, Alice), and then ensures that she no longer
exists. What could be wrong with Delete
that this test
wouldn’t catch?
Well, what if both Create
and
Delete
do nothing at all? That seems like a pretty major
bug, yet this test doesn’t detect it. There are no preconditions, so the
outcome of the test is the same whether Create
and
Delete
actually have any effect or not. The test isn’t
wrong, as far as it goes: it just doesn’t go far enough.
There’s a pretty big loophole in it.
This kind of bug isn’t as unlikely as you might think, either. I’ve
made this exact mistake in the past: I stubbed out Create
and Delete
methods with placeholders, then forgot that I
hadn’t finished them, because the test was passing. It’s easy to do.
What we’re missing here, in fact, is another precondition: that the user does exist before we try to delete them.
func TestDeleteDeletesGivenUser(t *testing.T) {
.Parallel()
t.Create("Alice")
userif !user.Exists("Alice") {
.Error("Alice not created")
t}
.Delete("Alice")
userif user.Exists("Alice") {
.Error("Alice still exists after delete")
t}
}
If Create
doesn’t do anything, this test will fail at
the first check. If Delete
doesn’t do anything, it’ll fail
at the second.
In any non-trivial codebase, you’re pretty much guaranteed to find at least a few tests that are optimistically feeble in this way. Look for any test that doesn’t properly establish its preconditions, and fix it. This will add a lot of value to the test suite overall.
Another example of this kind of problem is when the test fails to
check some important but implicit postconditions. For example,
in TestDelete
, the explicit postcondition here is that
Alice shouldn’t exist after deletion, so what are we implicitly missing?
What else could a reasonable person ask for from a Delete
function?
As usual, a productive way to answer that is to think about possible
bugs in Delete
. Suppose, for example, that
Delete
mistakenly deletes not only Alice, but all
users in the database. That kind of thing is surprisingly easy to do,
especially with SQL queries (omitting a WHERE
clause, for
example).
If calling Delete
on a single user instead nukes the
whole database, that’s a pretty major bug, wouldn’t you say? This test
doesn’t detect it, because it focuses only on what should
happen, and ignores what shouldn’t.
How could we detect such a bug, then? Quite easily, it turns out.
Here’s what we do. We create two users in the test, but delete only one of them. Then we check that the one we deleted doesn’t exist, and the one we didn’t delete still exists:
func TestDeleteDeletesGivenUserOnly(t *testing.T) {
.Parallel()
t.Create("Alice")
userif !user.Exists("Alice") {
.Error("Alice not created")
t}
.Create("Bob")
userif !user.Exists("Bob") {
.Error("Bob not created")
t}
.Delete("Alice")
userif user.Exists("Alice") {
.Error("Alice still exists after delete")
t}
if !user.Exists("Bob") {
.Error("Bob was unexpectedly deleted")
t}
}
This test has accumulated a bit of paperwork, so let’s refactor that out into a helper function:
func TestDeleteDeletesGivenUserOnly(t *testing.T) {
.Parallel()
t(t, "Alice")
createUserOrFail(t, "Bob")
createUserOrFail.Delete("Alice")
userif user.Exists("Alice") {
.Error("Alice still exists after delete")
t}
if !user.Exists("Bob") {
.Error("Bob was unexpectedly deleted")
t}
}
func createUserOrFail(t *testing.T, name string) {
.Helper()
t.Create(name)
userif !user.Exists(name) {
.Errorf("%s not created", name)
t}
}
Who would have thought that there were so much scope for things to go
wrong with a seemingly simple Delete
function? Well,
you would, or at least you will now.
If you find yourself, as a result, becoming thoroughly sceptical about the idea that anything works the way it’s supposed to, congratulations: you’re thinking like a tester.
Persnickety tests
Sometimes, though not often, people can take testing a bit too much to heart, and test more than strictly necessary. As we’ve seen, it’s much easier to err the other way, and leave out important things such as preconditions and implicit postconditions, like not deleting all the users in the database. But overtesting does afflict some test suites.
It’s important to keep tests focused on only the part of the system they’re supposed to care about, and on only the behaviour that matters. They should avoid checking for irrelevant things.
Beware also of simply comparing too much. Sometimes it can be convenient to compare a function’s entire result against an expected value, rather than individually checking each of its fields, but it’s not always a good idea.
Comparing the whole struct makes sense when all the fields are affected by the behaviour under test. But when only some fields are important to this test, checking irrelevant fields makes the test brittle, and obscures its real purpose. The same applies to checking entire strings or output files, when only certain parts of the data are actually important.
The easiest way to avoid brittle tests is to check only the properties you care about. Be selective in your assertions. Don’t check for exact string matches, for example, but look for relevant substrings that will remain unchanged as the program evolves.
—Alan Donovan & Brian Kernighan, “The Go Programming Language”
Watch out for tests that lazily compare output against a golden file, for example, when the behaviour they’re testing is only about a small subset of that file. Similarly, a test should not assert the exact value of an error, but only that there is some error, when there’s supposed to be.
Some exposure to the idea of property-based testing, as commonly used in fuzz testing, can also be helpful for constructing robust tests. For example, what’s important about a result is often not its exact value, but some property of the value, especially an invariant property.
I recently reviewed a program that needed to create a “fingerprint” of a piece of data, for deduplication purposes. In case the same data was submitted to the system later, the fingerprint would enable the system to recognise it without actually having to store all the data, which could be very large.
A cryptographic digest, or hash value, is an obvious way to do this, so the program had a test something like this:
func TestHashReturnsMD5HashOfGivenData(t *testing.T) {
.Parallel()
t:= []byte("These pretzels are making me thirsty.")
data := md5.Sum(data)
want := fingerprint.Hash(data)
got if want != got {
.Errorf("want %v, got %v", want, got)
t}
}
The implementation of Hash
doesn’t matter, but let’s
assume it’s something like this:
func Hash(data []byte) [md5.Size]byte {
return md5.Sum(data)
}
Fine. But MD5 is insecure, so I suggested using a SHA-256 hash instead:
func Hash(data []byte) [sha256.Size]byte {
return sha256.Sum256(data)
}
This broke the test, which makes no sense, because Hash
works. So what’s the problem?
Well, what are we really testing here? Not that Hash
produces an MD5 hash, specifically; that’s incidental. What
matters is that the same data should always hash to the same value,
whatever that value actually is. And, no less importantly, that
different data should hash to different values.
So we ended up with a test something like this instead:
func TestHashGivesSameUniqueHashForSameData(t *testing.T) {
.Parallel()
t:= []byte("These pretzels are making me thirsty.")
data := fingerprint.Hash(data)
orig := fingerprint.Hash(data)
same := fingerprint.Hash([]byte("Hello, Newman"))
different if orig != same {
.Error("same data produced different hash")
t}
if orig == different {
.Error("different data produced same hash")
t}
}
This test is better, because it doesn’t care about the implementation
details of Hash
, such as which algorithm is used. What it
cares about is that the same input to Hash
always gives the
same result, so it catches bugs like an unstable hash algorithm or even
a random result.
And, because a maliciously lazy implementation of Hash
might simply always return the same fixed value, the test also
requires that different data hashes to different values.
This isn’t completely bug-proof, of course. It’s possible, for
example, that Hash
always produces the same value unless
the input is exactly “Hello, Newman”. We could use fuzz testing to
tackle this, or just a table test with a bunch of different inputs.
But you get the point. We made the test both less brittle and less feeble without adding much extra code; all it took was a little extra thinking.
In the next and final post in this series, we’ll look at speed and reliability of tests. If the test suite as a whole is too slow, it simply won’t be run, so it’s no use. On the other hand, if the tests are unreliable, and often fail even when the code is correct, they’re also useless in a different way. Let’s talk about how to fix that situation next time.