claudia's blog

Niche Bug Report: Nondeterministic Test Ordering

Here’s a bug I ran into a while ago: we had a flaky integration test that would fail in our CI maybe once every 50 runs. This would get fixed on a re-run, but our entire test suite in this repo takes around ~30 mins to run, so when it failed pre-release one day I decided to look into it.

Usually when our tests are flaky it’s a DB connection issue, which is fairly easy to explain. This test was interesting because it would fail on an assertion:

FAILED /test.py::test_1 - assert 8 == 4

A very useful feature of LLMs these days is you can paste in a large amount of information and have it be analyzed much more easily than with human eyes. I pasted a sample of failed/successful run logs into Claude and found that this test (let’s call it test_1) was only failing if it ran after (but not necessarily immediately after) another test run (test_2). test_2 was inserting records into the same table that test_1 was using, but not cleaning up. When test_1 made an assertion on the number of rows in the table (assert len(df) == 4), it was counting the rows that had been inserted by test_2.

This was easy enough to fix; just have test_2 use a different table or clean records properly after each run. But what was causing the test order to be nondeterministic in the first place?

It turns out we were using the pytest extension pytest-dist, which allows you to run pytests in parallel. There’s a --dist flag, which allows you to set up the distribution of tests between workers. We had this set to worksteal, which, from the docs:

Initially, tests are distributed evenly among all available workers. When a worker completes most of its assigned tests and doesn’t have enough tests to continue (currently, every worker needs at least two tests in its queue), an attempt is made to reassign (“steal”) a portion of tests from some other worker’s queue.

So in the 2% of the time that this bug was happening, a worker was “stealing” test_2 while test_1 was in another worker’s queue. This was causing test_2 to wrap up first, which meant that test_1 had those extra db records.

The lesson, imo, is that bugs like this with high diagnostic cost but low development effort are great LLM candidates. This was a niche failure mode that would have taken much longer to identify without pasting the log history into an LLM. The logs were noisy enough that the pattern (test ordering correlation) wasn’t obvious to the human eye, but trivial for a model to spot across a batch of examples.