Serverless is amazing. Companies offer you databases, queues, buses, and compute services that scale up and down without you having to lift a finger. These fully-managed services let you compose resilient, complex, event-driven architectures on demand. For those new to this kind of architecture, however, testing them can be a challenge. When tests don't account for Eventual Consistency, they can become flaky/non-deterministic and this can ruin your test suite. This article looks at some of the ways I've done testing wrong in the hope that you can start out in the right direction.
Eventual Consistency (EC)
Self-scaling data stores (like AWS DynamoDB) offer amazing scalability and availability, but it achieves those attributes by sacrificing consistency.
Eventual consistency is a theoretical guarantee that, provided no new updates to an entity are made, all reads of the entity will eventually return that last updated value.1
This "eventualness" is true whether DynamoDB is in the middle of copying your data to all its partitions or whether an event bus (e.g., AWS SNS or AWS EventBridge) is in the midst of propagating your messages. Most humans don't notice the delay, but your automated tests will fall right into the EC void. What to do?
Bad Approach #1
The first time I ran into eventual consistency issues with my tests was when I was newly out of college and the best2 coder on the company's Quality Assurance team. The approach I used was, wait for it... Actually, that's it. I waited for it.
test('it should return the right number, eventually', () => {
sut.generateTheAnswer();
// This will take, what, a minute or so?
sleep(60000);
const result = fetchAnswer();
expect(result).toEqual(42);
});
First, we have to acknowledge the positives - this can work. But I think the trade-offs are not worth it. Let's look at the downsides:
The test will always take 60 seconds, even if the answer is generated in 5.
If the test occasionally takes longer than 60 seconds, you will have to bump up the sleep time to compensate.
If done naively (like I did), everything blocks while this test is waiting, increasing the overall time of your automated test suite.
Rapid feedback is essential to productive, iterative development. While the “waiting” approach can work, we can do better.
Bad Approach #2
When I first started working with DynamoDB, I wrote tests like this:
test('The repository can get the record', async () => {
// ARRANGE
const myRecord = generateTestRecord();
await dynamoDbClient.Put(myRecord); // pseudocode!
// ACT
const result = await repository.get(myRecord.id);
// ASSERT
expect(result)toEqual(myRecord);
});
And, quite often, they worked. But, as my test suite grew, and the sheer number of tests like this were executing on every build, things started to break down. Why?
On a write, DynamoDB will write the data to the node you are talking to, and then propagate that write to a majority of its nodes. In the simplest case, think of a three-node cluster and DynamoDB doesn't report "success" to you until it is written to two of those shards. What happens when you then try to read that record from DynamoDB3?
Continuing our example from above, on a read, DynamoDB will route you to one of its three shards. So, if you are fast enough and unlucky enough, you may be reading from the shard that - as of the moment you call it - does not have the data you seek. This kind of error will drive you nuts as, very often, a "re-run" of the test will pass. This gives a clue to the next approach we can try.
Bad Approach #3
"Well," you might say at this point, "if I can re-run the test and it works, can't I re-run the test preemptively?" As in approach #1, this often yields good results, for a while. I call this the "Naive Retry."
The idea is to, on test failure, re-run the test such that your test-runner doesn't count the failure. Let's put a wild-guess percentage on the problem outlined above and say "That test fails only 2% of the time". That means a few retries will really stack the deck in my favor.
const retry = require('async-retry');
test('The repository can get the record', async () => {
await retry(async () => {
// ARRANGE
const myRecord = generateTestRecord();
await dynamoDbClient.Put(myRecord); // pseudocode!
// ACT
const result = await repository.get(myRecord.id);
// ASSERT
expect(result)toEqual(myRecord);
}, { retries: 3 });
});
That's it! Should this test fail, it will re-run itself up to 3 times. On success, it will cease the retry loop and exit. For the types of problems I mentioned above (DynamoDB put/get shard-propagation), this test appears to work just fine. Mostly.
You see, even with a 98% chance of success (times four!), there is still a window for failing. With enough of these types of tests, you will see build failures creep in that never seem to be in the same place twice.
Further, for other situations (like our event bus), the odds start to change. Let's look at the difference and I'll try to not get too bogged down in numbers. (Caveat - all these statistics are made up, but they are close enough to portray the real problems I want to solve.)
Let's say DynamoDB is always consistent within a second. It is sometimes in the middle of its propagation at around 100ms, but it gets there quickly; always. Also, assume that EventBridge is consistent within a few seconds, and it always takes at least a second to propagate the message through. Given this, you can see why our naive retry may work for DynamoDB, but it will flat-out fail consistently for EventBridge. Now what?
Successful Retries
The first thing we are going to do is divide our test into two parts: the "initiation" and the "validation". We want to retry only the validation section. This allows us to gradually increase the elapsed time between the initial event and the state we wish to test. Taking the Repository test from above, let's divide it into sections.
test('The repository can get the record', async () => {
// ARRANGE
const myRecord = generateTestRecord();
await dynamoDbClient.Put(myRecord); // pseudocode!
// ----------------------
// ∧∧ INITIATION above ∧∧
// ∨∨ VALIDATION below ∨∨
// ----------------------
// ACT
const result = await repository.get(myRecord.id);
// ASSERT
expect(result)toEqual(myRecord);
});
Given that division, let's look at what a better retry implementation looks like.
const retry = require('async-retry');
test('The repository can get the record', async () => {
// ARRANGE
const myRecord = generateTestRecord();
await dynamoDbClient.Put(myRecord); // pseudocode!
await retry(async () => {
// ACT
const result = await repository.get(myRecord.id);
// ASSERT
expect(result)toEqual(myRecord);
}, { retries: 3 });
});
Notice that we now only retry the fetch-and-check section of the test - the validation part. If the record is not yet there, we can maintain the same initialization step and just wait a little longer to check for it.
Wrap-Up
The hardest part of testing eventually consistent systems is not the tests themselves - they are fairly straightforward to write with good libraries like async-retry
- but the recognition of eventual consistency in the first place. I've hinted about how both DynamoDB and EventBridge are eventually consistent; review your system architecture and get ahead of your testing failures.
Do your developers know what parts of your system are or are not immediately consistent? How do they find out? As a team-lead/architect, it likely falls on you to start the education process. Once you have identified the systems that need special handling, make sure your tests are solid and deterministic.
References
Eradicating Non-Determinism in Tests by Martin Fowler
Yubl's road to Serverless - Part 2, Testing and CI/CD by Yan Cui
12 Important Lessons from The DynamoDB Book by Jeremy Daly
The DynamoDB Book by Alex DeBrie
CAP Theorem in Wikipedia
I was the only coder on the QA team. Take that for what it's worth.