Tests as Contracts

I’ve invested a lot of time and money into testing. More than any startup frankly should. There’s been times where our testing budget has been the majority of our overall engineering spend allocation, significantly eclipsing the costs of actually serving our users. This strange bet has been critical to ensuring that no matter how large our product has grown, we’ve always been able to chug along with new features.

Tests are a contract between your present self and your future self. Each test you write is a declaration to state that a given behavior needs to remain true for the rest of time (or, at least until the contract is up for renegotiation). Your future self is going to sabotage you — they’re going to rewrite implementations, add new features, and deprecate various services. As long as you have that contract between your present and future self in place, you can maintain confidence that your future self is only making changes for the positive, rather than moving backwards.

These statements aren’t to say that these contracts must be set in stone. Contracts can always be up for renegotiation! Business use-cases evolve, and if they don’t, then your product isn’t growing. If we as developers had enough foresight to know exactly where the business would evolve to, then we would just make those future decisions now. But alas, we’re unfortunately not blessed with omniscience. Instead, we must rely on our future selves to make smart, explicit renegotiations. Even though the contract changes, the new behavior must still be explicitly written. This forms a new contract ensuring that new behavior is solidified until the next renegotiation. While I don’t trust my future self to not accidentally break something, I have to trust my future self to make smart renegotiations when it comes to new contracts. As long as these rewrites are explicit, we can trust in our future selves to write those new contracts.

What does this mean tangibly? When you think about a test as a contract, you can write that test once and trust that feature or behavior will work for the rest of time. Almost exactly a year ago, I did a major rewrite to make authentication the responsibility of our GraphQL API. These authentication operations require a one-time password (OTP) exchange for phone number and email. If a user doesn’t receive their OTP, they can request a new one. Built into this are safeguards to ensure a user can’t abuse this system and request thousands of OTPs. How do I know this works a year later? Tests! I’ve refactored various underlying components and have upgraded our underlying packages dozens of times since then yet I still have the exact same confidence now as before that this throttling behavior works exactly as intended. As much as I love dogfooding, there’s many features like this that will never be used by the team. Hell, there’s many features like this that won’t be used by real users frequently enough for issues to be caught. Additionally, we ship faster than human testing can naturally keep up. To maintain rapid shipping cadences without compromising functionality, we need to write and maintain these contracts.

Some Practical Tips

Only write contracts for the negotiables

Perhaps this is where the contract analogy falls apart since we now live in a world in which every bottle of bleach needs to come attached with a contract warning against consumption. There are some things that we need to be able to assume, otherwise we spend all day positing contracts. If I had to wake up every morning and check whether gravity was working, I would go insane. Tests can assume trivial non-negotiables. Blue is always going to be blue. We don’t need to write a contract that agrees blue is going to be blue. Similarly, some behaviors are enforced by contracts beyond our tests. I don’t need to test whether DynamoDB access controls work because I have a contract with AWS for how their service works (ok perhaps not an explicit contract but somewhere in the terms of service we can probably sue if they start leaking our data with bad access controls). Large scale service providers tend to be good about warning and deprecating when they’re changing their known behaviors, especially when that would cause breaking changes.

If you can’t imagine a scenario in which a test can fail, it’s probably not the most helpful test. This intuition comes with experience in unexpected bugs. Since it’s so hard to predict what could go wrong, I’d always recommend cautioning towards the safer side.

Write explicit contracts

Writing tests that are explicit ensures that any future renegotiations are explicitly nullifying those behaviors. For example, let’s say I have a test that asserts a user should only be able to retrieve their own phone number and other users cannot retrieve their phone number. If I explicitly write that into a test, then the future version of me needs to explicitly remove that check, therefore stating “We now allow users to retrieve others’ phone numbers in x, y, and z scenarios”. This then forms a new explicit contract.

This guideline of explicit contracts is exactly why I’m extremely wary of automated AI testing. There’s been a recent trend with the advent of LLMs around this idea of fully automated QA. I see this level of automagical AI as a spectrum in how explicit vs implicit you’re willing to be. On the scale of most explicit is a test file which writes a list of expects to ensure correct variable values. These take time, which means we need to move up the spectrum in certain scenarios. I love AI screenshot testing — a special type of test that I haven’t seen popularized (except Maestro a few weeks after me) — which takes a screenshot of a webpage and asks an LLM if it matches a certain prompt. This requires extremely strict prompting on undeniable truths. No matter which LLM we’re using, I can be pretty confident that a prompt asking “Does this website have a blue background?” will fail for a website with a pink background (though of course, some real-world testing and nuance required — perhaps the next OpenAI model thinks that the screenshot doesn’t look like a website at all and instead a mobile application). On the most implicit end of the spectrum lies fully automated solutions. I think these are a VC trap. The magical solution sounds incredible, until you realize that your automagical solution decided that it’s now ok for your website to now display in Spanish for all users. My hunch is that the end destination of testing is AI-accelerated, developer-first test development. A developer pings an AI bot in their PR which writes a suite of explicit tests in code and creates a branching PR for review which automatically runs the new tests against your code. These tests maintain all the speed and explicit contracts of regular testing while saving on developer time writing contracts. Plus, some sprinklings of AI screenshot testing sprinkled in (I might be biased here). If you’re reading this in the future and have created/found a good solution here (assuming I haven’t built one yet), please, please, please email me at nathan@joinswsh.com.

Make testing the easiest solution for building

While I’m nowhere near an expert on developer buy-in (just not yet at that scale of team), I’ve been able to build infrastructure and tooling to self-service my own buy-in. I’ve found that I tend to write tests when that’s the easiest route to writing a feature. Any developer writing a new feature needs to try the feature out to ensure their implementation works. For backend features especially, this is a pain in the ass. Nobody likes writing scripts, figuring out the authentication and syntax to CURL your new endpoint, or building a mock website page with 3 unstyled HTML buttons just to make HTTP requests. You should write strong test infrastructure to ensure it’s easier to write a test than it is not. For example, strong helper functions around test account setup. If I just need to call a helper function to create a new test account, then it’s easy for me to test with a fresh account environment every time (P.S. you totally should create new accounts for each test suite. Trying to share the same account is a recipe for disaster as you try to write future tests that don’t conflict with old ones). Add other easy helper functions for resource creation, and now I effectively have a reusable script which provisions all its required resources in a consistent manner and tests my new backend code. This is easier than writing a one-time-use throwaway script, and also comes with the benefit of the long-term contract with my future self once it’s checked in for CI/CD.

This unfortunately tends to be less true for frontend features. Automated testing just isn’t a good solution for testing interactivity or styling. Storybook (which we use!) is close, but I haven’t yet cracked the code for making Storybook development easier than local dev server development for most cases.

Make testing cheap

If you’re ever concerned about writing a test due to the long-term costs of maintaining that test, then you need to fix your testing infrastructure. Developers shouldn’t need to worry about costs when it comes to tests (within reason; ignoring load testing). Here’s a few practical tips that have worked for us to bring our testing budget down significantly while growing the number of test cases we have:

Don’t use GitHub Actions infrastructure

GitHub-hosted infrastructure is just insanely expensive
Instead, we use UbiCloud for trivial action steps and WarpBuild for integration and E2E tests (the majority of our costs)

Self-host authentication

This is in line with the suggestion to make test account creation trivial
Within the past week of writing, we created over 34k test accounts on our dev environment alone. The costs here would be absurd if we were creating test accounts in Firebase or Supabase

Be mindful of Mac runner costs

Even at a modest scale of iOS testing / builds, it’s cheaper to just buy a Mac Mini and spend a few hours setting up self-hosting

Plan for parallelism from the start

Ensure your test suite units can run in parallel
No matter how many tests you add, you should be able to allocate more compute to ensure CI/CD finishes in the same amount of time
Plan for parallelism across multiple machines

On top of this, we have a significant amount of testing infrastructure dedicated to caching test results and ensuring only helpful tests need to be run for each commit.

Notes

This entire blog post has little to do with contract testing, which is more relevant to microservices. We’re just not yet at the scale where we need contract testing.