Or, How To Deal With Flaky Friends Have you ever had that one friend? You know the one. You love ’em, but their carefree relationship with plans can really get on your nerves from time to time. Let’s say you decide to organize a nice picnic gathering in the park weeks in advance. Upon texting the invitation, […]
Have you ever had that one friend? You know the one. You love ’em, but their carefree relationship with plans can really get on your nerves from time to time.
Let’s say you decide to organize a nice picnic gathering in the park weeks in advance. Upon texting the invitation, your friend responds vaguely, “That sounds like a great idea!”
Doubt creeps in. You’ve seen this response before.
On the day of the picnic they suddenly found themselves taking care of their sister’s neighbor’s 4 month old golden doodle puppies and couldn’t possibly make it all the way across town in time to hang out. Honestly, who among us can compete with puppies?
Nobody likes being flaked on.
In the world of browser testing, a test is considered to be flaky when it can pass and fail across multiple retry attempts without any code changes. Flaky tests cause all sorts of friction in software development and deployment lifecycles.
The occasional flaky friend might be an unavoidable fact of life, but here at Benchling we set out to eliminate and inoculate ourselves against all kinds of flakiness in our browser testing suite.
This is the story of how Benchling implemented our own element selection interface on top of Cypress and systematically drove our flake rate down to essentially zero.
Benchling is a cloud platform for life sciences research. Our customers work on a vast range of life-improving biological and pharmaceutical products, and it’s our mission to make that research and development more effective and efficient. If Benchling as a whole or any one of our major features goes down, it can grind our customers’ workflows to a screeching halt.
Functional end-to-end (E2E) testing is a critical component of establishing and maintaining a reputation of quality and stability to our customers. It is also a key part of instilling confidence in our developers to be able to swiftly take on broad and impactful changes without fear of introducing regressions.
For Benchling to be successful we need to move fast and not break things.
Enforcing software quality requires a gatekeeper in the deploy process. Test flakes introduce friction to the entire software development cycle because a CI system is necessarily sensitive to every test failure.
If the flake rate is especially bad, it could take multiple retries for the build to pass. Friction in our deploy process doesn’t just slow down the development cycle, but can also delay the release of critical features or hot-fixes.
Before switching to Cypress, Benchling wrote browser tests in Selenium. Our flake rate was so high that we used to consider a test passing if it succeeded one out of three times. In other words, we tolerated up to a whopping 66% flake rate!
When we started piloting a migration of our Selenium tests to Cypress our flake rate dropped significantly, but was still hovering in the single digit percentages.
We put our Cypress rollout on hold to better understand why we were seeing flakes.
There’s a saying that goes,
“Trust takes years to build, seconds to break, and forever to repair”
We knew that if our tests were not consistently and overwhelmingly stable from the beginning, it could undermine confidence in the suite in an irreparable way.
Without confidence in our browser tests…
This vicious cycle was one we were intent on avoiding.
At Benchling, we like to approach problems scientifically. We set out to disprove the running theory that browser test flakes were just a part of life by rigorously studying and addressing the underlying causes of flakiness.
Because of the way websites work, all browser-based testing more or less follows the same high level pattern of interactivity and verification.
Browser-based end-to-end tests are focused on functionality and behavior from the user’s perspective, so the mental model for writing a test resembles how users interface with the page.
We call this pattern the Select → Act → Wait → Verify Loop.
For us, this looks like:
Rinse and repeat until a user story is covered and you have a test case.
So where do we begin?
As with any structurally sound engineering project, we’d want to start by surveying our surroundings and establishing a stable foundation to build upon.
The first thing to note here is that the Benchling client application is big. At time of writing, we’re at 1.1 million lines of code and growing rapidly!
To facilitate modularity and good patterns for reuse, our browser test code would benefit from a logical structure that could scale alongside our application as it evolved in functionality and complexity over time.
The second thing to note is that we expected both quality engineers and application engineers to be writing E2E tests in our Cypress suite.
This meant that for our engineers, the closer the test code structure mirrored the client code, the less context switching would be required to write tests.
We decided to lean into an established pattern of organization and reuse: a hierarchy of interconnected page objects that mirrored our frontend component structure.
An early version of a Benchling page object looked something like this:
Individual page objects would extend the Region base class. Let’s see how this plays out with an example.
For the sake of the following demonstration, imagine a hypothetical Benchling lab notebook. This notebook has a text area for the main content and also displays user activity as a simple stream of .log-entry elements off to the side.
Over the course of a test we might see the following entries appear in our notebook’s activity log:
We might represent this notebook component using the following page object:
The goal is for Cypress test files to be written in the “language of user interaction” and for page objects to encapsulate all of the messy details to make it happen.
Here’s what using the page object in a Cypress test would look like:
What’s the problem here? The code seems harmless and intuitive enough.
We said earlier that Cypress handles the Wait step for us through automatic retries. There’s a crucial caveat to this property: only the last command is retried.
If you aren’t aware of this rule or respect its implications, it can become a major problem because the DOM is inherently unstable.
In the code above we accidentally made two mistakes:
Because of the way this page works, when we look for '.log-entry' elements we’ll find some in the DOM without incident. But when we go on to call assertText , the 'Saved' text might not necessarily be one of those entries!
It depends on the state of the DOM when the N-2th command succeeded.
Depending on when the .find('.log-entry') command was run relative to when the UI was updated, the last entry may be 'Began Editing' or it might be 'Saved'.
The failure mode is illustrated below:
We can already see with just one page object how easy it is to fall into flaky traps. You could imagine that if we have an extensive hierarchy of page objects, the chain of commands becomes more and more difficult to track.
Indeed, this goes all the way to the root of the situation (ba-dum-tss 🙄).
Since we are using the root chain as the basis for composition and extension, it ends up being the furthest from the final Nth command in the chain. This means that the more we extend, the less likely it is that the root command will be retried if actually needed.
Hopefully we can find something workable from one of these techniques!
One way to solve the flake problem above would be to inject an assertion that guarantees that we don’t advance until conditions are right for our final assertion.
Command retry logic is relative to the position of assertions in the chain — in other words, we can define a “new” N in the middle of the chain.
By adding the assertion .should('have.length.greaterThan', 2) to our command chain, we can force Cypress to retry the .find('.log-entry') command until we have the correct list size of 3. Nice!
The problem with this approach is that it’s not especially reusable. Every additional check in the chain constrains generality. For example, in a test that edits the notebook more than once, we wouldn’t be able to rely on this safety mechanism since we’d expect a to find a longer list of log entries.
With this technique it’s difficult to know where exactly to add your assertions or what they should assert. Regrettably, these concerns grow as our network of page objects grows.
This brings us to our second approach: merging queries.
For better or for worse, Cypress is fundamentally tied to jQuery. When we talk about querying for elements in Cypress, the engine that powers that selection query is jQuery at the end of the day.
The principle of query merging draws from the power of jQuery selector strings.
If we write our getLastLogEntry function as a single connected chain we get:
Since Cypress is built on jQuery, we can map chains of Cypress selection commands into jQuery selector string syntax.
Now why would we want to do this?
Remember — Cypress only automatically retries the last command. If we collapse our chain of selection into a single selection command, Cypress will retry the entire selection for verification.
In this example, Cypress will continue to look at the last .log-entry element until it finds 'Saved' or we time out.
Great! That’s exactly what we want, right?
Unfortunately, we lost some important qualities in translation: encapsulation and reusability. The whole point of using the page object pattern was so that we can abstract away details like the location of the root element of a region.
It would be tedious to unravel every Cypress selection chain into its corresponding jQuery string. If the location of the element changes in the DOM, we’d need to identify every instance where a corresponding selector was used and change in lock-step to avoid regressions in behavior.
We have unintentionally violated Single Responsibility Principle (SRP).
How do we get ourselves out of this predicament?
Well, it turns out strings are quite amenable to composition. Let’s start to leverage this property by storing a selector string in our page object classes instead of a Cypress chain.
Now our subclass can look something like this:
We’ve started to walk back our SRP violation, but the interface is clunky.
The trick is to chain Selectors instead of chaining Cypress commands. This is why Benchling built our own Selector class that allows us to chain together pieces of selection strings.
Here’s what it looks like in practice:
We sprinkled in a pinch of syntactic sugar by extending Cypress with a .grab command that takes in a Selector object and does the final .toString() conversion for us.
Finally, we disallow returning Cypress.Chainable in page objects with a lint rule to avoid the distribution of command chains that got us into trouble earlier.
Instead, the page object may return a Selector object, a Region object, or void terminate with an interaction or assertion.
The end result looks something like this:
Benchling’s Selector class is the glue that binds our Cypress mirror world together. Practically any valid jQuery selector string can be expressed as an equivalent Selector chain.
Whenever we make a chain call on a Selector instance we return an immutable copy with the modifications made. Selector instances are always convertible to string primitives.
This allows us to safely store and chain selectors however and wherever we like. We can build them up, compose with them, and pass them around without needing to worry about the nuances of Cypress’s retry behavior until we call cy.grab.
Benchling’s Region class, which is the base class for all page object classes, is a place to store a rootSelector and related functionality.
A page object may:
A page object may not:
Following these rules and enforcing them with linters allows us to scale our test suite along with the growing size and complexity of our client application.
There’s one last piece of the Benchling Selection puzzle that we haven’t mentioned yet: Data Test Attributes.
Selectors based on CSS classes are not a stable way to select elements in the DOM. The reason is once again our old friend SRP.
CSS classes are concerned with the style of the element — we expect them to change when visual requirements change.
However, we shouldn’t need to update our Cypress selection when the color of a button changes from green to blue, so long as the behavior of the button is the same as before.
Injecting data-testid to elements is a common solution to the above problem. We decided to take data-testid a few steps further and the result is Data Test Attributes (DTA).
DataTestAttrs is a TypeScript interface that describes an object which may contain component, element, index, key.
Here’s what they look like in client code:
And in Cypress:
We made helpers for injecting as React props and Cypress selection using our Selector class. The getTestProps function translates our object into a number of data-test-X attributes that appear in the DOM, where X is one of the four fields above.
We think of DTA as markers or tags that help us consistently find elements in the DOM. The net result of using DTA is that Selector chains are ultimately shorter, simpler, and more stable as our application evolves over time.
Using these new tools and best practices we systematically migrated the bulk of our Selenium tests and modernized our initial Cypress pilot tests. Our flake rate dropped steadily even as we continued authoring new tests to fill out our functional coverage.
At Benchling we also built a Buildkite pipeline we call Flake Finder that we regularly run to check the overall health of our suite of browser tests. Every night the tool runs each test 200 times and reports the results.
In the last 3 months our overall flake rate has hovered between 0.000% and 0.003%. We regularly see our nightly runs in a row pass with zero failures. If a new flake arises, a developer is notified to investigate and correct.
Before introducing new test cases to our suite, test authors run a Flake Finder build on just that test as part of the verification process. Ensuring we aren’t introducing new flaky tests protects overall stability and maintains the quality of the test suite over time.
When flakes do arise, they are more often environmental or network related than the DOM. For example, these might involve a Docker or Buildkite configuration setting that needs tuning rather than the test code.
A highly reliable browser test suite gives our developers the confidence to ship quickly as we continue to grow in size and complexity.
Because functional tests are written in the “language of user behavior”, we can leverage Cypress test results to establish a chain of trust with our customers for critical workflows.
We have equipped ourselves to preserve and extend these benefits with every new test we write.
If you found this sort of problem engaging, there are plenty more to tackle as Benchling continues on our journey to revolutionize the life sciences industry.
If you found the patterns and solutions useful to your own browser testing journey, we’re considering open sourcing some of our tooling.
Reach out to us at [email protected] with thoughts and feedback so we can gauge interest!