Fix Flaky Tests: 2026 Masterclass

May 13, 20268 min readFlaky Tests

A flaky test is nondeterministic with respect to inputs you believe you control: same commit, same test binary, different outcome across runs. That is not “CI being moody”—it is almost always hidden variables (time, ordering, shared mutable state, environment, or unmocked externals) interacting with assertions that assume more than the system guarantees.

This masterclass walks through how nondeterminism enters test systems, how to localize it, and how to fix it without papering over product bugs. It assumes you run automated checks in CI, possibly in parallel, and care about signal quality on pull requests and release branches.

For a shorter, gate-focused playbook on the same topic, see No-BS Playbook: Fix Flaky Tests Without Slowing Releases. For framework trade-offs that affect stability, see Playwright vs Selenium vs Cypress: 2026 Comparison.

Formal definition and why it matters

Treat a test case as a function (T) over a tuple of inputs: product under test (P), test code (C), initial environment (E) (OS, clock, locale, feature flags, data volume), and schedule (S) (CPU speed, network latency, thread interleaving, job parallelism).

If (T(P, C, E, S)) maps to pass on some draws of ((E, S)) and fail on others while (P) and (C) are fixed, the test is flaky relative to your declared preconditions. Engineering work is either:

Narrow (E, S) so the test’s assumptions hold (containers, pinned TZ, single worker, seeded RNG), or
Broaden the test so it only asserts properties that hold for the whole family of ((E, S)) you officially support.

Skipping that analysis is how teams end up with infinite reruns and eroded trust.

Where nondeterminism actually comes from

Async completion without synchronization

UI and service stacks are event-driven. A test that clicks as soon as an element appears in the DOM may still race layout, hydration, focus, or network-driven content. Fixed sleep() is a statistical guess: under load, completion time shifts right; on a fast machine, you waste time and still miss rare races.

Technical fix: drive assertions off observable readiness: network idle where appropriate, response bodies for API-driven UI, or framework primitives that poll assertions with a timeout (Playwright’s expect auto-wait, Cypress’s retrying assertions, Selenium WebDriverWait with expected conditions). Reserve sleep for cases where there is no observable signal—and then treat that as a design smell in the app or test harness.

Concurrency and shared mutable state

Parallel CI runs multiple tests or workers at once. If two tests share a database schema, Redis key namespace, filesystem directory, static singleton, or global config mutation, you get order-dependent and timing-dependent failures.

Technical fixes:

Process-level isolation: unique DB per worker (workerIndex in Playwright test config, pytest-xdist worker id), ephemeral databases (throwaway schema per test class), or transaction rollbacks per test where the stack supports it.
Key namespacing: prefix cache keys and queue names with testRunId + testName.
Immutable fixtures: builders that create rows with UUIDs instead of assuming id = 1 is free.
Test order randomization on a single worker to catch hidden coupling before parallel CI does.

Clock, time zone, and locale

Assertions on “today,” formatted money, or sorted strings break across TZ, LANG, and daylight-saving boundaries.

Technical fix: inject a fake clock in unit and integration layers; in E2E, pin TZ (for example UTC) in CI job env and document supported locales. Never assert on full timestamps unless you control the clock source.

Resource limits and throttling

CI VMs share CPUs. Tests that assume sub-second SLA for local services fail when the host is noisy. Browser tests that open many tabs or skip viewport emulation behave differently under memory pressure.

Technical fix: explicit timeouts sized for p95 CI, not laptop best case; split heavy suites; use smaller fixtures; assert on functional outcomes rather than latency unless performance is the SUT.

Brittle locators and DOM churn

Selectors tied to CSS modules hash classes, deep XPath, or positional nth-child break when markup reflows—even when behavior is correct.

Technical fix: contract with frontend on data-testid (or role + accessible name) for critical flows; prefer user-facing queries (getByRole, getByLabelText) so refactors preserve semantics.

External systems and flapping dependencies

Sandboxes, rate limits, CDN variance, and third-party OAuth flows inject real-world noise.

Technical fix: wiremock, mock server, or recorded fixtures for CI; contract tests against a stable API surface; separate “full stack with real externals” into a non-blocking scheduled job until reliability matches PR gates.

Layered view: unit vs integration vs E2E

Layer	Typical hidden variables	Hardening direction
Unit	global mocks, static time, RNG	inject dependencies; no real I/O
Integration	DB migration order, pool sizing, async workers	real DB with isolation; await job completion
E2E	browser, network, layout, third parties	trace-first debugging; fewer, higher-value tests

Rule of thumb: push determinism down the pyramid. Flaky E2E often points to missing integration coverage of the same invariant.

Detection: from anecdote to data

Rerun-on-failure tagging

Configure your runner to rerun only failures a bounded number of times (for example two). Classify outcomes:

Pass, pass → healthy
Fail, pass, pass → classic flake candidate
Fail, fail → likely real regression or consistently broken env

Store (testId, commit, outcome vector, worker, duration) in your CI telemetry or a simple warehouse table.

Binomial intuition

If a test is independently flaky with probability (p) per run, the probability it passes at least once in (n) reruns is (1 - (1-p)^n). Example: (p = 0.2), (n = 3) gives roughly 49% chance of an eventual pass—so “green” does not mean “fine.” That math is why unbounded retries are dangerous for quality and cost.

Stress runs

For a suspect test, loop locally or in CI:

# Example: run one Playwright spec 50 times serially
for i in $(seq 1 50); do npx playwright test path/to/spec.ts || exit 1; done

If it fails on iteration 37, capture trace + video + stderr for that iteration only.

Localization playbook (when you have a single failure artifact)

Freeze variables: same Node/Java version as CI, same HEADLESS flag, same TZ.
Serial vs parallel: workers: 1 vs default; pytest without -n.
Shrink data: smallest dataset that still hits the code path.
Bisect timing: temporarily increase timeouts—if flakiness disappears, you likely have a slow waiter or resource starvation, not a wrong assertion.
Binary search the suite: half the file, half again, until one test proves order sensitivity.

Fixes by pattern (concrete)

Replace sleeps with condition waits

Playwright (assertion polling is built in):

await expect(page.getByRole("button", { name: "Submit" })).toBeEnabled();
await page.getByRole("button", { name: "Submit" }).click();

Selenium (Python) with explicit wait:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[data-testid='submit']")))
driver.find_element(By.CSS_SELECTOR, "[data-testid='submit']").click()

Eliminate cross-test leakage

Truncate or migrate DB in beforeEach only if fast enough; otherwise transaction per test or template DB clone.
Clear localStorage / sessionStorage / cookies between E2E cases unless the scenario explicitly needs persistence.
Reset feature-flag overrides in teardown, not only setup.

Parallel-safe ports and hosts

Binding tests to hardcoded 3000 collides across workers. Use 0 (OS-assigned) or a port allocator from your test runner and pass the URL into the app under test via env.

Wait for side effects, not only UI

After “Save,” assert via API or DB that the write landed when the UI is optimistic or eventually consistent:

await expect.poll(async () => fetchJson(`/api/items/${id}`)).toMatchObject({ status: "saved" });

That pattern removes a whole class of “clicked success toast but replication lagged” flakes.

Retries: engineering policy, not a vibe

Retries are legitimate mitigation while you fix root cause, not a substitute for fixing.

Policy template:

PR-blocking suites: at most 1 automatic retry on failure, plus mandatory ticket if a test needed it.
Nightly: higher rerun budget acceptable; results never block merge without human review.
Never retry without logging retry count and final vs first outcome.

Implement retries in the runner (test framework config) rather than sprinkling try/except with loops inside tests—centralized policy is auditable.

Instrumentation you should already be using

Playwright: trace on first retry, zip artifacts on failure.
Cypress: screenshots + video on failure; DEBUG logs for plugin issues.
JUnit / Allure: attach stdout and timing per test.
Backend: structured logs with correlation id propagated from test client so one failure ties UI → API → worker.

Organizational guardrails

Quarantine: move chronically flaky tests out of the merge gate into a named job with a SLA (fix or delete within N days).
Ownership: every package in the test tree has a CODEOWNER; flakes route there automatically.
No “test-only” production shortcuts that mock away the path users take—those create false greens worse than flakes.

Conclusion

Flaky tests are a systems problem: concurrency, time, I/O, and incomplete specifications show up as random-looking failures. You fix them by making hidden variables explicit—isolation, observability, condition-based synchronization, and metrics on flip-flop rates—then by narrowing assertions to what the platform truly guarantees under your supported environments.

Ship when signal is trustworthy: bounded retries, traced failures, and a workflow that turns intermittent red into a repeatable defect report. Anything else is just gambling with your release train.