AI Test Hallucinations: Detection + Fixes

May 17, 20267 min readAI

Large language models complete plausible-looking test code from partial context. When that completion drifts from your real DOM, OpenAPI, or business rules, you get test hallucinations: artifacts that compile or almost run but encode facts that are not true in your system. That is different from a flaky test (nondeterminism) or a wrong product bug—it is specification debt encoded as automation.

This article is technical: a taxonomy of hallucination modes, why they arise from model behavior and workflow design, detection layers (static, schema-bound, runtime), and fixes you can implement in prompts, repos, and CI. For prompt patterns that reduce ambiguity, see Prompt Engineering for Test Automation. For agent-style loops and tooling, see Agentic AI Testing for Software Test Engineers. For tool landscape context, see Top 12 AI Testing Tools 2026.

Definitions: hallucination vs other failures

Phenomenon	Typical cause	Example signal
Hallucination	Model invents structure not in app/API	Selector never matches any node; field not in schema
Flaky test	Timing, shared state, parallelism	Pass/fail toggles without code change — see Fix Flaky Tests masterclass
Stale test	Product changed, test did not	Once-green assertion now fails every run
Weak assertion	Test passes but checks little	`expect(true).toBe(true)` or overly broad `toContain`

Hallucinations are dangerous when they produce weak assertions or wrong green: the suite looks healthy while coverage of real behavior is illusory.

Taxonomy: where hallucinations appear

Locator hallucination — invented #id, XPath, or getByText('Welcome back, Alex') when copy and roles differ.
API hallucination — wrong path (/api/user vs /api/v1/users/me), verb, or JSON field names.
Assertion hallucination — expected UI copy, HTTP status, or side effect that the product never defined.
Flow hallucination — plausible user journey that skips auth redirects, feature flags, or BFF hops.
Fixture / import hallucination — import { foo } from '@/test/helpers/foo' where foo does not exist; factory names that match “common patterns” only.
Framework hallucination — API from wrong Playwright/Cypress version; deprecated page.click patterns mixed with Test API.

Once classified, each class maps to a different detector (DOM registry vs OpenAPI diff vs typecheck).

Why models hallucinate tests (mechanism, not mysticism)

LLMs do not execute your app. They approximate the probability of the next token given context over training data and your prompt. Under incomplete context, high-probability generic completions dominate: “login” flows often get #submit, REST responses often get data.items, dashboards often get a “Welcome” heading.

Contributing factors:

Context window limits — entire repo + design doc rarely fits; the model fills gaps with priors.
Ambiguous prompts — “test checkout” without routes, selectors, or API samples maximizes invention surface.
Stale or synthetic training priors — popular frameworks’ average patterns may not match your routing or component library.
Screenshot-only or prose-only input — OCR and layout inference are lossy; easy to misread text or miss data-testid.

So hallucination is often a context-binding problem, not “the model is lazy.”

Detection layer 0 — prompt and contract hygiene

Before any code runs, treat generation inputs as contracts:

Pin framework + version (e.g. @playwright/test 1.49+) and language (TypeScript).
Attach trimmed DOM (HTML snippet or React tree for one screen), OpenAPI fragment, or HAR redacted for secrets.
Require negative capability: “If a selector or field is not in the provided markup/schema, output TODO: unknown instead of inventing.”

That aligns with prompt engineering for test automation: same rigor as acceptance criteria.

Detection layer 1 — static analysis and allowlists

Run fast gates on AI output before merge:

Bash

# Example: fail if generated tests import non-existent helpers
npx tsc --noEmit

TypeScript / ESLint — catch impossible imports, wrong types for APIRequestContext, unused symbols.
Import allowlist — generated files may only import from @/fixtures, @/pages, @playwright/test; flag ../../mystery-helper.
AST grep for smell patterns — e.g. ban page.locator('xpath=//div[1]/div[2]') in generated paths unless exempted.

Static checks catch fixture/import and some framework hallucinations before CI spends minutes on browsers.

Detection layer 2 — schema- and contract-bound checks

For API-level tests:

Diff referenced paths and bodies against OpenAPI (or protobuf descriptors) with tooling or a small script in CI.
For consumer-driven setups, align with contract testing: if the test asserts response.userId but the schema exposes id, fail the PR at review or at codegen validation step.

For UI:

Maintain a selector registry (YAML/JSON): login.submit → getByTestId('login-submit'). Generated code must reference keys, not raw strings, so CI can verify keys exist.

Detection layer 3 — dry run and “locator resolution” probes

Short Playwright smoke that only resolves locators without full business logic:

Typescript

import { test, expect } from "@playwright/test";
import { selectors } from "./selectors.generated";

test("generated selectors resolve", async ({ page }) => {
  await page.goto(process.env.BASE_URL + "/login");
  for (const s of Object.values(selectors)) {
    await expect(page.locator(s)).toHaveCount(1, { timeout: 5_000 });
  }
});

If the AI invented s, resolution timeouts isolate locator hallucinations before you merge a 400-line spec.

Detection layer 4 — runtime and observability

Use first-run evidence:

Trace Viewer (Playwright debugging) — wrong navigation order shows up immediately.
Network tab in trace — calls to nonexistent hosts or paths.
Strict mode violations — multiple matches often mean a vague locator the model “guessed.”

Distinguish product regression vs hallucination: hallucination often fails on first action with “strict mode violation” or 0 matches; product bugs more often fail mid-flow after successful navigation.

Detection layer 5 — assertion strength scoring

Heuristics for reviewers or linters:

Flag expect(true), empty test.skip, or assertions only on URL contains without state change checks.
Prefer web-first assertions (toBeVisible, toHaveURL) tied to observable outcomes (Playwright vs Selenium vs Cypress).

Optional: a simple AST visitor that scores tests: +1 for role/testid locators, −1 for long XPath, −2 for string literals that do not appear in checked-in strings.json from i18n extract.

Fixes — ground truth sources

Single sources of truth the model (or codegen tool) must read:

Source	Use for
OpenAPI / GraphQL schema	Paths, methods, field names
`data-testid` map	Stable UI binding
Recorded HAR (sanitized)	Realistic status codes and payloads
Page objects in repo	Allowed locator surface

RAG over your repo (chunked by route and component) reduces open-ended guessing—keep embeddings fresh on each release branch.

Fixes — structured generation

Instead of “output a full spec file,” split:

Plan (JSON): steps, locators chosen from registry, API calls with operationId.
Code generated only from validated JSON.

If step 1 references an unknown operationId, reject before step 2. That pattern is how many internal “AI SDET” tools avoid free-form invention.

Fixes — human-in-the-loop with explicit gates

Minimal merge policy for AI-authored tests:

Author (human or bot) opens PR with ai-generated label.
Reviewer checks mapping to ticket + selector registry + OpenAPI.
CI runs tsc, contract diff, locator probe job, then full suite.
Owner merges; flaky ownership filed if new instability appears.

For release-critical paths (payments, auth), ban fully automated merge of AI-only diffs—same discipline as modern test pyramid risk tiers.

Fixes — CI integration

Hook validation into the same pipeline as GitHub Actions + Playwright:

Yaml

- name: Validate generated tests
  run: |
    node scripts/validate-ai-tests.mjs
    npx playwright test tests/generated/probe.spec.ts

validate-ai-tests.mjs can: parse imports, load OpenAPI + selector map, exit non-zero on drift.

Example: API hallucination caught by schema

Generated: expect(json).toHaveProperty('userTier')
OpenAPI: field is subscription.tier

Fix: codegen step runs JSON Schema validation from OpenAPI component schema against recorded fixture; mismatch fails before human review.

Example: locator hallucination caught by registry

Generated: page.locator('#login-submit')
Registry: login.submit → getByTestId('auth-login-submit')

Fix: linter replaces free page.locator in tests/ai/** with registry indirection; unknown keys fail CI.

When AI is lower risk vs higher risk

Lower risk — boilerplate from stable templates, table-driven cases from CSV, refactor of existing tests with diff-only review.
Higher risk — new E2E from screenshot alone, cross-app flows, auth with MFA, financial calculations, anything under strict compliance or safety.

Summary

AI test hallucinations are structured errors: the model completes plausible automation that is not bound to your app’s ground truth. Defense is layered: better prompts and contracts, static checks, schema alignment, locator probes, runtime traces, and merge policy. Speed and safety both improve when you treat AI output as untrusted input until proven against DOM, API, and product facts.

Takeaways

Separate hallucination from flakiness and staleness; fix each with different tools.
Bind generation to OpenAPI, selector registries, and typed helpers—never prose-only for risky flows.
Add CI validation scripts plus thin probe specs before full suites absorb bad code.
Keep humans on the critical path for high-risk journeys; use AI to draft, not to certify.

For broader AI + QA strategy, revisit agentic AI testing and keep prompts as explicit as your best tickets (prompt engineering).