AI Test Hallucinations: Detection + Fixes

Large language models complete plausible-looking test code from partial context. When that completion drifts from your real DOM, OpenAPI, or business rules, you get test hallucinations: artifacts that compile or almost run but encode facts that are not true in your system. That is different from a flaky test (nondeterminism) or a wrong product bug—it is specification debt encoded as automation.
This article is technical: a taxonomy of hallucination modes, why they arise from model behavior and workflow design, detection layers (static, schema-bound, runtime), and fixes you can implement in prompts, repos, and CI. For prompt patterns that reduce ambiguity, see Prompt Engineering for Test Automation. For agent-style loops and tooling, see Agentic AI Testing for Software Test Engineers. For tool landscape context, see Top 12 AI Testing Tools 2026.
Definitions: hallucination vs other failures
| Phenomenon | Typical cause | Example signal |
|---|---|---|
| Hallucination | Model invents structure not in app/API | Selector never matches any node; field not in schema |
| Flaky test | Timing, shared state, parallelism | Pass/fail toggles without code change — see Fix Flaky Tests masterclass |
| Stale test | Product changed, test did not | Once-green assertion now fails every run |
| Weak assertion | Test passes but checks little | expect(true).toBe(true) or overly broad toContain |
Hallucinations are dangerous when they produce weak assertions or wrong green: the suite looks healthy while coverage of real behavior is illusory.
Taxonomy: where hallucinations appear
- Locator hallucination — invented
#id, XPath, orgetByText('Welcome back, Alex')when copy and roles differ. - API hallucination — wrong path (
/api/uservs/api/v1/users/me), verb, or JSON field names. - Assertion hallucination — expected UI copy, HTTP status, or side effect that the product never defined.
- Flow hallucination — plausible user journey that skips auth redirects, feature flags, or BFF hops.
- Fixture / import hallucination —
import { foo } from '@/test/helpers/foo'wherefoodoes not exist; factory names that match “common patterns” only. - Framework hallucination — API from wrong Playwright/Cypress version; deprecated
page.clickpatterns mixed with Test API.
Once classified, each class maps to a different detector (DOM registry vs OpenAPI diff vs typecheck).
Why models hallucinate tests (mechanism, not mysticism)
LLMs do not execute your app. They approximate the probability of the next token given context over training data and your prompt. Under incomplete context, high-probability generic completions dominate: “login” flows often get #submit, REST responses often get data.items, dashboards often get a “Welcome” heading.
Contributing factors:
- Context window limits — entire repo + design doc rarely fits; the model fills gaps with priors.
- Ambiguous prompts — “test checkout” without routes, selectors, or API samples maximizes invention surface.
- Stale or synthetic training priors — popular frameworks’ average patterns may not match your routing or component library.
- Screenshot-only or prose-only input — OCR and layout inference are lossy; easy to misread text or miss
data-testid.
So hallucination is often a context-binding problem, not “the model is lazy.”
Detection layer 0 — prompt and contract hygiene
Before any code runs, treat generation inputs as contracts:
- Pin framework + version (e.g.
@playwright/test1.49+) and language (TypeScript). - Attach trimmed DOM (HTML snippet or React tree for one screen), OpenAPI fragment, or HAR redacted for secrets.
- Require negative capability: “If a selector or field is not in the provided markup/schema, output
TODO: unknowninstead of inventing.”
That aligns with prompt engineering for test automation: same rigor as acceptance criteria.
Detection layer 1 — static analysis and allowlists
Run fast gates on AI output before merge:
# Example: fail if generated tests import non-existent helpers
npx tsc --noEmit- TypeScript / ESLint — catch impossible imports, wrong types for
APIRequestContext, unused symbols. - Import allowlist — generated files may only import from
@/fixtures,@/pages,@playwright/test; flag../../mystery-helper. - AST grep for smell patterns — e.g. ban
page.locator('xpath=//div[1]/div[2]')in generated paths unless exempted.
Static checks catch fixture/import and some framework hallucinations before CI spends minutes on browsers.
Detection layer 2 — schema- and contract-bound checks
For API-level tests:
- Diff referenced paths and bodies against OpenAPI (or protobuf descriptors) with tooling or a small script in CI.
- For consumer-driven setups, align with contract testing: if the test asserts
response.userIdbut the schema exposesid, fail the PR at review or at codegen validation step.
For UI:
- Maintain a selector registry (YAML/JSON):
login.submit→getByTestId('login-submit'). Generated code must reference keys, not raw strings, so CI can verify keys exist.
Detection layer 3 — dry run and “locator resolution” probes
Short Playwright smoke that only resolves locators without full business logic:
import { test, expect } from "@playwright/test";
import { selectors } from "./selectors.generated";
test("generated selectors resolve", async ({ page }) => {
await page.goto(process.env.BASE_URL + "/login");
for (const s of Object.values(selectors)) {
await expect(page.locator(s)).toHaveCount(1, { timeout: 5_000 });
}
});If the AI invented s, resolution timeouts isolate locator hallucinations before you merge a 400-line spec.
Detection layer 4 — runtime and observability
Use first-run evidence:
- Trace Viewer (Playwright debugging) — wrong navigation order shows up immediately.
- Network tab in trace — calls to nonexistent hosts or paths.
- Strict mode violations — multiple matches often mean a vague locator the model “guessed.”
Distinguish product regression vs hallucination: hallucination often fails on first action with “strict mode violation” or 0 matches; product bugs more often fail mid-flow after successful navigation.
Detection layer 5 — assertion strength scoring
Heuristics for reviewers or linters:
- Flag
expect(true), emptytest.skip, or assertions only on URL contains without state change checks. - Prefer web-first assertions (
toBeVisible,toHaveURL) tied to observable outcomes (Playwright vs Selenium vs Cypress).
Optional: a simple AST visitor that scores tests: +1 for role/testid locators, −1 for long XPath, −2 for string literals that do not appear in checked-in strings.json from i18n extract.
Fixes — ground truth sources
Single sources of truth the model (or codegen tool) must read:
| Source | Use for |
|---|---|
| OpenAPI / GraphQL schema | Paths, methods, field names |
data-testid map | Stable UI binding |
| Recorded HAR (sanitized) | Realistic status codes and payloads |
| Page objects in repo | Allowed locator surface |
RAG over your repo (chunked by route and component) reduces open-ended guessing—keep embeddings fresh on each release branch.
Fixes — structured generation
Instead of “output a full spec file,” split:
- Plan (JSON): steps, locators chosen from registry, API calls with operationId.
- Code generated only from validated JSON.
If step 1 references an unknown operationId, reject before step 2. That pattern is how many internal “AI SDET” tools avoid free-form invention.
Fixes — human-in-the-loop with explicit gates
Minimal merge policy for AI-authored tests:
- Author (human or bot) opens PR with
ai-generatedlabel. - Reviewer checks mapping to ticket + selector registry + OpenAPI.
- CI runs
tsc, contract diff, locator probe job, then full suite. - Owner merges; flaky ownership filed if new instability appears.
For release-critical paths (payments, auth), ban fully automated merge of AI-only diffs—same discipline as modern test pyramid risk tiers.
Fixes — CI integration
Hook validation into the same pipeline as GitHub Actions + Playwright:
- name: Validate generated tests
run: |
node scripts/validate-ai-tests.mjs
npx playwright test tests/generated/probe.spec.tsvalidate-ai-tests.mjs can: parse imports, load OpenAPI + selector map, exit non-zero on drift.
Example: API hallucination caught by schema
Generated: expect(json).toHaveProperty('userTier')
OpenAPI: field is subscription.tier
Fix: codegen step runs JSON Schema validation from OpenAPI component schema against recorded fixture; mismatch fails before human review.
Example: locator hallucination caught by registry
Generated: page.locator('#login-submit')
Registry: login.submit → getByTestId('auth-login-submit')
Fix: linter replaces free page.locator in tests/ai/** with registry indirection; unknown keys fail CI.
When AI is lower risk vs higher risk
Lower risk — boilerplate from stable templates, table-driven cases from CSV, refactor of existing tests with diff-only review.
Higher risk — new E2E from screenshot alone, cross-app flows, auth with MFA, financial calculations, anything under strict compliance or safety.
Summary
AI test hallucinations are structured errors: the model completes plausible automation that is not bound to your app’s ground truth. Defense is layered: better prompts and contracts, static checks, schema alignment, locator probes, runtime traces, and merge policy. Speed and safety both improve when you treat AI output as untrusted input until proven against DOM, API, and product facts.
Takeaways
- Separate hallucination from flakiness and staleness; fix each with different tools.
- Bind generation to OpenAPI, selector registries, and typed helpers—never prose-only for risky flows.
- Add CI validation scripts plus thin probe specs before full suites absorb bad code.
- Keep humans on the critical path for high-risk journeys; use AI to draft, not to certify.
For broader AI + QA strategy, revisit agentic AI testing and keep prompts as explicit as your best tickets (prompt engineering).