Build a LangChain Test Agent: Full Code Repo Walkthrough (2026)
Written by Kajal · Reviewed and published by Prasandeep

A LangChain-powered test agent can sit inside your QA workflow like a smart, junior SDET: it reads failing tests, understands your code and logs, and proposes fixes or new tests—all driven by an LLM plus tools you control. LangChain agents combine a chat model with callable tools so the model decides what to invoke, with what arguments, and in what order.
This guide is a from-zero-to-production walkthrough of a Python repository you can clone, run locally, and later hook into CI. You will design the architecture, implement tools (pytest, file I/O, patches), wire LangChain’s tool-calling agent, add evaluation hooks, and harden for real teams.
For the bigger picture on agentic QA, see Agentic AI Testing for Software Test Engineers. For prompt discipline and review gates, see Prompt Engineering for Test Automation and AI Test Hallucinations: Detection and Fixes. For flaky-suite triage before you automate fixes, see Fix Flaky Tests: 2026 Masterclass. For CI placement, see GitHub Actions + Playwright CI/CD pipeline.
What is a LangChain test agent?
In LangChain, an agent is an LLM orchestrated with tools—functions the model can call with structured arguments. A test agent specializes in QA workflows:
- Reads failing test logs or CI output
- Locates relevant test files and source code
- Diagnoses likely root causes (wrong assertion, API drift, timing, bad data)
- Proposes patches and new tests
- Optionally re-runs tests and summarizes what changed for human review
This is the same pattern described in recent agentic test pipelines that use LangChain (sometimes with CrewAI or custom controllers) to diagnose and fix flaky or broken tests—always with humans owning merge and release decisions.
Repository structure
A layout that stays maintainable as you add tools and evals:
langchain-test-agent/
├── agent/
│ ├── __init__.py
│ ├── config.py
│ ├── tools.py
│ ├── prompts.py
│ └── agent.py
├── tests/
│ ├── sample_project/
│ │ ├── src/
│ │ │ └── calculator.py
│ │ └── tests/
│ │ └── test_calculator.py
│ ├── unit/
│ │ └── test_agent_core.py
│ └── integration/
│ └── test_agent_on_sample_project.py
├── scripts/
│ └── run_agent_cli.py
├── tmp/
├── requirements.txt
├── pyproject.toml
└── README.mdagent/— core logic (tools, prompts, executors)tests/— unit tests for the agent plus a sample project the agent practices onscripts/— CLI entry point for engineerstmp/— captured pytest output foranalyze-failureruns
Step 1 — Setting up environment and dependencies
Install core packages:
pip install "langchain>=0.2.0" "langchain-openai" "langchain-community" \
pytest rich python-dotenv| Package | Role |
|---|---|
langchain, langchain-openai | Agents, tool-calling, model clients |
pytest | Test runner exposed as a tool |
rich | Readable CLI output |
python-dotenv | Load OPENAI_API_KEY (or other provider keys) from .env |
Configure at least one LLM provider (OpenAI, Anthropic, Groq, etc.) and point LangChain’s ChatOpenAI (or equivalent) at it. Keep keys out of prompts and logs—load from environment only.
Create .env at the repo root (never commit it):
OPENAI_API_KEY=sk-...Step 2 — Defining v1 use cases
Scope the first version so behavior stays testable:
- Analyze a failing test run — Input: pytest output. Output: root-cause hypothesis, suggested fix, affected files.
- Generate new tests from requirements — Input: short spec or story. Output: pytest modules in the right
tests/path. - Refactor brittle tests (optional v1.1) — Input: flaky test file. Output: improved fixtures, waits, or data setup.
Defer auto-opening PRs and full CI integration until tools, prompts, and evals are stable—then add post-failure hooks similar to other agentic QA pipelines.
Step 3 — Modeling the agent’s tools
Tools are plain functions with docstrings the model uses to choose arguments. Create agent/tools.py:
# agent/tools.py
import subprocess
from pathlib import Path
from typing import Optional
from langchain_core.tools import tool
# Repo root: agent/tools.py -> parents[1]
PROJECT_ROOT = Path(__file__).resolve().parents[1]
SAMPLE_PROJECT = PROJECT_ROOT / "tests" / "sample_project"
@tool("run_pytest", return_direct=False)
def run_pytest(args: Optional[str] = "") -> str:
"""
Run pytest on the sample project.
args: Optional extra CLI args, e.g. `-q tests/test_calculator.py::test_add_zero`.
Returns captured stdout and stderr combined.
"""
cmd = ["pytest", *args.split()] if args.strip() else ["pytest"]
proc = subprocess.Popen(
cmd,
cwd=SAMPLE_PROJECT,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
out, _ = proc.communicate()
return out
@tool("read_file", return_direct=False)
def read_file(path: str) -> str:
"""
Read a text file relative to the repository root.
Returns file contents or a not-found message.
"""
file_path = (PROJECT_ROOT / path).resolve()
if PROJECT_ROOT not in file_path.parents and file_path != PROJECT_ROOT:
return f"Path not allowed: {path}"
if not file_path.exists():
return f"File not found: {path}"
return file_path.read_text(encoding="utf-8")
@tool("write_patch", return_direct=False)
def write_patch(path: str, new_content: str) -> str:
"""
Overwrite a file at path with new_content (relative to repo root).
Use for fixes or generated tests. Returns a confirmation message.
"""
file_path = (PROJECT_ROOT / path).resolve()
if PROJECT_ROOT not in file_path.parents:
return f"Path not allowed: {path}"
file_path.parent.mkdir(parents=True, exist_ok=True)
file_path.write_text(new_content, encoding="utf-8")
return f"Wrote updated content to {path}"Extensions for v2: search_code(pattern) via ripgrep, git_diff(), format_code(path) with Ruff or Black. The path guard on read_file / write_patch is a minimal sandbox—tighten allowlists (e.g. only under tests/) before production.
Step 4 — Designing system and task prompts
Explicit instructions reduce hallucinated tool use. Create agent/prompts.py:
# agent/prompts.py
from langchain_core.prompts import ChatPromptTemplate
BASE_SYSTEM_PROMPT = """You are a senior software engineer acting as a Test Agent.
You help with:
- Investigating failing tests
- Explaining root causes
- Proposing minimal, high-quality fixes
- Writing new pytest tests
Constraints:
- Prefer small, targeted code changes over massive refactors.
- Preserve existing behavior unless clearly wrong.
- Use idiomatic Python, pytest, and clean naming.
- When unsure, explain trade-offs; do not invent APIs that do not exist.
"""
FAILURE_ANALYSIS_PROMPT = ChatPromptTemplate.from_messages([
("system", BASE_SYSTEM_PROMPT),
("human", """You are given pytest output from a failing run and tools:
- run_pytest
- read_file
- write_patch
Goal:
1. Identify the root cause of the failure.
2. Read relevant files.
3. Propose a minimal fix and apply it via write_patch when appropriate.
4. Re-run tests with run_pytest and summarize the result.
Pytest output:
```text
{pytest_output}Think step-by-step, call tools as needed, and end with a short summary of what changed."""), ])
TEST_GENERATION_PROMPT = ChatPromptTemplate.from_messages([ ("system", BASE_SYSTEM_PROMPT), ("human", """Generate pytest tests for the following requirement.
Requirement:
{requirement}Project structure:
{project_tree}Write tests into the appropriate tests/ path using write_patch. Cover happy paths and key edge cases. Then run_pytest to validate."""), ])
Separate prompts per task keep trajectories easier to evaluate than one mega-prompt.
## Step 5 — Building the agent with LangChain
Wire LLM, tools, and prompts. Create `agent/config.py`:
```python
# agent/config.py
import os
from langchain_openai import ChatOpenAI
def get_llm():
return ChatOpenAI(
model="gpt-4o",
temperature=0.1,
api_key=os.environ.get("OPENAI_API_KEY"),
)Create agent/agent.py:
# agent/agent.py
from typing import Any
from langchain.agents import AgentExecutor, create_tool_calling_agent
from .config import get_llm
from .prompts import FAILURE_ANALYSIS_PROMPT, TEST_GENERATION_PROMPT
from .tools import run_pytest, read_file, write_patch
TOOLS = [run_pytest, read_file, write_patch]
def build_failure_analysis_agent() -> AgentExecutor:
llm = get_llm()
agent_runnable = create_tool_calling_agent(
llm=llm,
tools=TOOLS,
prompt=FAILURE_ANALYSIS_PROMPT,
)
return AgentExecutor(agent=agent_runnable, tools=TOOLS, verbose=True)
def build_test_generation_agent() -> AgentExecutor:
llm = get_llm()
agent_runnable = create_tool_calling_agent(
llm=llm,
tools=TOOLS,
prompt=TEST_GENERATION_PROMPT,
)
return AgentExecutor(agent=agent_runnable, tools=TOOLS, verbose=True)
def run_failure_analysis(pytest_output: str) -> dict[str, Any]:
agent = build_failure_analysis_agent()
return agent.invoke({"pytest_output": pytest_output})
def run_test_generation(requirement: str, project_tree: str) -> dict[str, Any]:
agent = build_test_generation_agent()
return agent.invoke({"requirement": requirement, "project_tree": project_tree})This follows LangChain’s tool-calling agent pattern: the model decides when to call run_pytest, read_file, or write_patch. Pin LangChain versions in requirements.txt and re-run integration tests when upgrading—agent APIs evolve.
Step 6 — Adding a CLI for local use
Engineers should run the agent from the terminal. Create scripts/run_agent_cli.py:
# scripts/run_agent_cli.py
import argparse
import subprocess
import sys
from pathlib import Path
from rich.console import Console
# Allow `python scripts/run_agent_cli.py` from repo root
REPO_ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(REPO_ROOT))
from agent.agent import run_failure_analysis, run_test_generation
console = Console()
SAMPLE_PROJECT = REPO_ROOT / "tests" / "sample_project"
def get_project_tree() -> str:
try:
proc = subprocess.run(
["tree", "-L", "3", "."],
cwd=SAMPLE_PROJECT,
capture_output=True,
text=True,
check=False,
)
return proc.stdout or proc.stderr or "(tree not installed)"
except FileNotFoundError:
return "Install `tree` or paste project layout manually."
def main() -> None:
parser = argparse.ArgumentParser(description="LangChain Test Agent CLI")
subparsers = parser.add_subparsers(dest="command", required=True)
p_fail = subparsers.add_parser("analyze-failure")
p_fail.add_argument("--pytest-output-file", required=True)
p_gen = subparsers.add_parser("generate-tests")
p_gen.add_argument("--requirement-file", required=True)
args = parser.parse_args()
if args.command == "analyze-failure":
output_path = REPO_ROOT / args.pytest_output_file
pytest_output = output_path.read_text(encoding="utf-8")
console.rule("[bold cyan]Analyzing failure with Test Agent")
result = run_failure_analysis(pytest_output)
console.print(result.get("output", result))
elif args.command == "generate-tests":
req_path = REPO_ROOT / args.requirement_file
requirement = req_path.read_text(encoding="utf-8")
tree = get_project_tree()
console.rule("[bold cyan]Generating tests with Test Agent")
result = run_test_generation(requirement, tree)
console.print(result.get("output", result))
if __name__ == "__main__":
main()Example usage from the repo root:
python scripts/run_agent_cli.py analyze-failure --pytest-output-file tmp/last_run.txt
python scripts/run_agent_cli.py generate-tests --requirement-file docs/feature_x.mdStep 7 — Sample project and intentional failure
Under tests/sample_project, add a minimal app and a failing test:
# tests/sample_project/src/calculator.py
def add(a: int, b: int) -> int:
return a + b# tests/sample_project/tests/test_calculator.py
from src.calculator import add
def test_add_positive_numbers():
assert add(2, 3) == 5
def test_add_negative_numbers():
assert add(-2, -3) == -5
def test_add_zero():
# Intentional bug: wrong expectation
assert add(0, 5) == 4Capture output for the agent:
mkdir -p tmp
cd tests/sample_project
pytest -q > ../../tmp/last_run.txt 2>&1
cd ../..
python scripts/run_agent_cli.py analyze-failure --pytest-output-file tmp/last_run.txtStep 8 — Walking through an end-to-end failure fix
When you run analyze-failure, a typical trajectory is:
- Parse pytest output (failure in
test_add_zero). read_fileontests/sample_project/tests/test_calculator.pyand optionallysrc/calculator.py.- Conclude the expectation
add(0, 5) == 4is wrong. write_patchto set the assertion to== 5.run_pytestto verify green.- Summarize the diff and outcome.
In your team’s workflow, treat that summary as input to code review, not an auto-merge. Show before/after diffs and tool-call logs in the blog or internal runbooks—the same evidence you would want for AI hallucination guardrails.
Step 9 — Evaluating the test agent itself
LangChain’s testing guidance applies to your agent too: unit tests, integration tests, and trajectory evals.
Unit tests
In tests/unit/test_agent_core.py:
- Mock the LLM with LangChain mock chat models where possible.
- Assert tools reject paths outside the repo root.
- Assert graceful handling of missing files or empty pytest output.
Integration tests
In tests/integration/test_agent_on_sample_project.py, run the real agent against the sample project with a real API key—often in a nightly job, not every PR, to control cost.
Trajectory and eval tests
Record golden runs and compare new agent versions:
- Does it still identify the correct failing test?
- Does it propose a minimal fix?
- Does it avoid editing production code when only the test is wrong?
Use deterministic checks plus optional LLM-as-judge evaluators; see LangChain testing docs for patterns.
Step 10 — Hardening for production
| Area | Practice |
|---|---|
| Guardrails | Allowlist paths for write_patch; require human review before commit |
| Cost | Smaller models for log parsing; larger models only for patch generation; cache repeated failures |
| Observability | Structured logs of tool name, args (redacted), and outcomes; track accept vs reject rate |
| Security | Sandbox run_pytest and file writes; never pass secrets into prompts |
| Trust | Ban auto-merge on high-risk suites; align with risk-based testing tiers |
These mirror recommendations for production LangGraph and LangChain deployments.
Extending the agent (v2 roadmap)
Ideas that extend the same repo without rewriting the core:
- Multi-agent split — “Failure detector”, “Fix proposer”, and “Validator” as separate agents under a controller (similar to LangChain + CrewAI pipelines for flaky tests).
- Docs-to-tests — Parse Markdown or OpenAPI; generate API or BDD-style tests.
- Synthetic user testing — Exercise prompts and tools with simulated inputs before CI exposure.
- CI/CD hooks — GitHub Actions step on test failure: run agent, open draft PR with patch suggestions.
Each item is a follow-on post or README section once v1 evals are green.
README highlights
Document in README.md:
- What the test agent does and what it does not do (no unsupervised release)
- Setup: Python version,
pip install,.envkeys - CLI examples and sample failure walkthrough
- Evaluation and safety notes
- Roadmap: PR integration, multi-agent, Playwright or Java test runners as additional tools
Conclusion
A LangChain test agent will not replace human QA engineers. It can automate tedious work: reading logs, locating files, suggesting patches, and scaffolding tests—with re-runs and evals closing the loop. The repository structure and patterns above give you a concrete foundation to diagnose failures, generate tests from requirements, and evolve toward richer agentic QA on the same LangChain abstractions.
Use it as an assistant with clear boundaries: humans own risk, data, and merge decisions—the same bar set in Agentic AI Testing for Software Test Engineers.