Build a LangChain Test Agent: Full Code Repo Walkthrough (2026)

Written by Kajal · Reviewed and published by Prasandeep

May 20, 202610 min readAI

Build a LangChain Test Agent: Full Code Repo Walkthrough (2026)

A LangChain-powered test agent can sit inside your QA workflow like a smart, junior SDET: it reads failing tests, understands your code and logs, and proposes fixes or new tests—all driven by an LLM plus tools you control. LangChain agents combine a chat model with callable tools so the model decides what to invoke, with what arguments, and in what order.

This guide is a from-zero-to-production walkthrough of a Python repository you can clone, run locally, and later hook into CI. You will design the architecture, implement tools (pytest, file I/O, patches), wire LangChain’s tool-calling agent, add evaluation hooks, and harden for real teams.

For the bigger picture on agentic QA, see Agentic AI Testing for Software Test Engineers. For prompt discipline and review gates, see Prompt Engineering for Test Automation and AI Test Hallucinations: Detection and Fixes. For flaky-suite triage before you automate fixes, see Fix Flaky Tests: 2026 Masterclass. For CI placement, see GitHub Actions + Playwright CI/CD pipeline.

What is a LangChain test agent?

In LangChain, an agent is an LLM orchestrated with tools—functions the model can call with structured arguments. A test agent specializes in QA workflows:

Reads failing test logs or CI output
Locates relevant test files and source code
Diagnoses likely root causes (wrong assertion, API drift, timing, bad data)
Proposes patches and new tests
Optionally re-runs tests and summarizes what changed for human review

This is the same pattern described in recent agentic test pipelines that use LangChain (sometimes with CrewAI or custom controllers) to diagnose and fix flaky or broken tests—always with humans owning merge and release decisions.

Repository structure

A layout that stays maintainable as you add tools and evals:

Clike

langchain-test-agent/
├── agent/
│   ├── __init__.py
│   ├── config.py
│   ├── tools.py
│   ├── prompts.py
│   └── agent.py
├── tests/
│   ├── sample_project/
│   │   ├── src/
│   │   │   └── calculator.py
│   │   └── tests/
│   │       └── test_calculator.py
│   ├── unit/
│   │   └── test_agent_core.py
│   └── integration/
│       └── test_agent_on_sample_project.py
├── scripts/
│   └── run_agent_cli.py
├── tmp/
├── requirements.txt
├── pyproject.toml
└── README.md

agent/ — core logic (tools, prompts, executors)
tests/ — unit tests for the agent plus a sample project the agent practices on
scripts/ — CLI entry point for engineers
tmp/ — captured pytest output for analyze-failure runs

Step 1 — Setting up environment and dependencies

Install core packages:

Bash

pip install "langchain>=0.2.0" "langchain-openai" "langchain-community" \
            pytest rich python-dotenv

Package	Role
`langchain`, `langchain-openai`	Agents, tool-calling, model clients
`pytest`	Test runner exposed as a tool
`rich`	Readable CLI output
`python-dotenv`	Load `OPENAI_API_KEY` (or other provider keys) from `.env`

Configure at least one LLM provider (OpenAI, Anthropic, Groq, etc.) and point LangChain’s ChatOpenAI (or equivalent) at it. Keep keys out of prompts and logs—load from environment only.

Create .env at the repo root (never commit it):

Bash

OPENAI_API_KEY=sk-...

Step 2 — Defining v1 use cases

Scope the first version so behavior stays testable:

Analyze a failing test run — Input: pytest output. Output: root-cause hypothesis, suggested fix, affected files.
Generate new tests from requirements — Input: short spec or story. Output: pytest modules in the right tests/ path.
Refactor brittle tests (optional v1.1) — Input: flaky test file. Output: improved fixtures, waits, or data setup.

Defer auto-opening PRs and full CI integration until tools, prompts, and evals are stable—then add post-failure hooks similar to other agentic QA pipelines.

Step 3 — Modeling the agent’s tools

Tools are plain functions with docstrings the model uses to choose arguments. Create agent/tools.py:

Python

# agent/tools.py
import subprocess
from pathlib import Path
from typing import Optional

from langchain_core.tools import tool

# Repo root: agent/tools.py -> parents[1]
PROJECT_ROOT = Path(__file__).resolve().parents[1]
SAMPLE_PROJECT = PROJECT_ROOT / "tests" / "sample_project"


@tool("run_pytest", return_direct=False)
def run_pytest(args: Optional[str] = "") -> str:
    """
    Run pytest on the sample project.

    args: Optional extra CLI args, e.g. `-q tests/test_calculator.py::test_add_zero`.
    Returns captured stdout and stderr combined.
    """
    cmd = ["pytest", *args.split()] if args.strip() else ["pytest"]
    proc = subprocess.Popen(
        cmd,
        cwd=SAMPLE_PROJECT,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
    )
    out, _ = proc.communicate()
    return out


@tool("read_file", return_direct=False)
def read_file(path: str) -> str:
    """
    Read a text file relative to the repository root.
    Returns file contents or a not-found message.
    """
    file_path = (PROJECT_ROOT / path).resolve()
    if PROJECT_ROOT not in file_path.parents and file_path != PROJECT_ROOT:
        return f"Path not allowed: {path}"
    if not file_path.exists():
        return f"File not found: {path}"
    return file_path.read_text(encoding="utf-8")


@tool("write_patch", return_direct=False)
def write_patch(path: str, new_content: str) -> str:
    """
    Overwrite a file at path with new_content (relative to repo root).
    Use for fixes or generated tests. Returns a confirmation message.
    """
    file_path = (PROJECT_ROOT / path).resolve()
    if PROJECT_ROOT not in file_path.parents:
        return f"Path not allowed: {path}"
    file_path.parent.mkdir(parents=True, exist_ok=True)
    file_path.write_text(new_content, encoding="utf-8")
    return f"Wrote updated content to {path}"

Extensions for v2: search_code(pattern) via ripgrep, git_diff(), format_code(path) with Ruff or Black. The path guard on read_file / write_patch is a minimal sandbox—tighten allowlists (e.g. only under tests/) before production.

Step 4 — Designing system and task prompts

Explicit instructions reduce hallucinated tool use. Create agent/prompts.py:

Python

# agent/prompts.py
from langchain_core.prompts import ChatPromptTemplate

BASE_SYSTEM_PROMPT = """You are a senior software engineer acting as a Test Agent.
You help with:
- Investigating failing tests
- Explaining root causes
- Proposing minimal, high-quality fixes
- Writing new pytest tests

Constraints:
- Prefer small, targeted code changes over massive refactors.
- Preserve existing behavior unless clearly wrong.
- Use idiomatic Python, pytest, and clean naming.
- When unsure, explain trade-offs; do not invent APIs that do not exist.
"""

FAILURE_ANALYSIS_PROMPT = ChatPromptTemplate.from_messages([
    ("system", BASE_SYSTEM_PROMPT),
    ("human", """You are given pytest output from a failing run and tools:
- run_pytest
- read_file
- write_patch

Goal:
1. Identify the root cause of the failure.
2. Read relevant files.
3. Propose a minimal fix and apply it via write_patch when appropriate.
4. Re-run tests with run_pytest and summarize the result.

Pytest output:

```text
{pytest_output}

Think step-by-step, call tools as needed, and end with a short summary of what changed."""), ])

TEST_GENERATION_PROMPT = ChatPromptTemplate.from_messages([ ("system", BASE_SYSTEM_PROMPT), ("human", """Generate pytest tests for the following requirement.

Requirement:

Clike

{requirement}

Project structure:

Clike

{project_tree}

Write tests into the appropriate tests/ path using write_patch. Cover happy paths and key edge cases. Then run_pytest to validate."""), ])

Python

Separate prompts per task keep trajectories easier to evaluate than one mega-prompt.

## Step 5 — Building the agent with LangChain

Wire LLM, tools, and prompts. Create `agent/config.py`:

```python
# agent/config.py
import os
from langchain_openai import ChatOpenAI

def get_llm():
    return ChatOpenAI(
        model="gpt-4o",
        temperature=0.1,
        api_key=os.environ.get("OPENAI_API_KEY"),
    )

Create agent/agent.py:

Python

# agent/agent.py
from typing import Any

from langchain.agents import AgentExecutor, create_tool_calling_agent

from .config import get_llm
from .prompts import FAILURE_ANALYSIS_PROMPT, TEST_GENERATION_PROMPT
from .tools import run_pytest, read_file, write_patch

TOOLS = [run_pytest, read_file, write_patch]


def build_failure_analysis_agent() -> AgentExecutor:
    llm = get_llm()
    agent_runnable = create_tool_calling_agent(
        llm=llm,
        tools=TOOLS,
        prompt=FAILURE_ANALYSIS_PROMPT,
    )
    return AgentExecutor(agent=agent_runnable, tools=TOOLS, verbose=True)


def build_test_generation_agent() -> AgentExecutor:
    llm = get_llm()
    agent_runnable = create_tool_calling_agent(
        llm=llm,
        tools=TOOLS,
        prompt=TEST_GENERATION_PROMPT,
    )
    return AgentExecutor(agent=agent_runnable, tools=TOOLS, verbose=True)


def run_failure_analysis(pytest_output: str) -> dict[str, Any]:
    agent = build_failure_analysis_agent()
    return agent.invoke({"pytest_output": pytest_output})


def run_test_generation(requirement: str, project_tree: str) -> dict[str, Any]:
    agent = build_test_generation_agent()
    return agent.invoke({"requirement": requirement, "project_tree": project_tree})

This follows LangChain’s tool-calling agent pattern: the model decides when to call run_pytest, read_file, or write_patch. Pin LangChain versions in requirements.txt and re-run integration tests when upgrading—agent APIs evolve.

Step 6 — Adding a CLI for local use

Engineers should run the agent from the terminal. Create scripts/run_agent_cli.py:

Python

# scripts/run_agent_cli.py
import argparse
import subprocess
import sys
from pathlib import Path

from rich.console import Console

# Allow `python scripts/run_agent_cli.py` from repo root
REPO_ROOT = Path(__file__).resolve().parents[1]
sys.path.insert(0, str(REPO_ROOT))

from agent.agent import run_failure_analysis, run_test_generation

console = Console()
SAMPLE_PROJECT = REPO_ROOT / "tests" / "sample_project"


def get_project_tree() -> str:
    try:
        proc = subprocess.run(
            ["tree", "-L", "3", "."],
            cwd=SAMPLE_PROJECT,
            capture_output=True,
            text=True,
            check=False,
        )
        return proc.stdout or proc.stderr or "(tree not installed)"
    except FileNotFoundError:
        return "Install `tree` or paste project layout manually."


def main() -> None:
    parser = argparse.ArgumentParser(description="LangChain Test Agent CLI")
    subparsers = parser.add_subparsers(dest="command", required=True)

    p_fail = subparsers.add_parser("analyze-failure")
    p_fail.add_argument("--pytest-output-file", required=True)

    p_gen = subparsers.add_parser("generate-tests")
    p_gen.add_argument("--requirement-file", required=True)

    args = parser.parse_args()

    if args.command == "analyze-failure":
        output_path = REPO_ROOT / args.pytest_output_file
        pytest_output = output_path.read_text(encoding="utf-8")
        console.rule("[bold cyan]Analyzing failure with Test Agent")
        result = run_failure_analysis(pytest_output)
        console.print(result.get("output", result))

    elif args.command == "generate-tests":
        req_path = REPO_ROOT / args.requirement_file
        requirement = req_path.read_text(encoding="utf-8")
        tree = get_project_tree()
        console.rule("[bold cyan]Generating tests with Test Agent")
        result = run_test_generation(requirement, tree)
        console.print(result.get("output", result))


if __name__ == "__main__":
    main()

Example usage from the repo root:

Bash

python scripts/run_agent_cli.py analyze-failure --pytest-output-file tmp/last_run.txt
python scripts/run_agent_cli.py generate-tests --requirement-file docs/feature_x.md

Step 7 — Sample project and intentional failure

Under tests/sample_project, add a minimal app and a failing test:

Python

# tests/sample_project/src/calculator.py
def add(a: int, b: int) -> int:
    return a + b

Python

# tests/sample_project/tests/test_calculator.py
from src.calculator import add

def test_add_positive_numbers():
    assert add(2, 3) == 5

def test_add_negative_numbers():
    assert add(-2, -3) == -5

def test_add_zero():
    # Intentional bug: wrong expectation
    assert add(0, 5) == 4

Capture output for the agent:

Bash

mkdir -p tmp
cd tests/sample_project
pytest -q > ../../tmp/last_run.txt 2>&1
cd ../..
python scripts/run_agent_cli.py analyze-failure --pytest-output-file tmp/last_run.txt

Step 8 — Walking through an end-to-end failure fix

When you run analyze-failure, a typical trajectory is:

Parse pytest output (failure in test_add_zero).
read_file on tests/sample_project/tests/test_calculator.py and optionally src/calculator.py.
Conclude the expectation add(0, 5) == 4 is wrong.
write_patch to set the assertion to == 5.
run_pytest to verify green.
Summarize the diff and outcome.

In your team’s workflow, treat that summary as input to code review, not an auto-merge. Show before/after diffs and tool-call logs in the blog or internal runbooks—the same evidence you would want for AI hallucination guardrails.

Step 9 — Evaluating the test agent itself

LangChain’s testing guidance applies to your agent too: unit tests, integration tests, and trajectory evals.

Unit tests

In tests/unit/test_agent_core.py:

Mock the LLM with LangChain mock chat models where possible.
Assert tools reject paths outside the repo root.
Assert graceful handling of missing files or empty pytest output.

Integration tests

In tests/integration/test_agent_on_sample_project.py, run the real agent against the sample project with a real API key—often in a nightly job, not every PR, to control cost.

Trajectory and eval tests

Record golden runs and compare new agent versions:

Does it still identify the correct failing test?
Does it propose a minimal fix?
Does it avoid editing production code when only the test is wrong?

Use deterministic checks plus optional LLM-as-judge evaluators; see LangChain testing docs for patterns.

Step 10 — Hardening for production

Area	Practice
Guardrails	Allowlist paths for `write_patch`; require human review before commit
Cost	Smaller models for log parsing; larger models only for patch generation; cache repeated failures
Observability	Structured logs of tool name, args (redacted), and outcomes; track accept vs reject rate
Security	Sandbox `run_pytest` and file writes; never pass secrets into prompts
Trust	Ban auto-merge on high-risk suites; align with risk-based testing tiers

These mirror recommendations for production LangGraph and LangChain deployments.

Extending the agent (v2 roadmap)

Ideas that extend the same repo without rewriting the core:

Multi-agent split — “Failure detector”, “Fix proposer”, and “Validator” as separate agents under a controller (similar to LangChain + CrewAI pipelines for flaky tests).
Docs-to-tests — Parse Markdown or OpenAPI; generate API or BDD-style tests.
Synthetic user testing — Exercise prompts and tools with simulated inputs before CI exposure.
CI/CD hooks — GitHub Actions step on test failure: run agent, open draft PR with patch suggestions.

Each item is a follow-on post or README section once v1 evals are green.

README highlights

Document in README.md:

What the test agent does and what it does not do (no unsupervised release)
Setup: Python version, pip install, .env keys
CLI examples and sample failure walkthrough
Evaluation and safety notes
Roadmap: PR integration, multi-agent, Playwright or Java test runners as additional tools

Conclusion

A LangChain test agent will not replace human QA engineers. It can automate tedious work: reading logs, locating files, suggesting patches, and scaffolding tests—with re-runs and evals closing the loop. The repository structure and patterns above give you a concrete foundation to diagnose failures, generate tests from requirements, and evolve toward richer agentic QA on the same LangChain abstractions.

Use it as an assistant with clear boundaries: humans own risk, data, and merge decisions—the same bar set in Agentic AI Testing for Software Test Engineers.