EvalView

Behavior regression gate for AI agents — snapshot behavior, diff tool calls, and catch silent regressions, exposed to Claude Code over MCP.

Unverified

stdio (local)

No auth

Python

View repo 117 Website

Add to your client

Copy the config for your MCP client and paste it into its config file.

Install / run

pip install evalview

Paste into ~/Library/Application Support/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "evalview": {
      "command": "evalview",
      "args": [
        "mcp",
        "serve"
      ]
    }
  }
}

Step-by-step guides: Add to Claude Desktop · Add to Cursor · Add to Windsurf

Before you start

Python with `pip` (install the `evalview` package: `pip install evalview`)
An AI agent / test suite for EvalView to snapshot and check (run `evalview init` to auto-detect and scaffold one)
Optional: an LLM provider API key (e.g. OPENAI_API_KEY or ANTHROPIC_API_KEY) only if you enable the semantic-similarity or LLM-as-judge scoring layers

About EvalView

EvalView catches silent behavior regressions in AI agents that traditional health checks miss — a model or provider update can change tool choice, skip a clarification, or degrade output without breaking your code or returning a non-200. It snapshots agent behavior as golden baselines, diffs tool calls/parameters/sequence and output on every change, grades drift with confidence (not a binary alarm), and can auto-heal flaky failures. The MCP server lets Claude Code drive that loop conversationally: install with pip install evalview, then claude mcp add --transport stdio evalview -- evalview mcp serve. It scores across four layers (tool calls + sequence and code-based checks are free and offline; semantic similarity and LLM-as-judge are optional, paid layers).

Tools & capabilities (8)

create_test

Create an EvalView test case for an agent behavior.

run_snapshot

Run tests and save the resulting traces as golden baselines.

run_check

Replay tests, diff against baselines, and report regressions/changes with the ship verdict.

list_tests

List the EvalView tests defined in the workspace.

validate_skill

Validate a skill (e.g. for Claude Code / Codex / OpenClaw) against EvalView's expectations.

generate_skill_tests

Generate EvalView tests for a skill.

run_skill_test

Run an EvalView test against a skill.

generate_visual_report

Generate a visual (HTML) report of EvalView results.

What this server can do

EvalView provides tools for these capabilities — tap one to see every MCP server that does the same:

Manage Git repos

When to use it

Let Claude Code answer "did my refactor break anything?" by running a behavior regression check inline before you commit
Snapshot an agent's tool-calling behavior as a baseline and detect silent drift after a model or provider update
Generate and run regression tests for Claude Code / Codex / OpenClaw skills
Produce a visual report of which agent tests passed, changed, or regressed

Security notes

Your data stays local by default — nothing leaves your machine unless you opt in to cloud sync via `evalview login`. The deterministic tool + sequence diff runs without any API key; semantic similarity and LLM-as-judge layers are optional and require an OpenAI/Anthropic (or other provider) API key when enabled.

EvalView FAQ

How do I connect EvalView to Claude Code?

Install the package with `pip install evalview`, then run `claude mcp add --transport stdio evalview -- evalview mcp serve`. Optionally copy `CLAUDE.md.example` to `CLAUDE.md` to make Claude Code proactively run checks.

What tools does the MCP server expose?

Eight tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, and generate_visual_report.

Does it require an API key?

No for the core regression gate — the deterministic tool + sequence diff and code-based checks run offline with no API key, and your data stays local by default. An LLM provider API key is only needed if you opt into the semantic-similarity or LLM-as-judge scoring layers.

Is EvalView only an MCP server?

No. It is primarily an `evalview` CLI and Python library (`gate()` / `gate_async()`), with a pytest plugin and GitHub Action; the MCP server is one integration that surfaces its core commands to Claude Code.

#agent-evaluation #regression-testing #ai-agents #testing #ci-cd #golden-baselines #tool-calling #claude-code

Alternatives to EvalView

Compare all alternatives →

Playwright MCP Server

Developer Tools

24k

Microsoft's official browser-automation MCP using Playwright's accessibility tree (no vision model).

Featured

Verified

stdio (local)

No auth

TypeScript

12 tools

Updated 13 days agoRepo

Context7 MCP Server

Developer Tools

28k

Up-to-date, version-specific library documentation injected into your coding agent.

Verified

stdio (local)

API key

TypeScript

2 tools

Updated 17 days agoRepo

Serena

Developer Tools

12k

LSP-powered coding agent toolkit: semantic symbol search, references and structural edits.

Verified

stdio (local)

No auth

Python

11 tools

Updated 15 days agoRepo

Compare EvalView with:

vs Playwright MCP Server vs Context7 MCP Server vs Serena vs Mobile MCP