
EvalView
Behavior regression gate for AI agents — snapshot behavior, diff tool calls, and catch silent regressions, exposed to Claude Code over MCP.
Add to your client
Copy the config for your MCP client and paste it into its config file.
pip install evalviewPaste into ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"evalview": {
"command": "evalview",
"args": [
"mcp",
"serve"
]
}
}
}Step-by-step guides: Add to Claude Desktop · Add to Cursor · Add to Windsurf
Before you start
- Python with `pip` (install the `evalview` package: `pip install evalview`)
- An AI agent / test suite for EvalView to snapshot and check (run `evalview init` to auto-detect and scaffold one)
- Optional: an LLM provider API key (e.g. OPENAI_API_KEY or ANTHROPIC_API_KEY) only if you enable the semantic-similarity or LLM-as-judge scoring layers
About EvalView
EvalView catches silent behavior regressions in AI agents that traditional health checks miss — a model or provider update can change tool choice, skip a clarification, or degrade output without breaking your code or returning a non-200. It snapshots agent behavior as golden baselines, diffs tool calls/parameters/sequence and output on every change, grades drift with confidence (not a binary alarm), and can auto-heal flaky failures. The MCP server lets Claude Code drive that loop conversationally: install with pip install evalview, then claude mcp add --transport stdio evalview -- evalview mcp serve. It scores across four layers (tool calls + sequence and code-based checks are free and offline; semantic similarity and LLM-as-judge are optional, paid layers).
Tools & capabilities (8)
create_testCreate an EvalView test case for an agent behavior.
run_snapshotRun tests and save the resulting traces as golden baselines.
run_checkReplay tests, diff against baselines, and report regressions/changes with the ship verdict.
list_testsList the EvalView tests defined in the workspace.
validate_skillValidate a skill (e.g. for Claude Code / Codex / OpenClaw) against EvalView's expectations.
generate_skill_testsGenerate EvalView tests for a skill.
run_skill_testRun an EvalView test against a skill.
generate_visual_reportGenerate a visual (HTML) report of EvalView results.
What this server can do
EvalView provides tools for these capabilities — tap one to see every MCP server that does the same:
When to use it
- Let Claude Code answer "did my refactor break anything?" by running a behavior regression check inline before you commit
- Snapshot an agent's tool-calling behavior as a baseline and detect silent drift after a model or provider update
- Generate and run regression tests for Claude Code / Codex / OpenClaw skills
- Produce a visual report of which agent tests passed, changed, or regressed
Security notes
Your data stays local by default — nothing leaves your machine unless you opt in to cloud sync via `evalview login`. The deterministic tool + sequence diff runs without any API key; semantic similarity and LLM-as-judge layers are optional and require an OpenAI/Anthropic (or other provider) API key when enabled.
EvalView FAQ
How do I connect EvalView to Claude Code?
Install the package with `pip install evalview`, then run `claude mcp add --transport stdio evalview -- evalview mcp serve`. Optionally copy `CLAUDE.md.example` to `CLAUDE.md` to make Claude Code proactively run checks.
What tools does the MCP server expose?
Eight tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, and generate_visual_report.
Does it require an API key?
No for the core regression gate — the deterministic tool + sequence diff and code-based checks run offline with no API key, and your data stays local by default. An LLM provider API key is only needed if you opt into the semantic-similarity or LLM-as-judge scoring layers.
Is EvalView only an MCP server?
No. It is primarily an `evalview` CLI and Python library (`gate()` / `gate_async()`), with a pytest plugin and GitHub Action; the MCP server is one integration that surfaces its core commands to Claude Code.
Alternatives to EvalView
Compare all alternatives →Microsoft's official browser-automation MCP using Playwright's accessibility tree (no vision model).
Up-to-date, version-specific library documentation injected into your coding agent.
LSP-powered coding agent toolkit: semantic symbol search, references and structural edits.
Compare EvalView with: