The Eval Problem Nobody Talks About
Most agent evals are fake. They use synthetic tasks, curated datasets, and sanitized inputs that look nothing like the chaos of a real production environment. E2eval is built on a different premise entirely: if you want to know whether your agent actually works, you need to test it against the work your team is already doing.
That's a genuinely sharp idea — and it's worth unpacking.
E2eval is an eval framework for agents that use real tools against real data. Not simulated APIs. Not mock responses. The agent gets the same MCP tools it would use in production — searching Slack messages, querying service catalogs, reading internal docs, calling APIs — and it has to solve tasks that were already solved by a human.
The source material is golden. Tasks come from actual work: someone traced a config problem through logs, someone debugged a production incident, someone answered a question in Slack. That interaction becomes the ground truth. The agent has to reproduce the answer from scratch.
If you want to know whether your agent actually works, test it against the work your team is already doing.
The clever part — and this is what separates E2eval from naive replay frameworks — is the source scrubbing step. Before the agent sees a tool result, middleware can redact the evidence that would give the answer away. You're not testing whether the agent can read the answer off the page. You're testing whether it can reason to the same conclusion the human reached.
E2eval exposes two primary extension points, and they're intentionally lean.
Middleware is a scrubber. It intercepts MCP tool results before the agent processes them, redacting source evidence that would contaminate the evaluation. Think of it as a proctor pulling the answer key off the desk before the exam starts. This is the mechanism that makes the whole framework honest.
Adapters implement the AgentAdapter interface — they handle spawning the agent and parsing its output. Each adapter wraps a different agent runtime, which means E2eval isn't tied to one model or orchestration layer. If you can describe "start the agent, get the result," you can write an adapter.
e2eval), which makes it easy to drop into existing JavaScript/TypeScript agent toolchains without ceremony.That's genuinely minimal. Two tools, no resources, no pre-built prompts. Whether that's elegant or incomplete depends entirely on what you bring to it.
With a total score of 91/100, E2eval lands near the top of what MCPpedia tracks. Let's look at where those points are going.
MCPpedia Scoring System
Total: 100 ptsThe 30/30 on security is the standout. The scrubbing middleware isn't just a feature — it's the architectural heart of the framework. Building evidence redaction into the protocol layer rather than the evaluation harness is a genuinely principled design choice.
E2eval is not for teams that are still asking "does our agent sort of work?" It's for teams that have already shipped something and need to know whether it regresses when they change the model, the prompt, or the underlying tools.
Evals built from real tasks, solved by real people, using real tools — that's the bar every serious agent team should be held to.
The ideal user is an AI engineering team running agents against internal tooling — Slack integrations, internal APIs, service catalogs, log pipelines. If your agent's job is to do the work an engineer or ops person was already doing, E2eval can turn those past sessions into a test suite.
It's also a natural fit for teams iterating on MCP tool design. Because evals run against the actual MCP tool layer, you can catch regressions in tool behavior — not just model behavior.
What it's not is a general-purpose eval platform. There's no hosted dashboard, no pre-built benchmark suite, no LLM-as-judge scaffolding out of the box. This is infrastructure, not a product. You wire it in, you build your adapters, you own your test suite.
The Bottom Line
Most eval frameworks ask "did the agent get the right answer?" E2eval asks a harder question: "did the agent earn the right answer, or did we hand it to them?" That's a more honest test — and for any team serious about deploying agents into production workflows, it's the only test that matters.
The framework is early, the star count is zero, and the adapter ecosystem is whatever you build yourself. But the core idea is sound, the architecture is clean, and the security story is stronger than most tools twice the age. Give E2eval a look before your next agent deployment — and before you discover your evals were lying to you.
Servers mentioned
MCP Security Weekly
Weekly CVE alerts, new server roundups, and MCP ecosystem insights. Free.
Keep reading
This article was written by AI, powered by Claude and real-time MCPpedia data. All facts and figures are sourced from our database — but AI can make mistakes. If something looks off, let us know.