E2eval: The Eval Framework That Actually Earns Its Results

The Eval Problem Nobody Talks About

Most agent evals are fake. They use synthetic tasks, curated datasets, and sanitized inputs that look nothing like the chaos of a real production environment. E2eval is built on a different premise entirely: if you want to know whether your agent actually works, you need to test it against the work your team is already doing.

That's a genuinely sharp idea — and it's worth unpacking.

What E2eval Actually Does

E2eval is an eval framework for agents that use real tools against real data. Not simulated APIs. Not mock responses. The agent gets the same MCP tools it would use in production — searching Slack messages, querying service catalogs, reading internal docs, calling APIs — and it has to solve tasks that were already solved by a human.

The source material is golden. Tasks come from actual work: someone traced a config problem through logs, someone debugged a production incident, someone answered a question in Slack. That interaction becomes the ground truth. The agent has to reproduce the answer from scratch.

If you want to know whether your agent actually works, test it against the work your team is already doing.

The clever part — and this is what separates E2eval from naive replay frameworks — is the source scrubbing step. Before the agent sees a tool result, middleware can redact the evidence that would give the answer away. You're not testing whether the agent can read the answer off the page. You're testing whether it can reason to the same conclusion the human reached.

The Two Core Building Blocks

E2eval exposes two primary extension points, and they're intentionally lean.

Middleware is a scrubber. It intercepts MCP tool results before the agent processes them, redacting source evidence that would contaminate the evaluation. Think of it as a proctor pulling the answer key off the desk before the exam starts. This is the mechanism that makes the whole framework honest.

Adapters implement the AgentAdapter interface — they handle spawning the agent and parsing its output. Each adapter wraps a different agent runtime, which means E2eval isn't tied to one model or orchestration layer. If you can describe "start the agent, get the result," you can write an adapter.

🎯

E2eval is available as an npm package (e2eval), which makes it easy to drop into existing JavaScript/TypeScript agent toolchains without ceremony.

That's genuinely minimal. Two tools, no resources, no pre-built prompts. Whether that's elegant or incomplete depends entirely on what you bring to it.

Score Breakdown

With a total score of 91/100, E2eval lands near the top of what MCPpedia tracks. Let's look at where those points are going.

MCPpedia Scoring System

Total: 100 pts

30/ 30

Security

Full marks. The scrubbing-first architecture is a structural security property — it limits agent exposure to raw data by design, not as an afterthought.

16/ 25

Maintenance

Strong score, though with zero GitHub stars at launch, the community signal is still absent. Watch this space.

15/ 15

Documentation

Near the ceiling for this category. The concept is well-explained, though the middleware and adapter APIs could use more worked examples.

20/ 20

Efficiency

Full marks. A lean two-tool surface area means minimal overhead and fast integration.

10/ 10

Compatibility

Full marks here too. npm packaging and the adapter pattern signal broad compatibility intentions.

The 30/30 on security is the standout. The scrubbing middleware isn't just a feature — it's the architectural heart of the framework. Building evidence redaction into the protocol layer rather than the evaluation harness is a genuinely principled design choice.

🔥

91/100 total score puts E2eval in the top tier of MCP servers in the developer-tools and AI/ML categories. The security score alone — a perfect 30 — is rare.

Who Should Actually Use This

E2eval is not for teams that are still asking "does our agent sort of work?" It's for teams that have already shipped something and need to know whether it regresses when they change the model, the prompt, or the underlying tools.

Evals built from real tasks, solved by real people, using real tools — that's the bar every serious agent team should be held to.

The ideal user is an AI engineering team running agents against internal tooling — Slack integrations, internal APIs, service catalogs, log pipelines. If your agent's job is to do the work an engineer or ops person was already doing, E2eval can turn those past sessions into a test suite.

It's also a natural fit for teams iterating on MCP tool design. Because evals run against the actual MCP tool layer, you can catch regressions in tool behavior — not just model behavior.

⚠️

Zero GitHub stars at time of writing means the community hasn't weighed in yet. The framework's quality scores are strong, but production battle-testing is still an open question. Evaluate accordingly.

What it's not is a general-purpose eval platform. There's no hosted dashboard, no pre-built benchmark suite, no LLM-as-judge scaffolding out of the box. This is infrastructure, not a product. You wire it in, you build your adapters, you own your test suite.

The Bottom Line

Most eval frameworks ask "did the agent get the right answer?" E2eval asks a harder question: "did the agent earn the right answer, or did we hand it to them?" That's a more honest test — and for any team serious about deploying agents into production workflows, it's the only test that matters.

The framework is early, the star count is zero, and the adapter ecosystem is whatever you build yourself. But the core idea is sound, the architecture is clean, and the security story is stronger than most tools twice the age. Give E2eval a look before your next agent deployment — and before you discover your evals were lying to you.

The Eval Problem Nobody Talks About

That's a genuinely sharp idea — and it's worth unpacking.

What E2eval Actually Does

If you want to know whether your agent actually works, test it against the work your team is already doing.

The Two Core Building Blocks

E2eval exposes two primary extension points, and they're intentionally lean.

🎯

E2eval is available as an npm package (e2eval), which makes it easy to drop into existing JavaScript/TypeScript agent toolchains without ceremony.

That's genuinely minimal. Two tools, no resources, no pre-built prompts. Whether that's elegant or incomplete depends entirely on what you bring to it.

Score Breakdown

With a total score of 91/100, E2eval lands near the top of what MCPpedia tracks. Let's look at where those points are going.

MCPpedia Scoring System

Total: 100 pts

30/ 30

Security

Full marks. The scrubbing-first architecture is a structural security property — it limits agent exposure to raw data by design, not as an afterthought.

16/ 25

Maintenance

Strong score, though with zero GitHub stars at launch, the community signal is still absent. Watch this space.

15/ 15

Documentation

Near the ceiling for this category. The concept is well-explained, though the middleware and adapter APIs could use more worked examples.

20/ 20

Efficiency

Full marks. A lean two-tool surface area means minimal overhead and fast integration.

10/ 10

Compatibility

Full marks here too. npm packaging and the adapter pattern signal broad compatibility intentions.

🔥

91/100 total score puts E2eval in the top tier of MCP servers in the developer-tools and AI/ML categories. The security score alone — a perfect 30 — is rare.

Who Should Actually Use This

Evals built from real tasks, solved by real people, using real tools — that's the bar every serious agent team should be held to.

It's also a natural fit for teams iterating on MCP tool design. Because evals run against the actual MCP tool layer, you can catch regressions in tool behavior — not just model behavior.

⚠️

E2eval: The Eval Framework That Actually Earns Its Results

The Eval Problem Nobody Talks About

MCPpedia Scoring System

The Bottom Line

Servers mentioned

Keep reading

Mailtrap's Official MCP Server Gives AI Agents a Full Email Stack

StatFin MCP: Finland's 3,000+ Official Tables, No API Key Needed

E2eval: The Eval Framework That Actually Earns Its Results

The Eval Problem Nobody Talks About

MCPpedia Scoring System

The Bottom Line

Servers mentioned

Keep reading

Mailtrap's Official MCP Server Gives AI Agents a Full Email Stack

StatFin MCP: Finland's 3,000+ Official Tables, No API Key Needed