Did Your Agent Actually Get Better, or Does It Just Feel Better?

2026-05-22

Last updated 2026-05-22

Good agent evals turn vague feelings ("the agent feels worse this week") into numbers a team can act on. Anthropic's engineering team published a practical guide on demystifying evals for AI agents with lessons from shipping agents like Claude Code. This post shares the parts that stuck after a careful read.

Why agent evaluation is harder than it looks

A single-turn LLM eval is a clean object: prompt in, response out, grader scores it. Agents break that simplicity. They use tools across many turns, modify state in their environment, and adapt as they go. Mistakes propagate, side effects compound, and even when the trajectory looks ugly the final outcome can still be correct (or vice versa).

Take Memex's own knowledge-base agent. When you record something, the agent runs a multi-step loop with tools like LS, Grep, Read, Edit, Write, plus higher-level moves like updating insights or skipping organization entirely. Each step touches the file system. Some outcomes are objectively verifiable (was the file written? does the link resolve?), others are inherently subjective (is this the right slice of the file? is the insight worth surfacing?).

Why agent evaluation is hard

Memex's knowledge-base agent as a worked example

Toolset

ReadWriteEditGrepLSMoveUpdate insightsSkip organize

Knowledge-base agent · multi-turn

LOOP

LS the knowledge base, decide where this fact belongs

Grep historical records to discover relevant prior entries

Read the target file and plan the edit

Edit the knowledge, hit file is too large, adjust strategy and split

Write split files and update insights → ✓ done

Every step mutates the knowledge base file system.

Objectively verifiable

Did the right files get written?
Are fact_id links intact?
Does the structure follow P.A.R.A.?

Hard to judge objectively

Is the insight any good?
Is the categorization sensible?
Is the split granularity right?

Agents call tools across many turns to reach a goal, mutating environment state and adapting along the way. Final outputs include both objectively checkable facts and subjectively judged qualities, and evals need to grade both kinds.

The vocabulary of an agent evaluation framework

Anthropic's post defines a small but precise vocabulary that makes the rest of the conversation tractable:

Task: a single test case with inputs and a success definition.
Trial: one attempt at a task. Multiple trials are run because outputs vary.
Grader: logic that scores some aspect of the output. A task can have several.
Transcript: the complete record of a trial, including reasoning and tool calls.
Outcome: the final environment state at the end of the trial.
Eval harness: the runner that orchestrates trials, records steps, and aggregates.
Eval suite: a collection of related tasks measuring a capability or behavior.

Anthropic's evaluation framework

The nesting of harness, suite, task, graders, metrics, and trials

Eval harness

Eval suite

Task

Graders

Deterministic testsLLM rubricState checktool_calls

Tracked metrics

n_turnsn_toolcallstokenslatency

Trials

Trial #1

transcript + final state

Trial #2

transcript + final state

...

Agent runtime environment

Final outcome

environment end-state

Grader scoring

transcript + outcome → score

Hierarchy at a glance

Eval suitecontainsN tasks (each with inputs + success criteria)

TaskcontainsN graders

Taskis run asN trials

Trialproducesone transcript + one final state

The eval harness provides the runtime, defines the agent, ships the suite, runs tasks concurrently, records every step, applies graders, and aggregates results.

Three kinds of graders for agent evals

Most useful eval setups combine three grader types, picking the right tool for each dimension of the task.

Code-based graders

String matching, fuzzy similarity, tool-call verification, and pass/fail unit tests. The graders themselves are deterministic code: fast, cheap, reproducible, and easy to debug. They struggle with valid variations and any kind of subjective quality, so use them for outcomes you can verify mechanically.

Model-based graders (LLM-as-judge)

An LLM-as-judge applies a rubric written in natural language. Reference solutions or a panel of judges raise confidence. The tradeoff: you are running a stochastic grader on a stochastic agent, so model-based graders need calibration against human ratings to stay trustworthy.

Human graders

SME review, crowd judgment, A/B testing, and inter-annotator agreement. Slow and expensive, but the gold standard for ambiguous quality. Their main job is calibrating model-based graders and resolving the edges where automated scoring drifts.

How to grade an AI agent

Agent evals usually combine three kinds of graders

Code-based

String matching, fuzzy similarity, tool-call verification, and pass/fail unit tests. The graders themselves are deterministic code.

Pros Fast, cheap, objective, reproducible, easy to debug.

Cons Brittle to valid variations that don't match expected patterns; weak on subjective qualities.

Fit Agents whose outputs can be objectively verified.

Model-based

An LLM-as-judge applies a rubric written in natural language. You can also use reference solutions or a panel of judges to raise confidence.

Pros Flexible, scalable, captures nuance, handles open-ended outputs.

Cons Non-deterministic and more expensive. You are using a stochastic grader on a stochastic agent, so calibration with humans is essential.

Fit Agents whose outputs are hard to verify with code alone.

Human

SME review, crowd judgment, A/B testing, and inter-annotator agreement. Slow, but the gold standard for ambiguous quality.

Pros Authoritative; captures expert taste.

Cons Most expensive and slowest; access to experts can be the real bottleneck.

Fit Calibrating model-based graders and resolving subjective edges.

Non-determinism: pass@k vs pass^k

Agent behavior varies between runs. A task that passed yesterday might fail today. Two metrics from the literature help reason about this.

pass@k: at least one success in k attempts

pass@k is the probability that at least one of k attempts succeeds. As k grows, this trends up: more shots on goal mean better odds of at least one win. pass@1 is what you usually care about for first-try success, since end users rarely retry.

pass^k: all k attempts succeed

pass^k is the probability that all k attempts succeed. This trends down with k: demanding consistency across more trials is a higher bar. Customer-facing agents that must "just work every time" live or die by this number.

Capturing non-determinism in agent evals

Anthropic recommends two complementary metrics: pass@k and pass^k

pass@k

The fewer attempts an agent needs to succeed, the better. We want pass@1 as close to 100% as possible (succeed on the first try).

pass^k

How many trials in a row can the agent succeed? The bigger the k we can tolerate, the more reliably the agent works in production.

pass@k≥1 success

pass^kall succeed

At k=1 the two metrics are identical (both equal the per-trial success rate). By k=10 they tell opposite stories: pass@k approaches 100% while pass^k approaches 0%. Pick the one that matches the user expectation you actually need to meet.

Recommendations: why, when, and how to start agent evals

The tail of the original guide is full of practical advice. Here is the condensed version we keep coming back to.

Anthropic's evaluation recommendations

Why · when · how to start

Why?

Effective evals help teams ship agents with confidence. Whether you tweaked a prompt, changed the runtime, or upgraded the model, evals answer the question "is the agent actually better?"

When?

Evals are valuable at every stage. Early on they force you to define what success means. Later they keep quality from drifting. The sooner you start, the cheaper they are to build.

How?

Mix grader types deliberately. Decide how scores combine: weighted, all-pass, or hybrid?

Start with 20–50 simple tasks drawn from real failures. Small samples deliver outsized iteration value (the 80/20 rule).

Write unambiguous tasks that two domain experts would grade identically. A pass rate of 0% pass@100 usually points to a broken task, not an incapable agent.

Cover both should and should-not cases. One-sided suites incentivize one-sided behavior.

Isolate the eval environment. Shared state, leftover files, or resource pressure ruin trial independence.

Treat the eval suite as a living artifact. Read transcripts. Promote stable capability tasks into a regression suite. Audit graders when scores look surprising.

Open contribution to domain experts and product teams. They are closest to the user and the requirements.

The original guide also mentions frameworks like Harbor, Braintrust, LangSmith, Arize, and Langfuse. We'll share our experience with them as we use them more in Memex's agent harness.

Why agent evals matter for Memex

Memex runs a multi-agent system on your phone: Card Agent, PKM Agent, Comment Agent, Insight Agent, plus any custom agents you build. Each one is autonomous, uses tools, and modifies state in your knowledge base. The Anthropic vocabulary maps cleanly onto how we think about quality: tasks for each kind of record, mixed graders for outcomes we can verify and outputs we can only judge, and pass^k-style reliability targets for the experiences users see every day.

Evals are the conversation we have with ourselves about what "better" means. The earlier you start that conversation, the less your agent feels like a slot machine and the more it feels like a teammate that improves every release.

For the full text, see the original Anthropic engineering post. If you want to see how we put this into practice, the Memex source is open and the engineering behind Memex covers the agent architecture in more depth.

FAQ: AI agent evaluations

What is an AI agent evaluation?

An agent evaluation is a programmatic test for an AI agent: give it an input task, let it run with its tools, and apply graders to its transcript and final state to measure success. Unlike single-turn LLM tests, agent evals have to handle multi-step tool use, state changes, and partial credit.

What is the difference between pass@k and pass^k?

pass@k measures the probability that at least one of k attempts succeeds; it goes up as k grows. pass^k measures the probability that all k attempts succeed; it goes down as k grows. pass@1 is what you usually optimize for first-try success, while pass^k matters when an agent must work reliably every time for end users.

Should I use code-based, model-based, or human graders?

Use all three, picking the best fit for each dimension. Code-based graders for outcomes you can verify deterministically. Model-based graders (LLM-as-judge) for open-ended outputs and tone. Human graders for calibrating model graders and resolving subjective edges. Anthropic recommends layering them rather than choosing one.

How many tasks do I need to start an agent eval suite?

20 to 50 tasks drawn from real failures is enough to start. Anthropic emphasizes the 80/20 rule here: a small, well-chosen set produces most of the early-iteration value. You can grow the suite as the agent matures, then promote stable tasks into a regression suite.

Why is pass@100 = 0% usually a sign of a broken task, not a weak model?

If a frontier model cannot solve a task in 100 tries, the most likely cause is that the task is ambiguous, the grader is too strict, or the harness blocks valid solutions. Anthropic recommends double-checking task specifications and graders before concluding the agent itself is at fault.