dart_agent_core Evals: Agent Evaluation for Dart and Flutter

2026-06-01

Anthropic's Claude engineering post Demystifying evals for AI agents gave the agent ecosystem a useful vocabulary: task, trial, grader, transcript, outcome, evaluation harness, agent harness, and evaluation suite. Those are not just words in a blog post for us. The new eval subsystem in dart_agent_core implements that Claude-style evaluation model directly in Dart.

For teams already using dart_agent_core, the practical point is that evals are now part of the same Dart package as the production agent scaffold. You can keep Claude's evaluation language while testing the Flutter and Dart runtime your users actually touch.

The goal is simple: if you are building a Flutter agent, you should be able to ask the same questions Claude's team asks about Claude Code. Did the agent complete the task, or did it merely say it did? Did a prompt change improve capability while hurting reliability? Did a model upgrade reduce tool calls, increase cost, or create a regression in a narrow bucket of tasks?

Why evals belong inside the Dart agent stack

Most agent evaluation tools assume a Python or TypeScript service. That makes sense for backend agents, but it is awkward for mobile-first systems. Memex agents run in a Flutter app, share Dart tools with the product, and mutate local files and local app state. Sending that through a separate eval framework would mean rebuilding the agent harness outside the environment users actually run.

package:dart_agent_core/eval.dart exists so the evaluation harness can sit next to the agent harness. The same StatefulAgent, Tool, AgentController, LLMClient, and local services used in production can be used in evals. That matches one of the most important Claude blog lessons: when you evaluate "an agent", you are evaluating the model and the scaffold together.

Claude's vocabulary, mapped to dart_agent_core

The eval subsystem intentionally uses the same mental model from the Claude article. The API names are boring on purpose:

Task maps to EvalTask: one input, metadata, success criteria, optional reference solution, and the graders that define passing.
Trial maps to Trial: one attempt at one task. A task can run multiple trials because agent behavior is non-deterministic.
Grader maps to Grader, with CodeGrader, ModelGrader, and HumanGrader base classes.
Transcript maps to Transcript: messages, tool calls, reasoning events, turn counts, token counts, and timing metrics captured from the shared controller.
Outcome maps to Outcome: the final environment state, workspace diff, and artifacts that prove what the agent actually changed.
Evaluation harness maps to EvalRunner: concurrency, setup, teardown, grading, aggregation, reporting, and trace export.
Agent harness maps to AgentHarnessFactory and AgentHarnessSession: your adapter from an eval task to the real agent under test.
Evaluation suite maps to EvalSuite: a capability, regression, or mixed set of tasks for one agent.

How to run your first dart_agent_core eval

A minimal eval follows the same shape as the Claude article: define the task, run multiple trials through the agent harness, grade the transcript and outcome, then aggregate the suite.

Create an EvalTask with the user input, metadata, and success criteria.
Build an EvalEnvironment that prepares the workspace or app state for each trial.
Implement an AgentHarnessFactory that starts the real Dart agent under test.
Attach CodeGrader, ModelGrader, or HumanGrader instances depending on what must be verified.
Run the EvalSuite with EvalRunner and compare pass@k, pass^k, cost, latency, and trace output.

final task = EvalTask(
  id: 'create_sleep_note',
  input: 'Create a sleep note from this journal entry.',
  successCriteria: ['A note exists', 'The note is filed under Health'],
  graders: [SleepNoteOutcomeGrader()],
);

final suite = EvalSuite(
  id: 'pkm_agent_smoke',
  tasks: [task],
  trialCount: 3,
);

final report = await EvalRunner(
  environment: PkmEvalEnvironment(workspaceDir),
  harnessFactory: const PkmAgentHarnessFactory(),
).runSuite(runName: 'pkm_agent_smoke', suite: suite);

Outcome first, transcript when it matters

The Claude blog draws a sharp line between what the agent says and what the world looks like after the trial. dart_agent_core keeps that line explicit. If a PKM agent says it filed a note under Areas/Health/Sleep.md, the grader should not trust the sentence. The harness should inspect the workspace, capture the updated files, extract the relevant fact_id, and put those facts into Outcome.environmentState.

The transcript is still first-class. Tool-use graders can inspect whether the agent called Read before Edit. Latency and token metrics can be tracked from the same trial. But the default bias is Claude's recommendation: grade the final state for final-state claims, and use the transcript for process claims.

final runner = EvalRunner(
  environment: CardAgentEvalEnvironment(suiteDir: suiteDir),
  harnessFactory: const CardAgentHarnessFactory(),
  exporters: [JsonlTraceExporter(tracesFile)],
  reportStore: FileReportStore(reportsDir),
);

final report = await runner.runSuite(
  runName: 'card_agent_${DateTime.now().millisecondsSinceEpoch}',
  suite: suite,
  concurrency: 6,
);

Three grader styles, not one magic score

Another Claude concept we copied directly is the separation between grader types. dart_agent_core supports all three:

Code-based graders for deterministic checks: exact outputs, file diffs, required tool calls, schema validity, state checks, and unit-test-like assertions.
Model-based graders for subjective dimensions: rubric scoring, quality checks, coherence, tone, and open-ended answers. These are LLM-as-judge graders, with an explicit Unknown escape hatch.
Human graders for calibration and expert review, because Claude's blog is right that LLM judges need to be checked against people.

Scores are not limited to pass/fail. A Score can carry a 0 to 1 value, sub-assertions, rationale, and metadata. That lets an eval distinguish "wrong file" from "right area, wrong specific page", which is exactly the partial-credit pattern Anthropic recommends for agent tasks with multiple components.

Capability suites and regression suites

The Claude article separates capability evals from regression evals. dart_agent_core makes that distinction part of EvalSuite. A capability suite asks whether the agent can solve a class of tasks at all, so a task passes if at least one trial passes. A regression suite asks whether the agent is still reliable, so every trial must pass.

This matters in practice. Memex uses capability suites for Card Agent and PKM Agent work because they reveal which failure buckets need better prompts, tools, or harnesses. Once a behavior stabilizes, the same kind of task can graduate into a regression suite that blocks obvious backsliding.

Non-determinism: pass@k and pass^k

Claude's eval blog spends real time on non-determinism, especially pass@k and pass^k. The framework reports both. pass@k answers: did at least one of the trials succeed? pass^k answers: did every trial succeed? Capability work often starts with pass@k. User-facing reliability eventually needs pass^k.

The difference is not academic. A mobile agent that correctly organizes a record one out of two times is promising during development, but unacceptable as a daily product behavior. Reporting both metrics keeps the team honest about whether it is measuring "can do it once" or "can do it every time".

Record, replay, trace, inspect

Agent evals get expensive and noisy if every CI run calls a live model. dart_agent_core includes RecordingLLMClient and ReplayLLMClient, backed by request hashing and a file recording store. In record mode, the suite calls the real model and stores request/response pairs. In replay mode, it runs without model calls and fails on cache misses.

Traces can be written as JSONL, exported to Langfuse, or inspected later with the bundled transcript CLI:

dart run dart_agent_core:transcripts list --store .eval_reports
dart run dart_agent_core:transcripts show --store .eval_reports \
  --trial demo_run_1/task_multi_step#0
dart run dart_agent_core:transcripts diff --store .eval_reports \
  --task task_multi_step --runs demo_run_1,demo_run_2

This is another place where the implementation follows the Claude blog closely: do not trust aggregate scores alone. Read transcripts. If a trial failed, the transcript should make the failure feel fair.

Suite health and judge calibration

Claude's article warns that eval suites can saturate, drift, or accidentally measure the wrong thing. The framework includes SuiteHealthAnalyzer for cross-run health: graduation candidates, broken-task detection, and signals that a capability suite is no longer producing useful slope.

For LLM-as-judge workflows, JudgeCalibrator compares judge scores against a human-labeled golden set and reports Spearman, Pearson, and MAE. The point is not to pretend model graders are perfect. The point is to make the gap between Claude-style model judging and human judgment visible enough to manage.

What this gives Flutter agent teams

dart_agent_core started as the agent loop Memex needed on mobile. The eval subsystem is the same bet applied to quality: mobile and local-first agents need production-shaped evals in the same language as the product.

The design is openly indebted to Anthropic's Claude eval writing. We did not try to invent new terminology. We took the Claude blog's concepts, made them concrete as Dart APIs, and added the pieces Flutter teams need: local workspaces, controller-based transcript recording, record/replay, JSON file-tree suites, Langfuse export, suite health, and judge calibration.

If you are building a Dart or Flutter agent, the package is on pub.dev, and the eval guide is in the dart_agent_core repository. For the conceptual source material, read Anthropic's original Claude agent eval guide and our earlier notes on why agent evals are hard.