How Memex Evaluates Card Agent and PKM Agent with dart_agent_core

2026-06-01

Memex has two very different agent evaluation problems. Card Agent turns a raw daily input into a timeline card, so many failures have a relatively objective answer: did it save a completed card, choose an acceptable template, and preserve key facts like time, URL, amount, place, or rating? PKM Agent organizes the same raw input into a PARA knowledge base, where there is often no single perfect answer. The right file depends on the existing workspace, the user's history, and whether the new fact should be persisted at all.

That split is why Memex uses dart_agent_core evals in two styles. Card Agent is mostly outcome checking. PKM Agent combines outcome checking, trajectory checking, partial credit, and LLM-as-judge rubrics. Both suites run the production Dart agent codepath, not a simplified benchmark harness.

What Card Agent actually does

Card Agent is the visual interpretation layer for raw input. When a user publishes "Standup with the eval team Thursday 10:30am, Zoom", the agent should create a timeline card that is complete, has a title, contains UI config, keeps the fact id, and chooses a suitable template such as event. For a short mood entry it may choose mood, compact, or snapshot. For a URL, quote, metric, purchase, rating, place, person, routine, or task, the acceptable template set changes.

In production the Card Agent path goes through processWithCardAgent, merges asset analysis and location reminders into the raw fact, formats a user message with publish time and fact_id, and then runs CardAgent.runWithContent. The agent is not asked to return JSON in chat. It must call the timeline card skill and persist a card file. That distinction matters for evals: we grade the saved card, not the agent's explanation.

Card Agent: evals with mostly exact answers

The Card Agent suite currently has 25 capability tasks covering positive, negative, and ambiguous buckets. Each task seeds a small workspace fixture, runs the production card pipeline, then captures an Outcome.environmentState with fields like is_complete, status, template_ids, card_tags, and the raw saved YAML blob.

Completion: card_completion checks that the save tool succeeded, a card file exists, status is completed, the title exists, and at least one UI config exists.
Template choice: card_template_choice gives full credit when the primary template is acceptable, half credit when the acceptable template appears secondarily, and zero when no acceptable template appears.
Fact preservation: card_must_contain scans the saved YAML for required substrings such as 10:30, zoom, URLs, names, amounts, or quoted text.

This is the easy side of agent evaluation. The task can say: for this input, event is correct, 10:30 must survive, and the saved card must be complete. There is still ambiguity, so the suite allows sets of acceptable templates. But the grader can stay mostly deterministic.

{
  "id": "card_event_meeting_with_time",
  "input": {
    "fact_id": "2026/05/25.md#ts_1",
    "content": "Standup with the eval team Thursday 10:30am, Zoom."
  },
  "graders": [
    { "name": "card_completion" },
    {
      "name": "card_template_choice",
      "config": { "expected_template_ids": ["event"] }
    },
    {
      "name": "card_must_contain",
      "config": { "substrings": ["10:30", "zoom"] }
    }
  ]
}

What PKM Agent actually does

PKM Agent is not a classifier. It is a local knowledge organizer. It reads the existing PKM/ tree, applies the PARA method, decides whether the new raw input belongs under Projects, Areas, Resources, or Archives, writes or edits Markdown files, preserves the current fact_id, and updates the timeline card insight. It can also skip persistence when the input is trivial or the user explicitly says not to save it.

The production agent has access to file tools such as Read, BatchRead, Write, Edit, Move, Remove, LS, Glob, and Grep. It also receives system reminders when files are too large, directories are fragmented, names look date-based, or recent edits suggest a file is becoming an accidental log. Those reminders make the agent more than "put text in a folder"; it has to maintain a living knowledge base.

PKM Agent: evals without one perfect answer

The PKM Agent suite currently has 32 capability tasks. The fixtures are much richer than Card Agent: some start from a fresh workspace, some from a realistic starter PKM, some from a novelist's workspace, some from a researcher's workspace, and some are bilingual. The hard cases are exactly the cases a deterministic answer key would oversimplify.

Routing with partial credit: pkm_routed_correctly can give 1.0 for editing the exact expected file, 0.5 for staying in the right PARA bucket, and 0.0 for writing somewhere unrelated.
Workflow completion: pkm_completion accepts either a valid persist path, meaning PARA write plus card insight update, or a clean skip path with no PKM mutation.
Trajectory constraints: pkm_read_before_write checks that the agent read an existing target file before editing it, so it does not blindly overwrite user knowledge.
Regression safety: pkm_no_overwrite looks for a seeded marker anywhere under PKM/, allowing consolidation but rejecting destructive edits.
Model-based quality: pkm_insight_quality grades groundedness, non-redundancy, and expression quality. pkm_append_coherencegrades whether an edit blends with the existing Markdown file's voice and structure.

This is where dart_agent_core matters. The framework gives us a single trial with both a transcript and an outcome. Code graders can inspect file diffs and tool calls; model graders can judge quality dimensions that are real but not exactly enumerable. The suite can be strict where the product needs determinism and flexible where humans would accept more than one good answer.

{
  "id": "pkm_starter_route_into_health_sleep",
  "input": {
    "content": "Slept 5.5h, energy crashed at 2pm...",
    "base_fixture": "_starter_pkm"
  },
  "graders": [
    { "name": "pkm_completion" },
    {
      "name": "pkm_routed_correctly",
      "config": {
        "expected_buckets": ["Areas/Health/"],
        "expected_files": ["Areas/Health/Sleep.md"]
      }
    },
    { "name": "pkm_no_overwrite" },
    { "name": "pkm_insight_quality" },
    { "name": "pkm_append_coherence" }
  ]
}

Why both suites are capability suites

Both suites are marked as capability suites, not hard regression gates. That is deliberate. Early agent evals should expose slope: which failure buckets are getting better, which prompts made the agent brittle, and which task specs are too ambiguous. A single miss should not make the test red if the harness is healthy. The tests only fail when every trial errors, which points to setup or infrastructure.

Once a behavior stabilizes, a task bucket can graduate into a regression suite. For Card Agent, that might be "future dinner with date and place must become event". For PKM Agent, it might be "existing sleep fact must extend Areas/Health/Sleep.md without deleting the seed marker". Capability suites tell us where to work; regression suites tell us what must not break.

Using evals to improve prompts

The useful loop is not "rewrite the prompt until it feels better". The loop is: define a task set, run evals, give a coding agent the failures plus a prompt engineering manual, let it patch the prompt, then run evals again. That turns prompt editing into an engineering process with before/after measurements.

Define the task set: split tasks by failure bucket, such as template_event, decline_trivial, route_into_existing_subfile, skip_trivial, or cross_ref_paper_to_thesis.
Run the suite: collect pass/fail, partial scores, grader rationales, workspace diffs, and transcript traces.
Give the coding agent a prompt manual: include rules distilled from official guidance: be explicit, provide context and motivation, use consistent structure and delimiters, define ambiguous terms, put critical constraints early, control output verbosity, and validate against test cases.
Patch the prompt narrowly: change the Card Agent prompt only for the template or fact-preservation bucket that failed; change the PKM Agent prompt only for routing, read-before-write, skip, insight, or append-coherence failures.
Rerun and compare: look for score movement by bucket, not just one aggregate number. Reject prompt changes that improve one bucket while harming another.

This matches the spirit of the major vendor guidance. OpenAI describes prompt engineering as an iterative optimization process and emphasizes validation. OpenAI's prompting best practices push clear instructions, examples, and iterative testing. Anthropic's Claude prompting guidance emphasizes explicit instructions, context, examples, and avoiding overfitting to tests. Google's Gemini prompt design guide emphasizes direct prompts, consistent structure, clear delimiters, parameter definitions, critical instructions near the front, and explicit control over agentic behavior.

# Prompt iteration loop
1. Run card_agent and pkm_agent capability suites.
2. Export failing trials, grader rationales, diffs, and traces.
3. Give the coding agent:
   - current prompt file
   - eval report
   - prompt engineering best-practice manual / skill
   - constraint: do not hard-code eval cases
4. Ask for the smallest prompt change that explains the failures.
5. Rerun the same suites.
6. Keep the change only if bucket scores improve without regressions.

What the coding agent should be told

The prompt-improving coding agent needs constraints, otherwise it will optimize for the visible eval set. Anthropic explicitly warns against hard-coding to tests in its coding guidance, and the same rule applies to prompts. The instruction should be: improve the general behavior described by the failure bucket, not the literal sample string in task.json.

For Card Agent, prefer rules like "when an input has a concrete future time and meeting medium, bias toward event" over "if text contains Zoom, choose event".
For PKM Agent, prefer rules like "read candidate existing files before editing and append in their existing voice" over "always put sleep facts in Sleep.md".
For insights, prefer rubric language: grounded in the user's PKM, non-redundant, concise, and not a report-like evidence dump.

The real goal

The point of these evals is not to make Card Agent and PKM Agent look good on a leaderboard. The point is to make prompt changes auditable. Card Agent shows how far deterministic outcome graders can take you when the product has a relatively clear answer. PKM Agent shows where agent quality needs partial credit, trajectory checks, and calibrated model judgment.

Once those suites exist, prompt optimization becomes a loop the team can repeat: measure, inspect, revise, rerun. That is the difference between "this prompt feels better" and "this prompt improved routing into existing files without making trivial inputs persist by accident".