Memex は dart_agent_core で Card Agent と PKM Agent をどう評価するか

2026-06-01

Memex には性質の違う二つの Agent 評価があります。Card Agent は日々の raw input を timeline card に変換するため、多くの失敗は比較的はっきり判定できます。完成した card が保存されたか、template は妥当か、時刻、URL、金額、場所、rating などの重要情報が残っているか。一方で PKM Agent は同じ raw input を PARA knowledge base に整理します。こちらは一つの正解に落としにくく、正しい file は既存 workspace、ユーザーの履歴、そもそも永続化すべきかどうかに依存します。

そのため Memex は dart_agent_core evals を二つのスタイルで使っています。Card Agent は主に outcome を見る。PKM Agent は outcome、transcript、partial credit、LLM-as-judge rubric を組み合わせる。どちらの suite も production の Dart agent codepath を走らせます。

Card Agent が実際にしていること

Card Agent は raw input の視覚的な解釈レイヤーです。会議時刻と Zoom を含む入力なら、complete な timeline card を作り、fact_id を保ち、title と UI config を持たせ、event のような適切な template を選ぶ必要があります。mood、URL、quote、metric、purchase、rating などでは受け入れ可能な template の集合が変わります。

Card Agent：比較的正解を決めやすい eval

Card Agent suite には 25 個の capability tasks があり、positive、negative、ambiguous buckets をカバーしています。各 task は workspace fixture を用意し、production card pipeline を走らせ、is_complete、status、template_ids、card_tags、保存された YAML blob を Outcome.environmentState に残します。

card_completion は save tool、card file、completed status、title、UI config を確認します。
card_template_choice は template 選択を採点します。primary template が当たれば満点、許容 template が secondary に出れば半分、なければゼロです。
card_must_contain は保存された YAML を見て、時刻、URL、名前、金額、引用などの重要情報が落ちていないかを確認します。

{
  "id": "card_event_meeting_with_time",
  "input": {
    "fact_id": "2026/05/25.md#ts_1",
    "content": "Standup with the eval team Thursday 10:30am, Zoom."
  },
  "graders": [
    { "name": "card_completion" },
    {
      "name": "card_template_choice",
      "config": { "expected_template_ids": ["event"] }
    },
    {
      "name": "card_must_contain",
      "config": { "substrings": ["10:30", "zoom"] }
    }
  ]
}

PKM Agent が実際にしていること

PKM Agent は classifier ではなく、ローカルの知識整理役です。既存の PKM/ tree を読み、PARA method に従って Projects、Areas、Resources、Archives のどこに入れるべきかを判断し、Markdown を write/edit し、fact_id を残し、timeline card insight を更新します。軽すぎる入力や、ユーザーが保存しないでほしいと言った入力は clean に skip する必要があります。

PKM Agent：一つの正解がない eval

PKM Agent suite には 32 個の capability tasks があります。fixture は Card Agent よりずっと豊かで、fresh workspace、starter PKM、小説家の workspace、研究者の workspace、bilingual case まで含みます。これらを無理に一つの answer key にすると、むしろ metric が歪みます。

pkm_routed_correctly は routing に partial credit を入れます。期待 file なら 1.0、正しい PARA bucket なら 0.5、無関係なら 0。
pkm_completion は、PARA write と card insight update の persist path、または mutation のない clean skip path を受け入れます。
pkm_read_before_write は、既存 file を edit する前に read したかを見ます。
pkm_no_overwrite は seed marker を探し、整理は許しつつ破壊的な edit を拒みます。
pkm_insight_quality と pkm_append_coherence は LLM-as-judge で insight と append の質を見ます。

{
  "id": "pkm_starter_route_into_health_sleep",
  "input": {
    "content": "Slept 5.5h, energy crashed at 2pm...",
    "base_fixture": "_starter_pkm"
  },
  "graders": [
    { "name": "pkm_completion" },
    {
      "name": "pkm_routed_correctly",
      "config": {
        "expected_buckets": ["Areas/Health/"],
        "expected_files": ["Areas/Health/Sleep.md"]
      }
    },
    { "name": "pkm_no_overwrite" },
    { "name": "pkm_insight_quality" },
    { "name": "pkm_append_coherence" }
  ]
}

なぜどちらも capability suite なのか

どちらの suite も最初は capability suite です。早い段階の eval では、赤か緑かよりも slope が重要です。どの failure bucket が改善しているか、どの prompt が agent を脆くしたか、どの task spec が曖昧すぎるかを見るためです。安定した bucket は後から regression suite に昇格できます。

eval で prompt を改善する

役に立つ loop は、感覚で prompt を書き換えることではありません。task set を定義し、eval を走らせ、失敗例と prompt engineering manual を coding agent に渡し、最小の prompt 変更を入れ、同じ eval をもう一度走らせることです。

failure bucket ごとに task set を分ける。
suite を走らせ、pass/fail、partial scores、grader rationales、workspace diffs、transcript traces を集める。
coding agent に prompt manual を渡す。明確な指示、十分な context、安定した構造と delimiter、用語定義、前方に置いた重要制約、検証用 test cases を含める。
prompt 変更は狭く行う。Card Agent なら template や fact preservation、PKM Agent なら routing、read-before-write、skip、insight、append coherence に絞る。
再実行して bucket ごとに比較する。一つの bucket だけ改善して別の bucket を壊す変更は採用しない。

OpenAI prompting best practices · Claude prompting guidance · Gemini prompt design guide

# Prompt iteration loop
1. Run card_agent and pkm_agent capability suites.
2. Export failing trials, grader rationales, diffs, and traces.
3. Give the coding agent:
   - current prompt file
   - eval report
   - prompt engineering best-practice manual / skill
   - constraint: do not hard-code eval cases
4. Ask for the smallest prompt change that explains the failures.
5. Rerun the same suites.
6. Keep the change only if bucket scores improve without regressions.

coding agent に渡す制約

prompt を直す coding agent には制約が必要です。そうしないと visible eval set に過剰適合します。求めるべきなのは、task.json の文字列を hard-code することではなく、failure bucket が示す一般的な振る舞いを改善することです。

Card Agent には、Zoom という文字列ではなく、具体的な未来時刻と meeting medium があるなら event に寄せる、という rule を学ばせる。
PKM Agent には、sleep fact を常に Sleep.md に入れるのではなく、候補 file を読んでから既存の voice に合わせて append することを学ばせる。
Insight prompt には grounded、non-redundant、concise という rubric language を使わせる。

本当の目的

これらの eval は leaderboard のためではありません。prompt 変更を audit 可能にするためです。Card Agent は deterministic outcome grader でどこまで行けるかを示し、PKM Agent は partial credit、trajectory checks、calibrated model judgment が必要になる境界を示します。