Memex 如何用 dart_agent_core 評估 Card Agent 和 PKM Agent

2026-06-01

Memex 裡有兩個很不一樣的 Agent 評估問題。Card Agent 會把一條原始記錄變成時間線卡片，所以很多失敗可以用相對明確的答案判斷：有沒有保存完整卡片？模板選得對不對？時間、URL、金額、地點、評分這些關鍵資訊有沒有保留下來？PKM Agent 處理的是另一類問題。它要把同一條原始記錄組織進 PARA 知識庫，而這裡往往沒有唯一正確答案。

所以 Memex 用 dart_agent_core 做了兩種 eval。Card Agent 主要看 outcome；PKM Agent 則同時看 outcome、transcript、partial credit 和 LLM-as-judge rubric。兩套 suite 都跑真實 production code path，而不是為了 benchmark 另寫一個簡化 harness。

Card Agent 實際在做什麼

Card Agent 是原始輸入的視覺解讀層。使用者發布一條帶時間和 Zoom 的會議記錄時，它應該建立完整的 timeline card，保留 fact_id，生成 title 和 UI config，並選擇 event 之類的合適模板。不同輸入類型會有不同的可接受模板集合。

Card Agent：偏確定答案的 eval

Card Agent suite 目前有 25 個 capability tasks，覆蓋 positive、negative 和 ambiguous buckets。每個 task 都會準備 workspace fixture，跑 production card pipeline，再把 is_complete、status、template_ids、card_tags 和保存下來的 YAML blob 寫進 Outcome.environmentState。

card_completion 檢查 save tool 是否成功、card 檔案是否存在、status 是否為 completed、title 和 UI config 是否存在。
card_template_choice 檢查模板選擇：主模板命中給滿分，可接受模板只出現在次級位置給半分，完全不命中給零分。
card_must_contain 掃描保存下來的 YAML，確認時間、地點、URL、人名、金額或引用文字等關鍵資訊沒有丟。

{
  "id": "card_event_meeting_with_time",
  "input": {
    "fact_id": "2026/05/25.md#ts_1",
    "content": "Standup with the eval team Thursday 10:30am, Zoom."
  },
  "graders": [
    { "name": "card_completion" },
    {
      "name": "card_template_choice",
      "config": { "expected_template_ids": ["event"] }
    },
    {
      "name": "card_must_contain",
      "config": { "substrings": ["10:30", "zoom"] }
    }
  ]
}

PKM Agent 實際在做什麼

PKM Agent 不是分類器，而是本地知識整理器。它會讀取既有 PKM/ 目錄，按 PARA 方法判斷新資訊應該進 Projects、Areas、Resources 還是 Archives，接著寫入或編輯 Markdown，保留 fact_id，並更新時間線卡片的 insight。輸入太輕、或使用者明確說不要保存時，它也應該乾淨地跳過持久化。

PKM Agent：沒有唯一正確答案的 eval

PKM Agent suite 目前有 32 個 capability tasks，fixture 比 Card Agent 複雜得多：全新 workspace、starter PKM、小說作者知識庫、研究者知識庫和雙語場景都有。這些 case 如果硬做成唯一答案，指標會失真。

pkm_routed_correctly 做 routing partial credit：命中具體檔案給 1.0，留在正確 PARA bucket 給 0.5，寫到無關位置給 0。
pkm_completion 接受有效持久化或乾淨 skip 兩條完成路徑。
pkm_read_before_write 檢查 agent 是否先讀候選檔案再編輯，避免盲寫覆蓋使用者知識。
pkm_no_overwrite 用 seed marker 檢查 fixture 內容是否還在，允許合併整理，但拒絕破壞性編輯。
pkm_insight_quality 和 pkm_append_coherence 用 LLM-as-judge 評估 insight 品質和追加內容是否貼合原檔案風格。

{
  "id": "pkm_starter_route_into_health_sleep",
  "input": {
    "content": "Slept 5.5h, energy crashed at 2pm...",
    "base_fixture": "_starter_pkm"
  },
  "graders": [
    { "name": "pkm_completion" },
    {
      "name": "pkm_routed_correctly",
      "config": {
        "expected_buckets": ["Areas/Health/"],
        "expected_files": ["Areas/Health/Sleep.md"]
      }
    },
    { "name": "pkm_no_overwrite" },
    { "name": "pkm_insight_quality" },
    { "name": "pkm_append_coherence" }
  ]
}

為什麼兩套都是 capability suites

兩套 suite 都先作為 capability suite，而不是一開始就做硬回歸門禁。早期 eval 更重要的是看斜率：哪些 failure bucket 變好了，哪些 prompt 讓 agent 變脆了，哪些任務規格本身太模糊。等行為穩定後，再把關鍵 bucket 提升為 regression suite。

用 eval 優化 prompt

有用的循環不是「改到感覺更好為止」，而是：定義任務集，跑 eval，把失敗樣例和 prompt engineering 手冊交給 coding agent，讓它按最佳實踐小步修改 prompt，然後繼續跑同一套 eval。

按 failure bucket 定義任務集。
跑 suite，收集 pass/fail、partial scores、grader rationales、workspace diffs 和 transcript traces。
給 coding agent 一份 prompt 手冊：指令要明確，給足上下文和動機，結構和分隔符要穩定，術語要定義，關鍵約束放前面。
只做窄修改：Card Agent 針對模板或事實保留問題改；PKM Agent 針對 routing、read-before-write、skip、insight 或 append coherence 改。
重新跑 eval，按 bucket 比較分數；只提升一個 bucket 卻傷害另一個 bucket 的 prompt 不應該合併。

OpenAI prompting best practices · Claude prompting guidance · Gemini prompt design guide

# Prompt iteration loop
1. Run card_agent and pkm_agent capability suites.
2. Export failing trials, grader rationales, diffs, and traces.
3. Give the coding agent:
   - current prompt file
   - eval report
   - prompt engineering best-practice manual / skill
   - constraint: do not hard-code eval cases
4. Ask for the smallest prompt change that explains the failures.
5. Rerun the same suites.
6. Keep the change only if bucket scores improve without regressions.

給 coding agent 的約束

負責改 prompt 的 coding agent 必須被約束住，否則它很容易優化可見 eval set。正確要求是改善 failure bucket 所代表的一般行為，而不是硬編碼 task.json 裡的樣例字串。

Card Agent 應該學到有明確未來時間和會議媒介時偏向 event，而不是看到 Zoom 就選 event。
PKM Agent 應該學到編輯前先讀候選檔案，並按原檔案語氣追加，而不是所有睡眠記錄都塞進 Sleep.md。
Insight prompt 應該使用 rubric 語言：grounded、non-redundant、concise，不要寫成證據清單。

真正的目標

這些 eval 不是為了讓 Card Agent 和 PKM Agent 在 leaderboard 上好看，而是為了讓 prompt 改動可以被審計。Card Agent 展示確定性 outcome grader 能走多遠；PKM Agent 展示什麼時候必須引入 partial credit、trajectory checks 和校準過的 model judgment。