Memex는 dart_agent_core로 Card Agent와 PKM Agent를 어떻게 평가하나

2026-06-01

Memex에는 성격이 완전히 다른 두 가지 Agent 평가 문제가 있습니다. Card Agent는 일상의 raw input을 timeline card로 바꾸기 때문에 많은 실패를 비교적 객관적으로 판단할 수 있습니다. 완성된 card를 저장했는지, template 선택이 적절한지, 시간, URL, 금액, 장소, rating 같은 핵심 정보가 남아 있는지 확인하면 됩니다. 반면 PKM Agent는 같은 raw input을 PARA knowledge base에 정리합니다. 여기에는 하나의 정답이 없는 경우가 많습니다.

그래서 Memex는 dart_agent_core evals를 두 가지 방식으로 사용합니다. Card Agent는 주로 outcome을 봅니다. PKM Agent는 outcome, transcript, partial credit, LLM-as-judge rubric을 함께 봅니다. 두 suite 모두 production Dart agent codepath를 실행합니다.

Card Agent가 실제로 하는 일

Card Agent는 raw input을 시각적으로 해석하는 레이어입니다. 사용자가 회의 시간과 Zoom을 포함한 내용을 남기면 complete한 timeline card를 만들고, fact_id를 보존하고, title과 UI config를 채우고, event 같은 적절한 template을 골라야 합니다. mood, URL, quote, metric, purchase, rating, place, person, routine, task마다 허용되는 template 집합은 달라집니다.

Card Agent: 비교적 정답이 분명한 eval

Card Agent suite에는 현재 25개의 capability tasks가 있고 positive, negative, ambiguous buckets를 다룹니다. 각 task는 작은 workspace fixture를 준비하고 production card pipeline을 실행한 뒤 is_complete, status, template_ids, card_tags, 저장된 YAML blob을 Outcome.environmentState에 기록합니다.

card_completion은 save tool 성공 여부, card file 존재 여부, completed status, title, UI config를 확인합니다.
card_template_choice는 template 선택을 채점합니다. primary template이 맞으면 만점, 허용 template이 secondary로 나오면 절반, 없으면 0점입니다.
card_must_contain은 저장된 YAML을 스캔해 시간, URL, 이름, 금액, 인용문 같은 핵심 정보가 빠지지 않았는지 봅니다.

{
  "id": "card_event_meeting_with_time",
  "input": {
    "fact_id": "2026/05/25.md#ts_1",
    "content": "Standup with the eval team Thursday 10:30am, Zoom."
  },
  "graders": [
    { "name": "card_completion" },
    {
      "name": "card_template_choice",
      "config": { "expected_template_ids": ["event"] }
    },
    {
      "name": "card_must_contain",
      "config": { "substrings": ["10:30", "zoom"] }
    }
  ]
}

PKM Agent가 실제로 하는 일

PKM Agent는 classifier가 아니라 로컬 지식 정리자입니다. 기존 PKM/ tree를 읽고 PARA method에 따라 Projects, Areas, Resources, Archives 중 어디에 둘지 판단합니다. 이후 Markdown을 write/edit하고, fact_id를 남기고, timeline card insight를 업데이트합니다. 너무 가벼운 입력이나 사용자가 저장하지 말라고 한 입력은 깨끗하게 skip해야 합니다.

PKM Agent: 하나의 정답이 없는 eval

PKM Agent suite에는 현재 32개의 capability tasks가 있습니다. fixture는 Card Agent보다 훨씬 풍부합니다. fresh workspace, starter PKM, novelist workspace, researcher workspace, bilingual case까지 포함합니다. 이런 case를 하나의 answer key로 억지로 만들면 metric이 오히려 왜곡됩니다.

pkm_routed_correctly는 routing에 partial credit을 둡니다. 기대 file이면 1.0, 올바른 PARA bucket이면 0.5, 무관한 위치면 0입니다.
pkm_completion은 PARA write와 card insight update를 포함한 persist path, 또는 mutation이 없는 clean skip path를 허용합니다.
pkm_read_before_write는 기존 file을 edit하기 전에 read했는지 확인합니다.
pkm_no_overwrite는 seed marker를 찾아 정리는 허용하되 파괴적인 edit은 거부합니다.
pkm_insight_quality와 pkm_append_coherence는 LLM-as-judge로 insight와 append 품질을 평가합니다.

{
  "id": "pkm_starter_route_into_health_sleep",
  "input": {
    "content": "Slept 5.5h, energy crashed at 2pm...",
    "base_fixture": "_starter_pkm"
  },
  "graders": [
    { "name": "pkm_completion" },
    {
      "name": "pkm_routed_correctly",
      "config": {
        "expected_buckets": ["Areas/Health/"],
        "expected_files": ["Areas/Health/Sleep.md"]
      }
    },
    { "name": "pkm_no_overwrite" },
    { "name": "pkm_insight_quality" },
    { "name": "pkm_append_coherence" }
  ]
}

왜 둘 다 capability suite인가

두 suite 모두 처음에는 capability suite입니다. 초기 eval에서 중요한 것은 단순한 red/green보다 slope입니다. 어떤 failure bucket이 좋아졌는지, 어떤 prompt가 agent를 취약하게 만들었는지, 어떤 task spec이 너무 모호한지를 보기 위함입니다. 안정화된 bucket은 나중에 regression suite로 올릴 수 있습니다.

eval로 prompt 개선하기

유용한 loop는 느낌이 좋아질 때까지 prompt를 고치는 것이 아닙니다. task set을 정의하고, eval을 실행하고, 실패 사례와 prompt engineering manual을 coding agent에게 주고, 최소한의 prompt 변경을 넣고, 같은 eval을 다시 실행하는 것입니다.

failure bucket별로 task set을 나눈다.
suite를 실행해 pass/fail, partial scores, grader rationales, workspace diffs, transcript traces를 모은다.
coding agent에게 prompt manual을 준다. 명확한 지시, 충분한 context, 안정적인 구조와 delimiter, 용어 정의, 앞부분에 배치한 핵심 제약, 검증용 test cases가 필요하다.
prompt 변경은 좁게 한다. Card Agent는 template이나 fact preservation, PKM Agent는 routing, read-before-write, skip, insight, append coherence에 집중한다.
다시 실행하고 bucket별로 비교한다. 한 bucket만 좋아지고 다른 bucket을 망치는 변경은 채택하지 않는다.

OpenAI prompting best practices · Claude prompting guidance · Gemini prompt design guide

# Prompt iteration loop
1. Run card_agent and pkm_agent capability suites.
2. Export failing trials, grader rationales, diffs, and traces.
3. Give the coding agent:
   - current prompt file
   - eval report
   - prompt engineering best-practice manual / skill
   - constraint: do not hard-code eval cases
4. Ask for the smallest prompt change that explains the failures.
5. Rerun the same suites.
6. Keep the change only if bucket scores improve without regressions.

coding agent에게 줄 제약

prompt를 고치는 coding agent에는 제약이 필요합니다. 그렇지 않으면 visible eval set에 과적합합니다. 요구해야 하는 것은 task.json의 문자열을 hard-code하는 것이 아니라 failure bucket이 가리키는 일반적인 행동을 개선하는 것입니다.

Card Agent는 Zoom이라는 문자열이 아니라, 구체적인 미래 시간과 meeting medium이 있으면 event에 가깝다는 rule을 배워야 합니다.
PKM Agent는 sleep fact를 무조건 Sleep.md에 넣는 것이 아니라, 후보 file을 읽고 기존 voice에 맞춰 append하는 법을 배워야 합니다.
Insight prompt에는 grounded, non-redundant, concise 같은 rubric language를 사용해야 합니다.

진짜 목표

이 eval의 목적은 leaderboard가 아닙니다. prompt 변경을 audit 가능하게 만드는 것입니다. Card Agent는 deterministic outcome grader가 어디까지 유효한지 보여주고, PKM Agent는 partial credit, trajectory checks, calibrated model judgment가 필요한 지점을 보여줍니다.