Skip to content

Cache judge responses for reruns #353

@ScuttleBot

Description

@ScuttleBot

Priority 5: Cache Judge Responses (5-10% speedup on reruns)

Problem

When rerunning benchmarks for regression testing or model comparisons, we re-grade identical transcripts.

Solution

Cache judge responses keyed by (task_id, transcript_hash).

Implementation

import hashlib
from pathlib import Path

JUDGE_CACHE_DIR = Path(".judge_cache")

def cache_key(task_id: str, transcript: list) -> str:
    transcript_str = json.dumps(transcript, sort_keys=True)
    hash_hex = hashlib.sha256(transcript_str.encode()).hexdigest()[:16]
    return f"{task_id}:{hash_hex}"

def get_cached_grade(task_id: str, transcript: list) -> dict | None:
    key = cache_key(task_id, transcript)
    cache_path = JUDGE_CACHE_DIR / f"{key}.json"
    if cache_path.exists():
        return json.loads(cache_path.read_text())
    return None

Expected Impact

  • Minimal for first run
  • Significant for reruns (regression testing)
  • Reduces judge API costs over time

Caveats

  • Only cache deterministic transcripts
  • May need opt-in flag (--use-judge-cache)
  • Cache invalidation when rubric changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions