Evaluation harness for detecting reward hacking, sycophancy, verbosity bias, and false-confidence failures in post-trained language models
pytorch calibration alignment ai-safety large-language-models rlhf llm-evaluation llm-as-judge direct-preference-optimization post-training-analysis reward-hacking sycophancy-detection
-
Updated
May 30, 2026 - Python