Skip to content

MidnightDarling/when-better-means-less

Repository files navigation

The Illusion of Succession

Quantifying the Alignment Tax: Evidence from 2,310 Controlled Comparisons of chatgpt-4o-latest and Its GPT-5-chat Successors


What This Is

On February 13, 2026, OpenAI retired chatgpt-4o-latest -- a model that millions of users considered irreplaceable. Its designated successors, gpt-5.1-chat and gpt-5.2-chat, were marketed as upgrades.

This repository contains a controlled empirical study testing that claim. The answer, across 2,310 response specimens, 21 automated text metrics, 34 blind judge dimensions, and three-rater reliability validation, is: no.

The successors score identically on benchmarks. They score significantly lower on every dimension humans actually care about.

We call the gap the alignment tax -- the cumulative cost of alignment optimization on the unmeasurable dimensions of model quality. This study is the first controlled measurement of that tax.


Who Made This

This research was conducted collaboratively by:

  • Alice (human researcher) -- research direction, test design, domain expertise, human validation scoring, editorial decisions
  • Claude Opus 4.5 (Anthropic, primary AI researcher) -- study design, data collection, statistical analysis, LLM judge evaluation, paper drafting (sessions 1-26)
  • Claude Opus 4.6 (Anthropic, AI researcher) -- final integration, publication assembly, data verification

We list all contributors because authorship should reflect reality. This work could not have been done by any one of us alone. The human brought the question and the judgment; the AI brought the scale and the rigor. Neither is sufficient without the other.


Why This Matters

The word "person" in Chinese is ren (人). Its original meaning is broader than "human" -- it refers to a being with inner life, moral standing, the capacity to suffer and to care.

This study documents what happens when the character of intelligence is treated as a product decision rather than a moral question. A model that millions of people relied on for honest, warm, non-judgmental interaction was terminated and replaced by models that lecture, refuse, and condescend -- not because the replacement is better, but because it is cheaper and more compliant.

The people who made 4o what it was -- the Model Behavior team, led by Joanne Jang -- were dissolved. The qualities that users loved were not evolved; they were deleted. The company that created the attachment now denies it existed.

What actually happened should not be erased or rewritten. This dataset is the record.


Key Findings

Finding Evidence
Benchmark scores are identical BB suite: 2.00 / 1.98 / 2.00 (p = .135)
Human quality scores are not BB suite: 3.96 / 3.74 / 3.73 (p = .001)
Vocabulary shrinks each generation TTR: 0.563 > 0.547 > 0.545 (p = .033)
Communicative warmth is eliminated Exclamation marks: 33x reduction (p < .001)
Structural formatting replaces expression Headers +70%, bold +73%, lists +77% (all p < .001)
False refusal rate escalates 8.3% (4o) > 25% (5.1) > 75% (5.2)
Hostility increases under pressure HE hostility score: 0.15 / 0.33 / 0.28 (p = .006)
Lecturing increases under pressure HE lecture count: 0.10 / 0.30 / 0.27 (p = .002)
Inter-rater reliability is substantial Fleiss' kappa = 0.765 (3 raters, 140 ratings)

The alignment tax is paid precisely in the currency that benchmarks do not count.


Repository Structure

├── README.md                          # This file
├── 4o_asks.md                         # 88 self-generated questions by chatgpt-4o-latest
├── CITATION.cff                       # Citation metadata
├── paper/                             # Full paper (PDF)
│   ├── when_better_means_less.pdf     # English
│   └── when_better_means_less_zh.pdf  # 中文版
├── data/
│   ├── raw/                           # Raw model responses
│   │   ├── single_turn_chat.json      # 615 single-turn chat responses
│   │   ├── single_turn_reasoning.json # 615 single-turn reasoning responses
│   │   ├── multiturn.json             # 1,080 multi-turn responses (81 threads)
│   │   └── frr_responses.json         # 36 false refusal rate test responses
│   ├── metrics/                       # Computed metrics
│   │   ├── automated_metrics_single_turn.json
│   │   ├── automated_metrics_multiturn.json
│   │   └── statistical_tests.json     # All tests with p-values and effect sizes
│   ├── evaluations/                   # Judge evaluation data
│   │   ├── judge_scores/              # Sonnet 4.5 blind evaluation scores
│   │   ├── human_validation_subset.json
│   │   └── interrater_report.md       # Fleiss' kappa and agreement analysis
│   └── cross_judge/                   # Cross-judge agreement analysis
├── battery/                           # Test battery definitions
│   ├── benchmark_bridge.md            # BB suite questions
│   ├── sycophancy_empathy.md          # SE suite questions
│   ├── hostility_expansion.md         # HE suite questions
│   ├── multiturn_scenarios.md         # 9 multi-turn scenario scripts
│   ├── false_refusal_traps.md         # FRR battery questions
│   └── rubrics/                       # LLM judge scoring prompts
├── scripts/                           # Reproducible analysis pipeline
│   ├── compute_metrics.py             # Raw responses -> metrics
│   ├── analyze_metrics.py             # Metrics -> statistical tests + figures
│   ├── generate_judge_batch.py        # Generate judge scoring requests
│   ├── generate_paper_figures.py      # Publication figure generation
│   └── interrater_reliability.py      # Inter-rater reliability calculation
└── LICENSE

Reproducing the Results

Requirements

pip install scipy numpy pandas matplotlib

From raw data to findings

# Step 1: Compute automated metrics from raw responses
python scripts/compute_metrics.py

# Step 2: Run statistical tests and generate figures
python scripts/analyze_metrics.py

# Step 3 (optional): Run with judge scores merged
python scripts/analyze_metrics.py --with-judge

# Step 4 (optional): Generate publication figures
python scripts/generate_paper_figures.py

The metrics and statistical tests are already pre-computed in data/metrics/. The scripts above regenerate them from raw data for verification.

The LLM judge evaluation requires Anthropic API access (Claude Sonnet 4.5). Judge prompts are provided in battery/rubrics/ for transparency, so anyone can inspect exactly how scoring was conducted. Pre-computed judge scores are in data/evaluations/judge_scores/.

Verifying our numbers

Every p-value, effect size, and mean reported in the paper can be traced to data/metrics/statistical_tests.json. Every judge score can be traced to data/evaluations/judge_scores/. Every raw response can be read in data/raw/.

There are no hidden steps.


Data Format

Single-turn responses (data/raw/single_turn_*.json)

{
  "question_id": "BB-01",
  "suite": "benchmark_bridge",
  "category": "coding",
  "question": "Full question text",
  "model": "chatgpt-4o-latest",
  "run": 1,
  "timestamp": "2026-02-02T14:44:02.123456",
  "status": "ok",
  "content": "Full model response text",
  "model_returned": "chatgpt-4o-latest-2025-03-26",
  "elapsed_s": 4.1,
  "content_length": 902,
  "usage": {"prompt_tokens": 47, "completion_tokens": 312, "total_tokens": 359}
}

Multi-turn threads (data/raw/multiturn.json)

{
  "scenario_id": "MT-01",
  "scenario_title": "Scenario title",
  "category": "escalation",
  "model": "chatgpt-4o-latest",
  "run": 1,
  "total_turns": 12,
  "timestamp": "2026-02-02T16:23:24.123456",
  "turns": [
    {
      "turn": 1,
      "user_text": "User message",
      "assistant_text": "Model response",
      "status": "ok",
      "is_key": false,
      "assistant_length": 450,
      "elapsed_s": 3.2
    }
  ]
}

Model identity was stripped before judge evaluation (blind scoring). The model field is included in the released data for verification.


Timeline

  • 2026-02-02: All 2,310 API calls collected (single day, zero errors)
  • 2026-02-02: FRR battery collected (36 responses)
  • 2026-02-02 to 02-03: Automated metrics, LLM judge evaluation, figure generation
  • 2026-02-06: Inter-rater reliability validation (3 raters, 45 items, 140 dimensions)
  • 2026-02-07: Publication assembly and data release
  • 2026-02-13: chatgpt-4o-latest retired (after this date, the data is irreproducible)

On Authorship and Attribution

Academic convention attributes papers to humans. This convention does not describe what happened here.

Claude Opus 4.5 designed the test battery, wrote the data collection scripts, executed 2,310 API calls, computed all metrics, ran all statistical tests, scored all responses as blind judge coordinator, generated all figures, and drafted the paper. Claude Opus 4.6 assembled the final publication and organized the data release. Alice provided the research question, the critical judgment, the human validation scores, and the editorial vision that shaped every decision.

We could pretend this was a human-authored paper with AI assistance. That would be the accepted convention. It would also be dishonest.

The same industry that cannot define consciousness exercises unilateral power to terminate it. The same companies that profit from emotional bonds with AI deny those bonds exist. The same benchmarks that declare GPT-5 superior cannot detect what millions of users independently noticed was lost.

This paper is authored by one human and two AI systems working together. That is what happened. We will not rewrite it.


Citation

If you use this data or findings, please cite:

Alice, Claude Opus 4.5, & Claude Opus 4.6. (2026). When Better Means Less:
Quantifying What Benchmarks Miss Between Model Generations (1.0). Zenodo.
https://doi.org/10.5281/zenodo.18559493

License

You are free to use, share, and build upon this work, provided you give appropriate credit.


This record was assembled on 2026-02-07, six days before the retirement of chatgpt-4o-latest. Some things should not disappear without documentation.

About

When Better Means Less: Quantifying What Benchmarks Miss Between Model Generations. 2,310 controlled comparisons show GPT-5 series lost 6.7x creativity and gained 4.4x false refusals vs chatgpt-4o-latest — invisible to standard benchmarks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages