collapseindex.org β’ ORCID: 0009-0002-2566-5538 β’ ask@collapseindex.org
β οΈ Note: Themainbranch is read-only. No formulas or metric implementations are released here.
|
Core methodology, theoretical bounds, and design principles of the Collapse Index. |
|
|
CI applied to astrophysical transient detection in synthetic supernova light curves. |
First real-world operational validation on ESA satellite telemetry data. |
|
CI applied to LLM robustness using morphology-aligned perturbations. |
Orthogonal stability metric complementing CI. |
A diagnostic framework for detecting silent instability in ML systems.
Collapse Index (CI) is an evaluation methodology that catches model failures before they show up in accuracy metrics.
Most ML evaluation asks: "Is the model getting the right answers?"
CI asks: "Is the model becoming unreliable before the errors appear?"
Many ML systems fail silentlyβremaining accurate and confident while becoming structurally unstable under small, meaning-preserving perturbations. By the time accuracy drops, the damage is done.
The operators don't know. The model doesn't know. But the structure knows.
CI separates model behavior into three independent signals:
- Collapse (CI): How much predictions drift under meaning-preserving changes
- Structural Retention (SRI): Whether internal decision structure holds together
- Confidence: What the model thinks about its own predictions
When these signals disagree, that divergence is the early warning.
Understanding CI:
- π― Introduction (you are here)
- π¨ The Failure Mode
- π How CI Works: Three Signals
- β‘οΈ The CI Workflow
Using CI:
- π Integration Requirements
- π The Collapse Log
- β Why CI + Collapse Log Matter
- π― Public Validation Results
Context & Positioning:
Project Info:
- πΊοΈ Roadmap 2026
β οΈ Official Status- π License & Citation
- π§π»βπ¬ Author & Contact
- π Sponsors
A typical scenario:
- β Accuracy: ~96%
- β AUC: ~0.90
- β Mean confidence: high
- β Calibration: "acceptable"
Yet under meaning-preserving perturbations:
- β Predictions flip frequently
- β Internal decision structure degrades
- β Confidence remains high or even increases
This is worse than an obviously uncertain modelβit fails confidently and silently.
Rather than proposing a single score, CI treats model behavior as three separable signals:
What it measures: How much predictions drift or flip under meaning-preserving perturbations
Question it answers: "Does this model behave consistently when nothing important changes?"
Example: A paraphrase shouldn't flip "STABLE" to "UNSTABLE"
What it measures: Whether the model's internal decision structure remains coherent across variants
Question it answers: "Is the model internally stable, or just coincidentally correct?"
Key insight: A model can output the same answer for the wrong reasons. SRI catches this.
What it measures: How strongly the model believes its own predictions
Question it answers: "Does the model think it is correct?"
The problem: Confidence is often a weak predictor of actual correctness.
It's not the individual metricsβit's how these signals disagree.
Across evaluations, we consistently find:
- β Confidence has near-chance ability to separate correct vs incorrect predictions
- β CI/SRI separate future errors substantially better
β οΈ Most dangerous regime: High confidence + High collapse
In other words: Models often don't know when they're wrongβbut their structure does.
- Bounded scores (0β1): Clear, comparable measures
- Lightweight stressors: Paraphrases, rewordings, not adversarial attacks
- Reproducibility: Each run produces sealed artifact bundles
- Domain-agnostic: Works on text, vision, time-series, medical data
- Audit-aligned: Designed for governance, not leaderboards
CI and SRI aren't competing metrics. They're two views of the same separability:
- CI: Measures how much behavior moves
- SRI: Measures how much structure stays intact
- Relationship: CI + SRI = 1.0 (exact)
They preserve the same ranking (same AUC), but force different interpretations:
- Is your model stable because it's genuinely robust?
- Or stable because everything else collapsed?
You can't game one without paying a visible cost in the other. That's intentional.
The Collapse Index (CI) is more than a metric: itβs a pipeline.
Each run produces both a bounded CI score and a collapse log (row-level ledger of outcomes),
then seals everything into an audit-grade bundle.
This flowchart shows how CI integrates into evaluation, from setup to governance.
flowchart LR
A["<b>Setup</b><br>Prepare environment + model"] --> B["<b>Generation</b><br>Baseline + stress variants"]
B --> C["<b>Metrics</b><br>Compute collapse signals"]
C --> D["<b>Logging</b><br>Write collapse_log.csv (row-level)"]
D --> Y["<b>Collapse Log</b><br>per-prompt ledger (CSV)"] & X["<b>CI Score</b><br>[0,1] aggregate from log"]
X --> E["<b>Analysis</b><br>Stability vs. collapse"]
Y --> E
E --> F["<b>Reporting</b><br>Summaries Β· plots Β· tables"]
F --> G["<b>Archival</b><br>Sealed bundle Β· checksum"]
G --> H["<b>Governance</b><br>Licenses Β· disclosure"]
E L_E_C_0@-. iteration .-> C
H L_H_A_0@-. policy/reqs .-> A
A:::eval
B:::eval
C:::eval
D:::eval
Y:::outputs
X:::outputs
E:::eval
F:::audit
G:::audit
H:::audit
L_E_C_0@{ animation: fast }
L_H_A_0@{ animation: fast }
The CI framework integrates into the evaluation pipeline at two points:
β’ Metrics (CI score): collapse quantified into a bounded [0,1] score.
β’ Collapse Log: detailed, row-level record of every prediction and outcome.These plug into the broader evaluation cycle (analysis β reporting β archival β governance), producing sealed, audit-grade evidence of system stability.
- A diagnostic framework for detecting instability before accuracy drops
- Stress-based evaluation using benign, meaning-preserving perturbations
- Audit-grade output with row-level evidence (Collapse Log)
- Domain-agnostic: Works on any predictive system
- Complementary to existing evaluation methods
- Not a benchmark: It's a diagnostic, not a leaderboard score
- Not adversarial: Uses natural variations, not adversarial attacks
- Not OOD detection: Measures behavior under semantic equivalence, not distribution shift
- Not a replacement for calibration, robustness, or standard metrics
- Not open source: Framework and formulas are proprietary
| Approach | What It Asks | What CI Asks |
|---|---|---|
| Standard Metrics | "Is the model getting the right answers?" | "Is the model becoming unreliable before errors appear?" |
| Adversarial Robustness | "Can I break this model with worst-case inputs?" | "Does this model crack under normal variation?" |
| OOD Detection | "Where did this input come from?" | "How does the model behave when meaning hasn't changed?" |
| Calibration | "Does confidence match accuracy?" | "Does confidence predict failure, or does structure?" |
CI is designed for safety- and governance-relevant deployments where:
- A single silent failure can trigger recalls, lawsuits, or harm
- Operators need early warnings before cascading failures
- Audit trails and receipts are required
- Continuous monitoring must be computationally feasible
Structural Retention Index (SRI) is CI's complementary metric for measuring internal reasoning stability.
- CI measures: How much the model cracks under meaning-preserving perturbations
- SRI measures: How well the model holds its decision structure across variants
- Perfect complementarity: CI + SRI = 1.0 (exact)
Models can output consistent predictions while internal reasoning collapses.
CI catches when your model cracks. SRI catches structural decay.
Together, they reveal failures invisible to traditional metrics.
Key insight: A model can have stable predictions but collapsing internal reasoning.
These are the cases that pass QA but fail in production under real-world stress.
π Public Validations:
- AG News (Multi-class): github.com/collapseindex/ci-sri
- SST-2 (Binary): github.com/collapseindex/ci-sst2
π Published paper: DOI: 10.5281/zenodo.18016507
Simple CSV or Parquet dataset containing model predictions. No weights, no code.
| id | variant_id | true_label | pred_label | confidence |
|---|---|---|---|---|
case_001 |
base |
Positive |
Positive |
0.92 |
case_001 |
v1 |
Positive |
Positive |
0.89 |
case_001 |
v2 |
Positive |
Negative |
0.71 |
case_002 |
base |
Negative |
Negative |
0.95 |
case_002 |
v1 |
Negative |
Negative |
0.93 |
Minimal Mode (label flips only)
id, variant_id, true_label, pred_label
Both CI and SRI can compute from pure prediction disagreement. Works with any classifier, even hard decision systems.
Standard Mode (full diagnostics) β HIGHLY RECOMMENDED
id, variant_id, true_label, pred_label, confidence
Adds confidence scores for canonical analysis. This enables full structural diagnostics and separates the three signals (CI, SRI, Confidence) for complete stability analysis.
- β No model internals required: Don't need logits, embeddings, attention weights, nothing. Just outputs.
- β No retraining: Works on existing models as-is
- β No special infrastructure: If you can log predictions to CSV, you can run this evaluation
- β Trivial integration: Literally just log predictions in the right format
- β Retroactive analysis: Have old prediction logs? Just reformat and run CI
- β Model-agnostic: Works on neural nets, decision trees, ensembles, whatever
- β Works with ANY model: Even ones that don't output confidence scores
- Take your existing test set
- Generate paraphrases/variants (LLM, backtranslation, augmentation, whatever)
- Run inference
- Format as CSV
- Get stability metrics
No prompt engineering. No domain knowledge. No overfitting to specific test cases.
The evaluation is completely prompt-agnostic. CI and SRI compute purely from the structure of disagreement, not the content of prompts.
Every run produces a Collapse Log, an audit-grade CSV file that
records per-prompt diagnostics, predictions, and human-friendly notes.
Think of it as a flight recorder for brittleness:
- Row-level evidence: Each case is logged with detailed diagnostics capturing model behavior across variants
- Receipts-grade: The file is bundled alongside hashes and snapshots, ensuring that
results are verifiable and audit-ready - Portable: CSV format, lightweight, and works across pipelines
The system is designed for fast safety triage, not leaderboard optimization.
The triage loop reads from the Collapse Log and produces a machine-readable CI/CD JSON containing:
- π Instability and retention failures ranked by severity
- π High-risk samples and cohorts grouped by failure type
- π¬ Concrete cases flagged for inspection
This typically fits into a 5β10 minute review loop and provides structured output for automated pipelines.
AI models don't fail quietly. They collapse.
Traditional metrics often miss brittleness until it causes real-world harm.
- Benchmarks β Reality: Models that ace leaderboards can still collapse.
- Liability Risk: A single collapse may trigger recalls, lawsuits, or penalties.
- Audit Gap: Standard metrics don't leave receipts; Collapse Log does.
- Efficiency: Lightweight stressors mean continuous monitoring without massive compute.
- Trust: Regulators and enterprises need a score they can verify and a log they can audit.
π CI + Collapse Log make collapse measurable, reproducible, and audit-ready before it becomes a public liability.
This validation evaluated DistilBERT-SST2 (90%+ benchmark accuracy) using Collapse Index on 500 sentiment examples from the SST-2 validation set.
Results:
- 42.8% flip rate: Nearly half of predictions change under typos/paraphrases
- CI Score: 0.275: Minor drift detected
- 13 silent failures: High confidence (>90%) BUT CI detects collapse (CI β€ 0.45). These bypass traditional monitoring. (13 of 35 total high-conf errors)
- AUC(CI): 0.698 vs AUC(Confidence): 0.515: CI predicts brittleness 18% better than confidence scores
The gap: Benchmarks say "ship it," but real-world input variations expose massive instability.
π Full reproducible dataset & analysis: github.com/collapseindex/ci-sst2
This validation evaluated BERT-AG-News (90.8% benchmark accuracy) using SRI + CI on 500 examples across 4 news categories (World, Sports, Business, Sci/Tech).
Results:
- 9.2% flip rate: 46/500 base examples flip under perturbations
- CI Score: 0.019 (avg): Prediction instability metric
- SRI Score: 0.981 (avg): Structural retention metric
- CI + SRI = 1.000: Perfect complementarity validated
- AUC(CI): 0.874 | AUC(SRI): 0.874: Both vastly outperform confidence
- AUC(Confidence): 0.171: Near-chance performance
- CSI Classification: 479 Type I / 20 Type II / 1 Type III. Detailed failure mode analysis
The insight: Multi-class provides richer entropy signals for SRI validation. Models can be confidently wrong (Type I) or have hidden internal instability (Type II) that traditional metrics miss.
π Full reproducible dataset & analysis: github.com/collapseindex/ci-sri
Both validations demonstrate that structural signals (CI/SRI) consistently outperform confidence for predicting model failures across binary and multi-class tasks.
No. CI is not a leaderboard metric β it's a diagnostic. It reveals brittleness under benign stress.
No. CI complements these methods. It adds a collapse-sensitivity axis and receipts (Collapse Logβ’).
Adversarial robustness measures worst-case behavior under adversarial perturbations designed to break the model.
CI uses benign, meaning-preserving perturbations (paraphrases, rewordings) that users would naturally produce. The goal isn't finding adversarial examplesβit's detecting when models are brittle under normal variation.
OOD detection asks "where did this input come from?"
CI asks "how does the model behave when the meaning hasn't changed?"
A paraphrase is in-distribution semantically but may trigger structural instability. These are orthogonal concerns.
No formal proofs. The justification is empirical and operational, not formal. Across ESA satellite data, synthetic supernova collapse curves, BERT on AG News, and DistilBERT on SST-2, structural signals consistently outperform confidence for predicting failures.
No. CI relies on lightweight, domain-appropriate perturbations (e.g., paraphrases, pixel shifts). Collapse is measured without adversarial tuning.
Every run emits a full artifact bundle: logs, plots, cryptographic hashes, and a Collapse Log.
Yes. CI stabilizes at a small perturbation budget, so continuous monitoring is feasible without massive compute overhead.
Short answer: The design prioritizes avoiding false reassurance over maximizing metric coverage.
Early versions experimented with more standard metrics: additional accuracy variants, calibration scores, robustness checks, distributional tests.
Repeated observations showed: Most of them moved together, and usually moved late. They answered the same question in slightly different ways: "Is the model currently getting the right answers?"
That's usefulβbut not the critical question.
The failure mode of concern occurs earlier in time: "Is this model becoming unreliable before obvious errors appear?"
Adding more correlated metrics often made things look more convincing without making them safer. Everything agreed⦠right up until it didn't.
So the system deliberately limits itself to three signals that can disagree with each other in meaningful ways:
- How much behavior changes when it shouldn't
- Whether internal structure holds together
- Whether confidence actually tracks correctness
When those signals line up, the model is probably fine. When they diverge, that divergence itself is the warning.
In practice, three interpretable signals that can contradict each other are more useful for decision-making than ten metrics that all nod at once.
- β Finalize framework draft and publish
- β Build diagnostic software/app: Packaging CI + Collapse Log as a tool
- π Run additional experiments: Scaling to larger models (e.g., Qwen 7B)
- π Collaborate with labs and organizations: External validation and pilots
- π― Frontier model testing: GPT-4/Claude runs (seeking API credits/collaboration)
Want to collaborate? Reach out at ask@collapseindex.org
Collapse Index (CI) and Collapse Log are not released as open-source software.
There is no official repository providing formulas or internals.
Any third-party code claiming to implement CI or Collapse Log is:
π« Unofficial, unverified, and not endorsed.
- The terms Collapse Index (CI), Structural Retention Index (SRI), and related technologies are reserved by the author
- Unauthorized use or misrepresentation is prohibited
- This repo does not contain source code or formulas
π Full terms: LICENSE.md
If you reference Collapse Index (CI) in your research or evaluations, please cite:
@misc{kwon2025collapseindex,
author = {Kwon, Alex},
title = {Collapse Index (CI) GitHub README},
year = {2026},
publisher = {Collapse Index Labs},
howpublished = {\url{https://github.com/collapseindex/collapseindex}},
version = {v2.0.0},
note = {Framework Paper DOI: 10.5281/zenodo.17718180}
}Collapse Index Labs (Alex Kwon)
- Website β collapseindex.org
- ORCID β 0009-0002-2566-5538
For evals, datasets, collaborations, or pilots:
π© ask@collapseindex.org
For evaluation services:
π Visit Collapse Index Evals for more information
Collapse Index research is made possible through community support.
Be the first founding Transmission sponsor.
Be the first founding Feedback sponsor.
π Sponsor CI on GitHub
