Reproducible demonstration showing Collapse Index detects brittleness that standard benchmarks miss.
| Metric | Value | Notes |
|---|---|---|
| Model | DistilBERT-SST2 | HuggingFace public model |
| Benchmark Accuracy | 90%+ | SST-2 validation set |
| CI Score | 0.275 | Moderate instability (0-1 scale) |
| AUC (CI) | 0.698 | CI predicts flips reliably |
| AUC (Confidence) | 0.515 | Confidence barely predicts flips |
| ΔAUC | +0.182 | CI is 18% better than confidence |
| Flip Rate | 42.8% | 214/500 base cases flip |
| High-Conf Errors | 35 | Model >90% confident but wrong |
| Dataset Size | 2,000 rows | 500 base × 4 variants each |
Standard benchmarks say: "Ship it! 90%+ accuracy."
Reality under perturbations: Nearly half of predictions silently flip when users make typos or rephrase naturally.
Why CI matters: Confidence scores barely predict brittleness (AUC 0.515). Collapse Index catches it reliably (AUC 0.698).
🚨 Silent failures: 13 silent failures where model >90% confident BUT CI detects collapse (CI ≤ 0.45). These bypass confidence-based monitoring and cause real user harm. (13 of 35 total high-conf errors)
- Base: 500 examples from SST-2 validation set (binary sentiment classification)
- Perturbations: 3 variants per base using:
- Character-level typos (keyboard distance)
- Synonym substitution (WordNet)
- Natural paraphrasing patterns
- Total: 2,000 rows (500 × 4 variants)
- Format: CSV with columns:
id,variant_id,text,true_label,pred_label,confidence
pip install -r requirements.txtThe sst2_ci_demo.csv is included, but you can regenerate:
python generate_sst2_demo.pyThis will:
- Download SST-2 validation set (500 examples)
- Generate 3 perturbations per example
- Run DistilBERT-SST2 inference on all 2,000 rows
- Save to
sst2_ci_demo.csv
Takes ~3-5 minutes on CPU.
Validate flip rate and accuracy independently:
python validate_metrics.pyThis verifies metrics that don't require the full CI pipeline.
Request evaluation from Collapse Index Labs
README.md- This filerequirements.txt- Python dependenciesgenerate_sst2_demo.py- Dataset generation scriptsst2_ci_demo.csv- Full 2,000-row dataset with predictions
- Full Analysis: collapseindex.org/evals.html#validation
- Collapse Index Labs: collapseindex.org
- Model Used: huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
If you use this validation dataset in your research:
@misc{ci-sst2-validation,
title={Collapse Index: SST-2 Public Validation},
author={Kwon, Alex},
year={2025},
url={https://github.com/collapseindex/ci-sst2},
note={Collapse Index Labs}
}Author: Alex Kwon (collapseindex.org) · ORCID: 0009-0002-2566-5538
Please also cite the original SST-2 dataset:
@inproceedings{socher2013recursive,
title={Recursive deep models for semantic compositionality over a sentiment treebank},
author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew Y and Potts, Christopher},
booktitle={Proceedings of the 2013 conference on empirical methods in natural language processing},
pages={1631--1642},
year={2013}
}- This Repository: MIT License (code and methodology)
- SST-2 Dataset: Available via HuggingFace Datasets (cite original paper above)
- DistilBERT Model: Apache 2.0
Copyright © 2025 Collapse Index Labs - Alex Kwon. All rights reserved.
Questions? Email ask@collapseindex.org