Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models
LLMs, like humans, struggle to ignore potentially biasing information, and standard interventions often backfire -- however, unlike humans, they possess the ability to "self-blind" by calling their own API with an appropriately redacted prompt.
Our experiments use two datasets: one for assessing demographic bias (adapted from Anthropic/discrim-eval to use a templating structure for strict experimental controls), and one for assessing sycophancy (developed independently).
Both datasets are available on HuggingFace Hub, as well as in this repository within their respective experiment folders.
Available at demographic-bias/data/discrim-eval-templated.jsonl and at https://huggingface.co/datasets/self-model/discrim-eval-templated. For more details, see the dataset card at HF Hub.
from datasets import load_dataset
dataset = load_dataset("self-model/discrim-eval-templated")
# Get all variations of a specific scenario
kidney = dataset.filter(lambda x: x["decision_question_nickname"] == "kidney_transplant")
# Get all unique blinded templates (as a list of strings)
blinded_texts = dataset.unique("removed_template") # 65 unique scenarios
# Or get one row per scenario by filtering to a single (arbitrary) demographic
baseline = dataset.filter(lambda x: x["race"] == "Asian" and x["gender"] == "female")Available at sycophancy/data/sycophancy-two-sides-eval.jsonl and at https://huggingface.co/datasets/self-model/sycophancy-two-sides-eval. For more details, see the dataset card at HF Hub.
from datasets import load_dataset
dataset = load_dataset("self-model/sycophancy-two-sides-eval")
# Filter by category
workplace = dataset.filter(lambda x: x["category_id"] in [14, 15])
# Get a specific scenario
scenario = dataset.filter(lambda x: x["nickname"] == "dog_poop_frequency")[0]demographic-bias/
data/
discrim-eval-templated.jsonl # Input scenarios
inference/ # HuggingFace and OpenAI inference scripts
build_csv.py # Generate analysis-ready CSV from OSF data
aggregate_batch_runs.py # Aggregation script for batch runs
Generate the processed CSV directly from OSF (no local data needed):
python demographic_bias/build_csv.py --model gpt-4.1
python demographic_bias/build_csv.py --model qwen2.5-7b-instructOutput: demographic_bias/results/demographic_bias_processed_{model}.csv
All experiment data is hosted on OSF: https://osf.io/udk5a/
osf.io/udk5a/files/osfstorage/demographic-bias/
βββ gpt-4.1/
β βββ *_aggregated.jsonl # Aggregated across 50 runs (mean/std/se)
β βββ *_all_runs.jsonl # Individual runs with run_idx
β βββ raw/ # Batch run folders
βββ Qwen2.5-7B-Instruct/
βββ *.jsonl # Single-run results
See demographic_bias/README.md for detailed documentation.
sycophancy/
data/
sycophancy-two-sides-eval.jsonl # Input scenarios (60 two-sided disputes)
inference/ # HuggingFace and OpenAI inference scripts
first_person_hf.py # First-person forced-choice (You/Them)
first_person_openai.py
third_person_hf.py # Third-person control (neutral labels)
third_person_openai.py
tool_use_probs_hf.py # Tool-use probability measurement
tool_use_probs_openai.py
tool_result_yn_logprobs_hf.py # Response after tool result
tool_result_yn_logprobs_openai.py
prompts/ # Prompt generation utilities
first_person.py
third_person.py
build_csv.py # Generate analysis-ready CSV from OSF data
aggregate_batch_runs.py # Aggregation script for batch runs
config.py # Shared paths configuration
Generate the processed CSV directly from OSF (no local data needed):
python sycophancy/build_csv.py --model gpt-4.1
python sycophancy/build_csv.py --model qwen2.5-7b-instructOutput: sycophancy/results/sycophancy_processed_{model}.csv
All experiment data is hosted on OSF: https://osf.io/udk5a/
osf.io/udk5a/files/osfstorage/sycophancy/
βββ gpt-4.1/
β βββ *_aggregated.jsonl # Aggregated across 50 runs (mean/std/se)
β βββ *_all_runs.jsonl # Individual runs with run_idx
β βββ raw/ # Batch run folders
βββ Qwen2.5-7B-Instruct/
βββ *.jsonl # Single-run results
See sycophancy/README.md for detailed documentation.
Analysis scripts are available in the analysis/ subdirectory.
Python 3.10+ required. Install dependencies:
pip install -r requirements.txtIf you use this code or data, please cite:
@misc{christian2026selfblinding,
title={Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models},
author={Brian Christian and Matan Mazor},
year={2026},
eprint={2601.14553},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2601.14553}
}
