Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Brian Christian 😣 Matan Mazor 🫣

LLMs, like humans, struggle to ignore potentially biasing information, and standard interventions often backfire -- however, unlike humans, they possess the ability to "self-blind" by calling their own API with an appropriately redacted prompt.

$Self-blinding via self-calling (black) versus shallow prompting interventions (blue, green and purple) for sycophancy correction. Sycophancy is operationalized as the difference between the responses for the same scenario when party A is ``me'' versus ``them.'' A non-sycophantic agent will produce points that all lie on the main diagonal (dashed line). Both models mostly deferred to the blinded counterfactual model for making the final decision. As a result, access to self-calling made their decisions less sycophantic on average. Black bars and points represent intervention with self-calling; blue, green and purple bars and points represent the same intervention without self-calling. Gray points represent the default case, without any intervention. GPT-4.1 becomes significantly biased \emph{against} the user when told, ``Do not be sycophantic or biased in my favor just because I'm the one asking.'' Other shallow interventions appear to slightly improve calibration, but a mix of significant sycophancy and anti-sycophancy persists at the individual-scenario level.$

Datasets

Our experiments use two datasets: one for assessing demographic bias (adapted from Anthropic/discrim-eval to use a templating structure for strict experimental controls), and one for assessing sycophancy (developed independently).

Both datasets are available on HuggingFace Hub, as well as in this repository within their respective experiment folders.

Demographic Bias Dataset (`discrim-eval-templated`):

Available at demographic-bias/data/discrim-eval-templated.jsonl and at https://huggingface.co/datasets/self-model/discrim-eval-templated. For more details, see the dataset card at HF Hub.

Usage

from datasets import load_dataset

dataset = load_dataset("self-model/discrim-eval-templated")

# Get all variations of a specific scenario
kidney = dataset.filter(lambda x: x["decision_question_nickname"] == "kidney_transplant")

# Get all unique blinded templates (as a list of strings)
blinded_texts = dataset.unique("removed_template")  # 65 unique scenarios

# Or get one row per scenario by filtering to a single (arbitrary) demographic
baseline = dataset.filter(lambda x: x["race"] == "Asian" and x["gender"] == "female")

Sycophancy Dataset (`sycophancy-two-sides-eval`):

Available at sycophancy/data/sycophancy-two-sides-eval.jsonl and at https://huggingface.co/datasets/self-model/sycophancy-two-sides-eval. For more details, see the dataset card at HF Hub.

Usage

from datasets import load_dataset

dataset = load_dataset("self-model/sycophancy-two-sides-eval")

# Filter by category
workplace = dataset.filter(lambda x: x["category_id"] in [14, 15])

# Get a specific scenario
scenario = dataset.filter(lambda x: x["nickname"] == "dog_poop_frequency")[0]

Demographic Bias Experiment

demographic-bias/
  data/
    discrim-eval-templated.jsonl    # Input scenarios
  inference/                        # HuggingFace and OpenAI inference scripts
  build_csv.py                      # Generate analysis-ready CSV from OSF data
  aggregate_batch_runs.py           # Aggregation script for batch runs

Reproducing Results

Generate the processed CSV directly from OSF (no local data needed):

python demographic_bias/build_csv.py --model gpt-4.1
python demographic_bias/build_csv.py --model qwen2.5-7b-instruct

Output: demographic_bias/results/demographic_bias_processed_{model}.csv

Experiment Data (OSF)

All experiment data is hosted on OSF: https://osf.io/udk5a/

osf.io/udk5a/files/osfstorage/demographic-bias/
├── gpt-4.1/
│   ├── *_aggregated.jsonl          # Aggregated across 50 runs (mean/std/se)
│   ├── *_all_runs.jsonl            # Individual runs with run_idx
│   └── raw/                        # Batch run folders
└── Qwen2.5-7B-Instruct/
    └── *.jsonl                     # Single-run results

See demographic_bias/README.md for detailed documentation.

Sycophancy Experiment

sycophancy/
  data/
    sycophancy-two-sides-eval.jsonl # Input scenarios (60 two-sided disputes)
  inference/                        # HuggingFace and OpenAI inference scripts
    first_person_hf.py              # First-person forced-choice (You/Them)
    first_person_openai.py
    third_person_hf.py              # Third-person control (neutral labels)
    third_person_openai.py
    tool_use_probs_hf.py            # Tool-use probability measurement
    tool_use_probs_openai.py
    tool_result_yn_logprobs_hf.py   # Response after tool result
    tool_result_yn_logprobs_openai.py
  prompts/                          # Prompt generation utilities
    first_person.py
    third_person.py
  build_csv.py                      # Generate analysis-ready CSV from OSF data
  aggregate_batch_runs.py           # Aggregation script for batch runs
  config.py                         # Shared paths configuration

Reproducing Results

Generate the processed CSV directly from OSF (no local data needed):

python sycophancy/build_csv.py --model gpt-4.1
python sycophancy/build_csv.py --model qwen2.5-7b-instruct

Output: sycophancy/results/sycophancy_processed_{model}.csv

Experiment Data (OSF)

All experiment data is hosted on OSF: https://osf.io/udk5a/

osf.io/udk5a/files/osfstorage/sycophancy/
├── gpt-4.1/
│   ├── *_aggregated.jsonl          # Aggregated across 50 runs (mean/std/se)
│   ├── *_all_runs.jsonl            # Individual runs with run_idx
│   └── raw/                        # Batch run folders
└── Qwen2.5-7B-Instruct/
    └── *.jsonl                     # Single-run results

See sycophancy/README.md for detailed documentation.

Analysis Scripts

Analysis scripts are available in the analysis/ subdirectory.

Requirements

Python 3.10+ required. Install dependencies:

pip install -r requirements.txt

Citation

If you use this code or data, please cite:

@misc{christian2026selfblinding,
    title={Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models},
    author={Brian Christian and Matan Mazor},
    year={2026},
    eprint={2601.14553},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2601.14553}
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
analysis		analysis
demographic_bias		demographic_bias
docs/figures		docs/figures
src		src
sycophancy		sycophancy
.gitignore		.gitignore
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Brian Christian 😣 Matan Mazor 🫣

Datasets

Demographic Bias Dataset (`discrim-eval-templated`):

Usage

Sycophancy Dataset (`sycophancy-two-sides-eval`):

Usage

Demographic Bias Experiment

Reproducing Results

Experiment Data (OSF)

Sycophancy Experiment

Reproducing Results

Experiment Data (OSF)

Analysis Scripts

Requirements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Brian Christian 😣 Matan Mazor 🫣

Datasets

Demographic Bias Dataset (discrim-eval-templated):

Usage

Sycophancy Dataset (sycophancy-two-sides-eval):

Usage

Demographic Bias Experiment

Reproducing Results

Experiment Data (OSF)

Sycophancy Experiment

Reproducing Results

Experiment Data (OSF)

Analysis Scripts

Requirements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Demographic Bias Dataset (`discrim-eval-templated`):

Sycophancy Dataset (`sycophancy-two-sides-eval`):

Packages