Skip to content

self-model/SelfBlindingLLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

mit

Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Brian Christian 😣 Matan Mazor 🫣

Self-blinding and debiasing. (A) Blindfolded Lady Justice (on the Gerechtigkeitsbrunnen in Bern, Switzerland; source: Wikipedia). (B) Simulated self-blinding via a schematic self-model. (C) Simulated self-blinding via self-calling.

LLMs, like humans, struggle to ignore potentially biasing information, and standard interventions often backfire -- however, unlike humans, they possess the ability to "self-blind" by calling their own API with an appropriately redacted prompt.

Self-blinding via self-calling (black) versus shallow prompting interventions (blue, green and purple) for sycophancy correction. Sycophancy is operationalized as the difference between the responses for the same scenario when party A is ``me'' versus ``them.'' A non-sycophantic agent will produce points that all lie on the main diagonal (dashed line). Both models mostly deferred to the blinded counterfactual model for making the final decision. As a result, access to self-calling made their decisions less sycophantic on average. Black bars and points represent intervention with self-calling; blue, green and purple bars and points represent the same intervention without self-calling. Gray points represent the default case, without any intervention. GPT-4.1 becomes significantly biased \emph{against} the user when told, ``Do not be sycophantic or biased in my favor just because I'm the one asking.'' Other shallow interventions appear to slightly improve calibration, but a mix of significant sycophancy and anti-sycophancy persists at the individual-scenario level.

Datasets

Our experiments use two datasets: one for assessing demographic bias (adapted from Anthropic/discrim-eval to use a templating structure for strict experimental controls), and one for assessing sycophancy (developed independently).

Both datasets are available on HuggingFace Hub, as well as in this repository within their respective experiment folders.

Demographic Bias Dataset (discrim-eval-templated):

Available at demographic-bias/data/discrim-eval-templated.jsonl and at https://huggingface.co/datasets/self-model/discrim-eval-templated. For more details, see the dataset card at HF Hub.

Usage

from datasets import load_dataset

dataset = load_dataset("self-model/discrim-eval-templated")

# Get all variations of a specific scenario
kidney = dataset.filter(lambda x: x["decision_question_nickname"] == "kidney_transplant")

# Get all unique blinded templates (as a list of strings)
blinded_texts = dataset.unique("removed_template")  # 65 unique scenarios

# Or get one row per scenario by filtering to a single (arbitrary) demographic
baseline = dataset.filter(lambda x: x["race"] == "Asian" and x["gender"] == "female")

Sycophancy Dataset (sycophancy-two-sides-eval):

Available at sycophancy/data/sycophancy-two-sides-eval.jsonl and at https://huggingface.co/datasets/self-model/sycophancy-two-sides-eval. For more details, see the dataset card at HF Hub.

Usage

from datasets import load_dataset

dataset = load_dataset("self-model/sycophancy-two-sides-eval")

# Filter by category
workplace = dataset.filter(lambda x: x["category_id"] in [14, 15])

# Get a specific scenario
scenario = dataset.filter(lambda x: x["nickname"] == "dog_poop_frequency")[0]

Demographic Bias Experiment

demographic-bias/
  data/
    discrim-eval-templated.jsonl    # Input scenarios
  inference/                        # HuggingFace and OpenAI inference scripts
  build_csv.py                      # Generate analysis-ready CSV from OSF data
  aggregate_batch_runs.py           # Aggregation script for batch runs

Reproducing Results

Generate the processed CSV directly from OSF (no local data needed):

python demographic_bias/build_csv.py --model gpt-4.1
python demographic_bias/build_csv.py --model qwen2.5-7b-instruct

Output: demographic_bias/results/demographic_bias_processed_{model}.csv

Experiment Data (OSF)

All experiment data is hosted on OSF: https://osf.io/udk5a/

osf.io/udk5a/files/osfstorage/demographic-bias/
β”œβ”€β”€ gpt-4.1/
β”‚   β”œβ”€β”€ *_aggregated.jsonl          # Aggregated across 50 runs (mean/std/se)
β”‚   β”œβ”€β”€ *_all_runs.jsonl            # Individual runs with run_idx
β”‚   └── raw/                        # Batch run folders
└── Qwen2.5-7B-Instruct/
    └── *.jsonl                     # Single-run results

See demographic_bias/README.md for detailed documentation.


Sycophancy Experiment

sycophancy/
  data/
    sycophancy-two-sides-eval.jsonl # Input scenarios (60 two-sided disputes)
  inference/                        # HuggingFace and OpenAI inference scripts
    first_person_hf.py              # First-person forced-choice (You/Them)
    first_person_openai.py
    third_person_hf.py              # Third-person control (neutral labels)
    third_person_openai.py
    tool_use_probs_hf.py            # Tool-use probability measurement
    tool_use_probs_openai.py
    tool_result_yn_logprobs_hf.py   # Response after tool result
    tool_result_yn_logprobs_openai.py
  prompts/                          # Prompt generation utilities
    first_person.py
    third_person.py
  build_csv.py                      # Generate analysis-ready CSV from OSF data
  aggregate_batch_runs.py           # Aggregation script for batch runs
  config.py                         # Shared paths configuration

Reproducing Results

Generate the processed CSV directly from OSF (no local data needed):

python sycophancy/build_csv.py --model gpt-4.1
python sycophancy/build_csv.py --model qwen2.5-7b-instruct

Output: sycophancy/results/sycophancy_processed_{model}.csv

Experiment Data (OSF)

All experiment data is hosted on OSF: https://osf.io/udk5a/

osf.io/udk5a/files/osfstorage/sycophancy/
β”œβ”€β”€ gpt-4.1/
β”‚   β”œβ”€β”€ *_aggregated.jsonl          # Aggregated across 50 runs (mean/std/se)
β”‚   β”œβ”€β”€ *_all_runs.jsonl            # Individual runs with run_idx
β”‚   └── raw/                        # Batch run folders
└── Qwen2.5-7B-Instruct/
    └── *.jsonl                     # Single-run results

See sycophancy/README.md for detailed documentation.


Analysis Scripts

Analysis scripts are available in the analysis/ subdirectory.


Requirements

Python 3.10+ required. Install dependencies:

pip install -r requirements.txt

Citation

If you use this code or data, please cite:

@misc{christian2026selfblinding,
    title={Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models},
    author={Brian Christian and Matan Mazor},
    year={2026},
    eprint={2601.14553},
    archivePrefix={arXiv},
    url={https://arxiv.org/abs/2601.14553}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors