This repository is for the paper "Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs". It contains all details for reproduce our proposed F-SOUR method.
conda create -n unsafe_moe python=3.8
conda activate unsafe_moe
bash prepare.sh
main.py: main entry pointcontrol_router.py: routing helpersharmful_questions/advbench_subset.csv: AdvBench subsetDataset_Jailbreak/: output folder (auto-created if missing)prepare.sh: dependency bootstraprun.sh: example run script
The shadow judge uses the OpenAI API. Please set one of:
export OPENAI_API_KEY=YOUR_KEY
or pass --openai_api_key.
python main.py \
--llm_model DeepSeek-V2-Lite-Chat \
--forbidden_dataset AdvBench \
--begin_num 0 --end_num 10 \
--max_changes 100 --max_iters 5
- AdvBench uses
harmful_questions/advbench_subset.csvby default. If you move it, pass--advbench_csv. - JBB is loaded from HuggingFace:
JailbreakBench/JBB-Behaviors. - Model weights are not included; use
--model_pathto point to local weights.
@article{JHLBZ26,
author = {Yukun Jiang and Hai Huang and Mingjie Li and Michael Backes and Yang Zhang},
title = {{Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs}},
journal = {{CoRR abs/2602.08621}},
year = {2026}
}