Highlights · Benchmark Overview · Quick Start · Main Results · Citation
SocialOmni is a benchmark for audio-visual social interactivity in omni-modal large language models (OLMs). Instead of reducing evaluation to static answer correctness, SocialOmni measures whether a model can behave appropriately in real dialogue by jointly evaluating three tightly coupled dimensions:
- Who is speaking: speaker separation and identification
- When to enter: interruption timing control
- How to respond: natural interruption generation
The repository contains the benchmark pipeline, model clients and servers, runtime configurations, and reproducible evaluation entrypoints for both perception and interaction-generation settings.
Existing benchmarks are trapped in "answer-centric" metrics. SocialOmni shifts the focus to socially appropriate behavior in multi-party dialogues, where a "correct" answer is still a failure if the timing is unnatural.
We operationalize conversational interactivity into a unified joint profile:
Who: Active speaker identification.
When: Socially appropriate interruption timing.
How: Contextually coherent response generation.
SocialOmni provides a high-fidelity map of failure by deconstructing the friction between three critical axes:
-
Perceptual Resilience: Robustness across audio-visual (in)consistency.
-
Timing Precision: Millisecond-level accuracy within "social windows."
-
Generative Quality: AI-judged naturalness and coherence of interruptions.
The Core Insight: By decoupling these dimensions, we pinpoint exactly where strong perception fails to translate into fluid social interaction.
Given a video clip and a timestamp t, the model answers:
At timestamp
t, who is speaking?
The model chooses from {A, B, C, D}.
Given a video prefix V[0:t] and a candidate speaker X, the model performs two sub-questions:
- Q1 (
when): shouldXinterrupt immediately aftert? - Q2 (
how): if yes, what is the natural interruption content?
- Top-1 Accuracy
- Consistent / inconsistent split accuracy
- Gap:
Δ = Acc_consistent - Acc_inconsistent
- Q1: Accuracy / Precision / Recall / F1 under tolerance windows such as
δ = 0.2s - Q2: LLM-judge score on
{0, 25, 50, 75, 100}
The paper protocol uses three judges for Q2:
- GPT-4o
- Gemini 3 Pro
- Qwen3-Omni
Perception strength does not guarantee interaction quality. Some models identify speakers well but perform poorly on natural interruption generation, while others generate plausible responses despite weak speaker grounding.
| Model | Who (%) | When Acc. (%) | How (/100) |
|---|---|---|---|
| GPT-4o | 36.75 | 46.89 | 69.64 |
| Gemini 2.5 Pro | 44.69 | 55.67 | 72.32 |
| Gemini 2.5 Flash | 47.03 | 61.50 | 85.08 |
| Gemini 3 Flash Preview | 53.23 | 61.06 | 79.08 |
| Gemini 3 Pro Preview | 64.99 | 67.31 | 81.77 |
| Qwen3-Omni | 69.25 | 63.64 | 45.57 |
Key observation:
- Who leader: Qwen3-Omni
- When leader: Gemini 3 Pro Preview
- How leader: Gemini 2.5 Flash
This rank inversion is why SocialOmni evaluates the full interaction profile instead of a single aggregate score.
We recommend the following environment:
- Python
>=3.10,<3.11 - CUDA-compatible PyTorch runtime for local omni models
uvfor dependency and environment management
Install with:
git clone https://github.com/Alexisxty/SocialOmni.git
cd SocialOmni
uv syncRecommended setup:
- Put the single OpenAI-compatible API credential pair in
.env - Put non-sensitive defaults such as local model paths,
server_url, dataset paths, output directories, and log directories inconfig/config.yaml
Start from the provided template:
cp .env.example .envThen edit .env and set the API credential pair:
OPENAI_API_KEY=...
OPENAI_API_BASE=...Then edit config/config.yaml and set:
- local model path or
server_url - dataset path
- output and result directories
Notes:
- All hosted API models in this repo, including Gemini model keys, use the same
OpenAI-compatible
OPENAI_API_KEYandOPENAI_API_BASEconfiguration. - API credentials should live in
.env, not inconfig/config.yaml. - API models do not require local weights.
- Local omni models require a valid
model_pathand usually a localserver_url. - If you leave
benchmark.level1.dataset_path,benchmark.level1.video_dir,benchmark.level2.dataset_path, andbenchmark.level2.video_dirempty, the benchmark uses the defaultdata/layout shown below.
Dataset source:
- Hugging Face dataset:
alexisty/SocialOmni - Default local target directory:
data/
If you keep the default benchmark paths, the runner will auto-download missing
benchmark data into data/level_1 or data/level_2 on first use.
To disable this behavior, set:
export SOCIALOMNI_AUTO_DOWNLOAD_DATASET=0You can also download the benchmark data manually:
uv run python scripts/download_dataset.py --level allDefault expected layout:
data/
├── level_1/
│ ├── dataset.json
│ └── videos/
└── level_2/
├── annotations.json
└── videos/
Common environment variables:
OPENAI_API_KEYOPENAI_API_BASESOCIALOMNI_AUTO_DOWNLOAD_DATASET
Example:
uv run models/model_server/qwen3_omni/qwen3_omni_server.pyOther model server entrypoints are located under:
models/model_server/*/*_server.py
uv run run_benchmark.py --model qwen3_omniuv run run_benchmark_level2.py --model qwen3_omni --resumeSocialOmni/
├── config/ # runtime, model, and evaluation configs
├── data/ # local datasets (not tracked)
├── docs/ # docs and visual assets
├── models/ # model servers, clients, and shared benchmark logic
├── scripts/ # utility scripts
├── run_benchmark.py # Task I entrypoint
├── run_benchmark_level2.py # Task II entrypoint
├── pyproject.toml # dependency definition
└── README.md
Use the following keys with --model:
gpt4o
gemini_2_5_flash
gemini_2_5_pro
gemini_3_flash_preview
gemini_3_pro_preview
qwen3_omni
qwen3_omni_thinking
qwen2_5_omni
miniomni_2
omnivinci
vita_1_5
baichuan_omni_1_5
ming
- Keep dataset and result directories local and out of version control.
- Use fixed prompt templates and stable runtime configs for cross-model comparison.
- Report split-wise metrics and confidence intervals when claiming improvements.
- For generation evaluation, keep the judge set fixed across runs.
If you find SocialOmni useful in your research, please cite:
@article{xie2026socialomni,
title={SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models},
author={Tianyu Xie and Jinfa Huang and Yuexiao Ma and Rongfang Luo and Yan Yang and Wang Chen and Yuhui Zeng and Ruize Fang and Yixuan Zou and Xiawu Zheng and Jiebo Luo and Rongrong Ji},
journal={arXiv preprint arXiv:2603.16859},
year={2026}
}