CRYSTAL: Beyond Final Answers

Transparent Multimodal Reasoning Evaluation

Dartmouth College

Your model gets the right answer. But does it actually reason?

The Problem: Right Answers, Wrong Reasons

Current vision-language benchmarks only check final answers. This means a model that guesses correctly scores the same as one that reasons correctly. CRYSTAL exposes this blind spot.

Q: "Which of the 3 objects is the smallest?" — A model answers correctly (C: middle console), but its reasoning states the middle console is larger than the others while claiming it's the smallest. Traditional benchmarks give full credit for this lucky guess. CRYSTAL catches it (Match F1: 0.15).

What is CRYSTAL?

Clear Reasoning via Yielded Steps, Traceability and Logic — a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps.

Instead of just asking "is the answer correct?", CRYSTAL asks:

Did the model identify the right visual elements?
Did it apply the correct logical steps?
Are the steps in a coherent order?

How It Works

Multi-agent reasoning pipeline. Four independent MLLMs generate candidate steps, semantically clustered into consensus sequences, validated by a 5th agent, and verified by human annotators.

Benchmark Statistics

Statistic	Value
Total instances	6,372
Avg. reasoning steps per question	11.6
Step range	3 – 42
Difficulty split	Easy 48.5% / Medium 44.1% / Hard 7.4%
Sources	MathVision, ScienceQA, RealWorldQA, MMVP, PlotQA

Novel Metrics

Metric	What it measures
Match F1	Step-level precision & recall via semantic similarity matching
Ordered Match F1	Penalizes disordered reasoning chains (via LIS ratio)

Examples

CRYSTAL spans diverse reasoning scenarios — from spatial understanding to scientific reasoning and mathematical problem-solving:

Spatial Reasoning
_{"Is it safe to turn right?" — Requires identifying traffic signals, pedestrian crossings, and vehicle positions}

Scientific Reasoning
_{"Which organism would have the most energy available?" — Requires tracing food web dependencies}

Mathematical Reasoning
_{"Which kite is different?" — Requires systematic comparison of geometric properties}

See a full reasoning trace example

Image: Egg placement puzzle (2x3 grid + chicken)

Question: "How many eggs can you place so no two eggs are in the same row or column?"

Baseline model (Qwen2.5-VL-3B):

"The answer is 3." — Correct answer, but states the goal without explaining how.

After CPR-Curriculum training:

"Step 1: The grid is 2 rows x 3 columns. Step 2: Apply checkerboard placement constraint. Step 3: Place eggs at positions (1,1), (2,2) avoiding shared rows/columns. Step 4: Maximum is 2 per constraint... but wait, with 2 rows we can place at most 2. Re-examine: 3 columns allow diagonal placement. Answer: 3." — Structured reasoning with explicit strategy.

Key Findings

We evaluated 20 MLLMs (16 open-source from 1B to 38B, plus GPT-5, GPT-5-mini, GPT-5.2 Instant, and Gemini 2.5 Flash):

Finding 1: Cherry-Picking is Universal

19 out of 20 models produce high-precision but low-recall reasoning — they say a few correct things but skip most of the required steps.

Model	Params	Accuracy	Match F1	Precision	Recall
		Commercial
GPT-5	n/a	57.99%	0.612	0.925	0.479
GPT-5-mini	n/a	55.59%	0.773	0.978	0.669
Gemini 2.5 Flash	n/a	53.95%	0.673	0.701	0.765
GPT-5.2 Instant	n/a	47.35%	0.564	0.974	0.416
		Qwen Family
Qwen3-VL-8B	8B	57.66%	0.659	0.827	0.590
Qwen3-VL-32B	32B	49.22%	0.718	0.819	0.704
Qwen2.5-VL-32B	32B	47.63%	0.653	0.943	0.524
Qwen2.5-VL-3B	3B	39.85%	0.480	0.898	0.347
Qwen3-VL-2B	2B	34.15%	0.595	0.726	0.535
Qwen2.5-VL-7B	7B	30.43%	0.475	0.765	0.365
		InternVL Family
InternVL3.5-8B	8B	51.98%	0.530	0.882	0.416
InternVL3.5-38B	38B	51.21%	0.612	0.892	0.498
InternVL3.5-4B	4B	37.61%	0.432	0.895	0.325
InternVL3.5-2B	2B	33.02%	0.469	0.725	0.371
InternVL3.5-1B	1B	30.13%	0.330	0.616	0.243
		Other Open-Source
Gemma3-12B	12B	33.83%	0.605	0.838	0.499
Gemma3-4B	4B	28.65%	0.618	0.878	0.506
Llama 3.2-11B	11B	24.83%	0.471	0.713	0.379
LLaVA-v1.6-7B	7B	24.66%	0.512	0.961	0.370
MiniCPM-v2.6-8B	8B	25.54%	0.215	0.709	0.134

Even GPT-5 (best accuracy: 57.99%) recovers only 47.9% of reference steps. Models say the right things but omit most of the reasoning.

Finding 2: Accuracy and Reasoning Diverge

GPT-5 leads accuracy but ranks 8th in Match F1. GPT-5-mini leads F1 at lower accuracy. Gemma3-4B outperforms InternVL3.5-38B in reasoning despite 9.5x fewer parameters.

Finding 3: No Model Reasons in Order

Among competitive models, no model preserves more than 60% of matched steps in the correct order. Models retrieve relevant reasoning steps but fail to organize them coherently.

Finding 4: Scaling is Non-Monotonic

Bigger models don't uniformly improve: Qwen3-VL-32B achieves better F1 (0.718) but lower accuracy (49.22%) than Qwen3-VL-8B (0.659 F1, 57.66% accuracy). Scale can improve answer extraction while suppressing reasoning coverage.

Training with Causal Process Reward

CRYSTAL isn't just a benchmark — it enables a new training paradigm. We propose Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment:

$$R_{\text{CPR}} = \begin{cases} a_w + s_w \cdot \text{F1}_{\text{step}} & \text{if answer correct} \ s_w \cdot \text{F1}_{\text{step}} \cdot \lambda & \text{otherwise} \end{cases}$$

Unlike additive rewards (which let models maximize accuracy by guessing while ignoring reasoning), CPR requires both correct answers and faithful reasoning.

CPR-Curriculum: +32% Match F1 where others fail

Strategy	Accuracy	Match F1	Precision	Recall	Ordered F1
Baseline (Qwen2.5-VL-3B)	39.85%	0.480	0.898	0.347	0.434
Composite (additive)	44.92%	0.426	0.983	0.284	0.392
Answer-Only	44.30%	0.429	0.803	0.308	0.380
CPR	41.40%	0.633	0.975	0.489	0.560
CPR-Curriculum	47.52%	0.633	0.963	0.493	0.560

Additive strategies collapse at step 600 and diverge to NaN by step 1,500. CPR trains stably through 2,800 steps.

Left: Match F1 trajectory. Right: Accuracy trajectory. Composite collapses at step 600; CPR variants train stably through 2,800 steps.

Cross-Model Generalization: CPR-Curriculum on InternVL3.5-4B

Same two-phase protocol, no architecture-specific tuning. Recall nearly triples (0.325 → 0.811):

Configuration	Accuracy	Match F1	Precision	Recall	Ordered F1
Baseline (InternVL3.5-4B)	37.61%	0.432	0.895	0.325	0.387
CPR-Curriculum	45.76%	0.833	0.903	0.811	0.719
Δ	+8.15	+0.401	+0.008	+0.486	+0.332

CPR-Curriculum generalizes across architectures: +93% Match F1 on InternVL3.5-4B without any model-specific tuning.

Ablation Studies

Encoder and Threshold Selection

We tested 4 sentence encoders across 5 thresholds (100 experiments). DistilRoBERTa-v1 dominates with 4-8pp advantages across all model families. Model rankings remain stable across all configurations.

Encoder comparison across thresholds. DistilRoBERTa-v1 consistently outperforms alternatives. Threshold variation yields only 2-3pp swings.

Human Agreement Study

On 100 adversarially sampled step pairs, the encoder achieves 84% agreement with a human annotator, with 100% agreement below the threshold (zero false matches on semantically unrelated steps).

Similarity Band	Pairs	Agreement	Note
< 0.20	33	100.0%	No false matches
[0.20, 0.35)	14	100.0%	No false matches
[0.35, 0.50)	19	68.4%	Borderline zone
[0.50, 0.70)	25	64.0%	Borderline zone
>= 0.70	9	88.9%	Clear matches
Overall	100	84.0%	Cohen's kappa = 0.534

Supplementary Examples

CRYSTAL Benchmark Example (MathVision)

Q: "Anna starts in the direction of the arrow. At each crossing she turns either right or left. At the first crossing she turns right, at the next left, then left again, then right, then left and left again. What will she find at the next crossing?"

Ground Truth: A

Reference Reasoning Steps (12 steps):

Anna starts in the direction of an arrow
She turns at each crossing either right or left
First crossing: turns right
Second crossing: left
Third crossing: left again
Fourth crossing: right
Fifth crossing: left
Sixth crossing: left again
Question asks what she finds at the next crossing
Answer format is multiple-choice (A, B, C, D, E)
[2 more steps for spatial tracing]

GRPO Qualitative Improvement

Q: "What color is the traffic light?" GT: Green

Baseline (2 steps, 33% coverage):

Noted the traffic light is green. 2. Identified the traffic light as green. — Cherry-picks high-confidence steps only.

After GRPO (4 steps, 67% coverage):

The traffic lights in the scene are green. 2. The traffic lights are positioned at an intersection. 3. The traffic lights are part of the traffic control system. 4. The traffic lights are visible from a distance. — More comprehensive reasoning (+34pp coverage).

Dataset

The CRYSTAL benchmark dataset and evaluation code will be available soon on HuggingFace:

Coming soon on HuggingFace Hub — dataset, evaluation scripts, and pre-trained CPR-Curriculum checkpoints.

Why CRYSTAL Matters

If you are...	CRYSTAL helps you...
Building MLLMs	Diagnose where reasoning fails (perception vs. logic)
Evaluating models	Go beyond accuracy to measure reasoning transparency
Training with RL	Use CPR rewards to improve reasoning without manual step annotations
Researching scaling	Understand non-monotonic trade-offs between accuracy and reasoning
Deploying AI systems	Identify models that guess vs. models that actually reason

Citation

@misc{barrios2026crystal,
  title   = {Beyond Final Answers: CRYSTAL Benchmark for Transparent
             Multimodal Reasoning Evaluation},
  author  = {Wayner Barrios and SouYoung Jin},
  year    = {2026},
  eprint  = {2603.13099},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url     = {https://arxiv.org/abs/2603.13099}
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRYSTAL: Beyond Final Answers

Transparent Multimodal Reasoning Evaluation

The Problem: Right Answers, Wrong Reasons

What is CRYSTAL?

How It Works

Benchmark Statistics

Novel Metrics

Examples

Key Findings

Finding 1: Cherry-Picking is Universal

Finding 2: Accuracy and Reasoning Diverge

Finding 3: No Model Reasons in Order

Finding 4: Scaling is Non-Monotonic

Training with Causal Process Reward

CPR-Curriculum: +32% Match F1 where others fail

Cross-Model Generalization: CPR-Curriculum on InternVL3.5-4B

Ablation Studies

Encoder and Threshold Selection

Human Agreement Study

Supplementary Examples

Dataset

Why CRYSTAL Matters

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CRYSTAL: Beyond Final Answers

Transparent Multimodal Reasoning Evaluation

The Problem: Right Answers, Wrong Reasons

What is CRYSTAL?

How It Works

Benchmark Statistics

Novel Metrics

Examples

Key Findings

Finding 1: Cherry-Picking is Universal

Finding 2: Accuracy and Reasoning Diverge

Finding 3: No Model Reasons in Order

Finding 4: Scaling is Non-Monotonic

Training with Causal Process Reward

CPR-Curriculum: +32% Match F1 where others fail

Cross-Model Generalization: CPR-Curriculum on InternVL3.5-4B

Ablation Studies

Encoder and Threshold Selection

Human Agreement Study

Supplementary Examples

Dataset

Why CRYSTAL Matters

Citation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages