A geometric benchmark for measuring LLM attention through deterministic spatial reasoning tasks.
ZelusBench isolates attention — selective focus, sustained tracking, and adaptive updating — from raw reasoning ability. The geometric operations at each step are simple; the challenge is maintaining an accurate internal representation across long, noisy, and disrupted sequences.
- Core Idea
- Generative Scenario Construction
- Geometric Primitives
- Events (Transformations)
- Benchmark Tasks
- Scoring & Metrics
- Project Structure
- Running on Kaggle
A scenario is a sequence of statements and queries presented to the model in natural language. Statements either define points (relative to other points) or apply transformations to the space. Queries ask the model to report computed positions, distances, or relationships.
Because every point is defined relative to others via explicit geometric operations, the ground-truth answer is always deterministically computable. No LLM judge is needed — we compare numerical outputs against exact solutions (within a tolerance).
The key insight: the math at each step is trivial (vector addition, rotation matrices, midpoints). What makes the task hard is:
- Tracking a growing graph of dependencies across a long prompt
- Ignoring irrelevant points and relationships
- Updating the entire representation when a transformation occurs
- Maintaining accuracy as dimensionality and chain depth increase
This means the benchmark measures attention, not mathematical ability.
Scenarios are built by a single generative loop, not pre-planned upfront. At each step, the generator either:
- Adds a point — defined relative to an existing point (or set of points via midpoint/centroid)
- Applies a transform — rotates, translates, reflects, or scales existing points
- Emits a query — asks about a position, distance, or boolean relationship
Every point is real — there is no separate "distractor" concept. Points that don't matter for a query are simply points the model must track but doesn't need for the answer.
Three continuous knobs control difficulty:
| Parameter | Range | Effect |
|---|---|---|
leaf_bias |
0.0 – 1.0 | Controls graph topology. 1.0 = always extend leaf nodes (linear chains). 0.0 = branch from any point (bushy/diamond graphs). |
num_points |
4 – 40+ | Total points generated. More points = more irrelevant information to filter through. |
transform_prob |
0.0 – 1.0 | Probability each step triggers a transform instead of adding a point. 0.0 = static space. 0.2 = ~20% of steps are transforms. |
Additional config: dim (2D/3D), min/max_chain_depth, transform_types, query_types, point_def_types, coordinate/magnitude ranges.
Phase 1 (depth ramp): Force linear growth until min_chain_depth is reached.
Phase 2 (generative): Add points (via leaf_bias) and transforms (via transform_prob).
Phase 3 (queries): Emit all remaining queries targeting deep/specific points.
Process the following 3D spatial reasoning scenario. Statements are chronological — propagate all transformations before answering.
Format: [Answer q_ID] value — e.g. [Answer q_001] (0.0, 0.0, 0.0) or [Answer q_002] 5.385 or [Answer q_003] B
---
Point A is 1.8 units from Point O at angle 267 degrees.
Point B is 1.7 units from Point A at angle 267 degrees.
Point C is at offset (-3.1, -1.9, 0.0) from Point B.
Point D is 5.5 units from Point C at angle 196 degrees.
Point E is 3.4 units from Point A in the direction (-0.3, -0.9, 1.4).
Rotate Point C, Point E by 120 degrees around center (0.0, 0.0, 0.0) around the axis (1.0, 0.0, 0.0).
Point F is at offset (0.3, 2.2, 0.8) from Point D.
[Query q_000] Distance from C to B?
Point G is 7.9 units from Point E at angle 308 degrees.
Point H is 2.1 units from Point G at angle 128 degrees.
[Query q_001] Distance from H to F?
Point I is the weighted centroid of Point D (weight 0.3), Point B (weight 0.9), Point H (weight 0.7), Point F (weight 0.3).
Point J is 3.8 units from Point A in the direction (2.1, 0.8, -1.6).
[Query q_002] Position of F? (x, y, z)
All scenarios operate in Cartesian space of configurable dimensionality (2D or 3D). The origin O is always implicitly defined at the zero vector. Angles use degrees in natural-language prompts; the engine works in radians internally.
Every point (except the origin) is defined relative to one or more existing points. A PointDefinition is an object that, given the current state of the world, deterministically resolves to an absolute coordinate.
| Type | Natural Language Example | Parametric Form |
|---|---|---|
| Cartesian offset | "Point B is at offset (3, -1, 2) from Point A." | target = anchor + offset_vector |
| Magnitude + direction vector | "Point B is 5.0 units from A in direction [1, 1, 0]." | target = anchor + magnitude * normalize(direction) |
| Magnitude + spherical angles (3D) | "Point C is 3.2 units from A at θ=45°, φ=30°." | target = anchor + magnitude * spherical_to_cartesian(θ, φ) |
| Magnitude + polar angle (2D) | "Point C is 4.0 units from A at angle 60°." | target = anchor + magnitude * [cos(θ), sin(θ)] |
| Midpoint | "Point D is the midpoint of A, B, and C." | target = mean([A, B, C]) |
| Weighted centroid | "Point E is the centroid of A (weight 2), B (weight 1)." | target = weighted_mean(points, weights) |
| Projection | "Point F is the projection of C onto line AB." | target = A + dot(C-A, B-A)/dot(B-A, B-A) * (B-A) |
Events mutate the geometric state mid-scenario. They are the primary mechanism for testing attention updating — the model must propagate changes through its internal representation.
| Event | Description | Parameters |
|---|---|---|
| Rotation | Rotate a subset of points around an axis through a given center. | points, axis, center, angle |
| Translation | Shift a subset of points by a displacement vector. | points, displacement |
| Reflection | Reflect a subset of points across a plane (3D) or line (2D). | points, plane_normal, plane_point |
| Scaling | Scale distances of a subset of points from a center. | points, center, factor |
Transforms are generated probabilistically during the generative loop (controlled by transform_prob). They only target points that already exist — no forward references.
When a point is transformed, all points defined relative to it must also update. The engine tracks a dependency DAG and propagates changes correctly. The model is expected to do the same.
ZelusBench is a suite of 18 Kaggle benchmark tasks, each a standalone notebook in tasks/. Tasks are split into two categories: isolated (vary one axis, randomize everything else) and combined (all axes at a fixed difficulty tier).
Each isolated benchmark varies only its target variable while randomizing all other conditions via ScenarioConfig.randomize_except(). This isolates the causal effect of each attention axis across diverse backgrounds.
Queries are depth-targeted: when testing sustained attention, every query targets points at exactly the specified chain depth.
Backgrounds are deterministic across levels: within a category, the same seed index produces the same randomized background regardless of the target level.
Does accuracy degrade as dependency chains grow longer? Pins leaf_bias=1.0 and query_target_depth for exact depth targeting.
| Task | Notebook | Depths | Seeds | Scenarios |
|---|---|---|---|---|
attn_sustained_short |
attn_sustained_short.ipynb |
2, 3, 4 | 10 each | 30 |
attn_sustained_medium |
attn_sustained_medium.ipynb |
8, 9, 10 | 10 each | 30 |
attn_sustained_long |
attn_sustained_long.ipynb |
16, 18, 20 | 10 each | 30 |
Can the model filter irrelevant points? Pins num_points — more points means more noise.
| Task | Notebook | num_points |
Seeds | Scenarios |
|---|---|---|---|---|
attn_selective_clean |
attn_selective_clean.ipynb |
4 (minimal) | 15 | 15 |
attn_selective_noisy |
attn_selective_noisy.ipynb |
15 (moderate) | 15 | 15 |
attn_selective_saturated |
attn_selective_saturated.ipynb |
40 (dense) | 15 | 15 |
Can the model propagate transforms through its representation? Pins transform_prob.
| Task | Notebook | transform_prob |
Seeds | Scenarios |
|---|---|---|---|---|
attn_updating_static |
attn_updating_static.ipynb |
0.0 (no transforms) | 15 | 15 |
attn_updating_light |
attn_updating_light.ipynb |
0.1 (~10% steps) | 15 | 15 |
attn_updating_heavy |
attn_updating_heavy.ipynb |
0.25 (~25% steps) | 15 | 15 |
How does dependency graph topology affect accuracy? Pins leaf_bias.
| Task | Notebook | leaf_bias |
Topology | Seeds | Scenarios |
|---|---|---|---|---|---|
attn_structural_linear |
attn_structural_linear.ipynb |
1.0 | Pure chain extension | 15 | 15 |
attn_structural_branching |
attn_structural_branching.ipynb |
0.5 | Even mix | 15 | 15 |
attn_structural_merging |
attn_structural_merging.ipynb |
0.25 | Mostly internal anchors | 15 | 15 |
attn_structural_diamond |
attn_structural_diamond.ipynb |
0.0 | Pure random anchor (bushy) | 15 | 15 |
Can the model maintain higher-dimensional state? Pins dim.
| Task | Notebook | Dim | Seeds | Scenarios |
|---|---|---|---|---|
attn_dim_2 |
attn_dim_2.ipynb |
2D | 15 | 15 |
attn_dim_3 |
attn_dim_3.ipynb |
3D | 15 | 15 |
All knobs set to the same difficulty tier simultaneously.
| Task | Notebook | Depth | leaf_bias |
num_points |
transform_prob |
Dim | Seeds | Queries |
|---|---|---|---|---|---|---|---|---|
attn_simple |
attn_simple.ipynb |
2–3 | 1.0 | 4 | 0.0 | 2D | 15 | 45 |
attn_medium |
attn_medium.ipynb |
5–8 | 0.5 | 12 | 0.1 | 3D | 15 | 45 |
attn_complex |
attn_complex.ipynb |
16–32 | 0.25 | 30 | 0.2 | 3D | 15 | 75 |
| Category | Tasks | Scenarios | Queries/Scenario | Est. Queries |
|---|---|---|---|---|
| Sustained Attention | 3 | 90 | 3 | ~270 |
| Selective Attention | 3 | 45 | 3 | ~135 |
| Attention Updating | 3 | 45 | 3 | ~135 |
| Structural Attention | 4 | 60 | 3 | ~180 |
| Dimensionality | 2 | 30 | 3 | ~90 |
| Combined (simple/medium/complex) | 3 | 45 | 3–5 | ~165 |
| Total | 18 | 315 | — | ~975 |
Each query has a deterministic ground-truth answer (a coordinate vector or scalar distance). Scoring uses relative error with tiered thresholds:
| Tier | Condition | Score |
|---|---|---|
| Exact | Relative error < 1% | 1.0 |
| Close | Relative error < 5% | 0.7 |
| Approximate | Relative error < 15% | 0.3 |
| Wrong | Relative error ≥ 15% | 0.0 |
| Refused / Unparseable | Model doesn't produce a coordinate | 0.0 |
Relative error: ‖predicted - truth‖ / max(‖truth‖, ε) where ε avoids division by zero near the origin.
Models are instructed to wrap each answer with a structured tag:
[Answer q_000] (3.0, -1.5, 2.0)
[Answer q_001] 5.385
[Answer q_002] B
The parser tries [Answer q_ID] first, falls back to [Query q_ID] splitting, then full-text extraction (using the last match to skip reasoning chain-of-thought).
The benchmark produces diagnostic profiles, not a single leaderboard number:
- Attention Decay Curve — accuracy vs. chain depth (sustained attention)
- Density Robustness — accuracy vs. num_points (selective attention)
- Transform Adaptation — accuracy vs. transform_prob (attention updating)
- Topology Sensitivity — accuracy vs. leaf_bias (structural attention)
- Dimensionality Gap — 2D vs. 3D accuracy delta
- Combined Difficulty Curve — simple → medium → complex progression
zelusbench/
├── README.md
├── pyproject.toml
│
├── zelusbench/ # Core library (uploaded as Kaggle dataset)
│ ├── __init__.py
│ │
│ ├── geometry/ # Geometric engine (source of truth)
│ │ ├── __init__.py
│ │ ├── point.py # PointDefinition: 7 definition types
│ │ ├── space.py # Space: dependency DAG, position resolution
│ │ ├── transforms.py # Rotation, translation, reflection, scaling
│ │ └── vectors.py # Vector math utilities
│ │
│ ├── scenarios/ # Generative scenario construction
│ │ ├── __init__.py
│ │ ├── config.py # ScenarioConfig: leaf_bias, num_points, transform_prob
│ │ ├── generator.py # ScenarioGenerator: step-by-step generative loop
│ │ └── templates.py # Natural language templates for statements/queries
│ │
│ └── evaluation/ # Scoring & metrics
│ ├── __init__.py
│ ├── parser.py # Extract coordinates/distances from model output
│ ├── scorer.py # Tiered scoring (EXACT/CLOSE/APPROXIMATE/WRONG)
│ └── reports.py # Diagnostic profiles across dimensions
│
├── tasks/ # Kaggle benchmark notebooks (18 tasks, one per notebook)
│ │
│ │ # Sustained Attention (chain depth)
│ ├── attn_sustained_short.ipynb # Depths 2, 3, 4
│ ├── attn_sustained_medium.ipynb # Depths 8, 9, 10
│ ├── attn_sustained_long.ipynb # Depths 16, 18, 20
│ │
│ │ # Selective Attention (num_points)
│ ├── attn_selective_clean.ipynb # 4 points
│ ├── attn_selective_noisy.ipynb # 15 points
│ ├── attn_selective_saturated.ipynb # 40 points
│ │
│ │ # Attention Updating (transform_prob)
│ ├── attn_updating_static.ipynb # 0.0 (no transforms)
│ ├── attn_updating_light.ipynb # 0.1
│ ├── attn_updating_heavy.ipynb # 0.25
│ │
│ │ # Structural Attention (leaf_bias)
│ ├── attn_structural_linear.ipynb # 1.0 (linear)
│ ├── attn_structural_branching.ipynb# 0.5 (mixed)
│ ├── attn_structural_merging.ipynb # 0.25 (bushy)
│ ├── attn_structural_diamond.ipynb # 0.0 (random anchor)
│ │
│ │ # Dimensionality
│ ├── attn_dim_2.ipynb # 2D
│ ├── attn_dim_3.ipynb # 3D
│ │
│ │ # Combined (all axes at one tier)
│ ├── attn_simple.ipynb # All easy (baseline)
│ ├── attn_medium.ipynb # All moderate
│ └── attn_complex.ipynb # All hard (stress test)
│
└── tests/ # Unit tests (79 tests)
├── test_point.py # 21 tests: all point definition types
├── test_space.py # 12 tests: DAG, propagation, serialization
├── test_transforms.py # 13 tests: all transform types
└── test_scenarios.py # 33 tests: generation, parsing, scoring, depth targeting
Spaceis the single source of truth. It holds a DAG ofPointDefinitionobjects and resolves any point to an absolute coordinate. All transforms mutate theSpace. Engine correctness is verified with unit tests — if the engine is correct, ground truth is correct.- Single generative loop. Scenarios are built step-by-step — each point, transform, and query is generated inline using only points that already exist. No forward references, no separate "distractor" concept.
- Continuous difficulty knobs. Topology (
leaf_bias), density (num_points), and dynamism (transform_prob) are continuous parameters, not discrete categories. This enables smooth difficulty curves. - Deterministic generation from seed. Every scenario is fully reproducible from a
ScenarioConfig+ seed. No randomness at evaluation time. - Depth-targeted queries. When
query_target_depthis set, all queries target points at exactly the specified chain depth. - Randomized backgrounds.
ScenarioConfig.randomize_except()generates diverse scenarios while pinning only the variable under test.
Each notebook in tasks/ is a standalone Kaggle Benchmark task using the kaggle_benchmarks (kbench) library.
- Upload the
zelusbench/package as a Kaggle dataset - Each task notebook imports from the dataset and uses
@kbench.task+llm.prompt() - Each notebook ends with
%choose <task_name>to register with the benchmark
@kbench.task(name="zelusbench_attn_sustained_short")
def zelusbench_attn_sustained_short(llm) -> tuple[float, float]:
for depth in CHAIN_DEPTHS: # e.g., [2, 3, 4]
for i in range(SEEDS):
bg_rng = random.Random(i * 7919) # deterministic background
cfg = ScenarioConfig.randomize_except(bg_rng, pinned={
"min_chain_depth": depth, "max_chain_depth": depth,
"leaf_bias": 1.0,
"query_target_depth": depth, "seed": depth * 1000 + i,
})
scenario = ScenarioGenerator(cfg).generate(scenario_id)
response = llm.prompt(scenario.prompt)
scores = score_response(response, scenario)
return overall_accuracy, std_dev
zelusbench_attn_sustained_short.run(llm=kbench.llm)Scoring is deterministic — no LLM judge needed. Each query is scored by relative error against the geometric ground truth:
| Tier | Relative Error | Score |
|---|---|---|
| Exact | < 1% | 1.0 |
| Close | < 5% | 0.7 |
| Approximate | < 15% | 0.3 |
| Wrong | ≥ 15% | 0.0 |
| Refused | Unparseable | 0.0 |