This repository studies the geometry of reasoning in google/gemma-3-4b-it by measuring intrinsic dimension in hidden-state space and by asking how many principal components are required to decode different properties from activations. Across the completed runs, layer 17 emerges as a useful middle-layer probe: it carries a relatively high-dimensional representation of open-ended reasoning, yet the geometry contracts sharply when semantic content is held fixed under paraphrase, and it separates lexical features far more efficiently than it separates reasoning from non-reasoning discourse.
The central picture is therefore not that reasoning lives on a single tiny manifold, nor that it simply fills the ambient residual space. Instead, reasoning appears to occupy a structured intermediate regime: broad enough to support multiple cognitive sub-behaviors, but still organized enough that meaning-preserving variation stays confined to a much smaller region. The new pooled-segment results sharpen that picture further: reasoning as a whole is not obviously more expansive than its constituent sub-behaviors, suggesting that the global reasoning manifold is assembled from several partially overlapping local modes rather than from a single uniformly broad cloud.
conda create -n rdim python=3.14 -y
conda activate rdim
pip install -e .
(cd data; bash download_dataset.sh)The implementation uses torch, transformers, scikit-dimension, scikit-learn, datasets, ijson, jaxtyping, and einops. All experiment outputs are stored under results/.
Experiments 1 to 3 use data/annotated_dataset.json, which contains long reasoning traces from Gemini and DeepSeek together with sentence-level annotations. The six behavior labels are initializing, deduction, adding-knowledge, example-testing, uncertainty-estimation, and backtracking.
Experiments 4 and 5 use mattwesney/General_Inquiry_Thinking-Chain-Of-Thought. Reasoning text is drawn from meta.reasoning, non-reasoning text from answer, and <think>-style tags are stripped before chunking.
Intrinsic dimension is estimated with skdim.id.MLE on pooled residual activations. Experiments 1 and 2 now use the same merged annotated segments, filtered to a minimum length of 8 tokens; Experiment 1 pools them across categories with balanced sampling, while Experiment 2 estimates each category separately. Classification experiments apply PCA to pooled activations and train a LinearSVC. Multiple layers were evaluated, but this report focuses on layer 17.
At layer 17, pooled merged reasoning segments from the annotated corpus have intrinsic dimension 15.37 over 10,000 units. This is the baseline scale of the broader reasoning manifold: not so low as to look like a narrow code, but far smaller than the 2560-dimensional ambient residual space. Larger 20,000-unit pooled runs were unstable for MLE, so the stable 10,000-unit setting is the reported result.
When the same corpus is partitioned by reasoning behavior, intrinsic dimension is comparable to or higher than reasoning as a whole. initializing reaches 14.99, example-testing 14.83, and deduction 15.29, all close to the pooled value of 15.37; adding-knowledge rises to 17.06, backtracking to 18.23, and uncertainty-estimation to 20.29. The most expansive geometry therefore appears not in setup or routine deduction, but in behaviors tied to revision, branching, and uncertainty management.
This changes the interpretation of reasoning. If global reasoning were a single broad process that strictly contained its parts, then each sub-behavior should have looked substantially lower-dimensional than the pooled manifold. It does not. Instead, the results suggest that reasoning is better described as a coordinated family of behavior-specific manifolds. The pooled reasoning state is not broader simply because it mixes more text; rather, it seems to reuse overlapping directions across behaviors. In that picture, deduction, initialization, and example testing occupy nearby or partially shared subspaces, while uncertainty estimation and backtracking open up additional directions associated with revising hypotheses, tracking alternatives, and recovering from mistaken partial states.
The paraphrase experiment points in the opposite direction. Using sentence-preserving 128-token source units, 200 paraphrases per unit, and default per-unit aggregation, the layer-17 paraphrase manifold has mean intrinsic dimension 5.17 with standard deviation 0.81 across 5 source units. This is far below the 15.37 observed for the pooled reasoning corpus. Meaning-preserving variation therefore occupies a much tighter region than cross-problem reasoning variation. In geometric terms, the model allows substantial movement when the chain of thought itself changes, but only a narrow band of movement when the underlying reasoning state is held fixed and only the phrasing is altered.
The most revealing comparison is between Experiments 4 and 5 under a shared representation:
- layer
17 32-token chunks- max activation length
128 - max pooling
- balanced classes with
1000examples per class - identical PCA grid and
25%test split
Under this fair comparison, detecting the presence of the word wait is easier than separating reasoning from non-reasoning text at every tested PCA setting. This matters because it shows that the reasoning classifier is not recovering a cheap lexical marker: with the same architecture, same layer, same pooling rule, and same chunk size, a local lexical feature is decoded with fewer principal components and higher accuracy than the broader semantic distinction.
Layer-17 accuracies are:
| PCs | Experiment 4: Reasoning vs Non-Reasoning | Experiment 5: wait Presence |
|---|---|---|
| 1 | 0.592 | 0.608 |
| 2 | 0.760 | 0.852 |
| 3 | 0.822 | 0.854 |
| 4 | 0.826 | 0.888 |
| 5 | 0.840 | 0.908 |
| 6 | 0.834 | 0.908 |
| 7 | 0.854 | 0.910 |
| 8 | 0.856 | 0.920 |
| 12 | 0.880 | 0.930 |
| 16 | 0.882 | 0.930 |
| 24 | 0.922 | 0.946 |
| 32 | 0.934 | 0.954 |
| 64 | 0.940 | 0.962 |
| 128 | 0.922 | 0.954 |
Two consequences stand out. First, Experiment 5 has higher accuracy than Experiment 4 at every setting in the sweep, not merely at the optimum. Second, the route to strong performance is much shorter for the lexical task: to reach 90% accuracy, Experiment 4 needs 24 principal components, whereas Experiment 5 needs only 5. This gap is a compact way of saying that the geometry relevant to single-word detection is much more concentrated than the geometry relevant to reasoning-vs-answer discrimination.
That asymmetry is informative about language-model reasoning. If reasoning and non-reasoning were separated by a simple low-dimensional marker, then the reasoning classifier should have saturated at very small PCA budgets. It does not. Instead, the need for many more components suggests that reasoning status is distributed across a broader subspace, consistent with a representation that integrates discourse structure, latent problem state, and intermediate inferential commitments rather than a single trigger feature.
The completed runs summarized above used the following configurations.
- dataset:
data/annotated_dataset.json - sources:
gemini,deepseek - units: merged annotated reasoning segments, balanced across categories
- sample limit:
1000 - max units:
10000 - minimum segment length:
8tokens - exact-text deduplication before pooling
- estimator:
MLEwithn_neighbors=20
- dataset:
data/annotated_dataset.json - sources:
gemini,deepseek - sample limit:
1000 - max units per category:
10000 - minimum segment length:
8tokens - exact-text deduplication within category
- estimator:
MLEwithn_neighbors=20
- dataset:
data/annotated_dataset.json - sources:
gemini,deepseek - paragraph bound:
128tokens - sample limit:
5 - source units:
5 - paraphrases per unit:
200 - manifold mode:
per-unit - total texts evaluated:
983
- dataset:
mattwesney/General_Inquiry_Thinking-Chain-Of-Thought - positive class:
meta.reasoning - negative class:
answer - chunk size:
32tokens - pooling:
max - activation max length:
128 - sample limit:
8000 - balanced dataset:
1000texts per class - PCA grid:
1,2,3,4,5,6,7,8,12,16,24,32,64,128 - train/test split:
1500 / 500
- dataset:
mattwesney/General_Inquiry_Thinking-Chain-Of-Thought - mixed sources: reasoning plus answer text
- target word:
wait - chunk size:
32tokens - pooling:
max - activation max length:
128 - sample limit:
8000 - mixed-source pool size:
100000 - balanced dataset:
1000positive and1000negative texts - PCA grid:
1,2,3,4,5,6,7,8,12,16,24,32,64,128 - train/test split:
1500 / 500
Each experiment run is stored under results/<experiment_name>/<timestamp>/. The JSON files there are the source of truth for configuration and metrics:
summary.jsonfor configuration and aggregate resultsmetrics.jsonfor PCA sweeps in Experiments 4 and 5units.json,source_units.json,paraphrases.json, ortexts.jsonfor the evaluated text units
Taken together, the five experiments suggest that reasoning in Gemma is geometrically structured at an intermediate scale. Across many problems, layer 17 supports a manifold of moderate intrinsic dimension. But that global manifold is not simply wider than its parts: several sub-behaviors are comparably dimensional, and some are clearly more expansive, especially uncertainty estimation and backtracking. Under paraphrase, the geometry contracts sharply. And when compared against a matched lexical decoding task, reasoning remains the more distributed property: the model exposes word presence in a few principal directions, but reasoning status is spread across many more.
This is a useful way to think about language-model reasoning. The activation geometry is neither flat noise nor a single compressed code. It looks more like a family of coordinated submanifolds: broad enough to encode diverse inferential states, partially shared across routine behaviors, and stretched furthest when the model has to manage uncertainty or revise its own trajectory. That is exactly the sort of structure one would expect from a system whose reasoning is organized around evolving internal states rather than around a fixed set of surface templates.
If you use this code or findings in your research, please cite:
@article{ma2026falsifying,
title={{Falsifying Sparse Autoencoder Reasoning Features in Language Models}},
author={Ma, George and Liang, Zhongyuan and Chen, Irene Y. and Sojoudi, Somayeh},
journal={arXiv preprint arXiv:2601.05679},
year={2026}
}