Reasoning Manifolds

This repository studies the geometry of reasoning in google/gemma-3-4b-it by measuring intrinsic dimension in hidden-state space and by asking how many principal components are required to decode different properties from activations. Across the completed runs, layer 17 emerges as a useful middle-layer probe: it carries a relatively high-dimensional representation of open-ended reasoning, yet the geometry contracts sharply when semantic content is held fixed under paraphrase, and it separates lexical features far more efficiently than it separates reasoning from non-reasoning discourse.

The central picture is therefore not that reasoning lives on a single tiny manifold, nor that it simply fills the ambient residual space. Instead, reasoning appears to occupy a structured intermediate regime: broad enough to support multiple cognitive sub-behaviors, but still organized enough that meaning-preserving variation stays confined to a much smaller region. The new pooled-segment results sharpen that picture further: reasoning as a whole is not obviously more expansive than its constituent sub-behaviors, suggesting that the global reasoning manifold is assembled from several partially overlapping local modes rather than from a single uniformly broad cloud.

Installation

conda create -n rdim python=3.14 -y
conda activate rdim
pip install -e .
(cd data; bash download_dataset.sh)

The implementation uses torch, transformers, scikit-dimension, scikit-learn, datasets, ijson, jaxtyping, and einops. All experiment outputs are stored under results/.

Data And Measurement

Experiments 1 to 3 use data/annotated_dataset.json, which contains long reasoning traces from Gemini and DeepSeek together with sentence-level annotations. The six behavior labels are initializing, deduction, adding-knowledge, example-testing, uncertainty-estimation, and backtracking.

Experiments 4 and 5 use mattwesney/General_Inquiry_Thinking-Chain-Of-Thought. Reasoning text is drawn from meta.reasoning, non-reasoning text from answer, and <think>-style tags are stripped before chunking.

Intrinsic dimension is estimated with skdim.id.MLE on pooled residual activations. Experiments 1 and 2 now use the same merged annotated segments, filtered to a minimum length of 8 tokens; Experiment 1 pools them across categories with balanced sampling, while Experiment 2 estimates each category separately. Classification experiments apply PCA to pooled activations and train a LinearSVC. Multiple layers were evaluated, but this report focuses on layer 17.

Main Findings

At layer 17, pooled merged reasoning segments from the annotated corpus have intrinsic dimension 15.37 over 10,000 units. This is the baseline scale of the broader reasoning manifold: not so low as to look like a narrow code, but far smaller than the 2560-dimensional ambient residual space. Larger 20,000-unit pooled runs were unstable for MLE, so the stable 10,000-unit setting is the reported result.

When the same corpus is partitioned by reasoning behavior, intrinsic dimension is comparable to or higher than reasoning as a whole. initializing reaches 14.99, example-testing 14.83, and deduction 15.29, all close to the pooled value of 15.37; adding-knowledge rises to 17.06, backtracking to 18.23, and uncertainty-estimation to 20.29. The most expansive geometry therefore appears not in setup or routine deduction, but in behaviors tied to revision, branching, and uncertainty management.

This changes the interpretation of reasoning. If global reasoning were a single broad process that strictly contained its parts, then each sub-behavior should have looked substantially lower-dimensional than the pooled manifold. It does not. Instead, the results suggest that reasoning is better described as a coordinated family of behavior-specific manifolds. The pooled reasoning state is not broader simply because it mixes more text; rather, it seems to reuse overlapping directions across behaviors. In that picture, deduction, initialization, and example testing occupy nearby or partially shared subspaces, while uncertainty estimation and backtracking open up additional directions associated with revising hypotheses, tracking alternatives, and recovering from mistaken partial states.

The paraphrase experiment points in the opposite direction. Using sentence-preserving 128-token source units, 200 paraphrases per unit, and default per-unit aggregation, the layer-17 paraphrase manifold has mean intrinsic dimension 5.17 with standard deviation 0.81 across 5 source units. This is far below the 15.37 observed for the pooled reasoning corpus. Meaning-preserving variation therefore occupies a much tighter region than cross-problem reasoning variation. In geometric terms, the model allows substantial movement when the chain of thought itself changes, but only a narrow band of movement when the underlying reasoning state is held fixed and only the phrasing is altered.

Fair Representation Comparison

The most revealing comparison is between Experiments 4 and 5 under a shared representation:

layer 17
32-token chunks
max activation length 128
max pooling
balanced classes with 1000 examples per class
identical PCA grid and 25% test split

Under this fair comparison, detecting the presence of the word wait is easier than separating reasoning from non-reasoning text at every tested PCA setting. This matters because it shows that the reasoning classifier is not recovering a cheap lexical marker: with the same architecture, same layer, same pooling rule, and same chunk size, a local lexical feature is decoded with fewer principal components and higher accuracy than the broader semantic distinction.

Layer-17 accuracies are:

PCs	Experiment 4: Reasoning vs Non-Reasoning	Experiment 5: `wait` Presence
1	0.592	0.608
2	0.760	0.852
3	0.822	0.854
4	0.826	0.888
5	0.840	0.908
6	0.834	0.908
7	0.854	0.910
8	0.856	0.920
12	0.880	0.930
16	0.882	0.930
24	0.922	0.946
32	0.934	0.954
64	0.940	0.962
128	0.922	0.954

Two consequences stand out. First, Experiment 5 has higher accuracy than Experiment 4 at every setting in the sweep, not merely at the optimum. Second, the route to strong performance is much shorter for the lexical task: to reach 90% accuracy, Experiment 4 needs 24 principal components, whereas Experiment 5 needs only 5. This gap is a compact way of saying that the geometry relevant to single-word detection is much more concentrated than the geometry relevant to reasoning-vs-answer discrimination.

That asymmetry is informative about language-model reasoning. If reasoning and non-reasoning were separated by a simple low-dimensional marker, then the reasoning classifier should have saturated at very small PCA budgets. It does not. Instead, the need for many more components suggests that reasoning status is distributed across a broader subspace, consistent with a representation that integrates discourse structure, latent problem state, and intermediate inferential commitments rather than a single trigger feature.

Final Experiment Configurations

The completed runs summarized above used the following configurations.

Experiment 1

dataset: data/annotated_dataset.json
sources: gemini, deepseek
units: merged annotated reasoning segments, balanced across categories
sample limit: 1000
max units: 10000
minimum segment length: 8 tokens
exact-text deduplication before pooling
estimator: MLE with n_neighbors=20

Experiment 2

dataset: data/annotated_dataset.json
sources: gemini, deepseek
sample limit: 1000
max units per category: 10000
minimum segment length: 8 tokens
exact-text deduplication within category
estimator: MLE with n_neighbors=20

Experiment 3

dataset: data/annotated_dataset.json
sources: gemini, deepseek
paragraph bound: 128 tokens
sample limit: 5
source units: 5
paraphrases per unit: 200
manifold mode: per-unit
total texts evaluated: 983

Experiment 4

dataset: mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
positive class: meta.reasoning
negative class: answer
chunk size: 32 tokens
pooling: max
activation max length: 128
sample limit: 8000
balanced dataset: 1000 texts per class
PCA grid: 1,2,3,4,5,6,7,8,12,16,24,32,64,128
train/test split: 1500 / 500

Experiment 5

dataset: mattwesney/General_Inquiry_Thinking-Chain-Of-Thought
mixed sources: reasoning plus answer text
target word: wait
chunk size: 32 tokens
pooling: max
activation max length: 128
sample limit: 8000
mixed-source pool size: 100000
balanced dataset: 1000 positive and 1000 negative texts
PCA grid: 1,2,3,4,5,6,7,8,12,16,24,32,64,128
train/test split: 1500 / 500

Output Layout

Each experiment run is stored under results/<experiment_name>/<timestamp>/. The JSON files there are the source of truth for configuration and metrics:

summary.json for configuration and aggregate results
metrics.json for PCA sweeps in Experiments 4 and 5
units.json, source_units.json, paraphrases.json, or texts.json for the evaluated text units

Takeaway

Taken together, the five experiments suggest that reasoning in Gemma is geometrically structured at an intermediate scale. Across many problems, layer 17 supports a manifold of moderate intrinsic dimension. But that global manifold is not simply wider than its parts: several sub-behaviors are comparably dimensional, and some are clearly more expansive, especially uncertainty estimation and backtracking. Under paraphrase, the geometry contracts sharply. And when compared against a matched lexical decoding task, reasoning remains the more distributed property: the model exposes word presence in a few principal directions, but reasoning status is spread across many more.

This is a useful way to think about language-model reasoning. The activation geometry is neither flat noise nor a single compressed code. It looks more like a family of coordinated submanifolds: broad enough to encode diverse inferential states, partially shared across routine behaviors, and stretched furthest when the model has to manage uncertainty or revise its own trajectory. That is exactly the sort of structure one would expect from a system whose reasoning is organized around evolving internal states rather than around a fixed set of surface templates.

Citation

If you use this code or findings in your research, please cite:

@article{ma2026falsifying,
    title={{Falsifying Sparse Autoencoder Reasoning Features in Language Models}},
    author={Ma, George and Liang, Zhongyuan and Chen, Irene Y. and Sojoudi, Somayeh},
    journal={arXiv preprint arXiv:2601.05679},
    year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
results		results
src/reasoning_manifolds		src/reasoning_manifolds
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning Manifolds

Installation

Data And Measurement

Main Findings

Fair Representation Comparison

Final Experiment Configurations

Experiment 1

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Output Layout

Takeaway

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reasoning Manifolds

Installation

Data And Measurement

Main Findings

Fair Representation Comparison

Final Experiment Configurations

Experiment 1

Experiment 2

Experiment 3

Experiment 4

Experiment 5

Output Layout

Takeaway

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages