Morphological-Semantic Complexity Analysis

This is a study of Turkish, Russian, and English testing the Low-Entropy Conjecture using morphological analysis and semantic networks using Wikipedia articles.

Overview

In this project we are looking at whether languages balance morphological complexity with semantic organization (measured different ways at different stages of the research). We test three hypotheses on Russian, English and Turkish:

H1 (Trade-off): As a language’s inflectional system gets richer (more forms per lemma or higher morphological complexity), the semantic network density decreases.
H2 (Compensation): When a language has a higher I-complexity/entropy, meaning its morphology is less predictable, the semantic network shows higher clustering.
H3 (Dimensionality): Languages with more complex morphology need more dimensions to capture semantics, so their embeddings require more principal components to reach the same variance threshold.

Data Sources

Wikipedia Corpora (October 2018)

Required for Stage 1 preprocessing - download separately:

Turkish: trwiki-20181001-corpus.xml.bz2
Russian: ruwiki-20181001-corpus.xml.bz2
English: enwiki-20181001-corpus.xml.bz2

Download from: Linguatools Wikipedia Corpus

Create the following directory structure:

data/
└── raw_corpora/
    ├── trwiki-20181001-corpus.xml.bz2
    ├── ruwiki-20181001-corpus.xml.bz2
    └── enwiki-20181001-corpus.xml.bz2

The contents of the data/ folder are too large to push them to this repo, but can be reproduced using the notebooks. Here is how to use them:

git clone <repository-url>
cd morphological-semantic-complexity

pip install -r requirements.txt

# Download corpora (see Data Sources above)
# Run notebooks sequentially (01 -> 02 -> 03 -> 04)

Project Structure

.
├── LICENSE
├── [README.md](http://README.md)
├── data/
│   ├── processed/
│   │   ├── clean_text/
│   │   ├── lemmatized/
│   │   └── raw_text/
│   └── raw_corpora/
├── notebooks/
│   ├── 01_data_preprocessing.ipynb
│   ├── 02_morphological_baselines.ipynb
│   ├── 03_complexity_networks.ipynb
│   └── 04_correlation_analysis.ipynb
├── outputs/
│   ├── figures/
│   └── reports/
├── pearl_script.txt
├── requirements.txt
└── results/
    ├── phase1/
    │   ├── semantic_baseline/
    │   └── summary_*.csv
    ├── phase2/
    │   ├── comprehensive_analysis.*
    │   ├── *_metrics.csv
    │   └── temp_files/
    └── phase3/
        ├── embeddings/
        ├── graphs/
        │   ├── original/
        │   └── replicated/
        │       ├── English/
        │       ├── Russian/
        │       └── Turkish/
        ├── pca_analysis/
        ├── regression_models/
        └── visualizations/

Research flow

Stage 1: Data Preprocessing

At this stage we extracted and cleaned Wikipedia articles for English, Russian and Turkish (350k articles/language)
With an equal number of articles for each langauge we got the following token counts EN: 369M tokens, RU: 164M tokens, TR: 76M tokens
We then saved the clean corpora and Type-Token Ratios for each: (EN: 0.215, RU: 0.307, TR: 0.309).

Stage 2: Baseline Analysis

We built semantic networks of surface forms for each langague using cosine similarity treshold We took top 1000 most frequent words + similarity > 0.5
We ended up with large densities and avg degrees for English and Turkish EN 0.149, RU 0.710, TR 0.750; Avg degree: EN 148, RU 709, TR 749

Stage 3: Phase 2 - Enhanced Analysis

Here we took a more principled approach and used 10000 lemmas instead of 1000 surface forms and built kNN graphs with k=15. We also measured morphological complexity with better established metrics such as E/I-complexity.

Key Metrics

Language	E-complexity	I-complexity	Density	Clustering
English	1.34	3.45	0.0023	0.298
Russian	2.21	6.66	0.0024	0.349
Turkish	2.40	5.01	0.0026	0.306

This gave us much more realistic measurements.

Stage 4: Hypothesis Testing

After obtaining reliable metrics we could proceed with the hypotheses testing. Here is what we found:

H1 (Trade-off): ρ = +1.000, p < 0.001 -> rejected (positive not negative!)
H2 (Compensation): ρ = +1.000, p < 0.001 -> supprted (I-complexity predicts clustering -- our main finding)
H3 (Dimensionality): ρ = +0.500, p = 0.667 -> weak

Citation

Research conducted for Data Science for Linguists course, University of Tübingen, Summer 2025.

References

Ackerman, F., & Malouf, R. (2013). Morphological organization: The low entropy conjecture.
Cotterell et al. (2019). On the Complexity and Typology of Inflectional Morphological Systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morphological-Semantic Complexity Analysis

Overview

Data Sources

Wikipedia Corpora (October 2018)

Project Structure

Research flow

Stage 1: Data Preprocessing

Stage 2: Baseline Analysis

Stage 3: Phase 2 - Enhanced Analysis

Key Metrics

Stage 4: Hypothesis Testing

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
notebooks		notebooks
outputs/figures		outputs/figures
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pearl_script.txt		pearl_script.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Morphological-Semantic Complexity Analysis

Overview

Data Sources

Wikipedia Corpora (October 2018)

Project Structure

Research flow

Stage 1: Data Preprocessing

Stage 2: Baseline Analysis

Stage 3: Phase 2 - Enhanced Analysis

Key Metrics

Stage 4: Hypothesis Testing

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages