Unsupervised Morpheme Discovery for Uralic Languages
SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.
The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.
- โจ Unsupervised Learning: No annotated data required
- ๐ High Performance: Rust-powered core with Python bindings via PyO3
- ๐ฌ Linguistically Motivated: Incorporates positional priors for roots vs. affixes
- ๐ Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
- ๐ Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
- ๐ Iterative Refinement: Converges to stable morpheme representations
pip install samponlpgit clone https://github.com/AragonerUA/samponlp.git
cd samponlp
pip install maturin
maturin develop --releasefrom samponlp import MorphemeCleaner
# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
language='estonian',
min_length=1,
min_type_support=3,
max_iterations=100,
convergence_threshold=1e-7
)
# Process morphemes from a file
results = cleaner.process(
raw_morphemes_path='data/estonian_morphemes.txt',
output_dir='results/estonian_output'
)
print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
print(morpheme)
# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
print(f"{token}: {reason}")
# Examine final scores
print(results.final_scores['hรกz']) # 0.334SampoNLP comes with pre-configured settings for:
- ๐ซ๐ฎ Finnish (
language='finnish') - ๐ช๐ช Estonian (
language='estonian') - ๐ญ๐บ Hungarian (
language='hungarian')
Each language has customized:
- Alphabet validation patterns
- Single-character morpheme whitelists
- Language-specific filtering rules
SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:
- Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
- Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
- Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
- Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes
For detailed algorithm description, see our paper (link coming soon).
- Contributing Guide - How to contribute
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone the repository
git clone https://github.com/AragonerUA/samponlp.git
cd samponlp
# Build with maturin
pip install maturin
maturin develop --release
# Run tests
pytest tests/python run_pipeline.pySampoNLP is released under the Apache 2.0 License.
Contributions are welcome! Please see our Contributing Guide for details.
If you find SampoNLP useful, please consider:
- โญ Starring the repository
- ๐ข Sharing it with colleagues
- ๐ฌ Providing feedback via issues
- ๐ Sponsoring the project
This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.
Made with โค๏ธ for the Uralic NLP community