SampoNLP

Unsupervised Morpheme Discovery for Uralic Languages

SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.

The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.

🌟 Features

✨ Unsupervised Learning: No annotated data required
🚀 High Performance: Rust-powered core with Python bindings via PyO3
🔬 Linguistically Motivated: Incorporates positional priors for roots vs. affixes
🌍 Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
📊 Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
🔄 Iterative Refinement: Converges to stable morpheme representations

📦 Installation

From PyPI (recommended)

pip install samponlp

From source

git clone https://github.com/AragonerUA/samponlp.git
cd samponlp
pip install maturin
maturin develop --release

🚀 Quick Start

Basic Usage

from samponlp import MorphemeCleaner

# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
    language='estonian',
    min_length=1,
    min_type_support=3,
    max_iterations=100,
    convergence_threshold=1e-7
)

# Process morphemes from a file
results = cleaner.process(
    raw_morphemes_path='data/estonian_morphemes.txt',
    output_dir='results/estonian_output'
)

print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")

Analyzing Results

# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
    print(morpheme)

# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
    print(f"{token}: {reason}")

# Examine final scores
print(results.final_scores['ház'])  # 0.334

📚 Supported Languages

SampoNLP comes with pre-configured settings for:

🇫🇮 Finnish (language='finnish')
🇪🇪 Estonian (language='estonian')
🇭🇺 Hungarian (language='hungarian')

Each language has customized:

Alphabet validation patterns
Single-character morpheme whitelists
Language-specific filtering rules

🔬 Algorithm Overview

SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:

Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes

For detailed algorithm description, see our paper (link coming soon).

📖 Documentation

Contributing Guide - How to contribute

🛠️ Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/AragonerUA/samponlp.git
cd samponlp

# Build with maturin
pip install maturin
maturin develop --release

# Run tests
pytest tests/

Running the Pipeline

python run_pipeline.py

📄 License

SampoNLP is released under the Apache 2.0 License.

🤝 Contributing

Contributions are welcome! Please see our Contributing Guide for details.

💖 Support

If you find SampoNLP useful, please consider:

⭐ Starring the repository
📢 Sharing it with colleagues
💬 Providing feedback via issues
🙏 Sponsoring the project

🙏 Acknowledgments

This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.

Made with ❤️ for the Uralic NLP community

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
data		data
docs		docs
samponlp		samponlp
src		src
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
run_pipeline.py		run_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SampoNLP

🌟 Features

📦 Installation

From PyPI (recommended)

From source

🚀 Quick Start

Basic Usage

Analyzing Results

📚 Supported Languages

🔬 Algorithm Overview

📖 Documentation

🛠️ Development

Building from Source

Running the Pipeline

📄 License

🤝 Contributing

💖 Support

🙏 Acknowledgments

About

Uh oh!

Releases 10

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SampoNLP

🌟 Features

📦 Installation

From PyPI (recommended)

From source

🚀 Quick Start

Basic Usage

Analyzing Results

📚 Supported Languages

🔬 Algorithm Overview

📖 Documentation

🛠️ Development

Building from Source

Running the Pipeline

📄 License

🤝 Contributing

💖 Support

🙏 Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages