Skip to content

AragonerUA/SampoNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

25 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

SampoNLP

Unsupervised Morpheme Discovery for Uralic Languages

PyPI version Downloads License Python 3.8+

SampoNLP is a high-performance library for unsupervised morpheme discovery from raw text corpora. It implements the Iterative Morpheme Decomposition with Positional Priors (IMDP) algorithm, specifically designed for morphologically rich languages such as Finnish, Estonian, and Hungarian.

The library uses a Rust-accelerated core for efficient computation, wrapped in a user-friendly Python API.

๐ŸŒŸ Features

  • โœจ Unsupervised Learning: No annotated data required
  • ๐Ÿš€ High Performance: Rust-powered core with Python bindings via PyO3
  • ๐Ÿ”ฌ Linguistically Motivated: Incorporates positional priors for roots vs. affixes
  • ๐ŸŒ Multi-Language Support: Pre-configured for Finnish, Estonian, Hungarian, and general Uralic languages
  • ๐Ÿ“Š Automatic Thresholding: Uses Otsu's method for intelligent morpheme filtering
  • ๐Ÿ”„ Iterative Refinement: Converges to stable morpheme representations

๐Ÿ“ฆ Installation

From PyPI (recommended)

pip install samponlp

From source

git clone https://github.com/AragonerUA/samponlp.git
cd samponlp
pip install maturin
maturin develop --release

๐Ÿš€ Quick Start

Basic Usage

from samponlp import MorphemeCleaner

# Initialize the cleaner for Estonian
cleaner = MorphemeCleaner(
    language='estonian',
    min_length=1,
    min_type_support=3,
    max_iterations=100,
    convergence_threshold=1e-7
)

# Process morphemes from a file
results = cleaner.process(
    raw_morphemes_path='data/estonian_morphemes.txt',
    output_dir='results/estonian_output'
)

print(f"Found {results.morpheme_count} atomic morphemes")
print(f"Discarded {len(results.discarded)} tokens")

Analyzing Results

# Access cleaned morphemes
for morpheme in results.morphemes[:10]:
    print(morpheme)

# Check discarded tokens with reasons
for token, reason in results.discarded[:5]:
    print(f"{token}: {reason}")

# Examine final scores
print(results.final_scores['hรกz'])  # 0.334

๐Ÿ“š Supported Languages

SampoNLP comes with pre-configured settings for:

  • ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish (language='finnish')
  • ๐Ÿ‡ช๐Ÿ‡ช Estonian (language='estonian')
  • ๐Ÿ‡ญ๐Ÿ‡บ Hungarian (language='hungarian')

Each language has customized:

  • Alphabet validation patterns
  • Single-character morpheme whitelists
  • Language-specific filtering rules

๐Ÿ”ฌ Algorithm Overview

SampoNLP implements the IMDP (Iterative Morpheme Decomposition with Positional Priors) algorithm:

  1. Initial Filtering: Removes noise based on alphabet, type-support, and heuristics
  2. Iterative Scoring: Uses dynamic programming to find optimal morpheme decompositions
  3. Positional Priors: Applies different rules for roots (can split anywhere) vs. affixes (edge-only splits)
  4. Automatic Thresholding: Employs Otsu's method to separate atomic from composite morphemes

For detailed algorithm description, see our paper (link coming soon).

๐Ÿ“– Documentation

๐Ÿ› ๏ธ Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/AragonerUA/samponlp.git
cd samponlp

# Build with maturin
pip install maturin
maturin develop --release

# Run tests
pytest tests/

Running the Pipeline

python run_pipeline.py

๐Ÿ“„ License

SampoNLP is released under the Apache 2.0 License.

๐Ÿค Contributing

Contributions are welcome! Please see our Contributing Guide for details.

๐Ÿ’– Support

If you find SampoNLP useful, please consider:

  • โญ Starring the repository
  • ๐Ÿ“ข Sharing it with colleagues
  • ๐Ÿ’ฌ Providing feedback via issues
  • ๐Ÿ™ Sponsoring the project

๐Ÿ™ Acknowledgments

This project was inspired by morphological analysis needs in computational linguistics research for Uralic languages.


Made with โค๏ธ for the Uralic NLP community

About

Unsupervised morphological discovery and tokenizer evaluation for Uralic languages

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors