This project implements a Bigram (2-gram) Language Model using Natural Language Processing. It supports two smoothing techniquesโLaplace Smoothing and Good-Turing Smoothingโto estimate probabilities for seen and unseen word pairs. The model is interactive and runs in a Streamlit web app.
๐ Features
- ๐ Upload your own
.txtcorpus - ๐ง Generate bigrams and count their frequencies
- ๐ Apply Laplace and Good-Turing smoothing
- ๐ Enter custom bigrams to check their probabilities
- ๐ Compare probabilities side-by-side
Install the required Python packages: pip install -r requirements.txt
streamlit run app.py
ngram_language_model/ โ โโโ app.py # Main Streamlit application โโโ ngram_utils.py # Utility functions for preprocessing and probability calculation โโโ requirements.txt # Python dependencies โโโ long_sample_corpus.txt# Example text corpus
Laplace Smoothing: Adds 1 to every bigram count to avoid zero probabilities. Good-Turing Smoothing: Recalculates probability based on the frequency of frequency counts.
("language", "models") ("speech", "recognition") ("deep", "learning") (Unseen bigram)
This project is great for: Understanding the mechanics of N-gram models Seeing the effects of smoothing Hands-on exploration of NLP probability models