Skip to content

This project builds a Bigram Language Model (2-gram) using Natural Language Processing (NLP) techniques, with support for Laplace smoothing and Good-Turing smoothing to handle unseen word sequences. It runs in a Streamlit web app where users can interactively upload text, analyze bigram probabilities, and compare smoothing methods.

Notifications You must be signed in to change notification settings

susrithavemuri/NgramLanguageModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

N-Gram Language Model with Smoothing (Streamlit App)

This project implements a Bigram (2-gram) Language Model using Natural Language Processing. It supports two smoothing techniquesโ€”Laplace Smoothing and Good-Turing Smoothingโ€”to estimate probabilities for seen and unseen word pairs. The model is interactive and runs in a Streamlit web app.

๐Ÿš€ Features

  • ๐Ÿ“ Upload your own .txt corpus
  • ๐Ÿง  Generate bigrams and count their frequencies
  • ๐Ÿ“Š Apply Laplace and Good-Turing smoothing
  • ๐Ÿ” Enter custom bigrams to check their probabilities
  • ๐Ÿ“ˆ Compare probabilities side-by-side

๐Ÿ› ๏ธ Requirements

Install the required Python packages: pip install -r requirements.txt

โ–ถ๏ธ How to Run

streamlit run app.py

๐Ÿ“‚ File Structure

ngram_language_model/ โ”‚ โ”œโ”€โ”€ app.py # Main Streamlit application โ”œโ”€โ”€ ngram_utils.py # Utility functions for preprocessing and probability calculation โ”œโ”€โ”€ requirements.txt # Python dependencies โ””โ”€โ”€ long_sample_corpus.txt# Example text corpus

๐Ÿ“š Smoothing Techniques

Laplace Smoothing: Adds 1 to every bigram count to avoid zero probabilities. Good-Turing Smoothing: Recalculates probability based on the frequency of frequency counts.

๐Ÿงช Example Test Bigrams

("language", "models") ("speech", "recognition") ("deep", "learning") (Unseen bigram)

๐Ÿง  Educational Goals

This project is great for: Understanding the mechanics of N-gram models Seeing the effects of smoothing Hands-on exploration of NLP probability models

About

This project builds a Bigram Language Model (2-gram) using Natural Language Processing (NLP) techniques, with support for Laplace smoothing and Good-Turing smoothing to handle unseen word sequences. It runs in a Streamlit web app where users can interactively upload text, analyze bigram probabilities, and compare smoothing methods.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages