README: Slot Labeling

Overview

This project implements and improves a slot labeling model, a crucial component of task-oriented dialogue systems such as virtual assistants (e.g., Siri, Alexa). The primary focus is to enhance the accuracy of slot labeling by experimenting with different approaches, including logistic regression and BIO tagging.

Project Structure

|-- data/
    |-- training_data.json   # Training data
    |-- validation_data.json # Validation data
    |-- test_data.json       # Test data
    |-- ontology.json        # Ontology describing intents and slots
|-- src/
    |-- label_slots.py       # Main executable script
    |-- utils.py             # Helper functions for preprocessing and evaluation
|-- README.md                # Documentation

Requirements

Python 3.x
spacy
scikit-learn

Setup Environment

conda create -n anlp_hw2 python=3.x
conda activate anlp_hw2
conda install -c conda-forge spacy
conda install scikit-learn
python -m spacy download en_core_web_sm

Running the Code

Baseline Model (Most Frequent Tag)

Run the baseline model that uses the most frequent tag for each word:

python label_slots.py

Logistic Regression Model with Word Embeddings

Run the logistic regression-based model:

python label_slots.py -m logistic_regression

BIO Tagging with Constraints

To enforce correct tagging structure using BIO constraints:

python label_slots.py -m logistic_regression -p bio_tags

Custom Model (with Feature Engineering)

To run your improved model after implementing additional features:

python label_slots.py -m my_model -p bio_tags

Explanation of Components

1. Baseline Model

Tags each word based on the most frequent label in the training data.
Utilizes the BIO tagging scheme.

2. Logistic Regression Model

Conditions the tag prediction on word embeddings instead of individual words.
Addresses issues with unseen words by leveraging semantic similarities in embeddings.

3. BIO Tagging

Enforces valid sequences (e.g., I-<tag> must follow B-<tag>).
Improves consistency of labeled spans.

4. Error Analysis and Feature Engineering

Identify common errors by examining mispredicted spans.
Experiment with linguistic features such as:
- Part-of-speech (POS) tags
- Dependency relations
- Neighboring word embeddings
Implement improvements in train_my_model and rerun tests.

Evaluation Metrics

Precision, Recall, F1-score: Evaluated for each slot type and overall.
Micro-averaged and macro-averaged F1-scores are reported for performance comparison.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assignment_questions		assignment_questions
data		data
output_files		output_files
report		report
src		src
test_model_files		test_model_files
tex		tex
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README: Slot Labeling

Overview

Project Structure

Requirements

Setup Environment

Running the Code

Baseline Model (Most Frequent Tag)

Logistic Regression Model with Word Embeddings

BIO Tagging with Constraints

Custom Model (with Feature Engineering)

Explanation of Components

1. Baseline Model

2. Logistic Regression Model

3. BIO Tagging

4. Error Analysis and Feature Engineering

Evaluation Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

README: Slot Labeling

Overview

Project Structure

Requirements

Setup Environment

Running the Code

Baseline Model (Most Frequent Tag)

Logistic Regression Model with Word Embeddings

BIO Tagging with Constraints

Custom Model (with Feature Engineering)

Explanation of Components

1. Baseline Model

2. Logistic Regression Model

3. BIO Tagging

4. Error Analysis and Feature Engineering

Evaluation Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages