Skip to content

ltpisme/Test-AutoML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fact Checking Auto-ML

This project is a small AutoML-style NLP pipeline for binary fact checking. It trains classifiers that decide whether a claim is supported by a piece of evidence.

The current dataset is intentionally small and local: data.csv contains 30 rows with balanced labels:

  • 1: supported / true claim-evidence pair
  • 0: unsupported / false claim-evidence pair

Project Structure

.
├── data.csv              # Input dataset with claim, evidence, and label columns
├── features.py           # Text feature extractors
├── models.py             # Model factory for supported classifiers
├── run.py                # Training, Optuna search, evaluation, and experiment saving
├── requirements.txt      # Python dependencies
├── Makefile              # Setup and run shortcuts
└── experiments/          # Saved trial configs and F1 scores

Pipeline

run.py performs the full experiment workflow:

  1. Load data.csv.
  2. Split the data into train and test sets with an 80/20 split.
  3. Use Optuna to run 10 trials.
  4. For each trial, choose one feature extractor:
    • v1: TfidfVectorizer
    • v2: CountVectorizer with unigram and bigram features
  5. For each trial, choose one model:
    • logreg: LogisticRegression
    • rf: RandomForestClassifier
  6. Train the selected model and evaluate it with F1 score.
  7. Save each trial under experiments/exp_<trial_id>/.

Each experiment folder contains:

  • config.json: feature choice, model choice, and hyperparameters
  • score.json: F1 score for that trial

The recorded experiments currently report an F1 score of 0.8 for all 10 trials.

Setup

Create a virtual environment and install dependencies:

make setup

This runs:

python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt

Run

Run the AutoML search:

make run

Or run the script directly:

.venv/bin/python run.py

The script prints the label distribution, runs the Optuna study, saves trial outputs to experiments/, and prints the best parameter set and best F1 score.

Data Format

data.csv must contain these columns:

claim,evidence,label

Example:

"Paris is capital of France","Paris is the capital city of France",1
"Apple is a fruit","Apple Inc makes phones",0

The script expects at least two labels and at least two samples per class.

Dependencies

Main libraries:

  • pandas
  • scikit-learn
  • optuna
  • numpy

mlflow and joblib are listed in requirements.txt, but the current scripts do not use them yet.

Notes

  • The dataset is very small, so the recorded scores should be treated as a demonstration result rather than a reliable benchmark.
  • train_test_split uses random_state=42, but it does not currently stratify by label.
  • New feature extractors can be added in features.py and registered in the FEATURES dictionary in run.py.
  • New models can be added in models.py and included in the Optuna objective in run.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors