Industry-ready ML system for detecting fraudulent job postings using classical ML and Transformer models.
Features a premium Next.js Dashboard for interactive analysis and real-time monitoring. Built with Python, scikit-learn, XGBoost, LightGBM, BERT (Transformers), FastAPI, and Next.js.
All 6 models were trained and evaluated on the HuggingFace Fake Job Posting dataset (17,880 records).
Evaluation was performed on a held-out 15% stratified test set.
| Model | Accuracy | Precision | Recall | F1 Score | ROC-AUC |
|---|---|---|---|---|---|
| Baseline (DummyClassifier) | 95.2% | β | β | β | 0.50 |
| Logistic Regression | 96.5% | 0.59 | 0.89 | 0.71 | 0.99 |
| Linear SVM | 98.2% | 0.80 | 0.83 | 0.82 | 0.98 |
| Random Forest | 97.7% | 0.99 | 0.53 | 0.69 | 0.98 |
| XGBoost | 97.8% | 0.78 | 0.77 | 0.77 | 0.98 |
| LightGBM | 98.1% | 0.86 | 0.74 | 0.79 | 0.98 |
- Best overall (F1): Linear SVM β 0.82 F1 with 98.2% accuracy
- Best recall (catch fraud): Logistic Regression β 0.89 recall (misses fewest fake posts)
- Best precision (fewest false alarms): Random Forest β 0.99 precision
- Priority metric: F1 Score and Recall β minimizing missed fraud is critical
Fake-Job-Post-Prediction/
β
βββ client/ # Next.js Frontend Dashboard
β βββ app/ # App router pages (Console, Dataset, etc.)
β βββ components/ # Premium UI components
β βββ public/ # Static assets
β
βββ data/
β βββ raw/
β β βββ huggingface_dataset/ # Cached raw dataset from HF
β βββ processed/
β β βββ train.csv # 70% stratified train split
β β βββ val.csv # 15% validation split
β β βββ test.csv # 15% test split
β βββ external/ # Optional augmentation data
β
βββ notebooks/
β βββ 01_eda.ipynb # Exploratory Data Analysis
β βββ 02_preprocessing.ipynb
β βββ 03_feature_engineering.ipynb
β βββ 04_baseline_models.ipynb
β
βββ src/
β βββ __init__.py
β βββ config.py # Centralized hyperparameters & paths
β β
β βββ data/
β β βββ dataset.py # HuggingFace dataset loader + local cache
β β βββ preprocess.py # HTML/emoji/URL removal, stopwords, fraud indicators
β β βββ split.py # Stratified train/val/test splitting
β β βββ augment.py # SMOTE oversampling for class imbalance
β β
β βββ features/
β β βββ featurize.py # TF-IDF + metadata ColumnTransformer
β β βββ utils.py # Feature utility functions
β β
β βββ models/
β β βββ baseline.py # DummyClassifier (majority class)
β β βββ ml_models.py # Model registry: LR, SVM, RF, XGBoost, LightGBM
β β βββ transformer.py # BERT fine-tuning wrapper (train/predict/save/load)
β β
β βββ training/
β β βββ train.py # Main training script (--all, --smote, --full-features)
β β βββ evaluate.py # Evaluation metrics (Accuracy, F1, ROC-AUC, PR-AUC)
β β βββ callbacks.py # Early stopping callback
β β
β βββ inference/
β β βββ predict.py # Single + batch prediction with saved models
β β βββ explain.py # SHAP & LIME explainability
β β
β βββ api/
β β βββ app.py # FastAPI app (4 endpoints)
β β βββ schemas.py # Pydantic request/response schemas
β β
β βββ utils/
β β βββ helpers.py # Text combination, pattern matching
β β βββ metrics.py # Comprehensive metric computation
β β βββ logger.py # Centralized logging
β β
β βββ visualization/
β βββ plots.py # Confusion matrix, ROC, PR curves, model comparison
β
βββ models/ # Saved model artifacts (.joblib)
β βββ baseline.joblib
β βββ logistic_regression.joblib
β βββ svm.joblib
β βββ random_forest.joblib
β βββ xgboost.joblib
β βββ lightgbm.joblib
β βββ comparison.csv # Model comparison results
β
βββ requirements.txt
βββ Dockerfile
βββ .gitignore
βββ README.md
βββ LICENSE
git clone https://github.com/ByteNinjaSmit/Fake-Job-Post-Prediction.git
cd Fake-Job-Post-Prediction
# Create virtual environment
python -m venv venv
# Activate (Windows)
.\venv\Scripts\activate
# Activate (Linux/Mac)
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Train a single model
.\venv\Scripts\python src/training/train.py --model logistic_regression
# Train ALL models and generate comparison table
.\venv\Scripts\python src/training/train.py --all
# Train with SMOTE oversampling (handles class imbalance)
.\venv\Scripts\python src/training/train.py --model xgboost --smote
# Train with full features (TF-IDF + metadata + engineered features)
.\venv\Scripts\python src/training/train.py --model xgboost --full-featuresAvailable models: baseline, logistic_regression, svm, random_forest, xgboost, lightgbm
from src.inference.predict import Predictor
predictor = Predictor("logistic_regression")
result = predictor.predict_single(
"Earn $5000/week from home! No experience needed. Contact us on WhatsApp."
)
print(result)
# {'prediction': 'Fraudulent', 'label': 1, 'probability_fraudulent': 0.92, ...}uvicorn src.api.app:app --reloadThen visit: http://localhost:8000/docs for interactive Swagger documentation.
The project includes a premium, high-performance dashboard built with Next.js, Framer Motion, and Lucide React.
- Interactive Console: Real-time prediction gateway with high-fidelity UI.
- Batch Scan: Process bulk job data and visualize aggregate threat patterns.
- Deep Explain: Transparent AI with LIME-powered feature weight visualization.
- Corpus Explorer: Live dataset analytics, department density, and geographic threat nodes.
cd client
npm install
npm run devThen visit: http://localhost:3000
| Route | Method | Description |
|---|---|---|
/health |
GET | Health check β model status |
/predict |
POST | Classify a single job posting |
/batch |
POST | Classify multiple job postings |
/explain |
POST | Classify + LIME feature explanation |
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"title": "Marketing Intern",
"description": "Earn money fast from home!",
"company_profile": "",
"requirements": "No experience needed"
}'{
"prediction": "Fraudulent",
"confidence": 0.92,
"fraudulent_score": 0.92
}| Model | Library | Strategy |
|---|---|---|
| Logistic Regression | scikit-learn | class_weight='balanced', max_iter=1000 |
| Linear SVM | scikit-learn | class_weight='balanced' |
| Model | Library | Strategy |
|---|---|---|
| Random Forest | scikit-learn | 200 estimators, class_weight='balanced' |
| XGBoost | XGBoost | 200 estimators, scale_pos_weight=10 |
| LightGBM | LightGBM | 200 estimators, class_weight='balanced' |
| Model | Library | Strategy |
|---|---|---|
| BERT | Hugging Face Transformers | bert-base-uncased, lr=2e-5, 4 epochs, AdamW |
The dataset is highly imbalanced (~5% fraud). We address this through:
- Class weights β
balancedweighting in all classical models - Scale pos weight β XGBoost positive class weighting
- SMOTE β Synthetic minority oversampling (optional via
--smoteflag)
- TF-IDF vectors β up to 5,000 features, bigrams, sublinear TF
- Combined text from:
title + company_profile + description + requirements + benefits
| Feature | Rationale |
|---|---|
email_count |
Fake posts often include personal emails |
url_count |
External link redirection |
exclamation_count |
Emotional manipulation ("Earn $$$!!!") |
upper_ratio |
ALL CAPS usage |
word_count |
Unusually short or long descriptions |
company_profile_len |
Fake companies have short/empty profiles |
employment_type,required_experience,required_education,industry,function
telecommuting,has_company_logo,has_questions
| Metric | Description | Priority |
|---|---|---|
| F1 Score | Harmonic mean of precision & recall | β Primary |
| Recall | Fraction of actual fraud detected | β Primary |
| Precision | Fraction of predicted fraud that is real | Secondary |
| ROC-AUC | Overall discrimination ability | Secondary |
| PR-AUC | Precision-Recall area under curve | Secondary |
| Accuracy | Overall correctness | Baseline |
Priority: F1 and Recall β In fraud detection, missing a fake job post (false negative) is worse than a false alarm.
HuggingFace Dataset (17,880 records)
β
Text Cleaning (HTML, emoji, URL, stopword removal)
β
Fraud Indicator Feature Engineering
β
Stratified Split (70% train / 15% val / 15% test)
β
TF-IDF Vectorization + Metadata Encoding
β
Model Training & Evaluation
β
Model Comparison Table (models/comparison.csv)
# Start the API server
docker compose up api
# Access API
curl http://localhost:8000/health# Run all model training inside a container
docker compose --profile train up trainerModels and data are mounted as volumes β trained models persist on your host machine.
# Start API + Prometheus + Grafana
docker compose --profile monitoring up- API: http://localhost:8000
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
# Build image
docker build -t fake-job-api .
# Run container
docker run -p 8000:8000 -v ./models:/app/models fake-job-api
# Access API
curl http://localhost:8000/health| Service | Port | Profile | Description |
|---|---|---|---|
api |
8000 | default | FastAPI prediction server |
trainer |
β | train |
One-off model training |
prometheus |
9090 | monitoring |
Metrics collection |
grafana |
3000 | monitoring |
Dashboards |
- Explains individual predictions by highlighting contributing words
- Integrated into the
/explainAPI endpoint
- Global feature importance for ML models
- Available via
src/inference/explain.py
| Category | Libraries |
|---|---|
| Data | pandas, numpy, datasets (HuggingFace) |
| ML | scikit-learn, XGBoost, LightGBM, imbalanced-learn |
| Deep Learning | PyTorch, Transformers (HuggingFace) |
| NLP | NLTK, BeautifulSoup4 |
| API | FastAPI, Uvicorn, Pydantic |
| Explainability | SHAP, LIME |
| Visualization | Matplotlib, Seaborn |
| Testing | pytest, httpx |
Source: victor/real-or-fake-fake-jobposting-prediction
| Field | Type | Description |
|---|---|---|
title |
text | Job title |
company_profile |
text | Company description |
description |
text | Job description |
requirements |
text | Job requirements |
benefits |
text | Job benefits |
telecommuting |
binary | Remote work flag |
has_company_logo |
binary | Logo presence |
has_questions |
binary | Screening questions |
employment_type |
categorical | Full-time, Part-time, etc. |
required_experience |
categorical | Entry, Mid, Senior, etc. |
required_education |
categorical | Bachelor's, Master's, etc. |
industry |
categorical | Industry sector |
fraudulent |
binary | Target β 0 (Real) / 1 (Fake) |
- β Clean, documented codebase (30+ source files)
- β Reproducible training scripts with CLI arguments
- β 6 trained models with comparison table
- β Production-ready FastAPI with 5 endpoints
- β Premium Next.js Frontend Dashboard
- β Batch Scan & Deep Explain (LIME) visualizations
- β Live Corpus Explorer with geographic threat mapping
- β Dockerized deployment (API + Monitoring)
- β Comprehensive README with results
This project is licensed under the MIT License β see LICENSE for details.