🚀 Log Classification System – Hybrid AI Framework

A production-ready hybrid log classification system that intelligently combines Regex rules, Machine Learning (Sentence Transformers + Logistic Regression), and LLM fallback to classify system logs with high accuracy and adaptability.

Designed to handle:

⚡ Simple structured logs
🤖 Complex semantic logs
🧠 Unseen or ambiguous patterns

🧠 Overview

Modern applications generate massive volumes of logs that are difficult to analyze manually. This system automates log classification using a multi-stage intelligent pipeline that dynamically selects the best method for each log.

🏗️ Architecture

⚡ This hybrid pipeline ensures optimal trade-off between speed (Regex), accuracy (ML), and flexibility (LLM).

flowchart TD
    A[API Layer - FastAPI] --> B[Incoming Log]
    B --> C[Regex Engine]

    C -->|Regex Match| F1[Final Output]
    C -->|No Regex Match| D[ML Model\nSentence Transformer + LR]

    D -->|Confidence >=\n0.75| F1
    D -->|Confidence <\n0.75| E[LLM Classifier]

    E --> F1

    F1 --> F2[Label + Method + Confidence]

⚙️ Classification Strategy

🔹 1. Regex-Based Classification

Handles predictable log patterns
Ultra-fast pattern matching using predefined rules
Example:
- "System reboot initiated" → System Notification

🔹 2. ML-Based Classification

Uses embeddings from Sentence Transformers
Applies Logistic Regression for classification
Works best with sufficient labeled data
Returns:
- Predicted label
- Confidence score

🔹 3. LLM-Based Classification

Used when:
- ML confidence is low
- Log is complex or unseen
Uses LLM via Groq API
Ensures robustness for real-world logs

⚙️ Decision Logic

if regex_match:
    return label

label, prob = ML_model(log)

if prob > 0.75:
    return label
else:
    return LLM(log)

📁 Project Structure

Log-Classification-System/
│
├── models/
│   └── log_classifier.joblib
│
├── resources/
│   ├── test.csv
│   └── output.csv
│
├── training/
│   ├── dataset/
│   └── log-classification.ipynb
│
├── classify.py
├── processor_regex.py
├── processor_bert.py
├── processor_llm.py
├── server.py
├── requirements.txt
└── .env

⚡ Features

Hybrid classification (Regex + ML + LLM)
Confidence-based intelligent routing
FastAPI-powered backend API
CSV upload & batch classification
Modular and scalable architecture
Model persistence using joblib
Handles real-world log patterns

🛠️ Tech Stack

Backend: FastAPI
ML: scikit-learn
Embeddings: SentenceTransformers
LLM: Groq API (LLaMA models)
Data: Pandas, NumPy

⚙️ Setup Instructions

1️⃣ Clone Repository

git clone https://github.com/SwedeshnaMishra/Log-Classification-System.git
cd Log-Classification-System

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Setup Environment Variables

Create .env file:

GROQ_API_KEY=your_api_key_here

4️⃣ Run Server

uvicorn server:app --reload

🌐 API Endpoints

Endpoint	Method	Description
`/`	GET	Health check
`/classify/`	POST	Upload CSV file for batch log classification
`/classify-single/`	POST	Classify a single log message
`/docs`	GET	Swagger UI
`/redoc`	GET	API documentation

📥 Input Format

CSV must contain the following columns:

source,log_message
ModernCRM,User login failed
BillingSystem,Transaction timeout error
System,CPU usage exceeded threshold

📤 Output Format

source,log_message,target_label,method_used,confidence
ModernCRM,User login failed,Security Alert,ML,0.91
BillingSystem,Transaction timeout error,Workflow Error,Regex,0.99
System,CPU usage exceeded,Resource Usage,LLM,0.87

📊 Model Performance

Accuracy: ~99%
F1 Score: 0.98+
Dataset Size: 1900+ logs
Embedding Dimension: 384

🧪 Training Pipeline

Located in:

training/log-classification.ipynb

Steps:

Load dataset
Generate embeddings using Sentence Transformers
Train Logistic Regression classifier
Evaluate model
Save model using joblib

💡 Why Hybrid Approach?

Method	Strength	Limitation
Regex	Fast, deterministic	Limited flexibility
ML	Accurate, scalable	Needs labeled data
LLM	Flexible, intelligent	Higher latency & cost

Combining all three ensures:

Speed ⚡
Accuracy 🎯
Robustness 🧠

🚀 Future Improvements

📊 Streamlit dashboard for visualization
📡 Real-time log streaming support
🐳 Docker containerization
☁️ Cloud deployment (AWS / Render)
🔍 Explainable AI (prediction reasoning)

💼 Use Cases

DevOps monitoring
Security threat detection
System observability
Log anomaly detection
Automated incident classification

For Contributing

If you want to contribute to this project, please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature-name).
Make your changes and commit them (git commit -m 'Add some feature').
Push to the branch (git push origin feature/your-feature-name).
Open a pull request.

Project Maintainer

Github: Swedeshna Mishra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Log Classification System – Hybrid AI Framework

🧠 Overview

🏗️ Architecture

⚙️ Classification Strategy

🔹 1. Regex-Based Classification

🔹 2. ML-Based Classification

🔹 3. LLM-Based Classification

⚙️ Decision Logic

📁 Project Structure

⚡ Features

🛠️ Tech Stack

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Install Dependencies

3️⃣ Setup Environment Variables

4️⃣ Run Server

🌐 API Endpoints

📥 Input Format

📤 Output Format

📊 Model Performance

🧪 Training Pipeline

💡 Why Hybrid Approach?

🚀 Future Improvements

💼 Use Cases

For Contributing

Project Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
models		models
resources		resources
training		training
README.md		README.md
classify.py		classify.py
processor_bert.py		processor_bert.py
processor_llm.py		processor_llm.py
processor_regex.py		processor_regex.py
requirements.txt		requirements.txt
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

🚀 Log Classification System – Hybrid AI Framework

🧠 Overview

🏗️ Architecture

⚙️ Classification Strategy

🔹 1. Regex-Based Classification

🔹 2. ML-Based Classification

🔹 3. LLM-Based Classification

⚙️ Decision Logic

📁 Project Structure

⚡ Features

🛠️ Tech Stack

⚙️ Setup Instructions

1️⃣ Clone Repository

2️⃣ Install Dependencies

3️⃣ Setup Environment Variables

4️⃣ Run Server

🌐 API Endpoints

📥 Input Format

📤 Output Format

📊 Model Performance

🧪 Training Pipeline

💡 Why Hybrid Approach?

🚀 Future Improvements

💼 Use Cases

For Contributing

Project Maintainer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages