Skip to content

SwedeshnaMishra/Log-Classification-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Log Classification System – Hybrid AI Framework

A production-ready hybrid log classification system that intelligently combines Regex rules, Machine Learning (Sentence Transformers + Logistic Regression), and LLM fallback to classify system logs with high accuracy and adaptability.

Designed to handle:

  • ⚑ Simple structured logs
  • πŸ€– Complex semantic logs
  • 🧠 Unseen or ambiguous patterns

🧠 Overview

Modern applications generate massive volumes of logs that are difficult to analyze manually. This system automates log classification using a multi-stage intelligent pipeline that dynamically selects the best method for each log.


πŸ—οΈ Architecture

⚑ This hybrid pipeline ensures optimal trade-off between speed (Regex), accuracy (ML), and flexibility (LLM).

flowchart TD
    A[API Layer - FastAPI] --> B[Incoming Log]
    B --> C[Regex Engine]

    C -->|Regex Match| F1[Final Output]
    C -->|No Regex Match| D[ML Model\nSentence Transformer + LR]

    D -->|Confidence >=\n0.75| F1
    D -->|Confidence <\n0.75| E[LLM Classifier]

    E --> F1

    F1 --> F2[Label + Method + Confidence]

Loading

βš™οΈ Classification Strategy

πŸ”Ή 1. Regex-Based Classification

  • Handles predictable log patterns
  • Ultra-fast pattern matching using predefined rules
  • Example:
    • "System reboot initiated" β†’ System Notification

πŸ”Ή 2. ML-Based Classification

  • Uses embeddings from Sentence Transformers
  • Applies Logistic Regression for classification
  • Works best with sufficient labeled data
  • Returns:
    • Predicted label
    • Confidence score

πŸ”Ή 3. LLM-Based Classification

  • Used when:
    • ML confidence is low
    • Log is complex or unseen
  • Uses LLM via Groq API
  • Ensures robustness for real-world logs

βš™οΈ Decision Logic

if regex_match:
    return label

label, prob = ML_model(log)

if prob > 0.75:
    return label
else:
    return LLM(log)

πŸ“ Project Structure

Log-Classification-System/
β”‚
β”œβ”€β”€ models/
β”‚   └── log_classifier.joblib
β”‚
β”œβ”€β”€ resources/
β”‚   β”œβ”€β”€ test.csv
β”‚   └── output.csv
β”‚
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ dataset/
β”‚   └── log-classification.ipynb
β”‚
β”œβ”€β”€ classify.py
β”œβ”€β”€ processor_regex.py
β”œβ”€β”€ processor_bert.py
β”œβ”€β”€ processor_llm.py
β”œβ”€β”€ server.py
β”œβ”€β”€ requirements.txt
└── .env

⚑ Features

  • Hybrid classification (Regex + ML + LLM)
  • Confidence-based intelligent routing
  • FastAPI-powered backend API
  • CSV upload & batch classification
  • Modular and scalable architecture
  • Model persistence using joblib
  • Handles real-world log patterns

πŸ› οΈ Tech Stack

  • Backend: FastAPI
  • ML: scikit-learn
  • Embeddings: SentenceTransformers
  • LLM: Groq API (LLaMA models)
  • Data: Pandas, NumPy

βš™οΈ Setup Instructions

1️⃣ Clone Repository

git clone https://github.com/SwedeshnaMishra/Log-Classification-System.git
cd Log-Classification-System

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Setup Environment Variables

Create .env file:

GROQ_API_KEY=your_api_key_here

4️⃣ Run Server

uvicorn server:app --reload

🌐 API Endpoints

Endpoint Method Description
/ GET Health check
/classify/ POST Upload CSV file for batch log classification
/classify-single/ POST Classify a single log message
/docs GET Swagger UI
/redoc GET API documentation

πŸ“₯ Input Format

CSV must contain the following columns:

source,log_message
ModernCRM,User login failed
BillingSystem,Transaction timeout error
System,CPU usage exceeded threshold

πŸ“€ Output Format

source,log_message,target_label,method_used,confidence
ModernCRM,User login failed,Security Alert,ML,0.91
BillingSystem,Transaction timeout error,Workflow Error,Regex,0.99
System,CPU usage exceeded,Resource Usage,LLM,0.87

πŸ“Š Model Performance

  • Accuracy: ~99%
  • F1 Score: 0.98+
  • Dataset Size: 1900+ logs
  • Embedding Dimension: 384

πŸ§ͺ Training Pipeline

Located in:

training/log-classification.ipynb

Steps:

  • Load dataset
  • Generate embeddings using Sentence Transformers
  • Train Logistic Regression classifier
  • Evaluate model
  • Save model using joblib

πŸ’‘ Why Hybrid Approach?

Method Strength Limitation
Regex Fast, deterministic Limited flexibility
ML Accurate, scalable Needs labeled data
LLM Flexible, intelligent Higher latency & cost

Combining all three ensures:

  • Speed ⚑
  • Accuracy 🎯
  • Robustness 🧠

πŸš€ Future Improvements

  • πŸ“Š Streamlit dashboard for visualization
  • πŸ“‘ Real-time log streaming support
  • 🐳 Docker containerization
  • ☁️ Cloud deployment (AWS / Render)
  • πŸ” Explainable AI (prediction reasoning)

πŸ’Ό Use Cases

  • DevOps monitoring
  • Security threat detection
  • System observability
  • Log anomaly detection
  • Automated incident classification

For Contributing

If you want to contribute to this project, please follow these steps:

  • Fork the repository.
  • Create a new branch (git checkout -b feature/your-feature-name).
  • Make your changes and commit them (git commit -m 'Add some feature').
  • Push to the branch (git push origin feature/your-feature-name).
  • Open a pull request.

Project Maintainer

Github: Swedeshna Mishra

About

Hybrid log classification system using Regex, Sentence Transformers + Logistic Regression, and LLM fallback with confidence-based routing for accurate and scalable log analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors