Skip to content

RiddhiDhara/Hadoop_Mini_Map_Reduce_Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ Distributed Query Engine: MapReduce vs SQL Server


๐Ÿ“Œ Overview

This project implements a custom MapReduce-based query execution engine and compares its performance with a traditional relational database (SQL Server).

The system executes the same SQL query across two fundamentally different paradigms:

  • ๐Ÿ”น Custom MapReduce Engine (Python)
  • ๐Ÿ”น SQL Server Execution Engine

It then collects performance metrics and visualizes them using an interactive dashboard.


๐ŸŽฏ Objective

To understand and demonstrate:

  • How distributed data processing systems (like Hadoop) work internally
  • How they compare with optimized relational databases
  • The trade-offs in performance, scalability, and execution design

๐Ÿ—๏ธ System Architecture

๐Ÿ”น High-Level Architecture

Architecture


๐Ÿ”น Execution Flow

SQL Input
   โ†“
Query Router / Controller
   โ†“                 โ†“
MapReduce Engine     SQL Server
   โ†“                 โ†“
Execution Pipeline   Optimized DB Execution
   โ†“                 โ†“
      Metrics + Comparison Layer
                โ†“
         Dashboard (Streamlit)

โš™๏ธ Core Features

๐Ÿ”น 1. SQL Parsing Engine

  • Converts SQL queries into an Intermediate Representation (IR)

  • Supports:

    • SELECT
    • WHERE
    • GROUP BY
    • Aggregations: COUNT, SUM, AVG

๐Ÿ”น 2. Custom MapReduce Engine

Implements a full pipeline:

Split โ†’ Map โ†’ Shuffle โ†’ Reduce
  • Partition-based processing
  • Reducer distribution
  • Simulates parallel data processing

๐Ÿ”น 3. SQL Server Integration

  • Executes the same query using SQL Server
  • Uses pyodbc for database connectivity
  • Includes a query translation layer for compatibility

๐Ÿ”น 4. Metrics & Benchmarking System

Tracks:

  • โฑ Execution time
  • โšก CPU usage
  • ๐Ÿง  System specifications
  • ๐Ÿ“Š Rows processed

๐Ÿ”น 5. Interactive Dashboard (Streamlit)

  • Visual comparison of both engines
  • Bar charts for execution time
  • System & execution insights

๐Ÿงช Sample Query

SELECT Score, COUNT(*), SUM(Score), AVG(Score)
FROM data
WHERE Score > 1
GROUP BY Score;

๐Ÿ“Š Results

๐Ÿ”น MapReduce Output

Score         COUNT         SUM           AVG
--------------------------------------------------------
1             156804        428892        2.7352
2             89307         166023        1.8590
3             127920        217587        1.7010
4             241965        336402        1.3903
5             1089366       1824936       1.6752

๐Ÿ”น SQL Server Output

Score         COUNT         SUM           AVG
--------------------------------------------------------
1             156804        428892        2
2             89307         166023        1
3             127920        217587        1
4             241965        336402        1
5             1089366       1824936       1

๐Ÿ”น Key Observation

SQL Server performs integer division in AVG by default, while the custom MapReduce engine computes floating-point averages.


๐Ÿ“ Project Structure

Miniature MAP-REDUCE/
โ”‚
โ”œโ”€โ”€ Data/
โ”‚   โ”œโ”€โ”€ Processed Data/
โ”‚   โ”œโ”€โ”€ Raw Data/
โ”‚
โ”œโ”€โ”€ Data Processing Stage/
โ”‚   โ””โ”€โ”€ processor.ipynb
โ”‚
โ”œโ”€โ”€ Results/
โ”‚   โ”œโ”€โ”€ metrics.json
โ”‚   โ””โ”€โ”€ report.txt
โ”‚
โ”œโ”€โ”€ SQL/
โ”‚   โ”œโ”€โ”€ Queries.sql
โ”‚   โ””โ”€โ”€ test_cases.sql
โ”‚
โ”œโ”€โ”€ Src/
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Analytics/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ”œโ”€โ”€ dashboard.py
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.py
โ”‚   โ”‚   โ””โ”€โ”€ report.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ config/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ”œโ”€โ”€ db_config.py
โ”‚   โ”‚   โ”œโ”€โ”€ paths.py
โ”‚   โ”‚   โ””โ”€โ”€ settings.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Execution_Engine/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ”œโ”€โ”€ context.py
โ”‚   โ”‚   โ””โ”€โ”€ engine.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Map_Reduce/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ”œโ”€โ”€ mapper.py
โ”‚   โ”‚   โ”œโ”€โ”€ reducer.py
โ”‚   โ”‚   โ”œโ”€โ”€ shuffle.py
โ”‚   โ”‚   โ””โ”€โ”€ splitter.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Optimizer/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ””โ”€โ”€ optimizer.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ Parser/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ””โ”€โ”€ sql_parser.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ SQL_Server/
โ”‚   โ”‚   โ”œโ”€โ”€ __pycache__/
โ”‚   โ”‚   โ”œโ”€โ”€ adapter.py
โ”‚   โ”‚   โ”œโ”€โ”€ connection.py
โ”‚   โ”‚   โ”œโ”€โ”€ executor.py
โ”‚   โ”‚   โ”œโ”€โ”€ loader.py
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.py
โ”‚   โ”‚   โ””โ”€โ”€ translator.py
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ Utils/
โ”‚       โ”œโ”€โ”€ bootstrap.py
โ”‚       โ””โ”€โ”€ main.py

โ–ถ๏ธ How to Run

1๏ธโƒฃ Clone the repository

git clone <https://github.com/RiddhiDhara/Hadoop_Mini_Map_Reduce_Implementation>
cd Miniature MAP-REDUCE

2๏ธโƒฃ Install dependencies

pip install -r requirements.txt

3๏ธโƒฃ SQL SERVER MANAGEMENT STUDIO (SSMS) Set-up [ One Time ]

python bootstrap.py

4๏ธโƒฃ Run the engine

cd Src
python main.py

5๏ธโƒฃ Launch dashboard

streamlit run Analytics/dashboard.py

โš™๏ธ Configuration

All configurations are centralized:

๐Ÿ“ config/settings.py

  • Partitions
  • Reducers
  • Threads
  • Engine toggles

๐Ÿ“ config/paths.py

  • CSV path
  • SQL query file
  • Metrics & report paths

๐Ÿ“ config/db_config.py

  • SQL Server connection details

๐Ÿง  Key Learnings

  • Internals of MapReduce processing

  • Query execution pipeline design

  • Difference between:

    • Distributed systems
    • Relational databases
  • Performance benchmarking techniques

  • System modularization & architecture


๐Ÿ“ˆ Results & Insights

  • SQL Server is faster due to:

    • Query optimization
    • Execution planning
    • Internal indexing
  • MapReduce provides:

    • Better conceptual scalability
    • Flexibility for distributed environments

๐Ÿ”ฎ Future Improvements

  • Support for JOIN operations
  • Multi-column GROUP BY
  • True parallelism using multiprocessing
  • Cost-based query optimizer
  • Query caching
  • Cloud deployment (AWS / Azure)

๐Ÿงฉ Tech Stack

  • Python
  • Pandas
  • SQL Server
  • pyodbc
  • Streamlit

๐Ÿ’ก Key Highlight

This project demonstrates how the same query can be executed across two fundamentally different architectures and compares their performance using real metrics.


๐Ÿ‘ค Author

Riddhi Dhara
Software Engineer


โญ If you found this useful

Give it a โญ on GitHub and share your feedback!

About

This project implements a custom MapReduce-based query execution engine and compares its performance with a traditional relational database (SQL Server).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors