This project implements a custom MapReduce-based query execution engine and compares its performance with a traditional relational database (SQL Server).
The system executes the same SQL query across two fundamentally different paradigms:
- ๐น Custom MapReduce Engine (Python)
- ๐น SQL Server Execution Engine
It then collects performance metrics and visualizes them using an interactive dashboard.
To understand and demonstrate:
- How distributed data processing systems (like Hadoop) work internally
- How they compare with optimized relational databases
- The trade-offs in performance, scalability, and execution design
SQL Input
โ
Query Router / Controller
โ โ
MapReduce Engine SQL Server
โ โ
Execution Pipeline Optimized DB Execution
โ โ
Metrics + Comparison Layer
โ
Dashboard (Streamlit)
-
Converts SQL queries into an Intermediate Representation (IR)
-
Supports:
SELECTWHEREGROUP BY- Aggregations:
COUNT,SUM,AVG
Implements a full pipeline:
Split โ Map โ Shuffle โ Reduce
- Partition-based processing
- Reducer distribution
- Simulates parallel data processing
- Executes the same query using SQL Server
- Uses
pyodbcfor database connectivity - Includes a query translation layer for compatibility
Tracks:
- โฑ Execution time
- โก CPU usage
- ๐ง System specifications
- ๐ Rows processed
- Visual comparison of both engines
- Bar charts for execution time
- System & execution insights
SELECT Score, COUNT(*), SUM(Score), AVG(Score)
FROM data
WHERE Score > 1
GROUP BY Score;Score COUNT SUM AVG
--------------------------------------------------------
1 156804 428892 2.7352
2 89307 166023 1.8590
3 127920 217587 1.7010
4 241965 336402 1.3903
5 1089366 1824936 1.6752
Score COUNT SUM AVG
--------------------------------------------------------
1 156804 428892 2
2 89307 166023 1
3 127920 217587 1
4 241965 336402 1
5 1089366 1824936 1
SQL Server performs integer division in AVG by default, while the custom MapReduce engine computes floating-point averages.
Miniature MAP-REDUCE/
โ
โโโ Data/
โ โโโ Processed Data/
โ โโโ Raw Data/
โ
โโโ Data Processing Stage/
โ โโโ processor.ipynb
โ
โโโ Results/
โ โโโ metrics.json
โ โโโ report.txt
โ
โโโ SQL/
โ โโโ Queries.sql
โ โโโ test_cases.sql
โ
โโโ Src/
โ โ
โ โโโ Analytics/
โ โ โโโ __pycache__/
โ โ โโโ dashboard.py
โ โ โโโ metrics.py
โ โ โโโ report.py
โ โ
โ โโโ config/
โ โ โโโ __pycache__/
โ โ โโโ db_config.py
โ โ โโโ paths.py
โ โ โโโ settings.py
โ โ
โ โโโ Execution_Engine/
โ โ โโโ __pycache__/
โ โ โโโ context.py
โ โ โโโ engine.py
โ โ
โ โโโ Map_Reduce/
โ โ โโโ __pycache__/
โ โ โโโ mapper.py
โ โ โโโ reducer.py
โ โ โโโ shuffle.py
โ โ โโโ splitter.py
โ โ
โ โโโ Optimizer/
โ โ โโโ __pycache__/
โ โ โโโ optimizer.py
โ โ
โ โโโ Parser/
โ โ โโโ __pycache__/
โ โ โโโ sql_parser.py
โ โ
โ โโโ SQL_Server/
โ โ โโโ __pycache__/
โ โ โโโ adapter.py
โ โ โโโ connection.py
โ โ โโโ executor.py
โ โ โโโ loader.py
โ โ โโโ metrics.py
โ โ โโโ translator.py
โ โ
โ โโโ Utils/
โ โโโ bootstrap.py
โ โโโ main.py
git clone <https://github.com/RiddhiDhara/Hadoop_Mini_Map_Reduce_Implementation>
cd Miniature MAP-REDUCEpip install -r requirements.txtpython bootstrap.pycd Src
python main.pystreamlit run Analytics/dashboard.pyAll configurations are centralized:
- Partitions
- Reducers
- Threads
- Engine toggles
- CSV path
- SQL query file
- Metrics & report paths
- SQL Server connection details
-
Internals of MapReduce processing
-
Query execution pipeline design
-
Difference between:
- Distributed systems
- Relational databases
-
Performance benchmarking techniques
-
System modularization & architecture
-
SQL Server is faster due to:
- Query optimization
- Execution planning
- Internal indexing
-
MapReduce provides:
- Better conceptual scalability
- Flexibility for distributed environments
- Support for JOIN operations
- Multi-column
GROUP BY - True parallelism using
multiprocessing - Cost-based query optimizer
- Query caching
- Cloud deployment (AWS / Azure)
- Python
- Pandas
- SQL Server
- pyodbc
- Streamlit
This project demonstrates how the same query can be executed across two fundamentally different architectures and compares their performance using real metrics.
Riddhi Dhara
Software Engineer
Give it a โญ on GitHub and share your feedback!
