The architecture diagram above illustrates the complete workflow of the Adaptive CSV Cache Engine.
The system begins by loading a raw CSV file and generating:
- file-level hashes
- row-level hashes
These hashes are compared against previously stored metadata to detect:
- updated rows
- new rows
- deleted rows
The Change Detector sends the computed change metrics to the Adaptive Decision Engine.
The Decision Engine dynamically decides whether to:
1. Run Incremental Update
OR
2. Run Full Recompute
based on:
- percentage of changed rows
- current adaptive threshold
- historical runtime performance
If incremental update is selected:
- only affected partitions are processed
- partitions are updated in parallel
- unnecessary recomputation is avoided
If full recompute is selected:
- all partitions are rebuilt
- metadata is refreshed completely
The Threshold Tuner continuously adjusts the threshold dynamically based on:
real execution performance
allowing the system to adapt automatically to changing workloads.
All execution metadata is tracked using:
- run.json
- history.json
- performance.json
- partitions.json
- file_state.json
The Streamlit Dashboard provides:
- real-time monitoring
- execution analytics
- threshold trends
- performance comparison
- historical insights
This architecture transforms traditional CSV processing into an adaptive partition-aware incremental processing engine optimized for performance and scalability.
Adaptive CSV Cache Engine is a high-performance intelligent CSV processing system designed to optimize repeated CSV operations using:
- Incremental Updates
- Dynamic Threshold Tuning
- Partitioned Storage
- Parallel Partition Execution
- File Hashing
- Row Hashing
- Historical Analytics
- Performance Monitoring Dashboard
Instead of recomputing an entire CSV file after every small modification, the system intelligently detects changes and updates only the affected partitions.
Traditional CSV processing systems usually:
- Reprocess the entire file
- Consume unnecessary I/O
- Waste CPU resources
- Scale poorly for repeated small updates
This project solves that problem by introducing:
Adaptive Incremental CSV Processing
The system dynamically decides whether to:
1. Run Full Recompute
OR
2. Run Incremental Partition Updates
based on real runtime performance.
Detects whether the CSV file changed.
Detects:
- Updated rows
- New rows
- Deleted rows
using row-level MD5 hashing.
Dynamically decides:
Incremental Update
OR
Full Recompute
based on:
- percentage of changed rows
- adaptive threshold
- historical performance
The threshold automatically adjusts based on:
actual execution performance
Example:
If incremental becomes slower:
threshold decreases
If incremental becomes faster:
threshold increases
The CSV is split into multiple partition files:
part_0.csv
part_1.csv
part_2.csv
...
using hash-based partitioning.
Affected partitions are updated in parallel using:
ThreadPoolExecutorwhich significantly improves performance for multi-partition workloads.
All executions are tracked using:
history.json
allowing trend analysis.
A professional Streamlit dashboard visualizes:
- execution metrics
- threshold evolution
- performance comparison
- partition metadata
- historical trends
- incremental vs full recompute comparison
flowchart TD
A[Raw CSV File] --> B[File Hashing]
B --> C[Row Hashing]
C --> D[Change Detector]
D --> E[Decision Engine]
E -->|Few Changes| F[Incremental Update]
E -->|Many Changes| G[Full Recompute]
F --> H[Find Affected Partitions]
H --> I[Parallel Partition Workers]
I --> J[Load Partition]
J --> K[Delete Rows]
K --> L[Update Rows]
L --> M[Add New Rows]
M --> N[Save Partition]
G --> O[Recreate All Partitions]
N --> P[Performance Tracker]
O --> P
P --> Q[Threshold Tuner]
Q --> R[Update Config]
P --> S[History Logger]
S --> T[Dashboard Analytics]
The system loads the raw CSV file.
An MD5 hash is generated for the entire file.
Purpose:
Detect whether the file changed.
Each row receives its own MD5 hash.
Purpose:
Detect row-level changes.
The engine identifies:
- updated rows
- new rows
- deleted rows
The engine calculates:
rows_percentage_change
and compares it with:
current_threshold
Then decides:
incremental_update
OR
full_recompute
Only affected partitions are updated.
All partitions are recreated.
Affected partitions are processed in parallel.
Execution timings are recorded.
Threshold adapts automatically based on:
actual execution performance
All runs are stored in:
history.json
The dashboard displays:
- trends
- comparisons
- performance metrics
- analytics
csv-cache-system/
│
├── data/
│ ├── raw/
│ │ └── Titanic-Dataset.csv
│ │
│ └── processed/
│ └── partitions/
│ ├── part_0.csv
│ ├── part_1.csv
│ ├── part_2.csv
│ └── ...
│
├── metadata/
│ ├── config.json
│ ├── file_state.json
│ ├── history.json
│ ├── performance.json
│ ├── partitions.json
│ └── run.json
│
├── src/
│ ├── cache/
│ ├── dashboard/
│ ├── decision/
│ ├── hashing/
│ ├── pipeline/
│ ├── utils/
│ └── main.py
│
└── README.md
Stores:
- current threshold
- min threshold
- max threshold
- tuning step size
Stores:
- last file hash
- row hashes
- processed path
- identifier columns
Stores:
- last incremental execution time
- last full recompute execution time
Stores complete execution history.
Stores:
- partition metadata
- partition strategy
- partition folder path
- partition files
The engine uses:
ThreadPoolExecutorto process partitions in parallel.
Example:
Affected Partitions:
{0, 2, 3}
Instead of:
Sequential:
part_0 → part_2 → part_3
it runs:
Parallel:
part_0
part_2
part_3
simultaneously.
Displays:
- execution type
- execution time
- changed rows
- threshold used
Displays:
- incremental update time
- full recompute time
- speedup factor
Visualizes:
- execution trends
- threshold evolution
- execution type distribution
- rows changed trends
Displays:
- total partitions
- partition strategy
- partition files
Decision: incremental_update
Updated Rows: ['2', '3']
Affected Partitions: {0, 3}
Running Parallel Partitioned Incremental Update...
Partition 0 updated successfully.
Partition 3 updated successfully.
The engine achieved:
Incremental Update Time:
0.084 sec
Full Recompute Time:
0.095 sec
Meaning:
Incremental processing became faster than full recomputation.
- Incremental Processing
- Partitioned Storage
- Parallel Execution
- Metadata Management
- Adaptive Systems
- Hash Partitioning
- Workload Distribution
- Parallel Workers
- Execution Optimization
- Dynamic Threshold Tuning
- Runtime Optimization
- Incremental Processing
- Adaptive Decision Systems
- Python
- Pandas
- Streamlit
- JSON
- ThreadPoolExecutor
- MD5 Hashing
-
Go to the below link
-
Download the dataset inside the directory data/raw
git clone https://github.com/RiddhiDhara/csv-cache-systemcd csv cache systemCreate a requirements.txt file:
pandas
streamlit
Then install dependencies:
pip install -r requirements.txtIf you want to restart the engine from a clean state, reset the following files inside the metadata/ folder.
{
"current_threshold": 0.2,
"min_threshold": 0.05,
"max_threshold": 0.9,
"step_size": 0.05
}{
"last_incremental_time": null,
"last_full_recompute_time": null
}{
"history": []
}{
"decision": null,
"actual_execution": null,
"rows_percentage_change": null,
"changed_rows": null,
"total_rows": null,
"execution_time": null,
"threshold_used": null,
"timestamp": null
}{}{
"input_file_path": "",
"last_file_hash": "",
"processed_file_path": "",
"row_identifier_columns": ["PassengerId"],
"row_hashes": {}
}Delete all files inside:
data/processed/partitions/
Example:
part_0.csv
part_1.csv
part_2.csv
...
The system will automatically recreate them during the next full recompute.
This will regenerate once you run the project:
- partitions
- metadata
- performance metrics
- history tracking
pip install pandas streamlitpython -m src.mainstreamlit run src/dashboard/dashboard.pyAdaptive CSV Cache Engine successfully demonstrates:
- intelligent incremental processing
- partition-aware execution
- dynamic performance tuning
- parallel partition updates
- historical execution analytics
The project evolves beyond simple CSV processing into:
an adaptive distributed-style incremental processing engine
capable of optimizing workloads based on real runtime behavior.
