Production-oriented electricity load forecasting system for smart grids using a data pipeline + offline-trained models + JSON forecast artifacts + dashboard frontend.
- Project Overview
- Problem Statement
- What This Repository Delivers
- Repository Architecture
- System Flow (Mermaid)
- Tech Stack
- Data Sources and Data Contracts
- Detailed Module and Function Reference
- Runbook (End-to-End)
- Forecast JSON Contract
- Modeling Strategy and Metrics
- Model Evaluation & Comparison
- Observability and Logs
- Research and Design Synthesis
- External Research Notes (Web + Context7)
- Known Constraints
- Roadmap
- References
This project builds an AI-assisted forecasting workflow for short-term electricity demand prediction in a smart-grid setting (NRLDC-focused workflow in current implementation). It provides:
- Automated historical forecast-file collection from NRLDC web portal
- Structured extraction and cleaning pipeline for 15-minute demand series
- Feature-engineered local training with seasonal holdout validation
- JSON forecast publishing for 24h, 48h, and 72h horizons
- Precomputed residual heatmap for operational reliability analysis
- Next.js dashboard UI that reads static forecast JSON files and displays KPIs, forecasts, and residual patterns
The current serving constraint is file-based rather than API-based: the model is trained on your machine, forecast outputs are materialized as JSON, and the frontend consumes those JSON files directly.
The system is designed as a practical bridge between research-grade forecasting and operations-grade deployment patterns.
Modern grids must continuously balance supply and demand under volatility from:
- Intraday consumption pattern shifts
- Renewable generation variability
- Operational constraints during peak windows
- Data quality issues (spikes, missingness, inconsistent records)
Static or weak forecasting increases risk of:
- Dispatch inefficiency
- Higher operating cost
- Peak handling stress
- Reduced resilience and reliability
This repository addresses that gap with a reproducible ML pipeline and lightweight serving layer.
- Forecast horizon: 24, 48, and 72 hours ahead at 15-minute granularity (
96,192, and288steps) - Model family: XGBoost and LSTM artifacts trained through the project pipeline (season-aware split, autoregressive inference)
- Data cadence: 15-minute load points
- Pipeline stages: scrape → extract/merge → clean → train → generate JSON forecasts → sync → visualize
- Model artifacting:
joblibmodel + JSON metadata/buffer + forecast JSON files + metrics/residuals JSON files - Operational diagnostics: residual heatmap over day-of-week × hour-of-day, published as JSON for the dashboard
requirements.txt
README.md
data/
|-- raw/ # downloaded NRLDC files by year/month
| |-- 2024/
| | |-- <month folders by year>
| |-- 2025/
| | |-- <month folders by year>
| `-- 2026/
| `-- <month folders by year>
|-- extracted/ # merged extracted parquet
|-- cleaned/ # cleaned parquet used for training
|-- model/
| |-- xgboost/
| | |-- xgboost_model.joblib
| | `-- buffer.json
| `-- lstm/
| |-- 24h.keras
| |-- 48h.keras
| |-- 72h.keras
| `-- buffer.json
`-- public/
`-- data/
|-- xgboost/
| |-- forecast_24h.json
| |-- forecast_48h.json
| |-- forecast_72h.json
| |-- metrics.json
| `-- residuals.json
`-- lstm/
|-- forecast_24h.json
|-- forecast_48h.json
|-- forecast_72h.json
|-- metrics.json
`-- residuals.json
docs/
|-- design/
| |-- architecture.md
| `-- system_design.md
|-- overview/
| |-- problem_statement.md
| `-- project_overview.md
|-- research/
| |-- base_papers.md
| `-- model_analysis.md
`-- synopsis/
`-- Synopsis Report-NavGati Updated.pdf
logs/
|-- scrap_excel.log
`-- data_merger.log
notebooks/
|-- 01_eda.ipynb
|-- 02_baseline_models.ipynb
|-- 03_model_comparison.ipynb
|-- 04_xgboost_forecasting.ipynb
|-- 05_lstm_forecasting.ipynb
`-- 06_lstm_model_saving.ipynb
src/
|-- scrapping/
| `-- scrap_excel.py
|-- ingestion/
| |-- data_merger.py
| `-- data_cleaning.py
`-- pipeline/
|-- pre_generate.py
`-- xgboost/
`-- train_and_save_xgboost.py
gridcast-react/
|-- app/
|-- components/
|-- lib/
|-- public/
| `-- data/
`-- scripts/
`-- sync-real-data.mjs
flowchart TD
A[NRLDC Portal] --> B[scrap_excel.py\nDownload monthly files]
B --> C[data/raw]
C --> D[data_merger.py\nExtract + normalize rows]
D --> E[data/extracted/nrldc_extracted.parquet]
E --> F[data_cleaning.py\nSpike/range detection + interpolation]
F --> G[data/cleaned/nrldc_cleaned.parquet]
G --> H[train_and_save_xgboost.py\nModel training + buffer.json]
H --> I[data/model/xgboost/xgboost_model.joblib]
H --> J[data/model/xgboost/buffer.json]
I --> K[pre_generate.py\nGenerate 24h/48h/72h JSON forecasts]
J --> K
K --> L[data/public/data/*\nforecast_*.json + metrics.json + residuals.json]
L --> M[sync-real-data.mjs\nMirror artifacts to frontend]
M --> N[gridcast-react\nNext.js dashboard]
sequenceDiagram
participant UI as Dashboard
participant F as Static JSON files
participant M as Model artifacts
UI->>F: GET /data/xgboost/forecast_24h.json
UI->>F: GET /data/xgboost/forecast_48h.json
UI->>F: GET /data/xgboost/forecast_72h.json
UI->>F: GET /data/xgboost/metrics.json
UI->>F: GET /data/xgboost/residuals.json
F-->>UI: forecast rows + metrics + residual heatmap
M-->>F: pre_generate.py materializes JSON from trained model
flowchart LR
X[Raw extracted demand series] --> Y{Rule checks}
Y --> Y1[Step-change > 5000 MW]
Y --> Y2[Outside 25000-95000 MW]
Y --> Y3[Neighbor expansion for >10000 MW spikes]
Y1 --> Z[Flag as NaN]
Y2 --> Z
Y3 --> Z
Z --> W[Time interpolation]
W --> V[Verified cleaned series]
- Language: Python 3.x, JavaScript, HTML/CSS
- Data: pandas, pyarrow, fastparquet
- ML: xgboost, scikit-learn, numpy, joblib
- Serving: JSON artifact publishing plus static file delivery through the frontend
- Frontend: Next.js, React, TypeScript, Tailwind CSS
- Automation: selenium, webdriver-manager
- Notebook analysis: Jupyter notebooks under
notebooks/
- NRLDC intraday forecast portal (
https://nrldc.in/forecast/intra-day-forecast) - Downloaded files are grouped as
data/raw/<year>/<month>/...xlsx
- UCI ElectricityLoadDiagrams20112014 (15-minute consumption benchmark dataset)
- Extracted parquet (
data/extracted/nrldc_extracted.parquet)- columns:
date,timestamp,actual_demand_mw
- columns:
- Cleaned parquet (
data/cleaned/nrldc_cleaned.parquet)- datetime index
- column:
actual_demand_mw
- Model metadata (
data/model/xgboost/buffer.json,data/model/lstm/buffer.json)feature_cols,test_metrics, rollingbuffer, residual heatmap artifacts
- Forecast JSON artifacts (
data/public/data/<model>/forecast_24h.json,forecast_48h.json,forecast_72h.json)generated_at,model,data_end,trained_at,horizon,steps,forecast
- Diagnostics JSON artifacts (
data/public/data/<model>/metrics.json,residuals.json)horizon_metrics,heatmap_matrix,heatmap_flat,heatmap_info
Automates NRLDC SPA navigation, iterates year/month folders, downloads files with pagination, and logs run summary.
BASE_URL: target portalBASE_DOWNLOAD_DIR: local raw-data rootTABLE_ID: DataTable id (operationsTable)MONTH_MAP: month name to integer mapping- Runtime counters:
_total_files_downloaded,_total_files_checked,_total_files_skipped
| Function | Inputs | Returns | What it does | Why it matters |
|---|---|---|---|---|
get_driver(download_dir) |
download directory | configured webdriver.Chrome |
Creates Chrome driver with download prefs and stability flags | Ensures unattended, deterministic downloads |
set_download_dir(driver, directory) |
driver, target dir | None | Switches active browser download path via CDP | Supports year/month folder routing |
wait_for_downloads(directory, timeout=90) |
folder, timeout | bool | Waits until .crdownload/.tmp files disappear |
Prevents partial-file processing |
count_files(directory) |
folder path | int | Counts files in directory | Utility for summary checks |
get_existing_files(directory) |
folder path | set[str] |
Snapshot of existing filenames | Enables skip/new download calculation |
wait_table_loaded(wait) |
WebDriverWait |
None | Waits for DataTable processing spinner to clear | Reduces race conditions with SPA updates |
click_btn(driver, el) |
driver, element | None | Scrolls into view + JS click | Improves click reliability in dynamic UI |
get_folder_buttons(driver) |
driver | element list | Returns sidebar folder buttons | Base primitive for year/month traversal |
find_btn_by_fid(driver, fid) |
driver, folder id | element or None | Finds folder button by data-folderid |
Stable selector in changing DOM layouts |
table_has_no_data(driver) |
driver | bool | Detects “no data/no record” table states | Avoids empty-page download loops |
try_set_50_rows(driver) |
driver | bool | Sets DataTable page size to 50 using robust selectors | Reduces pagination overhead |
get_download_links(driver) |
driver | (links, strategy) |
Finds file download anchors using icon/href fallbacks | Handles icon or link-type differences |
_find_link_by_href(driver, href, timeout=4) |
driver, href | element or None | Re-locates an anchor after table re-render | Mitigates stale element references |
click_download_links(driver, links, existing_files) |
driver, link elements, file set | bool | Bulk-clicks links using re-query + JS fallback | Maximizes download success on dynamic pages |
month_is_future(year, month_name) |
year, month text | bool | Filters out future months relative to current date | Prevents invalid scraping targets |
open_year(driver, wait, year_fid) |
driver, wait, year folder id | bool | Fresh page load then opens a year accordion | Maintains clean state per year |
- Initializes logger and webdriver
- Discovers available year folders
- Iterates years → months (with future-month filter)
- Downloads each page’s files and paginates until exhaustion
- Writes aggregated run summary to
logs/scrap_excel.log - Exits non-zero on fatal failure (
sys.exit(1))
Parses raw Excel files into a single normalized parquet file for downstream cleaning and model training.
| Function | Inputs | Returns | What it does |
|---|---|---|---|
extract_date_from_filename(file_path) |
file path | pd.Timestamp |
Extracts date token from filename patterns DD-MM-YYYY / DD_MM_YYYY |
extract_nrldc_data(file_path) |
file path | pd.DataFrame |
Reads one Excel file, fixes headers, selects useful columns, numeric-casts demand, drops non-data rows |
main() |
none | None | Scans all .xlsx recursively, processes each file, concatenates, de-duplicates, sorts, saves to parquet |
- Writes
data/extracted/nrldc_extracted.parquet - Final columns:
date,timestamp,actual_demand_mw
Applies physically grounded anomaly detection and interpolation repair to demand time series.
STEP_THRESHOLD = 5000MW (15-min step-change)LOW_BOUND = 25000MWHIGH_BOUND = 95000MWLARGE_SPIKE = 10000MW (neighbor-expansion trigger)
| Function | Inputs | Returns | What it does |
|---|---|---|---|
load_dataset(path) |
parquet path | DataFrame (datetime index) | Creates datetime index from date + timestamp, sorts, deduplicates index |
detect_bad_points(series) |
load series | (mask, n_step, n_range, n_neighbour) |
Flags impossible jumps/range breaches and expands neighbors for large spikes |
repair(df, bad_mask) |
dataframe, bool mask | (clean_df, remaining_nulls) |
Nulls flagged points and fills with time interpolation |
verify(df_clean) |
cleaned dataframe | None | Re-runs checks and prints post-clean status |
- Loads extracted parquet
- Detects anomalies
- Repairs series
- Verifies residual anomalies
- Saves cleaned output to
data/cleaned/nrldc_cleaned.parquet
4) Training and Forecast Publishing (src/pipeline/xgboost/train_and_save_xgboost.py, src/pipeline/pre_generate.py)
Builds model-ready features, performs season-aware split, trains XGBoost, evaluates, computes residual heatmap, and then publishes JSON forecast artifacts for the frontend.
BUFFER_LEN = 672(1 week of 15-min history)HORIZONS = {"24h": 96, "48h": 192, "72h": 288}HOLDOUT_MONTHS = 3
| Function | Inputs | Returns | What it does |
|---|---|---|---|
add_features(df_part, full_series) |
partition df, full target series | engineered df | Adds lag features, rolling statistics, and calendar features |
make_split(df) |
full df | (train_mask, test_mask, cutoff) |
Time/season-aware split with last 3 months as holdout |
evaluate(y_true, y_pred, label='') |
true and predicted arrays | metrics dict | Computes MAE, RMSE, MAPE |
forecast_from(cutoff_idx, series, model, feature_cols) |
cutoff index, series, model, feature order | (pred_index, preds, actuals) |
96-step autoregressive forecasting loop from a single cutoff |
compute_residual_heatmap(series, model, feature_cols, test_start_idx) |
series, model, features, test start index | (matrix, flat_list) |
Runs repeated cutoffs over test period and computes mean APE by day/hour bucket |
run_xgb_forecast(horizon, model, buffer_meta) |
horizon, model, buffer metadata | prediction list | Generates 24h/48h/72h autoregressive forecasts |
save_json(preds, horizon, model_name, buffer_meta) |
forecast list, horizon, model name, metadata | JSON payload dict | Shapes the prediction payload for file-based publishing |
data/model/xgboost/xgboost_model.joblibdata/model/xgboost/buffer.jsoncontaining:- model metadata (
trained_at,data_end,feature_cols) test_metrics- last 672 observed loads (
buffer) - residual heatmap (
heatmap_matrix,heatmap_flat,heatmap_info)
- model metadata (
data/public/data/xgboost/forecast_24h.json,forecast_48h.json,forecast_72h.jsondata/public/data/xgboost/metrics.jsondata/public/data/xgboost/residuals.json
The LSTM pipeline follows the same publishing contract under data/public/data/lstm/.
The forecast is not served by a live inference API in the current setup. Instead, src/pipeline/pre_generate.py loads the saved model artifacts, generates 24h/48h/72h predictions, and writes ready-to-consume JSON files for the frontend.
| File | Purpose |
|---|---|
forecast_24h.json |
96-step forecast for the next 24 hours |
forecast_48h.json |
192-step forecast for the next 48 hours |
forecast_72h.json |
288-step forecast for the next 72 hours |
metrics.json |
Horizon metrics and training timestamps |
residuals.json |
7×24 residual heatmap data |
- Files are written to
data/public/data/<model>/and mirrored intogridcast-react/public/data/<model>/ - The frontend reads these files directly, so there is no mandatory always-on Flask service
gridcast-react/scripts/sync-real-data.mjsvalidates that the required JSON artifacts exist after sync
Next.js dashboard that reads the generated JSON artifacts from public/data, renders operational forecast views, KPIs, model comparison, residual heatmap, and CSV export.
| Function | Inputs | Returns | Responsibility |
|---|---|---|---|
fmt(n) |
number | localized string | Integer formatting for display |
fmtMW(n) |
number | string | Adds MW unit suffix |
mapeClass(v) |
number | class token | Maps MAPE to good/warn/bad styling |
nowIST() |
none | string | Current IST time string |
loadForecast() |
none | Promise | Fetches public/data/<model>/forecast_*.json, metrics.json, and residuals.json and triggers render |
refreshForecast() |
none | None | Manual refresh hook |
renderAll(health, data) |
JSON payloads | None | Updates all dashboard sections from latest data |
drawForecastChart(fc, peak, peakIdx) |
forecast list + peak metadata | None | Draws SVG line + confidence band + tooltip behaviors |
buildHeatmap(matrix) |
7×24 matrix or null | None | Renders real residual heatmap or static fallback |
exportCSV() |
none | None | Exports forecast rows as CSV file |
setupNavTabs() |
none | None | Handles tab switching between Forecast/Analysis/Models/Reports |
- On load:
setupNavTabs(); loadForecast(); - Error handling: Shows a data-loading banner when the JSON artifacts are missing or stale
- Progressive enhancement: if
/residuals.jsonis unavailable, uses fallback heatmap
pip install -r requirements.txt
cd gridcast-react
npm installpython src/scrapping/scrap_excel.py
python src/ingestion/data_merger.py
python src/ingestion/data_cleaning.pypython src/pipeline/xgboost/train_and_save_xgboost.pypython src/pipeline/pre_generate.pycd gridcast-react
npm run sync:datacd gridcast-react
npm run dev:realOpen http://localhost:3000 in your browser.
Each forecast artifact contains:
generated_atmodeldata_endtrained_athorizonhorizon_hstepshorizon_metricsforecastarray with timestamped values:datetime(string)load_mw(number)
Contains:
trained_atdata_endhorizon_metrics
heatmap_matrix(7x24, day-of-week × hour)heatmap_flat(168 values)heatmap_infotrained_atdata_end
- Feature-rich autoregressive regression with XGBoost
- Season-aware holdout (last 3 months) to test temporal generalization
- Sliding-cutoff residual analysis for operational reliability map
- MAE (absolute error magnitude)
- RMSE (penalizes larger misses)
- MAPE (relative error)
- Fast inference
- Explainable lag/calendar feature set
- Artifact-based serving (no online retrain dependency)
- File-based publishing avoids keeping a prediction server online
- Data Duration: April 2024 to March 2026 (~59,000 rows, 15-minute intervals)
- Features Used: 7 cyclic features (load + time-based: hour, day of week, month)
- LSTM Sequence Length: 192 (past 48 hours of historical data)
| Horizon | Avg MAPE | Status |
|---|---|---|
| 24h | 2.30% | Strong and stable |
| 48h | 2.31% | Slight degradation |
| 72h | 2.99% | Noticeable drop |
Training Insights:
- Early stopping triggered between epochs 6–9
- Learning rate reduced multiple times, indicating quick convergence
- Model showed signs of learning strong seasonal patterns with limited model complexity
| Horizon | XGBoost | LSTM v2 | Delta | Winner |
|---|---|---|---|---|
| 24h | 1.69% | 2.30% | +0.61% | XGBoost |
| 48h | 2.11% | 2.31% | +0.20% | XGBoost |
| 72h | 2.74% | 2.99% | +0.25% | XGBoost |
1. XGBoost Superiority
- Consistently outperforms LSTM across all forecast horizons
- Well-suited for structured/tabular data with engineered features
- Maintains stability and accuracy even at 72-hour horizon
2. LSTM Limitations
- Only basic time-based features utilized
- Missing important external variables: weather data, holidays, demand anomalies
- Prevents effective exploitation of temporal dependencies
3. Long-Horizon Forecast Challenge (72h)
- Significant error increase observed across both models
- Instability in later backtest windows
- Indicates inherent difficulty in extended forecasting
Primary Model: XGBoost
- Production deployment due to superior accuracy and stability
- Easier maintenance and explainability
- Consistent performance across all horizons
- Error remains under 3%, meeting industry-standard requirements
Secondary Use: LSTM
- Suitable for research and experimentation
- Foundation for future hybrid or ensemble models
- Opportunity for enhancement with external feature integration
logs/scrap_excel.log: scraping execution summary and failureslogs/data_merger.log: extraction/merge completion summaries- Training and JSON generation scripts print model metadata, training time, and data end timestamp
- The frontend sync script fails fast if a required JSON artifact is missing
Suggested production add-ons:
- Structured JSON logging
- Request/latency metrics for API endpoints
- Scheduled retraining and drift alerting
Based on docs/overview/project_overview.md, docs/design/system_design.md, and docs/research/model_analysis.md, the project direction combines:
- Time-series forecasting for grid stability
- Data engineering rigor (validation, cleaning, feature pipelines)
- ML/DL comparative approach (Linear/GRU/LSTM/XGBoost references)
- Operational objective alignment (peak-risk mitigation, reliability, sustainability)
Design patterns reflected in code:
- Modular pipeline stages
- Artifact-based model lifecycle
- Monitoring-oriented residual diagnostics
- Static artifacts + dashboard for applied decision support
Note:
docs/synopsis/Synopsis Report-NavGati Updated.pdfis included in repository documentation scope but is not machine-parsed in this README generation flow.
- IEA report emphasis: grids are central to secure clean-energy transitions, and planning/management upgrades are critical to avoid grid bottlenecks.
- U.S. DOE grid modernization framing: smarter, data-enabled grids improve resilience, outage response, renewable integration, and operational efficiency.
- UCI electricity load benchmark confirms broad use of 15-minute cadence load data for forecasting tasks.
- XGBoost (
/dmlc/xgboost): strong fit for tabular time-series feature sets, scikit-learn styleXGBRegressor, robust parameterization and importance analysis. - Next.js / static hosting: standard public-asset patterns for serving forecast JSON files directly to the dashboard.
- Pandas (
/pandas-dev/pandas): DatetimeIndex-first operations andinterpolate(method='time')align with this project’s cleaning path.
- Scraper relies on target portal DOM stability; UI changes may require selector updates.
- No formal test suite is currently present.
- Current pipeline is single-region operationalized (North) though docs discuss multi-region ambitions.
- The dashboard consumes static artifacts; online updates require explicit retraining and re-syncing of JSON files.
- Add weather/exogenous features for peak-window robustness.
- Add model registry and retraining scheduler.
- Add CI tests for pipeline invariants and forecast JSON contracts.
- Add region abstraction for North/South/East/West scaling.
- Integrate LSTM/GRU pipeline as optional model backend.
docs/overview/project_overview.mddocs/overview/problem_statement.mddocs/design/architecture.mddocs/design/system_design.mddocs/research/base_papers.mddocs/research/model_analysis.mddocs/synopsis/Synopsis Report-NavGati Updated.pdf
- IEA — Electricity Grids and Secure Energy Transitions: https://www.iea.org/reports/electricity-grids-and-secure-energy-transitions
- U.S. DOE — Grid Modernization and the Smart Grid: https://www.energy.gov/oe/activities/technology-development/grid-modernization-and-smart-grid
- UCI ML Repository — ElectricityLoadDiagrams20112014: https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014
- XGBoost docs (Context7 index):
/dmlc/xgboost - Flask docs (Context7 index):
/pallets/flask - Pandas docs (Context7 index):
/pandas-dev/pandas
pip install -r requirements.txt
python src/scrapping/scrap_excel.py
python src/ingestion/data_merger.py
python src/ingestion/data_cleaning.py
python src/pipeline/xgboost/train_and_save_xgboost.py
python src/pipeline/pre_generate.py
cd gridcast-react
npm install
npm run sync:data
npm run dev:realThen open dashboard:
http://localhost:3000