From 8c24204848ca83e0b3a47e17b5d67a19cbf7f4c4 Mon Sep 17 00:00:00 2001 From: Prateek Date: Thu, 7 May 2026 09:21:39 +0530 Subject: [PATCH] docs: update README to reflect changes in model architecture and JSON artifact publishing Co-authored-by: Copilot --- README.md | 313 +++++++++++++++++++++++++++++++++++------------------- 1 file changed, 202 insertions(+), 111 deletions(-) diff --git a/README.md b/README.md index 6ede190..ed5dab2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # GridCast — Smart Grid Load Prediction (Team NavGati) -Production-oriented electricity load forecasting system for smart grids using a **data pipeline + XGBoost model + Flask API + dashboard frontend**. +Production-oriented electricity load forecasting system for smart grids using a **data pipeline + offline-trained models + JSON forecast artifacts + dashboard frontend**. --- image @@ -19,11 +19,11 @@ Production-oriented electricity load forecasting system for smart grids using a - [Scraper (`src/scrapping/scrap_excel.py`)](#1-scraper-srcscrappingscrap_excelpy) - [Merger (`src/ingestion/data_merger.py`)](#2-merger-srcingestiondata_mergerpy) - [Cleaner (`src/ingestion/data_cleaning.py`)](#3-cleaner-srcingestiondata_cleaningpy) - - [Training (`src/pipeline/train_and_save.py`)](#4-training-srcpipelinetrain_and_savepy) - - [API (`src/api/app.py`)](#5-api-srcapiapppy) - - [Frontend (`src/Frontend/dashboard.html`)](#6-frontend-srcfrontenddashboardhtml) + - [Training and Forecast Publishing (`src/pipeline/xgboost/train_and_save_xgboost.py`, `src/pipeline/pre_generate.py`)](#4-training-and-forecast-publishing-srcpipelinexgboosttrain_and_save_xgboostpy-srcpipelinepre_generatepy) + - [Forecast JSON Artifacts](#5-forecast-json-artifacts) + - [Frontend (`gridcast-react/`)](#6-frontend-gridcast-react) 9. [Runbook (End-to-End)](#runbook-end-to-end) -10. [API Contract](#api-contract) +10. [Forecast JSON Contract](#forecast-json-contract) 11. [Modeling Strategy and Metrics](#modeling-strategy-and-metrics) 12. [Observability and Logs](#observability-and-logs) 13. [Research and Design Synthesis](#research-and-design-synthesis) @@ -40,10 +40,12 @@ This project builds an AI-assisted forecasting workflow for **short-term electri - Automated historical forecast-file collection from NRLDC web portal - Structured extraction and cleaning pipeline for 15-minute demand series -- Feature-engineered XGBoost training with seasonal holdout validation +- Feature-engineered local training with seasonal holdout validation +- JSON forecast publishing for 24h, 48h, and 72h horizons - Precomputed residual heatmap for operational reliability analysis -- Flask API serving 24-hour (96-step) forecasts -- Dashboard UI for operations teams to view forecast, KPIs, and residual patterns +- Next.js dashboard UI that reads static forecast JSON files and displays KPIs, forecasts, and residual patterns + +The current serving constraint is file-based rather than API-based: the model is trained on your machine, forecast outputs are materialized as JSON, and the frontend consumes those JSON files directly. The system is designed as a practical bridge between research-grade forecasting and operations-grade deployment patterns. @@ -71,12 +73,12 @@ This repository addresses that gap with a reproducible ML pipeline and lightweig ## What This Repository Delivers -- **Forecast horizon:** 24 hours ahead at 15-minute granularity (`96` steps) -- **Model family:** XGBoost (season-aware split, autoregressive inference) +- **Forecast horizon:** 24, 48, and 72 hours ahead at 15-minute granularity (`96`, `192`, and `288` steps) +- **Model family:** XGBoost and LSTM artifacts trained through the project pipeline (season-aware split, autoregressive inference) - **Data cadence:** 15-minute load points -- **Pipeline stages:** scrape → extract/merge → clean → train → serve → visualize -- **Model artifacting:** `joblib` model + JSON metadata/buffer -- **Operational diagnostics:** residual heatmap over day-of-week × hour-of-day +- **Pipeline stages:** scrape → extract/merge → clean → train → generate JSON forecasts → sync → visualize +- **Model artifacting:** `joblib` model + JSON metadata/buffer + forecast JSON files + metrics/residuals JSON files +- **Operational diagnostics:** residual heatmap over day-of-week × hour-of-day, published as JSON for the dashboard --- @@ -87,39 +89,83 @@ requirements.txt README.md data/ - raw/ # downloaded NRLDC files by year/month - extracted/ # merged extracted parquet - cleaned/ # cleaned parquet used for training - model/ - xgboost_model.joblib - buffer.json +|-- raw/ # downloaded NRLDC files by year/month +| |-- 2024/ +| | |-- +| |-- 2025/ +| | |-- +| `-- 2026/ +| `-- +|-- extracted/ # merged extracted parquet +|-- cleaned/ # cleaned parquet used for training +|-- model/ +| |-- xgboost/ +| | |-- xgboost_model.joblib +| | `-- buffer.json +| `-- lstm/ +| |-- 24h.keras +| |-- 48h.keras +| |-- 72h.keras +| `-- buffer.json +`-- public/ + `-- data/ + |-- xgboost/ + | |-- forecast_24h.json + | |-- forecast_48h.json + | |-- forecast_72h.json + | |-- metrics.json + | `-- residuals.json + `-- lstm/ + |-- forecast_24h.json + |-- forecast_48h.json + |-- forecast_72h.json + |-- metrics.json + `-- residuals.json docs/ - research.md - system_design.md - synopsis/ - Synopsis Report-NavGati Updated.pdf +|-- design/ +| |-- architecture.md +| `-- system_design.md +|-- overview/ +| |-- problem_statement.md +| `-- project_overview.md +|-- research/ +| |-- base_papers.md +| `-- model_analysis.md +`-- synopsis/ + `-- Synopsis Report-NavGati Updated.pdf logs/ - scrap_excel.log - data_merger.log +|-- scrap_excel.log +`-- data_merger.log notebooks/ - 01_eda.ipynb - 02_Baseline_model.ipynb +|-- 01_eda.ipynb +|-- 02_baseline_models.ipynb +|-- 03_model_comparison.ipynb +|-- 04_xgboost_forecasting.ipynb +|-- 05_lstm_forecasting.ipynb +`-- 06_lstm_model_saving.ipynb src/ - scrapping/ - scrap_excel.py - ingestion/ - data_merger.py - data_cleaning.py - pipeline/ - train_and_save.py - api/ - app.py - Frontend/ - dashboard.html +|-- scrapping/ +| `-- scrap_excel.py +|-- ingestion/ +| |-- data_merger.py +| `-- data_cleaning.py +`-- pipeline/ + |-- pre_generate.py + `-- xgboost/ + `-- train_and_save_xgboost.py + +gridcast-react/ +|-- app/ +|-- components/ +|-- lib/ +|-- public/ +| `-- data/ +`-- scripts/ + `-- sync-real-data.mjs ``` --- @@ -136,12 +182,14 @@ flowchart TD D --> E[data/extracted/nrldc_extracted.parquet] E --> F[data_cleaning.py\nSpike/range detection + interpolation] F --> G[data/cleaned/nrldc_cleaned.parquet] - G --> H[train_and_save.py\nFeature engineering + XGBoost] - H --> I[data/model/xgboost_model.joblib] - H --> J[data/model/buffer.json] - I --> K[app.py\nFlask inference service] + G --> H[train_and_save_xgboost.py\nModel training + buffer.json] + H --> I[data/model/xgboost/xgboost_model.joblib] + H --> J[data/model/xgboost/buffer.json] + I --> K[pre_generate.py\nGenerate 24h/48h/72h JSON forecasts] J --> K - K --> L[dashboard.html\nForecast + KPIs + heatmap] + K --> L[data/public/data/*\nforecast_*.json + metrics.json + residuals.json] + L --> M[sync-real-data.mjs\nMirror artifacts to frontend] + M --> N[gridcast-react\nNext.js dashboard] ``` ### 2) Forecast Serving Sequence @@ -149,20 +197,16 @@ flowchart TD ```mermaid sequenceDiagram participant UI as Dashboard - participant API as Flask API - participant M as xgboost_model.joblib - participant B as buffer.json - - UI->>API: GET /health - API-->>UI: model metadata + metrics - - UI->>API: GET /forecast - API->>B: Load latest buffer in memory - API->>M: Autoregressive step prediction (96x) - API-->>UI: forecast array [{datetime, load_mw}] - - UI->>API: GET /residuals - API-->>UI: heatmap_matrix (7x24 APE) + participant F as Static JSON files + participant M as Model artifacts + + UI->>F: GET /data/xgboost/forecast_24h.json + UI->>F: GET /data/xgboost/forecast_48h.json + UI->>F: GET /data/xgboost/forecast_72h.json + UI->>F: GET /data/xgboost/metrics.json + UI->>F: GET /data/xgboost/residuals.json + F-->>UI: forecast rows + metrics + residual heatmap + M-->>F: pre_generate.py materializes JSON from trained model ``` ### 3) Data Quality Logic @@ -187,7 +231,8 @@ flowchart LR - **Language:** Python 3.x, JavaScript, HTML/CSS - **Data:** pandas, pyarrow, fastparquet - **ML:** xgboost, scikit-learn, numpy, joblib -- **API:** Flask, Flask-CORS +- **Serving:** JSON artifact publishing plus static file delivery through the frontend +- **Frontend:** Next.js, React, TypeScript, Tailwind CSS - **Automation:** selenium, webdriver-manager - **Notebook analysis:** Jupyter notebooks under `notebooks/` @@ -211,8 +256,12 @@ flowchart LR 2. **Cleaned parquet** (`data/cleaned/nrldc_cleaned.parquet`) - datetime index - column: `actual_demand_mw` -3. **Model metadata** (`data/model/buffer.json`) +3. **Model metadata** (`data/model/xgboost/buffer.json`, `data/model/lstm/buffer.json`) - `feature_cols`, `test_metrics`, rolling `buffer`, residual heatmap artifacts +4. **Forecast JSON artifacts** (`data/public/data//forecast_24h.json`, `forecast_48h.json`, `forecast_72h.json`) + - `generated_at`, `model`, `data_end`, `trained_at`, `horizon`, `steps`, `forecast` +5. **Diagnostics JSON artifacts** (`data/public/data//metrics.json`, `residuals.json`) + - `horizon_metrics`, `heatmap_matrix`, `heatmap_flat`, `heatmap_info` --- @@ -314,15 +363,15 @@ Applies physically grounded anomaly detection and interpolation repair to demand --- -## 4) Training (`src/pipeline/train_and_save.py`) +## 4) Training and Forecast Publishing (`src/pipeline/xgboost/train_and_save_xgboost.py`, `src/pipeline/pre_generate.py`) ### Purpose -Builds model-ready features, performs season-aware split, trains XGBoost, evaluates, computes residual heatmap, and persists artifacts. +Builds model-ready features, performs season-aware split, trains XGBoost, evaluates, computes residual heatmap, and then publishes JSON forecast artifacts for the frontend. ### Key constants - `BUFFER_LEN = 672` (1 week of 15-min history) -- `STEPS = 96` (24-hour horizon) +- `HORIZONS = {"24h": 96, "48h": 192, "72h": 288}` - `HOLDOUT_MONTHS = 3` ### Function-by-function details @@ -334,44 +383,52 @@ Builds model-ready features, performs season-aware split, trains XGBoost, evalua | `evaluate(y_true, y_pred, label='')` | true and predicted arrays | metrics dict | Computes MAE, RMSE, MAPE | | `forecast_from(cutoff_idx, series, model, feature_cols)` | cutoff index, series, model, feature order | `(pred_index, preds, actuals)` | 96-step autoregressive forecasting loop from a single cutoff | | `compute_residual_heatmap(series, model, feature_cols, test_start_idx)` | series, model, features, test start index | `(matrix, flat_list)` | Runs repeated cutoffs over test period and computes mean APE by day/hour bucket | +| `run_xgb_forecast(horizon, model, buffer_meta)` | horizon, model, buffer metadata | prediction list | Generates 24h/48h/72h autoregressive forecasts | +| `save_json(preds, horizon, model_name, buffer_meta)` | forecast list, horizon, model name, metadata | JSON payload dict | Shapes the prediction payload for file-based publishing | ### Artifact outputs -- `data/model/xgboost_model.joblib` -- `data/model/buffer.json` containing: +- `data/model/xgboost/xgboost_model.joblib` +- `data/model/xgboost/buffer.json` containing: - model metadata (`trained_at`, `data_end`, `feature_cols`) - `test_metrics` - last 672 observed loads (`buffer`) - residual heatmap (`heatmap_matrix`, `heatmap_flat`, `heatmap_info`) +- `data/public/data/xgboost/forecast_24h.json`, `forecast_48h.json`, `forecast_72h.json` +- `data/public/data/xgboost/metrics.json` +- `data/public/data/xgboost/residuals.json` + +The LSTM pipeline follows the same publishing contract under `data/public/data/lstm/`. --- -## 5) API (`src/api/app.py`) +## 5) Forecast JSON Artifacts ### Purpose -Loads model + metadata once at startup and serves low-latency forecast and diagnostics endpoints. +The forecast is not served by a live inference API in the current setup. Instead, `src/pipeline/pre_generate.py` loads the saved model artifacts, generates 24h/48h/72h predictions, and writes ready-to-consume JSON files for the frontend. -### Function-by-function details +### Generated files -| Function | Route | Method | Output | -|---|---|---|---| -| `run_forecast()` | internal | internal | Generates 96-step forecast using autoregressive feature updates | -| `health()` | `/health` | `GET` | API status, model metadata, test metrics | -| `residuals()` | `/residuals` | `GET` | 7×24 residual heatmap data from training artifact | -| `forecast()` | `/forecast` | `GET` | full forecast payload + metadata | +| File | Purpose | +|---|---| +| `forecast_24h.json` | 96-step forecast for the next 24 hours | +| `forecast_48h.json` | 192-step forecast for the next 48 hours | +| `forecast_72h.json` | 288-step forecast for the next 72 hours | +| `metrics.json` | Horizon metrics and training timestamps | +| `residuals.json` | 7×24 residual heatmap data | ### Runtime notes -- Uses `Flask` + `Flask-CORS` -- Uses `use_reloader=False` to avoid duplicate startup loads -- Exposes service on `0.0.0.0:5000` +- Files are written to `data/public/data//` and mirrored into `gridcast-react/public/data//` +- The frontend reads these files directly, so there is no mandatory always-on Flask service +- `gridcast-react/scripts/sync-real-data.mjs` validates that the required JSON artifacts exist after sync --- -## 6) Frontend (`src/Frontend/dashboard.html`) +## 6) Frontend (`gridcast-react/`) ### Purpose -Single-file dashboard that fetches API data and renders operational forecast views, KPIs, model comparison, residual heatmap, and CSV export. +Next.js dashboard that reads the generated JSON artifacts from `public/data`, renders operational forecast views, KPIs, model comparison, residual heatmap, and CSV export. ### JavaScript function reference @@ -381,9 +438,9 @@ Single-file dashboard that fetches API data and renders operational forecast vie | `fmtMW(n)` | number | string | Adds `MW` unit suffix | | `mapeClass(v)` | number | class token | Maps MAPE to `good/warn/bad` styling | | `nowIST()` | none | string | Current IST time string | -| `loadForecast()` | none | Promise | Fetches `/health`, `/forecast`, `/residuals` and triggers render | +| `loadForecast()` | none | Promise | Fetches `public/data//forecast_*.json`, `metrics.json`, and `residuals.json` and triggers render | | `refreshForecast()` | none | None | Manual refresh hook | -| `renderAll(health, data)` | API payloads | None | Updates all dashboard sections from latest data | +| `renderAll(health, data)` | JSON payloads | None | Updates all dashboard sections from latest data | | `drawForecastChart(fc, peak, peakIdx)` | forecast list + peak metadata | None | Draws SVG line + confidence band + tooltip behaviors | | `buildHeatmap(matrix)` | 7×24 matrix or null | None | Renders real residual heatmap or static fallback | | `exportCSV()` | none | None | Exports forecast rows as CSV file | @@ -392,8 +449,8 @@ Single-file dashboard that fetches API data and renders operational forecast vie ### UI interaction model - On load: `setupNavTabs(); loadForecast();` -- Error handling: Shows API connectivity banner when fetch fails -- Progressive enhancement: if `/residuals` unavailable, uses fallback heatmap +- Error handling: Shows a data-loading banner when the JSON artifacts are missing or stale +- Progressive enhancement: if `/residuals.json` is unavailable, uses fallback heatmap --- @@ -403,6 +460,8 @@ Single-file dashboard that fetches API data and renders operational forecast vie ```bash pip install -r requirements.txt +cd gridcast-react +npm install ``` ## 2) Run data ingestion @@ -413,48 +472,69 @@ python src/ingestion/data_merger.py python src/ingestion/data_cleaning.py ``` -## 3) Train and persist model +## 3) Train the model on your machine ```bash -python src/pipeline/train_and_save.py +python src/pipeline/xgboost/train_and_save_xgboost.py ``` -## 4) Start API +## 4) Generate forecast JSON artifacts ```bash -python src/api/app.py +python src/pipeline/pre_generate.py ``` -## 5) Open frontend +## 5) Sync artifacts to the frontend -Open `src/Frontend/dashboard.html` in browser (ensure API is running on `http://localhost:5000`). +```bash +cd gridcast-react +npm run sync:data +``` ---- +## 6) Open frontend + +```bash +cd gridcast-react +npm run dev:real +``` -## API Contract +Open `http://localhost:3000` in your browser. -### `GET /health` +--- -Returns service and model metadata. +## Forecast JSON Contract -### `GET /forecast` +### `forecast_24h.json`, `forecast_48h.json`, `forecast_72h.json` -Returns: +Each forecast artifact contains: -- generation timestamp -- model details -- horizon metadata -- `forecast` array with 96 points: +- `generated_at` +- `model` +- `data_end` +- `trained_at` +- `horizon` +- `horizon_h` +- `steps` +- `horizon_metrics` +- `forecast` array with timestamped values: - `datetime` (string) - `load_mw` (number) -### `GET /residuals` +### `metrics.json` + +Contains: -Returns: +- `trained_at` +- `data_end` +- `horizon_metrics` + +### `residuals.json` - `heatmap_matrix` (`7x24`, day-of-week × hour) - `heatmap_flat` (168 values) - `heatmap_info` +- `trained_at` +- `data_end` --- @@ -477,6 +557,7 @@ Returns: - Fast inference - Explainable lag/calendar feature set - Artifact-based serving (no online retrain dependency) +- File-based publishing avoids keeping a prediction server online --- @@ -484,7 +565,8 @@ Returns: - `logs/scrap_excel.log`: scraping execution summary and failures - `logs/data_merger.log`: extraction/merge completion summaries -- API startup prints model metadata, training time, and data end timestamp +- Training and JSON generation scripts print model metadata, training time, and data end timestamp +- The frontend sync script fails fast if a required JSON artifact is missing Suggested production add-ons: @@ -496,7 +578,7 @@ Suggested production add-ons: ## Research and Design Synthesis -Based on `docs/research.md` and `docs/system_design.md`, the project direction combines: +Based on `docs/overview/project_overview.md`, `docs/design/system_design.md`, and `docs/research/model_analysis.md`, the project direction combines: - Time-series forecasting for grid stability - Data engineering rigor (validation, cleaning, feature pipelines) @@ -508,7 +590,7 @@ Design patterns reflected in code: - Modular pipeline stages - Artifact-based model lifecycle - Monitoring-oriented residual diagnostics -- API + dashboard for applied decision support +- Static artifacts + dashboard for applied decision support > Note: `docs/synopsis/Synopsis Report-NavGati Updated.pdf` is included in repository documentation scope but is not machine-parsed in this README generation flow. @@ -525,7 +607,7 @@ Design patterns reflected in code: ### Library guidance (Context7) - **XGBoost (`/dmlc/xgboost`)**: strong fit for tabular time-series feature sets, scikit-learn style `XGBRegressor`, robust parameterization and importance analysis. -- **Flask (`/pallets/flask`)**: standard route and JSON serving patterns, stable local server model. +- **Next.js / static hosting**: standard public-asset patterns for serving forecast JSON files directly to the dashboard. - **Pandas (`/pandas-dev/pandas`)**: DatetimeIndex-first operations and `interpolate(method='time')` align with this project’s cleaning path. --- @@ -535,7 +617,7 @@ Design patterns reflected in code: - Scraper relies on target portal DOM stability; UI changes may require selector updates. - No formal test suite is currently present. - Current pipeline is single-region operationalized (North) though docs discuss multi-region ambitions. -- API currently serves from static artifacts; online updates require explicit retraining. +- The dashboard consumes static artifacts; online updates require explicit retraining and re-syncing of JSON files. --- @@ -543,7 +625,7 @@ Design patterns reflected in code: 1. Add weather/exogenous features for peak-window robustness. 2. Add model registry and retraining scheduler. -3. Add CI tests for pipeline invariants and API contracts. +3. Add CI tests for pipeline invariants and forecast JSON contracts. 4. Add region abstraction for North/South/East/West scaling. 5. Integrate LSTM/GRU pipeline as optional model backend. @@ -553,8 +635,12 @@ Design patterns reflected in code: ### Repository documentation -- `docs/research.md` -- `docs/system_design.md` +- `docs/overview/project_overview.md` +- `docs/overview/problem_statement.md` +- `docs/design/architecture.md` +- `docs/design/system_design.md` +- `docs/research/base_papers.md` +- `docs/research/model_analysis.md` - `docs/synopsis/Synopsis Report-NavGati Updated.pdf` ### External @@ -572,15 +658,20 @@ Design patterns reflected in code: ```bash pip install -r requirements.txt +python src/scrapping/scrap_excel.py python src/ingestion/data_merger.py python src/ingestion/data_cleaning.py -python src/pipeline/train_and_save.py -python src/api/app.py +python src/pipeline/xgboost/train_and_save_xgboost.py +python src/pipeline/pre_generate.py +cd gridcast-react +npm install +npm run sync:data +npm run dev:real ``` Then open dashboard: -- `src/Frontend/dashboard.html` +- `http://localhost:3000` ---