GitHub - saikumar2608/brfss-risk-engine-v2

Risk Engine V2

A Multi-Year Chronic Disease Risk Prediction Application using BRFSS Data

Risk Engine V2 is an upgraded machine learning–powered application designed to predict chronic disease risks (Diabetes, Hypertension, Heart Disease, and Obesity) based on individual lifestyle, demographic, and health factors from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS). It delivers percentage-based risk estimates through a user-friendly Streamlit interface.

Project Overview

Models: XGBoost with class-weighted training for imbalanced targets

Targets: Diabetes, Hypertension, Heart Disease, Obesity

Data Sources: BRFSS 2019, 2020, 2021 (CDC)

Interface: Streamlit web app with dynamic inputs and contextual “About” section

Key Features:

Multi-year model training with cross-year validation

Feature importance via SHAP for transparency

Probability-based outputs for more informative feedback

Built-in definitions to ensure layperson understanding

Key Improvements in V2

Multi-year integration — Supports training on combined BRFSS years for better generalization.

Feature consistency — Unified feature naming (rename_dict) across datasets.

Enhanced interpretability — SHAP visualizations to explain predictions.

User experience upgrades — Clean, human-readable inputs and explanations in the UI.

Resilient deployment — Stable Streamlit Cloud release with pinned dependencies and asset checks.

Challenges & How We Solved Them

Age group bins problem

Issue: _AGEG5YR bins misaligned, producing illogical age–risk patterns.

Fix: Centralized mapping dictionary, rebuilt using pd.cut with precise boundaries, and validated with asserts on counts and min/max bins.

General health flipped

Issue: BRFSS GENHLTH scale (1=Excellent → 5=Poor) was reversed.

Fix: Explicit mapping so higher values indicate worse health; verified with correlation checks.

Smoking yes/no flipped

Issue: SMOKE100 (1=Yes, 2=No) was mistakenly inverted.

Fix: Boolean recode (1 → True, 2 → False) with sanity checks on prevalence rates.

Logic problems with BMI

Issue: Mismatch between dataset-calculated BMI and manual BMI from inputs (unit inconsistencies, off-by-one category edges).

Fix: Standardized formula and category definitions to match BRFSS exactly.

Sharp non-uniformness in SHAP outputs

Issue: Partial dependence sweeps had abrupt jumps.

Fix: Standardized categorical encodings, smoothed bins, validated on controlled synthetic sweeps.

Retraining BMI & HTN during UI fixes

Issue: Encoder/category order changes in UI broke old models.

Fix: Retrained BMI & HTN with unified preprocessing pipelines and locked category order.

Hypertension label gap across years

Issue: No consistent hypertension target in BRFSS 2020.

Fix: HTN trained on 2019 + 2021, with cross-year validation checks.

Cross-dataset comparability

Issue: CDC/WHO external features weren’t directly comparable to BRFSS definitions.

Fix: Stuck to BRFSS-native features for accuracy and comparability.

Model version & compatibility issues

Issue: XGBoost pickle load warnings, sklearn InconsistentVersionWarning.

Fix: Used save_model() for XGBoost, pinned dependency versions, rebuilt under consistent environment.

UI/UX cleanup

Issue: Raw category codes like yes(1) showing in UI.

Fix: Applied rename_dict and separate display labels from model codes.

Generalization across years

Issue: Performance dropped when predicting on a different BRFSS year.

Fix: Combined-year training for better distribution coverage, monitored calibration on holdout years.

Deployment friction

Issue: Local app worked; Streamlit Cloud failed due to missing files/dependency mismatches.

Fix: Added pre-deploy asset checks, pinned requirements.txt, and ensured all files bundled.

Tech Stack

Language: Python 3.10+

Libraries:

Data Processing: Pandas, NumPy

Modeling: XGBoost, scikit-learn

Interpretability: SHAP

UI: Streamlit

Deployment: Streamlit Cloud

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
api		api
app		app
docs		docs
models		models
utils		utils
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
runtime.txt		runtime.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages