NBA Analytics Data Lake

Quick Navigation

NBA Analytics Data Lake

Project Overview

Objective

Build an automated NBA analytics pipeline that:

Ingests raw player data via API
Stores it in a cloud-native data lake
Enables SQL analytics without data movement
Serves as a foundation for sports betting/analytics applications

Features

Serverless Infrastructure: Zero servers to manage (S3 + Glue + Athena)
Real-Time Schema Discovery: Auto-catalog JSON data with AWS Glue
Cost-Efficient Queries: $5/TB scanned via Amazon Athena
Scalable Storage: Handle 10,000+ player records with S3

Architecture

System Design

Data Source: SportsDataIO API for NBA player statistics.
Data Ingestion: Python script (boto3) for API integration and S3 uploads.
Storage Layer: AWS S3 bucket for raw JSON data and query results.
Metadata Catalog: AWS Glue for schema discovery and table creation.
Query Layer: Amazon Athena for serverless SQL analytics.

Workflow

Python script fetches NBA player data from SportsDataIO API.
Raw JSON is uploaded to an S3 bucket (s3://<bucket>/raw-data/).
AWS Glue crawler auto-discovers schema and creates metadata tables.
Analysts run SQL queries directly on S3 data via Athena.

Technologies Used

Category	Technologies
Data Source	SportsDataIO API
Cloud Storage	AWS S3
Data Catalog	AWS Glue
Query Engine	Amazon Athena
Execution	CloudShell
Automation	Python, Boto3 SDK
Environment Mgmt	Python-dotenv

Project Structure

nba-analytics-data-lake/  
├── src/  
│   ├── setup_nba_data_lake.py  # Infrastructure automation  
│   └── delete_resources.py     # Cleanup script  
├── .env                        # API credentials  
└── docs/                       # Architecture diagrams

Prerequisites

AWS Account with permissions to:
- Create/delete S3 buckets
- Manage Glue databases
- Run Athena queries
SportsDataIO API Key (Free Tier)

How to Setup

Create IAM Policy

Log in to AWS Management Console
Navigate to IAM:

In the search bar, type IAM and select IAM from the results.

Create a New Policy:

In the IAM dashboard, click on Policies in the left-hand menu.
Click the Create policy button.
Switch to JSON Editor:
In the Create Policy page, select the JSON tab.
Copy the provided JSON policy and paste it into the editor.

Launch CloudShell

Sign into AWS Console → Click >_ (CloudShell icon)

Configure Environment

nano .env

Press i to insert
Paste:

SPORTS_DATA_API_KEY=your_actual_key_here
NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players

Save & exit.

Create Script File

nano setup_nba_data_lake.py

Press i to enter insert mode
Paste script content
Save & exit.

Deployment

Install Dependencies

pip install -r requirements.txt

Run Script

python3 setup_nba_data_lake.py

Successful Output:

S3 Bucket Created: s3://<bucket_name>  
Glue Database 'nba_analytics' Ready  
Athena Query Interface Activated!

Query Demo

Run in Athena Query Editor:

-- Total points per team
SELECT Team, SUM(points) AS TotalTeamPoints 
FROM "glue-nba-data-lake"."nba_players" 
GROUP BY TEAM
ORDER BY TotalTeamPoints DESC 
LIMIT 5

Run more queries & know about players or team.

Validation

Verify S3 Data:
- Navigate to S3 → Check raw-data/nba_player_data.json
Check Glue Catalog:
- AWS Glue → Tables → nba_players schema

Security Considerations

IAM Roles: Least privilege access for S3/Glue/Athena
API Key Protection: Stored in .env (not committed to Git)
Encryption: S3 server-side encryption enabled

Troubleshooting

Issue	Resolution
`BucketAlreadyExists`	Use globally unique bucket name
`AccessDenied` in Glue	Verify IAM permissions
No data in Athena	Wait 2-3 mins after Glue crawl

Future Enhancements

Automated daily sync with EventBridge
Data transformation to Parquet format
Cost monitoring dashboard

Blog🔗

To visit blog click here

Contributing

Fork the repository
Submit PRs to new branch

License

MIT License - Full Text

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
Assets		Assets
policies		policies
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Analytics Data Lake

Project Overview

Objective

Features

Architecture

System Design

Workflow

Technologies Used

Project Structure

Prerequisites

How to Setup

Create IAM Policy

Launch CloudShell

Configure Environment

Create Script File

Deployment

Install Dependencies

Run Script

Query Demo

Validation

Security Considerations

Troubleshooting

Future Enhancements

Blog🔗

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NBA Analytics Data Lake

Project Overview

Objective

Features

Architecture

System Design

Workflow

Technologies Used

Project Structure

Prerequisites

How to Setup

Create IAM Policy

Launch CloudShell

Configure Environment

Create Script File

Deployment

Install Dependencies

Run Script

Query Demo

Validation

Security Considerations

Troubleshooting

Future Enhancements

Blog🔗

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages