Skip to content

vsingh55/NBA-Analytics-Data-Lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NBA Analytics Data Lake

Quick Navigation

Project Overview

Objective

Build an automated NBA analytics pipeline that:

  1. Ingests raw player data via API
  2. Stores it in a cloud-native data lake
  3. Enables SQL analytics without data movement
  4. Serves as a foundation for sports betting/analytics applications

Features

  • Serverless Infrastructure: Zero servers to manage (S3 + Glue + Athena)
  • Real-Time Schema Discovery: Auto-catalog JSON data with AWS Glue
  • Cost-Efficient Queries: $5/TB scanned via Amazon Athena
  • Scalable Storage: Handle 10,000+ player records with S3

Architecture

architecture

System Design

  1. Data Source: SportsDataIO API for NBA player statistics.
  2. Data Ingestion: Python script (boto3) for API integration and S3 uploads.
  3. Storage Layer: AWS S3 bucket for raw JSON data and query results.
  4. Metadata Catalog: AWS Glue for schema discovery and table creation.
  5. Query Layer: Amazon Athena for serverless SQL analytics.

Workflow

  1. Python script fetches NBA player data from SportsDataIO API.
  2. Raw JSON is uploaded to an S3 bucket (s3://<bucket>/raw-data/).
  3. AWS Glue crawler auto-discovers schema and creates metadata tables.
  4. Analysts run SQL queries directly on S3 data via Athena.
    workflow

Technologies Used

Category Technologies
Data Source SportsDataIO API
Cloud Storage AWS S3
Data Catalog AWS Glue
Query Engine Amazon Athena
Execution CloudShell
Automation Python, Boto3 SDK
Environment Mgmt Python-dotenv

Project Structure

nba-analytics-data-lake/  
β”œβ”€β”€ src/  
β”‚   β”œβ”€β”€ setup_nba_data_lake.py  # Infrastructure automation  
β”‚   └── delete_resources.py     # Cleanup script  
β”œβ”€β”€ .env                        # API credentials  
└── docs/                       # Architecture diagrams  

Prerequisites

  1. AWS Account with permissions to:
    • Create/delete S3 buckets
    • Manage Glue databases
    • Run Athena queries
  2. SportsDataIO API Key (Free Tier)

How to Setup

Create IAM Policy

  1. Log in to AWS Management Console

  2. Navigate to IAM:

  • In the search bar, type IAM and select IAM from the results.
  1. Create a New Policy:
  • In the IAM dashboard, click on Policies in the left-hand menu.
  • Click the Create policy button.
  • Switch to JSON Editor:
  • In the Create Policy page, select the JSON tab.
  • Copy the provided JSON policy and paste it into the editor.

Launch CloudShell

  1. Sign into AWS Console β†’ Click >_ (CloudShell icon)

Configure Environment

nano .env
  1. Press i to insert
  2. Paste:
SPORTS_DATA_API_KEY=your_actual_key_here
NBA_ENDPOINT=https://api.sportsdata.io/v3/nba/scores/json/Players
  1. Save & exit.

Create Script File

nano setup_nba_data_lake.py
  1. Press i to enter insert mode
  2. Paste script content
  3. Save & exit.

Deployment

Install Dependencies

pip install -r requirements.txt

Run Script

python3 setup_nba_data_lake.py

Successful Output:

S3 Bucket Created: s3://<bucket_name>  
Glue Database 'nba_analytics' Ready  
Athena Query Interface Activated!  

Query Demo

Run in Athena Query Editor:

-- Total points per team
SELECT Team, SUM(points) AS TotalTeamPoints 
FROM "glue-nba-data-lake"."nba_players" 
GROUP BY TEAM
ORDER BY TotalTeamPoints DESC 
LIMIT 5

Athena Results

Run more queries & know about players or team.

Validation

  1. Verify S3 Data:
    • Navigate to S3 β†’ Check raw-data/nba_player_data.json
  2. Check Glue Catalog:
    • AWS Glue β†’ Tables β†’ nba_players schema

Security Considerations

  • IAM Roles: Least privilege access for S3/Glue/Athena
  • API Key Protection: Stored in .env (not committed to Git)
  • Encryption: S3 server-side encryption enabled

Troubleshooting

Issue Resolution
BucketAlreadyExists Use globally unique bucket name
AccessDenied in Glue Verify IAM permissions
No data in Athena Wait 2-3 mins after Glue crawl

Future Enhancements

  1. Automated daily sync with EventBridge
  2. Data transformation to Parquet format
  3. Cost monitoring dashboard

BlogπŸ”—

To visit blog click here

Contributing

  1. Fork the repository
  2. Submit PRs to new branch

License

MIT License - Full Text

About

A sports analytics data lake leveraging AWS S3 for storage, AWS Glue for data cataloging, and AWS Athena for querying. Python scripts are used for data ingestion and manages the infrastructure.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages