Skip to content

NithinRams/automated_data_cleaning_project_sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Automated Data Cleaning in SQL

πŸ“˜ Project Overview This project demonstrates how to automate a full data cleaning pipeline using Stored Procedures, Triggers, and Events. The goal is to simulate a real-world scenario where incoming raw data is automatically cleaned, standardized, and stored in a separate table β€” ensuring data quality, consistency, and automation without manual intervention.

βš™οΈ Technologies & Skills Used β€’ SQL β€’ Stored Procedures β€’ Events β€’ Triggers β€’ Data Cleaning & Transformation β€’ Timestamp tracking for data lineage

🧩 Project Structure β€’ US_HOUSEHOLD_INCOME β€” Source raw dataset containing U.S. household income data. β€’ US_HOUSEHOLD_INCOME_CLEANED β€” Destination table to store cleaned and standardized records. β€’ Stored Procedure β€” Main logic that copies raw data, cleans it, and records timestamps. β€’ Event β€” Automates execution of the procedure every 2 minutes. β€’ Trigger β€” Automatically calls the cleaning procedure whenever new data is inserted.

🧠 Key Concepts Implemented Stored Procedure: Copy_and_Clean_Data() Creates or replaces the cleaned table, copies all data from the raw table with a timestamp, removes duplicates, and standardizes column values.

⏰ Event Scheduler Runs automatically every 30 DAYS, triggering the stored procedure:

CREATE EVENT run_data_cleaning ON SCHEDULE EVERY 30 DAYS DO CALL PRACTICE_DATABASE.PUBLIC.Copy_and_Clean_Data();

πŸ”„ Trigger Executes the cleaning process immediately after a new record is inserted:

CREATE TRIGGER transfer_clean_data AFTER INSERT ON PRACTICE_DATABASE.PUBLIC.US_HOUSEHOLD_INCOME FOR EACH ROW BEGIN CALL PRACTICE_DATABASE.PUBLIC.Copy_and_Clean_Data(); END;

🧹 Data Cleaning Highlights β€’ Duplicate Removal β€” Uses ROW_NUMBER() to identify and delete duplicates by ID. β€’ Text Standardization β€” Ensures consistent capitalization and naming conventions. β€’ Error Correction β€” Fixes known typos and category mismatches. β€’ Timestamp Tracking β€” Adds a timestamp for each cleaning run for monitoring and data versioning.

πŸ—‚οΈ File Summary πŸ“¦ automated-data-cleaning/

┣ πŸ“œ automated_cleaning.sql
┣ πŸ“„ README.md
β”— πŸ“Š sample_data.csv

πŸš€ How to Run the Project

  1. Create the Raw Table
  2. Execute the Script
  3. Verify Cleaned Data
  4. Observe Automation (event + trigger)

🧭 Key Learnings βœ… How to design modular SQL automation pipelines βœ… How to implement self-maintaining data cleaning systems βœ… How to integrate Snowflake-specific features (procedures, events, triggers) βœ… Improved understanding of data quality management in real-time pipelines.

πŸ‘¨β€πŸ’» Author Nithin Ramayanam Data Analyst | Aspiring Data Scientist Experience: SQL, Python, Tableau, Databricks, Snowflake AWS, Azure

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors