π Project Overview This project demonstrates how to automate a full data cleaning pipeline using Stored Procedures, Triggers, and Events. The goal is to simulate a real-world scenario where incoming raw data is automatically cleaned, standardized, and stored in a separate table β ensuring data quality, consistency, and automation without manual intervention.
βοΈ Technologies & Skills Used β’ SQL β’ Stored Procedures β’ Events β’ Triggers β’ Data Cleaning & Transformation β’ Timestamp tracking for data lineage
π§© Project Structure β’ US_HOUSEHOLD_INCOME β Source raw dataset containing U.S. household income data. β’ US_HOUSEHOLD_INCOME_CLEANED β Destination table to store cleaned and standardized records. β’ Stored Procedure β Main logic that copies raw data, cleans it, and records timestamps. β’ Event β Automates execution of the procedure every 2 minutes. β’ Trigger β Automatically calls the cleaning procedure whenever new data is inserted.
π§ Key Concepts Implemented Stored Procedure: Copy_and_Clean_Data() Creates or replaces the cleaned table, copies all data from the raw table with a timestamp, removes duplicates, and standardizes column values.
β° Event Scheduler Runs automatically every 30 DAYS, triggering the stored procedure:
CREATE EVENT run_data_cleaning ON SCHEDULE EVERY 30 DAYS DO CALL PRACTICE_DATABASE.PUBLIC.Copy_and_Clean_Data();
π Trigger Executes the cleaning process immediately after a new record is inserted:
CREATE TRIGGER transfer_clean_data AFTER INSERT ON PRACTICE_DATABASE.PUBLIC.US_HOUSEHOLD_INCOME FOR EACH ROW BEGIN CALL PRACTICE_DATABASE.PUBLIC.Copy_and_Clean_Data(); END;
π§Ή Data Cleaning Highlights β’ Duplicate Removal β Uses ROW_NUMBER() to identify and delete duplicates by ID. β’ Text Standardization β Ensures consistent capitalization and naming conventions. β’ Error Correction β Fixes known typos and category mismatches. β’ Timestamp Tracking β Adds a timestamp for each cleaning run for monitoring and data versioning.
ποΈ File Summary π¦ automated-data-cleaning/
β£ π automated_cleaning.sql
β£ π README.md
β π sample_data.csv
π How to Run the Project
- Create the Raw Table
- Execute the Script
- Verify Cleaned Data
- Observe Automation (event + trigger)
π§ Key Learnings β How to design modular SQL automation pipelines β How to implement self-maintaining data cleaning systems β How to integrate Snowflake-specific features (procedures, events, triggers) β Improved understanding of data quality management in real-time pipelines.
π¨βπ» Author Nithin Ramayanam Data Analyst | Aspiring Data Scientist Experience: SQL, Python, Tableau, Databricks, Snowflake AWS, Azure