Data Engineer and AI Data Engineer based in Dallas, TX
I build production-grade data systems where AI meets data engineering. My work spans real-time streaming pipelines, RAG systems, agent-orchestrated observability, analytics engineering, and cloud data platforms on AWS.
Currently at Fifth Third Bank building ELT platforms on AWS and Snowflake processing 2B+ daily transactions.
| Area | What I Work On |
|---|---|
| AI Data Engineering | RAG pipelines, LLM agents, tool calling, hallucination detection |
| Data Engineering | Kafka streaming, ETL/ELT, cloud pipelines, AWS S3 and EC2 |
| Analytics Engineering | dbt Core, dimensional modeling, KPI marts, data contracts |
| Data Quality | Observability pipelines, validation frameworks, health scoring |
| BI and Reporting | Power BI, DAX, KPI dashboards, executive reporting |
Natural language question answering over real SEC 10-K filings from JPMorgan, Goldman Sachs, and Apple. Deployed live on AWS EC2 with source citations and hallucination detection on every answer.
Stack: Python, AWS S3, AWS EC2, ChromaDB, sentence-transformers, Claude API
Agent-orchestrated pipeline using Claude API tool calling to autonomously profile, validate, explain anomalies in plain English, and generate HTML quality reports. Tracks data health score over time in DuckDB.
Stack: Python, Claude API (Haiku), DuckDB, Great Expectations, matplotlib
End-to-end streaming pipeline ingesting retail transaction events through Confluent Cloud Kafka, transforming in Python, storing results in SQLite, and visualizing live KPIs on a Streamlit dashboard that auto-refreshes every 5 seconds.
Stack: Python, Confluent Cloud Kafka, SQLite, Streamlit
Production analytics engineering pipeline with dbt Core, star schema dimensional modeling, KPI definitions, dbt tests, and documentation. Built on a 100k+ row ecommerce dataset.
Stack: dbt Core, DuckDB, SQL, dimensional modeling
LLM-powered benchmarking system comparing Claude API against a rule-based baseline on 2,000 financial transactions. LLM achieved 95.3% accuracy vs 58.7% baseline with structured evaluation framework covering per-category precision, recall, F1, confidence scoring, and cost analysis at $0.0003 per transaction.
Stack: Python, Claude API, pandas, scikit-learn, evaluation framework
End-to-end margin analysis pipeline identifying revenue leakage across 9,994 transactions using SQL, Python, and DuckDB. Includes a Tableau dashboard with KPI drill-down.
Stack: Python, SQL, DuckDB, Tableau
Languages: Python, SQL, Scala, Bash, DAX
Data Engineering: Apache Spark, Kafka, Airflow, dbt Core, AWS Glue, Azure Data Factory, Databricks, Fivetran, Delta Lake
Cloud: AWS (S3, EC2, Glue, Redshift, EMR, Lambda), Azure (ADF, Synapse, ADLS), GCP (BigQuery)
Databases: Snowflake, PostgreSQL, DuckDB, SQL Server, Redshift, Hive, MongoDB
AI and LLMs: Claude API, RAG pipelines, LLM agents, tool calling, ChromaDB, sentence-transformers, prompt engineering, LLM evaluation
BI: Power BI, Tableau, Looker, Streamlit, DAX, SSRS
DevOps: Docker, Kubernetes, Terraform, GitHub Actions, CI/CD
Banking and Financial Services, Retail and eCommerce, Healthcare, Insurance, HR Tech, Supply Chain, Procurement
Email: buildwithtarun@gmail.com
LinkedIn: linkedin.com/in/tarun-b-k
Location: Dallas, TX
Open to: Data Engineer, AI Data Engineer, Analytics Engineer, Senior Data Engineer roles