A quick overview of every problem in this repo. Use the Category and Topics columns to filter by what you want to practice. Each row links to the problem statement and the reference solution.
This file is generated by
scripts/build_index.pyfrom the frontmatter in each problem'squestion.md. Do not edit by hand.
| # | Problem | Category | Topics | Difficulty | Question | Solution |
|---|---|---|---|---|---|---|
| 1 | Log File Error Analysis | Logs and Monitoring | file streaming, counters, top-N, IoT logs | Easy | Question | Solution |
| 2 | Rolling Average of Sensor Readings | Streaming | rolling window, deque, IoT sensors, real-time | Easy | Question | Solution |
| 3 | Transform and Clean Raw Data for Analytics | Data Cleaning | CSV, validation, regex, date checks | Medium | Question | Solution |
| 4 | Schema Evolution and Validation for Streaming Events | Schema Validation | JSON, schema evolution, type coercion, pydantic | Medium | Question | Solution |
| 5 | Merging Messy CSVs from Multiple Partners | Data Integration | CSV, column mapping, date parsing, file walk | Medium | Question | Solution |
| 6 | Partitioning vs Clustering in BigQuery | Fundamentals | BigQuery, partitioning, clustering, cost | Easy | Question | Solution |
| 7 | ETL vs ELT and Why ELT Won | Fundamentals | ETL, ELT, dbt, warehouse | Easy | Question | Solution |
| 8 | OLTP vs OLAP | Fundamentals | OLTP, OLAP, column store, row store | Easy | Question | Solution |
| 9 | Idempotency in Data Pipelines | Fundamentals | idempotency, retries, MERGE, partitions | Medium | Question | Solution |
| 10 | Slowly Changing Dimensions | Fundamentals | SCD, dimensions, history, dbt snapshot | Medium | Question | Solution |
| 11 | Data Contracts in Plain Words | Fundamentals | data contracts, schema registry, ownership | Medium | Question | Solution |
| 12 | Parquet vs CSV vs JSON | Fundamentals | Parquet, CSV, JSON, columnar storage | Easy | Question | Solution |
| 13 | Data Lake vs Warehouse vs Lakehouse | Fundamentals | lake, warehouse, lakehouse, Iceberg, Delta | Medium | Question | Solution |
| 14 | Exactly Once Delivery | Fundamentals | exactly once, idempotency, Kafka, streaming | Medium | Question | Solution |
| 15 | Teaching SQL Performance to a Junior | SQL Thinking | EXPLAIN, performance, mentoring, optimization | Medium | Question | Solution |
| 16 | SELECT DISTINCT Hiding Join Bugs | SQL Thinking | DISTINCT, joins, grain, semi-join | Medium | Question | Solution |
| 17 | Reading an EXPLAIN Plan | SQL Thinking | EXPLAIN, query plan, joins, sort spill | Medium | Question | Solution |
| 18 | CTE vs Subquery | SQL Thinking | CTE, subquery, materialization, recursion | Medium | Question | Solution |
| 19 | Same Query Different Answers | SQL Thinking | time zones, RLS, session settings, debugging | Medium | Question | Solution |
| 20 | Window Functions vs GROUP BY | SQL Thinking | window functions, GROUP BY, running totals, ranking | Medium | Question | Solution |
| 21 | Data Platform for an Electricity Retailer | System Design | smart meter, IoT, warehouse, batch | Hard | Question | Solution |
| 22 | Banking App Monthly Spending Widget | System Design | streaming, CDC, serving store, low latency | Hard | Question | Solution |
| 23 | Ride Hailing Surge Pricing | System Design | streaming, H3, real-time, pricing | Hard | Question | Solution |
| 24 | Spotify Minutes Listened This Week | System Design | streaming aggregation, KV store, watermarks | Hard | Question | Solution |
| 25 | Smart Meter to Monthly Bill PDF | System Design | billing, SCD2, idempotency, audit | Hard | Question | Solution |
| 26 | Delivery Idle Driver Tracking | System Design | streaming, H3, TTL, geospatial | Hard | Question | Solution |
| 27 | Year in Review Recap | System Design | batch, KV store, CDN, image render | Medium | Question | Solution |
| 28 | Low Balance Notification Pipeline | System Design | batch, idempotency, time zones, notifications | Medium | Question | Solution |
| 29 | Daily Report Quietly Wrong for Two Weeks | Scenarios | incident, postmortem, comms, data quality | Medium | Question | Solution |
| 30 | Warehouse Cost Doubled in Two Months | Scenarios | cost, governance, comms, INFORMATION_SCHEMA | Medium | Question | Solution |
| 31 | The Dashboard is Wrong | Scenarios | trust, comms, vague reports | Easy | Question | Solution |
| 32 | Inheriting a Pipeline No One Owns | Scenarios | ownership, judgement, rewrite-or-not | Medium | Question | Solution |
| 33 | Executive Needs a Number Tomorrow | Scenarios | comms, exec, caveats, prioritization | Medium | Question | Solution |
| 34 | Three Days of Data Lost | Scenarios | Kafka retention, replay, recovery, postmortem | Hard | Question | Solution |
| 35 | Lambda vs Cloud Function vs Cloud Run | Cloud Decisions | serverless, AWS, GCP, runtime limits | Medium | Question | Solution |
| 36 | Scheduled Pipeline Pay Only When Run | Cloud Decisions | scheduled jobs, Cloud Run Jobs, AWS Batch | Easy | Question | Solution |
| 37 | BigQuery vs Snowflake for New Team | Cloud Decisions | BigQuery, Snowflake, pricing model | Medium | Question | Solution |
| 38 | Store Partner Files in S3 or Warehouse | Cloud Decisions | S3, raw layer, audit, schema evolution | Easy | Question | Solution |
| 39 | Managed Airflow vs Self Hosted | Cloud Decisions | Airflow, MWAA, Composer, Astronomer, Dagster | Medium | Question | Solution |
| 40 | BigQuery Access Control for 50 Person Company | Cloud Decisions | IAM, datasets, groups, RLS, audit | Medium | Question | Solution |
| 41 | Tables for an Airbnb Like App | Data Modeling | star schema, SCD2, multi-currency, reviews | Medium | Question | Solution |
| 42 | Tracking Subscription Plan History | Data Modeling | history, valid_from/to, billing, SCD2 | Medium | Question | Solution |
| 43 | Mixing Facts and Dimensions | Data Modeling | star schema, SCD2, views, history | Medium | Question | Solution |
| 44 | Explaining Fact Table Grain | Data Modeling | grain, facts, dimensions, aggregations | Easy | Question | Solution |
| 45 | Current State and Full History | Data Modeling | event sourcing, projections, MV, audit | Medium | Question | Solution |
| 46 | Region Suddenly Shows Zero Revenue | Debugging | dashboard, joins, SCD, time zones | Medium | Question | Solution |
| 47 | Airflow Green but Output Empty | Debugging | silent success, idempotency, anomaly checks | Medium | Question | Solution |
| 48 | Query Suddenly 80x Slower | Debugging | EXPLAIN, statistics, plan flip, join strategy | Medium | Question | Solution |
| 49 | User Says Data Is Wrong | Debugging | comms, vague reports, triage | Easy | Question | Solution |
| 50 | Partition Always Ten Percent Smaller | Debugging | anomaly, baselines, patterns, judgement | Medium | Question | Solution |
| 51 | BigQuery Bill Eight Times Higher | Cost & Performance | INFORMATION_SCHEMA, top queries, slot reservation | Medium | Question | Solution |
| 52 | Four Hour Spark Job Under One Hour | Cost & Performance | Spark UI, skew, AQE, broadcast joins | Medium | Question | Solution |
| 53 | Hourly Scan on Daily Data | Cost & Performance | summary tables, MV, refresh, BI tool | Easy | Question | Solution |
| 54 | Just Throw More Memory At It | Cost & Performance | upsize, plan inspection, optimization | Medium | Question | Solution |
| 55 | Partitioning Clustering Materialized Views | Cost & Performance | partitioning, clustering, MV, BigQuery | Easy | Question | Solution |
| 56 | Watermarks in Plain Words | Streaming | watermarks, event time, allowed lateness | Medium | Question | Solution |
| 57 | Kafka Ordering Guarantee | Streaming | Kafka, partition key, ordering, idempotent producer | Medium | Question | Solution |
| 58 | Streaming Consumer Lag Diagnosis | Streaming | lag, back-pressure, skew, Flink UI | Medium | Question | Solution |
| 59 | Onboarding a New Analyst | People & Process | onboarding, mentoring, pairing | Easy | Question | Solution |
| 60 | Metric by Tomorrow vs Doing It Right | People & Process | comms, prioritization, metrics | Easy | Question | Solution |
| 61 | Two Teams Disagree on Active User | People & Process | metric ownership, comms, metrics layer | Medium | Question | Solution |
| 62 | Postmortem After a Bad Day | People & Process | postmortem, blameless, action items | Medium | Question | Solution |
| 63 | Inherited Pipeline No Docs No Tests | People & Process | ownership, docs, tests, expectations | Medium | Question | Solution |
| 64 | Breaking Change in dbt Model 200 Consumers | People & Process | dbt, deprecation, comms, rollout | Medium | Question | Solution |
| 65 | 4000 DAG Airflow at 90 Percent CPU | People & Process | Airflow, scheduler, parsing, scale-out | Medium | Question | Solution |
| 66 | Indexes When to Add and When They Hurt | Databases | indexes, B-tree, write cost, EXPLAIN | Easy | Question | Solution |
| 67 | Transactions and ACID | Databases | transactions, ACID, durability, atomicity | Easy | Question | Solution |
| 68 | Isolation Levels in Plain Words | Databases | isolation, snapshot, anomalies, MVCC | Medium | Question | Solution |
| 69 | Normalization and When to Denormalize | Databases | normalization, 3NF, denormalization, star schema | Medium | Question | Solution |
| 70 | B-Tree vs Hash vs LSM Tree | Databases | B-tree, hash, LSM, storage engines | Medium | Question | Solution |
| 71 | Read Replicas and Replication Lag | Databases | replicas, replication lag, read after write | Medium | Question | Solution |
| 72 | Sharding and Picking a Shard Key | Databases | sharding, shard key, hot shards, hash | Hard | Question | Solution |
| 73 | Database Connection Pooling | Databases | connection pool, PgBouncer, sizing, Postgres | Medium | Question | Solution |
| 74 | Deadlocks and Lock Escalation | Databases | deadlocks, locks, retries, lock escalation | Medium | Question | Solution |
| 75 | SQL vs NoSQL | Databases | SQL, NoSQL, KV, document, wide column, graph | Medium | Question | Solution |
| Category | What you practice |
|---|---|
| Logs and Monitoring | Parsing and analyzing large log files, counting events, ranking |
| Streaming | Continuous data, rolling stats, watermarks, ordering, lag |
| Data Cleaning | Validating, normalizing and rejecting bad rows from raw files |
| Schema Validation | Handling evolving JSON or Avro schemas without breaking consumers |
| Data Integration | Combining data from many sources with different shapes and conventions |
| Fundamentals | Core concepts every data engineer should be able to explain plainly |
| SQL Thinking | Writing, reading and reasoning about SQL like a senior engineer |
| System Design | End-to-end pipelines for real consumer and energy-sector products |
| Scenarios | Tricky real-life situations that test judgement and communication |
| Cloud Decisions | Picking between AWS / GCP services with clear trade-offs |
| Data Modeling | Star schemas, history tracking, grain, dimensions |
| Debugging | Step-by-step investigation of "the number is wrong" style problems |
| Cost & Performance | Finding waste in queries, jobs, and infrastructure |
| People & Process | Mentoring, comms, postmortems, ownership, rollouts |
- Easy — A focused warm-up. Solvable or explainable in under an hour.
- Medium — Realistic interview question. Has edge cases that matter.
- Hard — Multi-step or system-design heavy. Closer to a take-home task.
New problems are added regularly. If you want to contribute, see the Contribution Guide.