Skip to content

Latest commit

 

History

History
112 lines (103 loc) · 20.2 KB

File metadata and controls

112 lines (103 loc) · 20.2 KB

Problems Index

A quick overview of every problem in this repo. Use the Category and Topics columns to filter by what you want to practice. Each row links to the problem statement and the reference solution.

This file is generated by scripts/build_index.py from the frontmatter in each problem's question.md. Do not edit by hand.

# Problem Category Topics Difficulty Question Solution
1 Log File Error Analysis Logs and Monitoring file streaming, counters, top-N, IoT logs Easy Question Solution
2 Rolling Average of Sensor Readings Streaming rolling window, deque, IoT sensors, real-time Easy Question Solution
3 Transform and Clean Raw Data for Analytics Data Cleaning CSV, validation, regex, date checks Medium Question Solution
4 Schema Evolution and Validation for Streaming Events Schema Validation JSON, schema evolution, type coercion, pydantic Medium Question Solution
5 Merging Messy CSVs from Multiple Partners Data Integration CSV, column mapping, date parsing, file walk Medium Question Solution
6 Partitioning vs Clustering in BigQuery Fundamentals BigQuery, partitioning, clustering, cost Easy Question Solution
7 ETL vs ELT and Why ELT Won Fundamentals ETL, ELT, dbt, warehouse Easy Question Solution
8 OLTP vs OLAP Fundamentals OLTP, OLAP, column store, row store Easy Question Solution
9 Idempotency in Data Pipelines Fundamentals idempotency, retries, MERGE, partitions Medium Question Solution
10 Slowly Changing Dimensions Fundamentals SCD, dimensions, history, dbt snapshot Medium Question Solution
11 Data Contracts in Plain Words Fundamentals data contracts, schema registry, ownership Medium Question Solution
12 Parquet vs CSV vs JSON Fundamentals Parquet, CSV, JSON, columnar storage Easy Question Solution
13 Data Lake vs Warehouse vs Lakehouse Fundamentals lake, warehouse, lakehouse, Iceberg, Delta Medium Question Solution
14 Exactly Once Delivery Fundamentals exactly once, idempotency, Kafka, streaming Medium Question Solution
15 Teaching SQL Performance to a Junior SQL Thinking EXPLAIN, performance, mentoring, optimization Medium Question Solution
16 SELECT DISTINCT Hiding Join Bugs SQL Thinking DISTINCT, joins, grain, semi-join Medium Question Solution
17 Reading an EXPLAIN Plan SQL Thinking EXPLAIN, query plan, joins, sort spill Medium Question Solution
18 CTE vs Subquery SQL Thinking CTE, subquery, materialization, recursion Medium Question Solution
19 Same Query Different Answers SQL Thinking time zones, RLS, session settings, debugging Medium Question Solution
20 Window Functions vs GROUP BY SQL Thinking window functions, GROUP BY, running totals, ranking Medium Question Solution
21 Data Platform for an Electricity Retailer System Design smart meter, IoT, warehouse, batch Hard Question Solution
22 Banking App Monthly Spending Widget System Design streaming, CDC, serving store, low latency Hard Question Solution
23 Ride Hailing Surge Pricing System Design streaming, H3, real-time, pricing Hard Question Solution
24 Spotify Minutes Listened This Week System Design streaming aggregation, KV store, watermarks Hard Question Solution
25 Smart Meter to Monthly Bill PDF System Design billing, SCD2, idempotency, audit Hard Question Solution
26 Delivery Idle Driver Tracking System Design streaming, H3, TTL, geospatial Hard Question Solution
27 Year in Review Recap System Design batch, KV store, CDN, image render Medium Question Solution
28 Low Balance Notification Pipeline System Design batch, idempotency, time zones, notifications Medium Question Solution
29 Daily Report Quietly Wrong for Two Weeks Scenarios incident, postmortem, comms, data quality Medium Question Solution
30 Warehouse Cost Doubled in Two Months Scenarios cost, governance, comms, INFORMATION_SCHEMA Medium Question Solution
31 The Dashboard is Wrong Scenarios trust, comms, vague reports Easy Question Solution
32 Inheriting a Pipeline No One Owns Scenarios ownership, judgement, rewrite-or-not Medium Question Solution
33 Executive Needs a Number Tomorrow Scenarios comms, exec, caveats, prioritization Medium Question Solution
34 Three Days of Data Lost Scenarios Kafka retention, replay, recovery, postmortem Hard Question Solution
35 Lambda vs Cloud Function vs Cloud Run Cloud Decisions serverless, AWS, GCP, runtime limits Medium Question Solution
36 Scheduled Pipeline Pay Only When Run Cloud Decisions scheduled jobs, Cloud Run Jobs, AWS Batch Easy Question Solution
37 BigQuery vs Snowflake for New Team Cloud Decisions BigQuery, Snowflake, pricing model Medium Question Solution
38 Store Partner Files in S3 or Warehouse Cloud Decisions S3, raw layer, audit, schema evolution Easy Question Solution
39 Managed Airflow vs Self Hosted Cloud Decisions Airflow, MWAA, Composer, Astronomer, Dagster Medium Question Solution
40 BigQuery Access Control for 50 Person Company Cloud Decisions IAM, datasets, groups, RLS, audit Medium Question Solution
41 Tables for an Airbnb Like App Data Modeling star schema, SCD2, multi-currency, reviews Medium Question Solution
42 Tracking Subscription Plan History Data Modeling history, valid_from/to, billing, SCD2 Medium Question Solution
43 Mixing Facts and Dimensions Data Modeling star schema, SCD2, views, history Medium Question Solution
44 Explaining Fact Table Grain Data Modeling grain, facts, dimensions, aggregations Easy Question Solution
45 Current State and Full History Data Modeling event sourcing, projections, MV, audit Medium Question Solution
46 Region Suddenly Shows Zero Revenue Debugging dashboard, joins, SCD, time zones Medium Question Solution
47 Airflow Green but Output Empty Debugging silent success, idempotency, anomaly checks Medium Question Solution
48 Query Suddenly 80x Slower Debugging EXPLAIN, statistics, plan flip, join strategy Medium Question Solution
49 User Says Data Is Wrong Debugging comms, vague reports, triage Easy Question Solution
50 Partition Always Ten Percent Smaller Debugging anomaly, baselines, patterns, judgement Medium Question Solution
51 BigQuery Bill Eight Times Higher Cost & Performance INFORMATION_SCHEMA, top queries, slot reservation Medium Question Solution
52 Four Hour Spark Job Under One Hour Cost & Performance Spark UI, skew, AQE, broadcast joins Medium Question Solution
53 Hourly Scan on Daily Data Cost & Performance summary tables, MV, refresh, BI tool Easy Question Solution
54 Just Throw More Memory At It Cost & Performance upsize, plan inspection, optimization Medium Question Solution
55 Partitioning Clustering Materialized Views Cost & Performance partitioning, clustering, MV, BigQuery Easy Question Solution
56 Watermarks in Plain Words Streaming watermarks, event time, allowed lateness Medium Question Solution
57 Kafka Ordering Guarantee Streaming Kafka, partition key, ordering, idempotent producer Medium Question Solution
58 Streaming Consumer Lag Diagnosis Streaming lag, back-pressure, skew, Flink UI Medium Question Solution
59 Onboarding a New Analyst People & Process onboarding, mentoring, pairing Easy Question Solution
60 Metric by Tomorrow vs Doing It Right People & Process comms, prioritization, metrics Easy Question Solution
61 Two Teams Disagree on Active User People & Process metric ownership, comms, metrics layer Medium Question Solution
62 Postmortem After a Bad Day People & Process postmortem, blameless, action items Medium Question Solution
63 Inherited Pipeline No Docs No Tests People & Process ownership, docs, tests, expectations Medium Question Solution
64 Breaking Change in dbt Model 200 Consumers People & Process dbt, deprecation, comms, rollout Medium Question Solution
65 4000 DAG Airflow at 90 Percent CPU People & Process Airflow, scheduler, parsing, scale-out Medium Question Solution
66 Indexes When to Add and When They Hurt Databases indexes, B-tree, write cost, EXPLAIN Easy Question Solution
67 Transactions and ACID Databases transactions, ACID, durability, atomicity Easy Question Solution
68 Isolation Levels in Plain Words Databases isolation, snapshot, anomalies, MVCC Medium Question Solution
69 Normalization and When to Denormalize Databases normalization, 3NF, denormalization, star schema Medium Question Solution
70 B-Tree vs Hash vs LSM Tree Databases B-tree, hash, LSM, storage engines Medium Question Solution
71 Read Replicas and Replication Lag Databases replicas, replication lag, read after write Medium Question Solution
72 Sharding and Picking a Shard Key Databases sharding, shard key, hot shards, hash Hard Question Solution
73 Database Connection Pooling Databases connection pool, PgBouncer, sizing, Postgres Medium Question Solution
74 Deadlocks and Lock Escalation Databases deadlocks, locks, retries, lock escalation Medium Question Solution
75 SQL vs NoSQL Databases SQL, NoSQL, KV, document, wide column, graph Medium Question Solution

Category Legend

Category What you practice
Logs and Monitoring Parsing and analyzing large log files, counting events, ranking
Streaming Continuous data, rolling stats, watermarks, ordering, lag
Data Cleaning Validating, normalizing and rejecting bad rows from raw files
Schema Validation Handling evolving JSON or Avro schemas without breaking consumers
Data Integration Combining data from many sources with different shapes and conventions
Fundamentals Core concepts every data engineer should be able to explain plainly
SQL Thinking Writing, reading and reasoning about SQL like a senior engineer
System Design End-to-end pipelines for real consumer and energy-sector products
Scenarios Tricky real-life situations that test judgement and communication
Cloud Decisions Picking between AWS / GCP services with clear trade-offs
Data Modeling Star schemas, history tracking, grain, dimensions
Debugging Step-by-step investigation of "the number is wrong" style problems
Cost & Performance Finding waste in queries, jobs, and infrastructure
People & Process Mentoring, comms, postmortems, ownership, rollouts

Difficulty Guide

  • Easy — A focused warm-up. Solvable or explainable in under an hour.
  • Medium — Realistic interview question. Has edge cases that matter.
  • Hard — Multi-step or system-design heavy. Closer to a take-home task.

New problems are added regularly. If you want to contribute, see the Contribution Guide.