The Data Foundry

Broken Pipeline Lab

Fix SQL bugs, PySpark mistakes, Airflow failures, and production data issues before interviews test you.

Practice Data Engineering the way it actually breaks in production: diagnose, attempt, evaluate, reveal, and explain.

Start Free Lab View guided roadmap

Total labs

132

Free labs

Premium labs

113

Completed

Start here

Choose the path closest to your current goal.

These shortcuts set the filters for you so the library feels like a guided practice plan, not a wall of cards.

Search scenarios

Sort

Advanced filters

Domain

Practice Type

Difficulty

Access

Domain

Practice Type

Difficulty

Access

Create an account to continue

OTP Login

Scenario cards

Showing 132 of 132 labs. Attempted 0.

SQLBeginnerFree

Wrong GROUP BY Grain Causing Customer Revenue Inflation

You will practice

Build a customer-level revenue result with exactly one row per customer. Include only completed orders, return customer_id, customer_name, and completed_revenue, and make sure duplicate status rows cannot inflate the dashboard.

The query groups by customer and order status, but the dashboard expects one row per customer. When downstream users sum the status-level rows again, cancelled and completed order rows are mixed into the customer metric.

Type

Broken SQL Fix

Time

18 min

Progress

Not started

SQLGrainRevenueData Quality

Start Lab

SQLBeginnerFree

LEFT JOIN Turned Into INNER JOIN by WHERE Filter

You will practice

Return every active customer. For customers who clicked campaign SPRING_26, show their latest click timestamp. For customers with no click, keep the customer row and return NULL for last_click_at.

The query uses a LEFT JOIN, but a filter on the campaign table is placed in the WHERE clause. That removes NULL right-side rows and silently turns the result into an inner join for this campaign.

Type

Broken SQL Fix

Time

16 min

Progress

Not started

SQLJoinsNULLsRetention

Start Lab

SQLIntermediateFree

Duplicate Revenue from Joining Orders to Multiple Payments and Refunds

You will practice

Return one row per order with paid_amount, refunded_amount, and net_revenue. Aggregate each child table to order_id before joining so payment and refund rows cannot multiply each other.

The mart joins orders directly to payments and refunds at row level. Because both child tables can have multiple rows per order, the join multiplies records before aggregation.

Type

Output Mismatch Debugging

Time

22 min

Progress

Not started

SQLJoin ExplosionRevenueOutput Mismatch

Start Lab

PySparkIntermediateFree

Append Mode Created Duplicate Daily Loads

You will practice

Make the daily write idempotent for order_date.

The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.

Type

Broken PySpark Fix

Time

20 min

Progress

Not started

PySparkIdempotencyDaily LoadsLakehouse

Start Lab

PySparkIntermediateFree

The Endless Final Stage

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job reaches 99 percent quickly, 199 out of 200 tasks finish, but one task runs for hours with high GC and 100 percent CPU before ending with OutOfMemoryError.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSpark job stuck at 99% because of data skew

Start Lab

PySparkIntermediateFree

The Executor Graveyard

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job does not freeze at one task; instead executors keep dying and getting replaced. Retries happen repeatedly and the stage eventually fails.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingRepeated executor deaths after a wide join

Start Lab

PySparkIntermediateFree

The Shuffle Storm

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

A new version replaced reduceByKey-like logic with groupByKey and several downstream transformations. Runtime doubled and shuffle traffic exploded.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingMassive shuffle caused by bad aggregation strategy

Start Lab

PySparkBeginnerFree

The Small Files Avalanche

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job technically succeeds, but downstream reads get slower over time and cloud object store listings become painfully expensive. Each partition contains hundreds or thousands of tiny files.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSpark write path creates too many tiny files

Start Lab

PySparkBeginnerFree

The Silent UDF Tax

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job still succeeds, but runtime tripled and CPU utilization looks poor despite no major shuffle increase.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingPython UDF makes a pipeline unexpectedly slow

Start Lab

PySparkIntermediateFree

The Broadcast Betrayal

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Someone forced a broadcast hint. It worked in test and smaller markets, but production now fails intermittently with broadcast timeout or executor memory pressure.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingWrong broadcast choice causes timeout or memory issues

Start Lab

PySparkBeginnerFree

The Cache Everything Trap

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Production jobs show higher memory pressure, more spills, and worse overall runtime than before caching was added.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingOver-caching makes the cluster slower

Start Lab

PySparkIntermediateFree

The Backfill Explosion

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The same code that works daily now runs for many hours, overloads the cluster, and causes repeated failures.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingDaily logic fails when rerun for months of history

Start Lab

PySparkBeginnerFree

The Union of Doom

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After onboarding a new market, some columns shifted position and downstream consumers started seeing nulls or incorrect values even though the job itself succeeded.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSchema mismatch during union creates silent data corruption risk

Start Lab

PySparkAdvancedFree

Checkpoint Chaos

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After a deployment and infrastructure interruption, the stream restarts but either reprocesses events, produces duplicates, or struggles with state-store recovery.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingStructured Streaming restarts cause duplicates or state issues

Start Lab

PySparkIntermediateFree

The AQE Surprise

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Production performance becomes inconsistent. On some days the job gets faster, but on others one stage becomes very heavy, task counts collapse, and the downstream write becomes slower than before.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingAdaptive Query Execution changes join or partition strategy in unexpected ways

Start Lab

PySparkIntermediateFree

The Window Function Blowup

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job does not fail immediately, but stages with sort and window execution become extremely slow, spill heavily to disk, and sometimes time out under peak loads.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingLarge window operations create spill, skew, or sort pressure

Start Lab

PySparkBeginnerFree

The Null Key Funnel

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Runtime suddenly deteriorates after a source issue causes a large share of merchant_id values to arrive as null or a default placeholder, and one stage becomes badly imbalanced.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingNull or default join keys collapse data into a pathological partition

Start Lab

PySparkIntermediateFree

The Explode Cascade

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After the change, row counts increase by orders of magnitude, a formerly healthy job now spills heavily, and downstream tables become far larger than expected.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingFlattening nested arrays multiplies row counts and destabilizes the plan

Start Lab

PySparkBeginnerFree

The Driver Memory Trap

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The executors look fine, but the application driver becomes unstable, crashes intermittently, or hangs during peak days.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingWork is accidentally pulled back to the driver and causes instability

Start Lab

PySparkIntermediatePremium

Spark Join Slowed Down Due to Skewed Customer Key

You will practice

Choose the likely root cause.

The join key has one customer_id that owns a massive share of events, causing one reducer partition to process most rows.

Type

MCQ Diagnosis

Time

12 min

Progress

Not started

PySparkSkewJoinPerformance

Broken Pipeline Lab

Choose the path closest to your current goal.

Create an account to continue

Scenario cards

Wrong GROUP BY Grain Causing Customer Revenue Inflation

LEFT JOIN Turned Into INNER JOIN by WHERE Filter

Duplicate Revenue from Joining Orders to Multiple Payments and Refunds

Append Mode Created Duplicate Daily Loads

The Endless Final Stage

The Executor Graveyard

The Shuffle Storm

The Small Files Avalanche

The Silent UDF Tax

The Broadcast Betrayal

The Cache Everything Trap

The Backfill Explosion

The Union of Doom

Checkpoint Chaos

The AQE Surprise

The Window Function Blowup

The Null Key Funnel

The Explode Cascade

The Driver Memory Trap

Spark Join Slowed Down Due to Skewed Customer Key

Too Many Small Files from Hourly Writes

DAG Green but Dashboard Wrong

Airflow Retry Reprocessed Same File and Created Duplicates

Revenue Dropped 30% After New SUCCESSFUL Status

UTC to Local Timezone Boundary Broke Daily Dashboard

Bad Partition Strategy by customer_id

Late Arriving Records Changed Previous Revenue Partitions

The Delta Merge Meltdown

The Speculation Confusion

The File Listing Bottleneck

The Partition Pruning Mirage

The Serializer Mismatch

The Duplicate Customer Nightmare

The NULL Trap

The History Table Dilemma

The Double Counting Report

The Query That Became Slow Overnight

The Merge Contention Problem

The Late Dimension Problem

The Snapshot Consistency Debate

The Incremental Load Gone Wrong

The Trusted Summary Table

The Top-N Tie Trap

The Fact Correction Dilemma

The FX Conversion Mismatch

The JSON Fanout Query

The MERGE with Duplicate Matches

The Timezone Boundary Bug

The Snapshot Join Trap

The Chasm Trap

The Distinct Count at Scale

The Restatement vs Close Debate

The Green DAG with Bad Data

The Backfill Hell

The Sensor Gridlock

The Dynamic Task Explosion

The Retry Illusion

The Secret Rotation Outage

The Idempotency Question

The SLA vs Throughput Trade-off

The Catchup Stampede

The Zombie Task Mystery

The XCom Bloat Problem

The DAG Parse Bottleneck

The Pagination Retry Loop

The Premature Publish Race

The Ownership Vacuum

The Cross-DAG Dependency Trap

The Metadata DB Pressure Wave

The Environment Drift Outage

The Run Config Chaos

The Maintenance Window Replay

The Partitioning Disaster

The Schema Evolution Shock

The Retention Regret

The Metadata Swamp