The Data Foundry

Built by Data with Pranjal

The Data Foundry

Broken Pipeline Lab

Fix SQL bugs, PySpark mistakes, Airflow failures, and production data issues before interviews test you.

Practice Data Engineering the way it actually breaks in production: diagnose, attempt, evaluate, reveal, and explain.

Total labs

132

Free labs

19

Premium labs

113

Completed

0

Start here

Choose the path closest to your current goal.

These shortcuts set the filters for you so the library feels like a guided practice plan, not a wall of cards.

Advanced filters

Domain

Practice Type

Difficulty

Access

Create an account to continue

Sign in with OTP first, then choose a plan and complete UPI activation.

OTP Login

Scenario cards

Showing 132 of 132 labs. Attempted 0.

SQLBeginnerFree

Wrong GROUP BY Grain Causing Customer Revenue Inflation

You will practice

Build a customer-level revenue result with exactly one row per customer. Include only completed orders, return customer_id, customer_name, and completed_revenue, and make sure duplicate status rows cannot inflate the dashboard.

The query groups by customer and order status, but the dashboard expects one row per customer. When downstream users sum the status-level rows again, cancelled and completed order rows are mixed into the customer metric.

Type

Broken SQL Fix

Time

18 min

Progress

Not started

SQLGrainRevenueData Quality
SQLBeginnerFree

LEFT JOIN Turned Into INNER JOIN by WHERE Filter

You will practice

Return every active customer. For customers who clicked campaign SPRING_26, show their latest click timestamp. For customers with no click, keep the customer row and return NULL for last_click_at.

The query uses a LEFT JOIN, but a filter on the campaign table is placed in the WHERE clause. That removes NULL right-side rows and silently turns the result into an inner join for this campaign.

Type

Broken SQL Fix

Time

16 min

Progress

Not started

SQLJoinsNULLsRetention
SQLIntermediateFree

Duplicate Revenue from Joining Orders to Multiple Payments and Refunds

You will practice

Return one row per order with paid_amount, refunded_amount, and net_revenue. Aggregate each child table to order_id before joining so payment and refund rows cannot multiply each other.

The mart joins orders directly to payments and refunds at row level. Because both child tables can have multiple rows per order, the join multiplies records before aggregation.

Type

Output Mismatch Debugging

Time

22 min

Progress

Not started

SQLJoin ExplosionRevenueOutput Mismatch
PySparkIntermediateFree

Append Mode Created Duplicate Daily Loads

You will practice

Make the daily write idempotent for order_date.

The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.

Type

Broken PySpark Fix

Time

20 min

Progress

Not started

PySparkIdempotencyDaily LoadsLakehouse
PySparkIntermediateFree

The Endless Final Stage

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job reaches 99 percent quickly, 199 out of 200 tasks finish, but one task runs for hours with high GC and 100 percent CPU before ending with OutOfMemoryError.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSpark job stuck at 99% because of data skew
PySparkIntermediateFree

The Executor Graveyard

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job does not freeze at one task; instead executors keep dying and getting replaced. Retries happen repeatedly and the stage eventually fails.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingRepeated executor deaths after a wide join
PySparkIntermediateFree

The Shuffle Storm

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

A new version replaced reduceByKey-like logic with groupByKey and several downstream transformations. Runtime doubled and shuffle traffic exploded.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingMassive shuffle caused by bad aggregation strategy
PySparkBeginnerFree

The Small Files Avalanche

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job technically succeeds, but downstream reads get slower over time and cloud object store listings become painfully expensive. Each partition contains hundreds or thousands of tiny files.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSpark write path creates too many tiny files
PySparkBeginnerFree

The Silent UDF Tax

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job still succeeds, but runtime tripled and CPU utilization looks poor despite no major shuffle increase.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingPython UDF makes a pipeline unexpectedly slow
PySparkIntermediateFree

The Broadcast Betrayal

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Someone forced a broadcast hint. It worked in test and smaller markets, but production now fails intermittently with broadcast timeout or executor memory pressure.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingWrong broadcast choice causes timeout or memory issues
PySparkBeginnerFree

The Cache Everything Trap

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Production jobs show higher memory pressure, more spills, and worse overall runtime than before caching was added.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingOver-caching makes the cluster slower
PySparkIntermediateFree

The Backfill Explosion

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The same code that works daily now runs for many hours, overloads the cluster, and causes repeated failures.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingDaily logic fails when rerun for months of history
PySparkBeginnerFree

The Union of Doom

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After onboarding a new market, some columns shifted position and downstream consumers started seeing nulls or incorrect values even though the job itself succeeded.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSchema mismatch during union creates silent data corruption risk
PySparkAdvancedFree

Checkpoint Chaos

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After a deployment and infrastructure interruption, the stream restarts but either reprocesses events, produces duplicates, or struggles with state-store recovery.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingStructured Streaming restarts cause duplicates or state issues
PySparkIntermediateFree

The AQE Surprise

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Production performance becomes inconsistent. On some days the job gets faster, but on others one stage becomes very heavy, task counts collapse, and the downstream write becomes slower than before.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingAdaptive Query Execution changes join or partition strategy in unexpected ways
PySparkIntermediateFree

The Window Function Blowup

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job does not fail immediately, but stages with sort and window execution become extremely slow, spill heavily to disk, and sometimes time out under peak loads.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingLarge window operations create spill, skew, or sort pressure
PySparkBeginnerFree

The Null Key Funnel

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Runtime suddenly deteriorates after a source issue causes a large share of merchant_id values to arrive as null or a default placeholder, and one stage becomes badly imbalanced.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingNull or default join keys collapse data into a pathological partition
PySparkIntermediateFree

The Explode Cascade

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

After the change, row counts increase by orders of magnitude, a formerly healthy job now spills heavily, and downstream tables become far larger than expected.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingFlattening nested arrays multiplies row counts and destabilizes the plan
PySparkBeginnerFree

The Driver Memory Trap

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The executors look fine, but the application driver becomes unstable, crashes intermittently, or hangs during peak days.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingWork is accidentally pulled back to the driver and causes instability
PySparkIntermediatePremium

Spark Join Slowed Down Due to Skewed Customer Key

You will practice

Choose the likely root cause.

The join key has one customer_id that owns a massive share of events, causing one reducer partition to process most rows.

Type

MCQ Diagnosis

Time

12 min

Progress

Not started

PySparkSkewJoinPerformance
PySparkIntermediatePremium

Too Many Small Files from Hourly Writes

You will practice

Diagnose the performance issue from logs.

Each hourly job writes many tiny files into the same date partition. Metadata overhead dominates scan time.

Type

Log / Error Analysis

Time

17 min

Progress

Not started

PySparkSmall FilesCompactionLakehouse
AirflowIntermediatePremium

DAG Green but Dashboard Wrong

You will practice

Find why green status is misleading.

The DAG only checks task completion, not data freshness or row-count expectations.

Type

Log / Error Analysis

Time

18 min

Progress

Not started

AirflowMonitoringData QualityIncident
AirflowIntermediatePremium

Airflow Retry Reprocessed Same File and Created Duplicates

You will practice

Explain the retry/idempotency bug.

Retry behavior is not idempotent. The pipeline lacks file-level checkpointing and deduplication keys.

Type

Mixed Lab

Time

22 min

Progress

Not started

AirflowRetriesIdempotencyDuplicates
Data QualityBeginnerPremium

Revenue Dropped 30% After New SUCCESSFUL Status

You will practice

Return one row per order_date with revenue from every provider status mapped to paid_success. Use the mapping table rather than hardcoding provider-specific values.

The revenue query only accepts status = 'SUCCESS'. The new provider sends 'SUCCESSFUL'.

Type

Output Mismatch Debugging

Time

15 min

Progress

Not started

Data QualityStatus MappingRevenueMonitoring
Data QualityIntermediatePremium

UTC to Local Timezone Boundary Broke Daily Dashboard

You will practice

Convert event_ts_utc to India business time before deriving the reporting date. Return business_date and revenue so late-night UTC events are counted on the correct India date.

The report groups by the UTC calendar date. Orders between 18:30 and 23:59 UTC belong to the next calendar day in Asia/Kolkata, so the dashboard shifts late-night revenue into the wrong business date.

Type

Broken SQL Fix

Time

19 min

Progress

Not started

SQLTimezoneReportingData Quality
AWS / Data LakeIntermediatePremium

Bad Partition Strategy by customer_id

You will practice

Choose the safest partition strategy.

customer_id is high-cardinality and query filters mostly use order_date. The lake now has millions of tiny partitions.

Type

MCQ Diagnosis

Time

10 min

Progress

Not started

AWSS3PartitioningLakehouse
MixedAdvancedPremium

Late Arriving Records Changed Previous Revenue Partitions

You will practice

Diagnose the late-arrival issue.

The pipeline assumes closed daily partitions are final. It has no late-arrival correction window or restatement process.

Type

Mixed Lab

Time

25 min

Progress

Not started

WatermarkLate DataSQLAirflow
PySparkAdvancedPremium

The Delta Merge Meltdown

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The merge now scans huge portions of the target table, rewrites many files, and regularly collides with maintenance tasks or downstream readers.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingLarge merge statements rewrite too much data and make jobs unstable
PySparkIntermediatePremium

The Speculation Confusion

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Runtime does not improve meaningfully, and in a few edge cases downstream side effects or writes become harder to reason about.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSpeculative execution is misunderstood as a cure for skew or bad logic
PySparkIntermediatePremium

The File Listing Bottleneck

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Teams notice that the application spends a long time before meaningful task execution begins, and the cost of simply planning the read is becoming painful.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingReading huge numbers of files is slow before compute even starts
PySparkBeginnerPremium

The Partition Pruning Mirage

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Even though the code appears selective, the scan bytes remain huge and runtime barely changes compared with a full-table read.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingFilters look selective in code but do not actually prune the dataset
PySparkIntermediatePremium

The Serializer Mismatch

You will practice

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

The job still works, but task deserialization time, network overhead, and CPU cost increase noticeably, especially around shuffle-heavy stages.

Type

Broken PySpark Fix

Time

24 min

Progress

Not started

PysparkBroken PysparkSpark Performance and DebuggingSerialization choices and object-heavy code paths degrade performance
SQLBeginnerPremium

The Duplicate Customer Nightmare

You will practice

Diagnose and fix the issue: Latest-record deduplication in SQL

Business users report duplicate customers in reporting and the metrics team is getting inconsistent counts.

Type

Broken SQL Fix

Time

20 min

Progress

Not started

SqlBroken SqlSQL and Warehousing ScenariosLatest-record deduplication in SQL
SQLBeginnerPremium

The NULL Trap

You will practice

Diagnose and fix the issue: NOT IN returns unexpected results because of NULL handling

The result set comes back empty or much smaller than expected, even though everyone knows there are customers without orders.

Type

Broken SQL Fix

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosNOT IN returns unexpected results because of NULL handling
SQLIntermediatePremium

The History Table Dilemma

You will practice

Diagnose and fix the issue: Designing SCD Type 2 correctly

The interviewer asks how you would design the dimension table so analysts can answer 'what did we know at that point in time?'

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosDesigning SCD Type 2 correctly
SQLIntermediatePremium

The Double Counting Report

You will practice

Diagnose and fix the issue: Join pattern causes duplicated fact rows

The SQL runs fine, but totals are inflated because the join grain is misunderstood.

Type

Broken SQL Fix

Time

20 min

Progress

Not started

SqlBroken SqlSQL and Warehousing ScenariosJoin pattern causes duplicated fact rows
SQLIntermediatePremium

The Query That Became Slow Overnight

You will practice

Diagnose and fix the issue: A previously fast SQL query regresses badly

The business did not change the SQL, but performance deteriorated sharply.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlBroken SqlSQL and Warehousing ScenariosA previously fast SQL query regresses badly
SQLAdvancedPremium

The Merge Contention Problem

You will practice

Diagnose and fix the issue: Concurrent upserts create lock contention or deadlocks

During peak hours, merges slow down, block each other, or deadlock.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosConcurrent upserts create lock contention or deadlocks
SQLIntermediatePremium

The Late Dimension Problem

You will practice

Diagnose and fix the issue: Facts arrive before dimension rows

Fact rows fail foreign-key checks or end up with missing enrichments, and analysts complain about 'unknown' categories.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosFacts arrive before dimension rows
SQLAdvancedPremium

The Snapshot Consistency Debate

You will practice

Diagnose and fix the issue: Reading from mutable source tables creates inconsistent extracts

Counts between related tables do not reconcile because the source is changing during extraction.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosReading from mutable source tables creates inconsistent extracts
SQLIntermediatePremium

The Incremental Load Gone Wrong

You will practice

Diagnose and fix the issue: Watermark logic misses or duplicates records

Users notice some records are missing while others are loaded twice after retries and timezone inconsistencies.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosWatermark logic misses or duplicates records
SQLIntermediatePremium

The Trusted Summary Table

You will practice

Diagnose and fix the issue: Materialized summary vs querying raw detail

The interviewer asks whether you would keep querying raw data or build summary tables.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosMaterialized summary vs querying raw detail
SQLBeginnerPremium

The Top-N Tie Trap

You will practice

Diagnose and fix the issue: Window-function ranking returns unstable or duplicated Top-N results

Different runs or engine migrations produce slightly different results whenever ties occur, and stakeholders question why the dashboard is not stable.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosWindow-function ranking returns unstable or duplicated Top-N results
SQLIntermediatePremium

The Fact Correction Dilemma

You will practice

Diagnose and fix the issue: Correcting fact-table errors without breaking auditability

The team debates whether to overwrite the old fact row, insert a corrected version, or maintain separate adjustment records.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosCorrecting fact-table errors without breaking auditability
SQLIntermediatePremium

The FX Conversion Mismatch

You will practice

Diagnose and fix the issue: Currency conversion logic creates inconsistent financial reporting

Different teams get different totals because some queries use transaction-date rates, others use month-end rates, and refunds are handled inconsistently.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosCurrency conversion logic creates inconsistent financial reporting
SQLIntermediatePremium

The JSON Fanout Query

You will practice

Diagnose and fix the issue: Flattening semi-structured fields in SQL creates double counting and high cost

A new query flattens multiple arrays out of the JSON and joins them back to orders, but revenue and counts become inflated while runtime increases sharply.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosFlattening semi-structured fields in SQL creates double counting and high cost
SQLIntermediatePremium

The MERGE with Duplicate Matches

You will practice

Diagnose and fix the issue: MERGE fails or behaves unpredictably because source keys are not unique

Some loads fail with multiple-match errors, while others succeed but produce hard-to-explain outcomes because the source contains duplicate keys in the same batch.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosMERGE fails or behaves unpredictably because source keys are not unique
SQLBeginnerPremium

The Timezone Boundary Bug

You will practice

Diagnose and fix the issue: Date-based reports are wrong because local and UTC boundaries are mixed

Counts near midnight look wrong in several countries, and month-end totals differ between dashboards built by different teams.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosDate-based reports are wrong because local and UTC boundaries are mixed
SQLIntermediatePremium

The Snapshot Join Trap

You will practice

Diagnose and fix the issue: Joining periodic snapshots to transactions creates misleading metrics

The resulting query runs, but inventory metrics are overstated or misaligned because the snapshot grain and transaction grain do not line up cleanly.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosJoining periodic snapshots to transactions creates misleading metrics
SQLAdvancedPremium

The Chasm Trap

You will practice

Diagnose and fix the issue: Joining multiple fact tables through shared dimensions produces fanout

They join three fact tables through common dimensions and produce impressive-looking dashboards, but conversion and revenue numbers are inflated in subtle ways.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosJoining multiple fact tables through shared dimensions produces fanout
SQLIntermediatePremium

The Distinct Count at Scale

You will practice

Diagnose and fix the issue: High-cardinality distinct counting becomes expensive and inconsistent

Different teams compute the metrics with different SQL patterns, runtimes are high, and slight logic differences create disagreement about the official number.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosHigh-cardinality distinct counting becomes expensive and inconsistent
SQLAdvancedPremium

The Restatement vs Close Debate

You will practice

Diagnose and fix the issue: Closed financial periods conflict with late-arriving changes

Business teams disagree on whether historical dashboards should change, whether the closed month should stay frozen, and how to reconcile operational truth with published finance truth.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisSQL and Warehousing ScenariosClosed financial periods conflict with late-arriving changes
AirflowBeginnerPremium

The Green DAG with Bad Data

You will practice

Diagnose and fix the issue: Tasks succeed but output is incomplete

All tasks are green, but finance reports missing data because the input file was truncated and the transformation script never validated record completeness.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosTasks succeed but output is incomplete
AirflowIntermediatePremium

The Backfill Hell

You will practice

Diagnose and fix the issue: Historical reruns create duplicates and dependency chaos

When the team reruns historical dates manually, some datasets duplicate and downstream DAGs process mixed old and new data.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosHistorical reruns create duplicates and dependency chaos
AirflowBeginnerPremium

The Sensor Gridlock

You will practice

Diagnose and fix the issue: Late upstream files clog worker slots and scheduler throughput

Simple poke sensors occupy many worker slots for hours, delaying unrelated pipelines and making the scheduler look unhealthy.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosLate upstream files clog worker slots and scheduler throughput
AirflowIntermediatePremium

The Dynamic Task Explosion

You will practice

Diagnose and fix the issue: Task generation scales beyond scheduler comfort

The scheduler becomes slow, UI becomes noisy, and task state management itself becomes the bottleneck.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosTask generation scales beyond scheduler comfort
AirflowBeginnerPremium

The Retry Illusion

You will practice

Diagnose and fix the issue: Automatic retries hide a deterministic data bug

One transform task intermittently succeeds on the third or fourth retry, but the resulting data is inconsistent and the root cause remains unresolved.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosAutomatic retries hide a deterministic data bug
AirflowIntermediatePremium

The Secret Rotation Outage

You will practice

Diagnose and fix the issue: Credential change breaks many DAGs at once

Some tasks still use old environment variables, some use Airflow connections, and no one is sure which pipelines are impacted.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosCredential change breaks many DAGs at once
AirflowIntermediatePremium

The Idempotency Question

You will practice

Diagnose and fix the issue: How do you rerun a failed DAG safely?

The question is broad, but they want to know whether you have practical patterns for safe re-execution.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosHow do you rerun a failed DAG safely?
AirflowIntermediatePremium

The SLA vs Throughput Trade-off

You will practice

Diagnose and fix the issue: A DAG misses SLA under shared-cluster pressure

On busy days the report misses its 7 AM SLA because the downstream Spark tasks wait for cluster capacity.

Type

Mixed Lab

Time

20 min

Progress

Not started

AirflowMixed LabAirflow and Reliability ScenariosA DAG misses SLA under shared-cluster pressure
AirflowBeginnerPremium

The Catchup Stampede

You will practice

Diagnose and fix the issue: Enabling catchup unleashes too many historical runs at once

Worker slots fill up, recent daily jobs queue behind old runs, and platform stability degrades because the DAG suddenly behaves like a backfill workload.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosEnabling catchup unleashes too many historical runs at once
AirflowIntermediatePremium

The Zombie Task Mystery

You will practice

Diagnose and fix the issue: Worker death leaves tasks stuck in misleading running states

Operations sees tasks that remain in running or uncertain states even though the actual process is gone, causing confusion about whether to retry, clear, or wait.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosWorker death leaves tasks stuck in misleading running states
AirflowBeginnerPremium

The XCom Bloat Problem

You will practice

Diagnose and fix the issue: Teams misuse XCom for large payloads and degrade Airflow itself

The DAG still functions for a while, but the metadata database grows, UI pages become sluggish, and scheduler behavior becomes less healthy.

Type

Mixed Lab

Time

20 min

Progress

Not started

AirflowMixed LabAirflow and Reliability ScenariosTeams misuse XCom for large payloads and degrade Airflow itself
AirflowIntermediatePremium

The DAG Parse Bottleneck

You will practice

Diagnose and fix the issue: Heavy top-level code makes the scheduler slow before tasks even run

The scheduler begins lagging even when worker capacity is fine, and new DAGs or task changes take too long to appear.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosHeavy top-level code makes the scheduler slow before tasks even run
AirflowIntermediatePremium

The Pagination Retry Loop

You will practice

Diagnose and fix the issue: External API ingestion retries create duplicates or inconsistent slices

Retries make some runs succeed eventually, but duplicate pages, missing final pages, or inconsistent cursor state begin appearing in the landed data.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosExternal API ingestion retries create duplicates or inconsistent slices
AirflowIntermediatePremium

The Premature Publish Race

You will practice

Diagnose and fix the issue: Downstream data is published before all upstream slices are truly ready

Occasionally the publish step runs after only part of the data is truly ready, and downstream dashboards momentarily show mixed-day or partially refreshed results.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosDownstream data is published before all upstream slices are truly ready
AirflowBeginnerPremium

The Ownership Vacuum

You will practice

Diagnose and fix the issue: No clear owner responds when DAGs fail repeatedly

Operational noise grows, SLAs are missed, and incident resolution is slow because the technical problem is compounded by missing ownership and escalation rules.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosNo clear owner responds when DAGs fail repeatedly
AirflowIntermediatePremium

The Cross-DAG Dependency Trap

You will practice

Diagnose and fix the issue: External DAG dependencies become brittle as schedules evolve

What used to be a simple dependency now causes missed runs, false waiting, or accidental deadlocks because the cross-DAG contract is no longer explicit.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosExternal DAG dependencies become brittle as schedules evolve
AirflowAdvancedPremium

The Metadata DB Pressure Wave

You will practice

Diagnose and fix the issue: Airflow's metadata database becomes the bottleneck

UI pages load slowly, scheduling becomes delayed, and operational tasks such as clearing runs or browsing logs feel increasingly painful.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosAirflow's metadata database becomes the bottleneck
AirflowIntermediatePremium

The Environment Drift Outage

You will practice

Diagnose and fix the issue: Same DAG behaves differently across dev, staging, and prod

Teams discover differences in library versions, Airflow connections, environment variables, or executor configuration that changed task behavior in subtle ways.

Type

Mixed Lab

Time

20 min

Progress

Not started

AirflowMixed LabAirflow and Reliability ScenariosSame DAG behaves differently across dev, staging, and prod
AirflowIntermediatePremium

The Run Config Chaos

You will practice

Diagnose and fix the issue: Manual triggers with custom params create irreproducible outputs

The flexibility helps in emergencies, but over time the team loses confidence in what different runs actually did because parameters were inconsistent and not governed.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosManual triggers with custom params create irreproducible outputs
AirflowIntermediatePremium

The Maintenance Window Replay

You will practice

Diagnose and fix the issue: Platform upgrades or pauses create a messy restart backlog

After the platform returns, some teams want immediate replay, others want to skip non-critical intervals, and several DAGs collide as they all try to recover at once.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

AirflowLog AnalysisAirflow and Reliability ScenariosPlatform upgrades or pauses create a messy restart backlog
MixedBeginnerPremium

The Partitioning Disaster

You will practice

Diagnose and fix the issue: High-cardinality partition column destroys performance

After a few months, the table has millions of partitions and both writes and reads become painful.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosHigh-cardinality partition column destroys performance
KafkaIntermediatePremium

The Schema Evolution Shock

You will practice

Diagnose and fix the issue: New fields or changed types break downstream jobs

A silver pipeline either starts failing on new schema versions or silently drops important fields, depending on configuration.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisData Lake and Lakehouse ScenariosNew fields or changed types break downstream jobs
MixedIntermediatePremium

The Retention Regret

You will practice

Diagnose and fix the issue: Old data needed for rerun was vacuumed or expired

Later, an audit requires replaying or verifying historical states, but the supporting files are gone.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosOld data needed for rerun was vacuumed or expired
MixedIntermediatePremium

The Metadata Swamp

You will practice

Diagnose and fix the issue: Too many partitions and files make the table hard to plan

Even simple queries spend a long time in planning before actual compute begins.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosToo many partitions and files make the table hard to plan
MixedAdvancedPremium

The Delete Semantics Problem

You will practice

Diagnose and fix the issue: How to model deletes in CDC pipelines

The team handles inserts and updates, but deleted records keep appearing in downstream current-state tables because delete events were ignored or treated inconsistently.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosHow to model deletes in CDC pipelines
MixedIntermediatePremium

The Compaction Strategy Interview

You will practice

Diagnose and fix the issue: How do you choose file size and compaction cadence?

They want more than 'I would optimize the table'.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosHow do you choose file size and compaction cadence?
MixedAdvancedPremium

The Multi-Writer Collision

You will practice

Diagnose and fix the issue: Multiple jobs write the same table inconsistently

The table sometimes shows missing rows, overwritten partitions, or transaction conflicts depending on timing.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosMultiple jobs write the same table inconsistently
MixedBeginnerPremium

The Format Decision

You will practice

Diagnose and fix the issue: Choosing Parquet vs Avro vs JSON for different layers

The interviewer wants to know whether you understand why one format is not best for every layer.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosChoosing Parquet vs Avro vs JSON for different layers
SQLBeginnerPremium

The Time Travel Misread

You will practice

Diagnose and fix the issue: Analysts query the wrong table version and draw false conclusions

Confusion grows because people compare the wrong version timestamps, assume current schemas apply to older versions, and sometimes publish conclusions from a snapshot that was never the official business close.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlBroken SqlData Lake and Lakehouse ScenariosAnalysts query the wrong table version and draw false conclusions
KafkaAdvancedPremium

The Vacuum vs Streaming Conflict

You will practice

Diagnose and fix the issue: Aggressive cleanup breaks slow readers or long-running streaming jobs

After the change, some readers fail because files referenced by their transaction state are no longer available, and teams argue about whether the retention policy or the consumers are at fault.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisData Lake and Lakehouse ScenariosAggressive cleanup breaks slow readers or long-running streaming jobs
MixedIntermediatePremium

The Dynamic Overwrite Blast Radius

You will practice

Diagnose and fix the issue: Partition overwrite logic replaces more data than intended

After a code or schema change, the job deletes or replaces a wider slice than intended, and downstream teams discover that previously healthy partitions have disappeared.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosPartition overwrite logic replaces more data than intended
MixedIntermediatePremium

The Rewrite Amplification Problem

You will practice

Diagnose and fix the issue: Small logical updates cause large physical file rewrites

In practice, the update workflow rewrites many files, consumes large compute budgets, and slows both writers and readers.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosSmall logical updates cause large physical file rewrites
MixedIntermediatePremium

The Data Skipping Disappointment

You will practice

Diagnose and fix the issue: Table statistics exist, but queries still scan far more than expected

Teams then discover that some important queries still scan heavily, leading to disappointment and confusion about why the advertised optimization is not working well.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosTable statistics exist, but queries still scan far more than expected
MixedBeginnerPremium

The Clone vs Copy Decision

You will practice

Diagnose and fix the issue: Teams need safe experimentation without corrupting production tables

Some engineers propose copying the entire table to a new location, while others suggest cloning, snapshotting, or creating a shallow environment-specific branch depending on platform capabilities.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosTeams need safe experimentation without corrupting production tables
KafkaIntermediatePremium

The Decimal Precision Shock

You will practice

Diagnose and fix the issue: Schema evolution changes numeric precision and silently alters downstream behavior

Some downstream jobs fail, while others continue but round values differently, causing subtle mismatches in finance and reconciliation reports.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabData Lake and Lakehouse ScenariosSchema evolution changes numeric precision and silently alters downstream behavior
SQLAdvancedPremium

The Manifest Consistency Gap

You will practice

Diagnose and fix the issue: External query engines see stale or inconsistent table state

Writes succeed in the source engine, but one or more external consumers see stale files, partial state, or delayed visibility, creating cross-tool inconsistency.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

SqlLog AnalysisData Lake and Lakehouse ScenariosExternal query engines see stale or inconsistent table state
MixedAdvancedPremium

The Erasure Request Challenge

You will practice

Diagnose and fix the issue: Personal-data deletion requests collide with append-only historical design

The data exists across raw, refined, and serving layers, and some datasets are intentionally append-only for audit or replay. Teams struggle to honor deletion without breaking lineage or creating hidden remnants.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosPersonal-data deletion requests collide with append-only historical design
MixedIntermediatePremium

The CDC Bootstrap Merge

You will practice

Diagnose and fix the issue: Combining initial snapshot load with ongoing CDC creates duplicates or ordering issues

After cutover, duplicate keys and inconsistent ordering appear because the snapshot and change stream overlap imperfectly.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabData Lake and Lakehouse ScenariosCombining initial snapshot load with ongoing CDC creates duplicates or ordering issues
MixedIntermediatePremium

The Orphan File Incident

You will practice

Diagnose and fix the issue: Failed writes or manual operations leave files not tracked by the table state

Confusion follows because some engineers want to clean them manually, others worry about breaking recovery, and storage costs slowly creep upward from unmanaged remnants.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosFailed writes or manual operations leave files not tracked by the table state
MixedIntermediatePremium

The Catalog Drift Problem

You will practice

Diagnose and fix the issue: Table metadata in the catalog diverges from actual storage or contract reality

Eventually consumers see schema mismatches, stale locations, or conflicting definitions for what they believe is the same dataset.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisData Lake and Lakehouse ScenariosTable metadata in the catalog diverges from actual storage or contract reality
KafkaIntermediatePremium

The Consumer Lag Crisis

You will practice

Diagnose and fix the issue: Kafka consumers fall behind during traffic spikes

During campaigns and product launches, consumer lag spikes badly and downstream freshness SLAs are missed.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosKafka consumers fall behind during traffic spikes
KafkaAdvancedPremium

The Exactly-Once Myth

You will practice

Diagnose and fix the issue: Candidate confuses source offsets with end-to-end exactly once

The interviewer pushes back and asks whether duplicates are still possible in the target warehouse or lakehouse.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosCandidate confuses source offsets with end-to-end exactly once
KafkaIntermediatePremium

The Out-of-Order Event Puzzle

You will practice

Diagnose and fix the issue: Late and out-of-order events distort aggregates

Users in unstable network conditions send events late, so dashboards change retroactively or show inconsistent counts.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabStreaming and Kafka ScenariosLate and out-of-order events distort aggregates
KafkaIntermediatePremium

The Hot Partition Problem

You will practice

Diagnose and fix the issue: One Kafka partition carries disproportionate traffic

A few enterprise customers generate far more traffic than others, causing one or two partitions to carry most of the load while others remain underutilized.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosOne Kafka partition carries disproportionate traffic
KafkaAdvancedPremium

The Replay Without Downtime Challenge

You will practice

Diagnose and fix the issue: Need to reprocess historical events while live stream continues

The interviewer asks how you would replay history without corrupting current processing.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosNeed to reprocess historical events while live stream continues
KafkaIntermediatePremium

The CDC Duplication Debate

You will practice

Diagnose and fix the issue: Database CDC emits duplicates or out-of-order updates

The downstream team sees what look like duplicate updates for the same key and is unsure whether the connector or the source is broken.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosDatabase CDC emits duplicates or out-of-order updates
KafkaIntermediatePremium

The Poison Pill Event

You will practice

Diagnose and fix the issue: A malformed event repeatedly breaks the stream

Because offsets are not advancing past that record, the pipeline enters a fail-restart-fail loop and freshness collapses.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosA malformed event repeatedly breaks the stream
KafkaIntermediatePremium

The Rebalance Storm

You will practice

Diagnose and fix the issue: Frequent consumer-group rebalances destroy throughput

Freshness collapses even though raw incoming throughput has not changed much, because consumers spend too much time rejoining rather than processing.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabStreaming and Kafka ScenariosFrequent consumer-group rebalances destroy throughput
KafkaIntermediatePremium

The Schema Registry Compatibility Break

You will practice

Diagnose and fix the issue: Producer schema evolution breaks downstream consumers

A producer deploys a new schema version that passes its own tests, but one or more consumers fail, mis-parse records, or begin dropping fields after the rollout.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabStreaming and Kafka ScenariosProducer schema evolution breaks downstream consumers
KafkaBeginnerPremium

The Producer Retry Duplicate

You will practice

Diagnose and fix the issue: Upstream retries create duplicate events that downstreams must tolerate

Downstream consumers begin seeing duplicate events for the same logical business action, and some pipelines amplify the issue because they assumed each event would appear exactly once.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosUpstream retries create duplicate events that downstreams must tolerate
KafkaIntermediatePremium

The Log Compaction Misunderstanding

You will practice

Diagnose and fix the issue: Teams misuse compacted topics because they confuse history with current state

Later, another team uses the same topic for audit-style analytics and is surprised that older change history is incomplete or no longer retained as they expected.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosTeams misuse compacted topics because they confuse history with current state
KafkaIntermediatePremium

The Event Version Drift

You will practice

Diagnose and fix the issue: Different event versions coexist and consumers handle them inconsistently

Some consumers cope well, but others silently ignore new fields, mis-handle old versions, or break on unexpected combinations when replaying older data.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosDifferent event versions coexist and consumers handle them inconsistently
KafkaIntermediatePremium

The DLQ Black Hole

You will practice

Diagnose and fix the issue: Dead-letter queues collect bad events but no one actually resolves them

At first this improves uptime, but later the DLQ becomes a graveyard of unresolved records, and business users realize some important events never made it back into the main datasets.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosDead-letter queues collect bad events but no one actually resolves them
KafkaAdvancedPremium

The Stream-Table Join State Blowup

You will practice

Diagnose and fix the issue: Joining events with changing reference data causes large or unstable state

As the joined state grows, checkpoint size increases, recovery slows, and the team struggles to balance freshness, memory, and correctness.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabStreaming and Kafka ScenariosJoining events with changing reference data causes large or unstable state
KafkaAdvancedPremium

The Historical Replay into a Live Topic

You will practice

Diagnose and fix the issue: Backfilling older events into an active topic disrupts consumers

Some consumers handle the replay badly, lag grows sharply, and ordering assumptions break because old events are now mixed into the live stream.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosBackfilling older events into an active topic disrupts consumers
KafkaIntermediatePremium

The Clock Skew Event-Time Bug

You will practice

Diagnose and fix the issue: Producer clocks distort event-time processing and watermark logic

Watermarks and windowing behave strangely because some events appear to come from the future or far from the expected event-time distribution.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosProducer clocks distort event-time processing and watermark logic
KafkaAdvancedPremium

The External API Sink Illusion

You will practice

Diagnose and fix the issue: Teams expect exactly-once outcomes while writing stream results to an external API

Leadership asks whether the pipeline is exactly once end to end, but engineers know retries, timeouts, and uncertain acknowledgements make that claim shaky.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosTeams expect exactly-once outcomes while writing stream results to an external API
KafkaIntermediatePremium

The Retention vs Replay Trade-off

You will practice

Diagnose and fix the issue: Topic retention is too short for real operational replay needs

Then an important downstream system falls behind for longer than expected, or a bug requires replay older than the topic still retains.

Type

Mixed Lab

Time

20 min

Progress

Not started

KafkaMixed LabStreaming and Kafka ScenariosTopic retention is too short for real operational replay needs
KafkaAdvancedPremium

The Multi-Topic Correlation Delay

You will practice

Diagnose and fix the issue: Joining multiple topics with different latency patterns creates inconsistent outputs

One topic is fast and clean, another is bursty and late, and the correlated output oscillates or requires frequent correction as related events arrive on different timelines.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosJoining multiple topics with different latency patterns creates inconsistent outputs
KafkaAdvancedPremium

The Transactional Producer Overclaim

You will practice

Diagnose and fix the issue: Teams misuse producer transactions and exaggerate what they guarantee

Later, consumers still encounter duplicates at the business level, and stakeholders feel misled because they assumed the producer feature solved end-to-end correctness automatically.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisStreaming and Kafka ScenariosTeams misuse producer transactions and exaggerate what they guarantee
KafkaAdvancedPremium

The Unified Customer 360 Design

You will practice

Diagnose and fix the issue: Designing batch and streaming together

They want to see whether you can balance raw ingestion, curated models, low-latency serving, and historical correctness in one architecture.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

KafkaLog AnalysisArchitecture, Cloud, and Governance ScenariosDesigning batch and streaming together
MixedIntermediatePremium

The Cloud Migration without a Big Bang

You will practice

Diagnose and fix the issue: Moving on-prem ETL to cloud safely

The interviewer asks how you would migrate while controlling risk.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMoving on-prem ETL to cloud safely
MixedIntermediatePremium

The Cost Spike Mystery

You will practice

Diagnose and fix the issue: Cloud data platform spend increases unexpectedly

You are asked to investigate where the money is going without damaging delivery SLAs.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabArchitecture, Cloud, and Governance ScenariosCloud data platform spend increases unexpectedly
MixedIntermediatePremium

The Data Quality Incident

You will practice

Diagnose and fix the issue: Business discovers wrong numbers after release

Leadership wants both a recovery plan and a prevention plan.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosBusiness discovers wrong numbers after release
MixedAdvancedPremium

The PII Exposure Risk

You will practice

Diagnose and fix the issue: Sensitive data lands in the lake without proper controls

No breach is confirmed, but governance controls are clearly inadequate.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosSensitive data lands in the lake without proper controls
MixedIntermediatePremium

The Observability Black Hole

You will practice

Diagnose and fix the issue: Pipeline fails but logs and metrics are insufficient

The interviewer asks how you would improve observability so incidents become faster to diagnose.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosPipeline fails but logs and metrics are insufficient
Data QualityAdvancedPremium

The End-to-End Sales Pipeline Design

You will practice

Diagnose and fix the issue: Comprehensive scenario covering batch, late data, idempotency, and quality

The interviewer wants to hear how you would build the system and what trade-offs you would make.

Type

Mixed Lab

Time

20 min

Progress

Not started

Data QualityMixed LabArchitecture, Cloud, and Governance ScenariosComprehensive scenario covering batch, late data, idempotency, and quality
MixedIntermediatePremium

The Medallion by Dogma

You will practice

Diagnose and fix the issue: Too many mandatory layers slow delivery without adding value

Over time some small, stable datasets move through unnecessary layers, delivery slows, and teams begin copying data mainly to satisfy architecture doctrine rather than a real need.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosToo many mandatory layers slow delivery without adding value
MixedIntermediatePremium

The Build vs Buy Workflow Debate

You will practice

Diagnose and fix the issue: Choosing managed services versus custom platform components

Debates become ideological: one group argues managed services are always faster, while another argues custom control is essential for scale and long-term flexibility.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosChoosing managed services versus custom platform components
MixedAdvancedPremium

The Domain Ownership Collision

You will practice

Diagnose and fix the issue: Central platform standards conflict with domain-team autonomy

Friction emerges because domain teams want speed and freedom, while the platform team wants consistent security, lineage, and operational controls.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosCentral platform standards conflict with domain-team autonomy
MixedAdvancedPremium

The Data Residency Split

You will practice

Diagnose and fix the issue: Global architecture must respect regional data residency and access rules

Business leaders still want global reporting and machine learning, so the architecture must balance regional isolation with cross-region insight.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosGlobal architecture must respect regional data residency and access rules
MixedIntermediatePremium

The Chargeback Blind Spot

You will practice

Diagnose and fix the issue: Cloud cost rises, but teams cannot see which workloads are responsible

Engineers can guess broadly, but the platform lacks reliable workload tagging, cost attribution, or standardized lineage between spend and value.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabArchitecture, Cloud, and Governance ScenariosCloud cost rises, but teams cannot see which workloads are responsible
MixedAdvancedPremium

The Features vs BI Truth Split

You will practice

Diagnose and fix the issue: Machine-learning features and BI metrics drift apart from the same source data

Eventually the product and analytics teams compare outputs and realize the 'same' concept means different things in models versus dashboards.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMachine-learning features and BI metrics drift apart from the same source data
MixedIntermediatePremium

The Bad Data Incident Command

You will practice

Diagnose and fix the issue: Wrong numbers reach business users and the response is unstructured

Engineers start debugging immediately, but the response is chaotic: no one knows whether to pull the report, page consumers, freeze downstream jobs, or keep the platform running while the root cause is investigated.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabArchitecture, Cloud, and Governance ScenariosWrong numbers reach business users and the response is unstructured
MixedIntermediatePremium

The Data Contract Rollout

You will practice

Diagnose and fix the issue: Producer-consumer contracts exist in theory but are hard to enforce in practice

Everyone agrees in principle, but adoption stalls because producer teams see contracts as bureaucracy and consumers still lack confidence that changes will be caught early.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosProducer-consumer contracts exist in theory but are hard to enforce in practice
MixedAdvancedPremium

The Platform Disaster Recovery Plan

You will practice

Diagnose and fix the issue: Data platform recovery design is vague until a real outage occurs

Leadership assumes the platform is resilient, yet engineers realize the actual recovery sequence, RTO, and data-loss boundaries are poorly defined.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosData platform recovery design is vague until a real outage occurs
MixedIntermediatePremium

The Self-Serve Guardrail Balance

You will practice

Diagnose and fix the issue: Making the platform self-serve without opening the door to chaos

At the same time, security, governance, and reliability teams worry that broad self-service will create an uncontrolled sprawl of low-quality data products and risky permissions.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMaking the platform self-serve without opening the door to chaos
MixedIntermediatePremium

The Backup vs Replay vs Recompute Choice

You will practice

Diagnose and fix the issue: Teams do not know whether to restore, replay, or rebuild after data loss

Some want to restore from backup, others want to replay from raw events, and others argue the dataset should simply be recomputed from upstream sources - but no one has a framework for choosing.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabArchitecture, Cloud, and Governance ScenariosTeams do not know whether to restore, replay, or rebuild after data loss
MixedIntermediatePremium

The Noisy Neighbor Platform

You will practice

Diagnose and fix the issue: One team's heavy workload degrades the shared platform for everyone else

At peak times a few large jobs monopolize resources, and unrelated teams experience slower queries, missed SLAs, or higher costs even though their own usage has not changed.

Type

Mixed Lab

Time

20 min

Progress

Not started

MixedMixed LabArchitecture, Cloud, and Governance ScenariosOne team's heavy workload degrades the shared platform for everyone else
MixedAdvancedPremium

The Executive KPI Mismatch

You will practice

Diagnose and fix the issue: Different teams publish different versions of the same top-line metric

The platform itself may be running fine, but organizational trust drops because no one can say which number is canonical or why the definitions diverged.

Type

Log / Error Analysis

Time

20 min

Progress

Not started

MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosDifferent teams publish different versions of the same top-line metric