SQLBeginnerFree
Wrong GROUP BY Grain Causing Customer Revenue Inflation
You will practice
Build a customer-level revenue result with exactly one row per customer. Include only completed orders, return customer_id, customer_name, and completed_revenue, and make sure duplicate status rows cannot inflate the dashboard.
The query groups by customer and order status, but the dashboard expects one row per customer. When downstream users sum the status-level rows again, cancelled and completed order rows are mixed into the customer metric.
SQLGrainRevenueData Quality
SQLBeginnerFree
LEFT JOIN Turned Into INNER JOIN by WHERE Filter
You will practice
Return every active customer. For customers who clicked campaign SPRING_26, show their latest click timestamp. For customers with no click, keep the customer row and return NULL for last_click_at.
The query uses a LEFT JOIN, but a filter on the campaign table is placed in the WHERE clause. That removes NULL right-side rows and silently turns the result into an inner join for this campaign.
SQLJoinsNULLsRetention
SQLIntermediateFree
Duplicate Revenue from Joining Orders to Multiple Payments and Refunds
You will practice
Return one row per order with paid_amount, refunded_amount, and net_revenue. Aggregate each child table to order_id before joining so payment and refund rows cannot multiply each other.
The mart joins orders directly to payments and refunds at row level. Because both child tables can have multiple rows per order, the join multiplies records before aggregation.
Type
Output Mismatch Debugging
SQLJoin ExplosionRevenueOutput Mismatch
PySparkIntermediateFree
Append Mode Created Duplicate Daily Loads
You will practice
Make the daily write idempotent for order_date.
The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.
PySparkIdempotencyDaily LoadsLakehouse
PySparkIntermediateFree
The Endless Final Stage
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job reaches 99 percent quickly, 199 out of 200 tasks finish, but one task runs for hours with high GC and 100 percent CPU before ending with OutOfMemoryError.
PysparkBroken PysparkSpark Performance and DebuggingSpark job stuck at 99% because of data skew
PySparkIntermediateFree
The Executor Graveyard
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job does not freeze at one task; instead executors keep dying and getting replaced. Retries happen repeatedly and the stage eventually fails.
PysparkBroken PysparkSpark Performance and DebuggingRepeated executor deaths after a wide join
PySparkIntermediateFree
The Shuffle Storm
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
A new version replaced reduceByKey-like logic with groupByKey and several downstream transformations. Runtime doubled and shuffle traffic exploded.
PysparkBroken PysparkSpark Performance and DebuggingMassive shuffle caused by bad aggregation strategy
PySparkBeginnerFree
The Small Files Avalanche
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job technically succeeds, but downstream reads get slower over time and cloud object store listings become painfully expensive. Each partition contains hundreds or thousands of tiny files.
PysparkBroken PysparkSpark Performance and DebuggingSpark write path creates too many tiny files
PySparkBeginnerFree
The Silent UDF Tax
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job still succeeds, but runtime tripled and CPU utilization looks poor despite no major shuffle increase.
PysparkBroken PysparkSpark Performance and DebuggingPython UDF makes a pipeline unexpectedly slow
PySparkIntermediateFree
The Broadcast Betrayal
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Someone forced a broadcast hint. It worked in test and smaller markets, but production now fails intermittently with broadcast timeout or executor memory pressure.
PysparkBroken PysparkSpark Performance and DebuggingWrong broadcast choice causes timeout or memory issues
PySparkBeginnerFree
The Cache Everything Trap
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Production jobs show higher memory pressure, more spills, and worse overall runtime than before caching was added.
PysparkBroken PysparkSpark Performance and DebuggingOver-caching makes the cluster slower
PySparkIntermediateFree
The Backfill Explosion
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The same code that works daily now runs for many hours, overloads the cluster, and causes repeated failures.
PysparkBroken PysparkSpark Performance and DebuggingDaily logic fails when rerun for months of history
PySparkBeginnerFree
The Union of Doom
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
After onboarding a new market, some columns shifted position and downstream consumers started seeing nulls or incorrect values even though the job itself succeeded.
PysparkBroken PysparkSpark Performance and DebuggingSchema mismatch during union creates silent data corruption risk
PySparkAdvancedFree
Checkpoint Chaos
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
After a deployment and infrastructure interruption, the stream restarts but either reprocesses events, produces duplicates, or struggles with state-store recovery.
PysparkBroken PysparkSpark Performance and DebuggingStructured Streaming restarts cause duplicates or state issues
PySparkIntermediateFree
The AQE Surprise
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Production performance becomes inconsistent. On some days the job gets faster, but on others one stage becomes very heavy, task counts collapse, and the downstream write becomes slower than before.
PysparkBroken PysparkSpark Performance and DebuggingAdaptive Query Execution changes join or partition strategy in unexpected ways
PySparkIntermediateFree
The Window Function Blowup
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job does not fail immediately, but stages with sort and window execution become extremely slow, spill heavily to disk, and sometimes time out under peak loads.
PysparkBroken PysparkSpark Performance and DebuggingLarge window operations create spill, skew, or sort pressure
PySparkBeginnerFree
The Null Key Funnel
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Runtime suddenly deteriorates after a source issue causes a large share of merchant_id values to arrive as null or a default placeholder, and one stage becomes badly imbalanced.
PysparkBroken PysparkSpark Performance and DebuggingNull or default join keys collapse data into a pathological partition
PySparkIntermediateFree
The Explode Cascade
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
After the change, row counts increase by orders of magnitude, a formerly healthy job now spills heavily, and downstream tables become far larger than expected.
PysparkBroken PysparkSpark Performance and DebuggingFlattening nested arrays multiplies row counts and destabilizes the plan
PySparkBeginnerFree
The Driver Memory Trap
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The executors look fine, but the application driver becomes unstable, crashes intermittently, or hangs during peak days.
PysparkBroken PysparkSpark Performance and DebuggingWork is accidentally pulled back to the driver and causes instability
PySparkIntermediatePremium
Spark Join Slowed Down Due to Skewed Customer Key
You will practice
Choose the likely root cause.
The join key has one customer_id that owns a massive share of events, causing one reducer partition to process most rows.
PySparkSkewJoinPerformance
PySparkIntermediatePremium
Too Many Small Files from Hourly Writes
You will practice
Diagnose the performance issue from logs.
Each hourly job writes many tiny files into the same date partition. Metadata overhead dominates scan time.
PySparkSmall FilesCompactionLakehouse
AirflowIntermediatePremium
DAG Green but Dashboard Wrong
You will practice
Find why green status is misleading.
The DAG only checks task completion, not data freshness or row-count expectations.
AirflowMonitoringData QualityIncident
AirflowIntermediatePremium
Airflow Retry Reprocessed Same File and Created Duplicates
You will practice
Explain the retry/idempotency bug.
Retry behavior is not idempotent. The pipeline lacks file-level checkpointing and deduplication keys.
AirflowRetriesIdempotencyDuplicates
Data QualityBeginnerPremium
Revenue Dropped 30% After New SUCCESSFUL Status
You will practice
Return one row per order_date with revenue from every provider status mapped to paid_success. Use the mapping table rather than hardcoding provider-specific values.
The revenue query only accepts status = 'SUCCESS'. The new provider sends 'SUCCESSFUL'.
Type
Output Mismatch Debugging
Data QualityStatus MappingRevenueMonitoring
Data QualityIntermediatePremium
UTC to Local Timezone Boundary Broke Daily Dashboard
You will practice
Convert event_ts_utc to India business time before deriving the reporting date. Return business_date and revenue so late-night UTC events are counted on the correct India date.
The report groups by the UTC calendar date. Orders between 18:30 and 23:59 UTC belong to the next calendar day in Asia/Kolkata, so the dashboard shifts late-night revenue into the wrong business date.
SQLTimezoneReportingData Quality
AWS / Data LakeIntermediatePremium
Bad Partition Strategy by customer_id
You will practice
Choose the safest partition strategy.
customer_id is high-cardinality and query filters mostly use order_date. The lake now has millions of tiny partitions.
AWSS3PartitioningLakehouse
MixedAdvancedPremium
Late Arriving Records Changed Previous Revenue Partitions
You will practice
Diagnose the late-arrival issue.
The pipeline assumes closed daily partitions are final. It has no late-arrival correction window or restatement process.
WatermarkLate DataSQLAirflow
PySparkAdvancedPremium
The Delta Merge Meltdown
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The merge now scans huge portions of the target table, rewrites many files, and regularly collides with maintenance tasks or downstream readers.
PysparkBroken PysparkSpark Performance and DebuggingLarge merge statements rewrite too much data and make jobs unstable
PySparkIntermediatePremium
The Speculation Confusion
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Runtime does not improve meaningfully, and in a few edge cases downstream side effects or writes become harder to reason about.
PysparkBroken PysparkSpark Performance and DebuggingSpeculative execution is misunderstood as a cure for skew or bad logic
PySparkIntermediatePremium
The File Listing Bottleneck
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Teams notice that the application spends a long time before meaningful task execution begins, and the cost of simply planning the read is becoming painful.
PysparkBroken PysparkSpark Performance and DebuggingReading huge numbers of files is slow before compute even starts
PySparkBeginnerPremium
The Partition Pruning Mirage
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Even though the code appears selective, the scan bytes remain huge and runtime barely changes compared with a full-table read.
PysparkBroken PysparkSpark Performance and DebuggingFilters look selective in code but do not actually prune the dataset
PySparkIntermediatePremium
The Serializer Mismatch
You will practice
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
The job still works, but task deserialization time, network overhead, and CPU cost increase noticeably, especially around shuffle-heavy stages.
PysparkBroken PysparkSpark Performance and DebuggingSerialization choices and object-heavy code paths degrade performance
SQLBeginnerPremium
The Duplicate Customer Nightmare
You will practice
Diagnose and fix the issue: Latest-record deduplication in SQL
Business users report duplicate customers in reporting and the metrics team is getting inconsistent counts.
SqlBroken SqlSQL and Warehousing ScenariosLatest-record deduplication in SQL
SQLBeginnerPremium
The NULL Trap
You will practice
Diagnose and fix the issue: NOT IN returns unexpected results because of NULL handling
The result set comes back empty or much smaller than expected, even though everyone knows there are customers without orders.
SqlLog AnalysisSQL and Warehousing ScenariosNOT IN returns unexpected results because of NULL handling
SQLIntermediatePremium
The History Table Dilemma
You will practice
Diagnose and fix the issue: Designing SCD Type 2 correctly
The interviewer asks how you would design the dimension table so analysts can answer 'what did we know at that point in time?'
SqlLog AnalysisSQL and Warehousing ScenariosDesigning SCD Type 2 correctly
SQLIntermediatePremium
The Double Counting Report
You will practice
Diagnose and fix the issue: Join pattern causes duplicated fact rows
The SQL runs fine, but totals are inflated because the join grain is misunderstood.
SqlBroken SqlSQL and Warehousing ScenariosJoin pattern causes duplicated fact rows
SQLIntermediatePremium
The Query That Became Slow Overnight
You will practice
Diagnose and fix the issue: A previously fast SQL query regresses badly
The business did not change the SQL, but performance deteriorated sharply.
SqlBroken SqlSQL and Warehousing ScenariosA previously fast SQL query regresses badly
SQLAdvancedPremium
The Merge Contention Problem
You will practice
Diagnose and fix the issue: Concurrent upserts create lock contention or deadlocks
During peak hours, merges slow down, block each other, or deadlock.
SqlLog AnalysisSQL and Warehousing ScenariosConcurrent upserts create lock contention or deadlocks
SQLIntermediatePremium
The Late Dimension Problem
You will practice
Diagnose and fix the issue: Facts arrive before dimension rows
Fact rows fail foreign-key checks or end up with missing enrichments, and analysts complain about 'unknown' categories.
SqlLog AnalysisSQL and Warehousing ScenariosFacts arrive before dimension rows
SQLAdvancedPremium
The Snapshot Consistency Debate
You will practice
Diagnose and fix the issue: Reading from mutable source tables creates inconsistent extracts
Counts between related tables do not reconcile because the source is changing during extraction.
SqlLog AnalysisSQL and Warehousing ScenariosReading from mutable source tables creates inconsistent extracts
SQLIntermediatePremium
The Incremental Load Gone Wrong
You will practice
Diagnose and fix the issue: Watermark logic misses or duplicates records
Users notice some records are missing while others are loaded twice after retries and timezone inconsistencies.
SqlLog AnalysisSQL and Warehousing ScenariosWatermark logic misses or duplicates records
SQLIntermediatePremium
The Trusted Summary Table
You will practice
Diagnose and fix the issue: Materialized summary vs querying raw detail
The interviewer asks whether you would keep querying raw data or build summary tables.
SqlLog AnalysisSQL and Warehousing ScenariosMaterialized summary vs querying raw detail
SQLBeginnerPremium
The Top-N Tie Trap
You will practice
Diagnose and fix the issue: Window-function ranking returns unstable or duplicated Top-N results
Different runs or engine migrations produce slightly different results whenever ties occur, and stakeholders question why the dashboard is not stable.
SqlLog AnalysisSQL and Warehousing ScenariosWindow-function ranking returns unstable or duplicated Top-N results
SQLIntermediatePremium
The Fact Correction Dilemma
You will practice
Diagnose and fix the issue: Correcting fact-table errors without breaking auditability
The team debates whether to overwrite the old fact row, insert a corrected version, or maintain separate adjustment records.
SqlLog AnalysisSQL and Warehousing ScenariosCorrecting fact-table errors without breaking auditability
SQLIntermediatePremium
The FX Conversion Mismatch
You will practice
Diagnose and fix the issue: Currency conversion logic creates inconsistent financial reporting
Different teams get different totals because some queries use transaction-date rates, others use month-end rates, and refunds are handled inconsistently.
SqlLog AnalysisSQL and Warehousing ScenariosCurrency conversion logic creates inconsistent financial reporting
SQLIntermediatePremium
The JSON Fanout Query
You will practice
Diagnose and fix the issue: Flattening semi-structured fields in SQL creates double counting and high cost
A new query flattens multiple arrays out of the JSON and joins them back to orders, but revenue and counts become inflated while runtime increases sharply.
SqlLog AnalysisSQL and Warehousing ScenariosFlattening semi-structured fields in SQL creates double counting and high cost
SQLIntermediatePremium
The MERGE with Duplicate Matches
You will practice
Diagnose and fix the issue: MERGE fails or behaves unpredictably because source keys are not unique
Some loads fail with multiple-match errors, while others succeed but produce hard-to-explain outcomes because the source contains duplicate keys in the same batch.
SqlLog AnalysisSQL and Warehousing ScenariosMERGE fails or behaves unpredictably because source keys are not unique
SQLBeginnerPremium
The Timezone Boundary Bug
You will practice
Diagnose and fix the issue: Date-based reports are wrong because local and UTC boundaries are mixed
Counts near midnight look wrong in several countries, and month-end totals differ between dashboards built by different teams.
SqlLog AnalysisSQL and Warehousing ScenariosDate-based reports are wrong because local and UTC boundaries are mixed
SQLIntermediatePremium
The Snapshot Join Trap
You will practice
Diagnose and fix the issue: Joining periodic snapshots to transactions creates misleading metrics
The resulting query runs, but inventory metrics are overstated or misaligned because the snapshot grain and transaction grain do not line up cleanly.
SqlLog AnalysisSQL and Warehousing ScenariosJoining periodic snapshots to transactions creates misleading metrics
SQLAdvancedPremium
The Chasm Trap
You will practice
Diagnose and fix the issue: Joining multiple fact tables through shared dimensions produces fanout
They join three fact tables through common dimensions and produce impressive-looking dashboards, but conversion and revenue numbers are inflated in subtle ways.
SqlLog AnalysisSQL and Warehousing ScenariosJoining multiple fact tables through shared dimensions produces fanout
SQLIntermediatePremium
The Distinct Count at Scale
You will practice
Diagnose and fix the issue: High-cardinality distinct counting becomes expensive and inconsistent
Different teams compute the metrics with different SQL patterns, runtimes are high, and slight logic differences create disagreement about the official number.
SqlLog AnalysisSQL and Warehousing ScenariosHigh-cardinality distinct counting becomes expensive and inconsistent
SQLAdvancedPremium
The Restatement vs Close Debate
You will practice
Diagnose and fix the issue: Closed financial periods conflict with late-arriving changes
Business teams disagree on whether historical dashboards should change, whether the closed month should stay frozen, and how to reconcile operational truth with published finance truth.
SqlLog AnalysisSQL and Warehousing ScenariosClosed financial periods conflict with late-arriving changes
AirflowBeginnerPremium
The Green DAG with Bad Data
You will practice
Diagnose and fix the issue: Tasks succeed but output is incomplete
All tasks are green, but finance reports missing data because the input file was truncated and the transformation script never validated record completeness.
AirflowLog AnalysisAirflow and Reliability ScenariosTasks succeed but output is incomplete
AirflowIntermediatePremium
The Backfill Hell
You will practice
Diagnose and fix the issue: Historical reruns create duplicates and dependency chaos
When the team reruns historical dates manually, some datasets duplicate and downstream DAGs process mixed old and new data.
AirflowLog AnalysisAirflow and Reliability ScenariosHistorical reruns create duplicates and dependency chaos
AirflowBeginnerPremium
The Sensor Gridlock
You will practice
Diagnose and fix the issue: Late upstream files clog worker slots and scheduler throughput
Simple poke sensors occupy many worker slots for hours, delaying unrelated pipelines and making the scheduler look unhealthy.
AirflowLog AnalysisAirflow and Reliability ScenariosLate upstream files clog worker slots and scheduler throughput
AirflowIntermediatePremium
The Dynamic Task Explosion
You will practice
Diagnose and fix the issue: Task generation scales beyond scheduler comfort
The scheduler becomes slow, UI becomes noisy, and task state management itself becomes the bottleneck.
AirflowLog AnalysisAirflow and Reliability ScenariosTask generation scales beyond scheduler comfort
AirflowBeginnerPremium
The Retry Illusion
You will practice
Diagnose and fix the issue: Automatic retries hide a deterministic data bug
One transform task intermittently succeeds on the third or fourth retry, but the resulting data is inconsistent and the root cause remains unresolved.
AirflowLog AnalysisAirflow and Reliability ScenariosAutomatic retries hide a deterministic data bug
AirflowIntermediatePremium
The Secret Rotation Outage
You will practice
Diagnose and fix the issue: Credential change breaks many DAGs at once
Some tasks still use old environment variables, some use Airflow connections, and no one is sure which pipelines are impacted.
AirflowLog AnalysisAirflow and Reliability ScenariosCredential change breaks many DAGs at once
AirflowIntermediatePremium
The Idempotency Question
You will practice
Diagnose and fix the issue: How do you rerun a failed DAG safely?
The question is broad, but they want to know whether you have practical patterns for safe re-execution.
AirflowLog AnalysisAirflow and Reliability ScenariosHow do you rerun a failed DAG safely?
AirflowIntermediatePremium
The SLA vs Throughput Trade-off
You will practice
Diagnose and fix the issue: A DAG misses SLA under shared-cluster pressure
On busy days the report misses its 7 AM SLA because the downstream Spark tasks wait for cluster capacity.
AirflowMixed LabAirflow and Reliability ScenariosA DAG misses SLA under shared-cluster pressure
AirflowBeginnerPremium
The Catchup Stampede
You will practice
Diagnose and fix the issue: Enabling catchup unleashes too many historical runs at once
Worker slots fill up, recent daily jobs queue behind old runs, and platform stability degrades because the DAG suddenly behaves like a backfill workload.
AirflowLog AnalysisAirflow and Reliability ScenariosEnabling catchup unleashes too many historical runs at once
AirflowIntermediatePremium
The Zombie Task Mystery
You will practice
Diagnose and fix the issue: Worker death leaves tasks stuck in misleading running states
Operations sees tasks that remain in running or uncertain states even though the actual process is gone, causing confusion about whether to retry, clear, or wait.
AirflowLog AnalysisAirflow and Reliability ScenariosWorker death leaves tasks stuck in misleading running states
AirflowBeginnerPremium
The XCom Bloat Problem
You will practice
Diagnose and fix the issue: Teams misuse XCom for large payloads and degrade Airflow itself
The DAG still functions for a while, but the metadata database grows, UI pages become sluggish, and scheduler behavior becomes less healthy.
AirflowMixed LabAirflow and Reliability ScenariosTeams misuse XCom for large payloads and degrade Airflow itself
AirflowIntermediatePremium
The DAG Parse Bottleneck
You will practice
Diagnose and fix the issue: Heavy top-level code makes the scheduler slow before tasks even run
The scheduler begins lagging even when worker capacity is fine, and new DAGs or task changes take too long to appear.
AirflowLog AnalysisAirflow and Reliability ScenariosHeavy top-level code makes the scheduler slow before tasks even run
AirflowIntermediatePremium
The Pagination Retry Loop
You will practice
Diagnose and fix the issue: External API ingestion retries create duplicates or inconsistent slices
Retries make some runs succeed eventually, but duplicate pages, missing final pages, or inconsistent cursor state begin appearing in the landed data.
AirflowLog AnalysisAirflow and Reliability ScenariosExternal API ingestion retries create duplicates or inconsistent slices
AirflowIntermediatePremium
The Premature Publish Race
You will practice
Diagnose and fix the issue: Downstream data is published before all upstream slices are truly ready
Occasionally the publish step runs after only part of the data is truly ready, and downstream dashboards momentarily show mixed-day or partially refreshed results.
AirflowLog AnalysisAirflow and Reliability ScenariosDownstream data is published before all upstream slices are truly ready
AirflowBeginnerPremium
The Ownership Vacuum
You will practice
Diagnose and fix the issue: No clear owner responds when DAGs fail repeatedly
Operational noise grows, SLAs are missed, and incident resolution is slow because the technical problem is compounded by missing ownership and escalation rules.
AirflowLog AnalysisAirflow and Reliability ScenariosNo clear owner responds when DAGs fail repeatedly
AirflowIntermediatePremium
The Cross-DAG Dependency Trap
You will practice
Diagnose and fix the issue: External DAG dependencies become brittle as schedules evolve
What used to be a simple dependency now causes missed runs, false waiting, or accidental deadlocks because the cross-DAG contract is no longer explicit.
AirflowLog AnalysisAirflow and Reliability ScenariosExternal DAG dependencies become brittle as schedules evolve
AirflowAdvancedPremium
The Metadata DB Pressure Wave
You will practice
Diagnose and fix the issue: Airflow's metadata database becomes the bottleneck
UI pages load slowly, scheduling becomes delayed, and operational tasks such as clearing runs or browsing logs feel increasingly painful.
AirflowLog AnalysisAirflow and Reliability ScenariosAirflow's metadata database becomes the bottleneck
AirflowIntermediatePremium
The Environment Drift Outage
You will practice
Diagnose and fix the issue: Same DAG behaves differently across dev, staging, and prod
Teams discover differences in library versions, Airflow connections, environment variables, or executor configuration that changed task behavior in subtle ways.
AirflowMixed LabAirflow and Reliability ScenariosSame DAG behaves differently across dev, staging, and prod
AirflowIntermediatePremium
The Run Config Chaos
You will practice
Diagnose and fix the issue: Manual triggers with custom params create irreproducible outputs
The flexibility helps in emergencies, but over time the team loses confidence in what different runs actually did because parameters were inconsistent and not governed.
AirflowLog AnalysisAirflow and Reliability ScenariosManual triggers with custom params create irreproducible outputs
AirflowIntermediatePremium
The Maintenance Window Replay
You will practice
Diagnose and fix the issue: Platform upgrades or pauses create a messy restart backlog
After the platform returns, some teams want immediate replay, others want to skip non-critical intervals, and several DAGs collide as they all try to recover at once.
AirflowLog AnalysisAirflow and Reliability ScenariosPlatform upgrades or pauses create a messy restart backlog
MixedBeginnerPremium
The Partitioning Disaster
You will practice
Diagnose and fix the issue: High-cardinality partition column destroys performance
After a few months, the table has millions of partitions and both writes and reads become painful.
MixedMixed LabData Lake and Lakehouse ScenariosHigh-cardinality partition column destroys performance
KafkaIntermediatePremium
The Schema Evolution Shock
You will practice
Diagnose and fix the issue: New fields or changed types break downstream jobs
A silver pipeline either starts failing on new schema versions or silently drops important fields, depending on configuration.
KafkaLog AnalysisData Lake and Lakehouse ScenariosNew fields or changed types break downstream jobs
MixedIntermediatePremium
The Retention Regret
You will practice
Diagnose and fix the issue: Old data needed for rerun was vacuumed or expired
Later, an audit requires replaying or verifying historical states, but the supporting files are gone.
MixedLog AnalysisData Lake and Lakehouse ScenariosOld data needed for rerun was vacuumed or expired
MixedIntermediatePremium
The Metadata Swamp
You will practice
Diagnose and fix the issue: Too many partitions and files make the table hard to plan
Even simple queries spend a long time in planning before actual compute begins.
MixedLog AnalysisData Lake and Lakehouse ScenariosToo many partitions and files make the table hard to plan
MixedAdvancedPremium
The Delete Semantics Problem
You will practice
Diagnose and fix the issue: How to model deletes in CDC pipelines
The team handles inserts and updates, but deleted records keep appearing in downstream current-state tables because delete events were ignored or treated inconsistently.
MixedMixed LabData Lake and Lakehouse ScenariosHow to model deletes in CDC pipelines
MixedIntermediatePremium
The Compaction Strategy Interview
You will practice
Diagnose and fix the issue: How do you choose file size and compaction cadence?
They want more than 'I would optimize the table'.
MixedMixed LabData Lake and Lakehouse ScenariosHow do you choose file size and compaction cadence?
MixedAdvancedPremium
The Multi-Writer Collision
You will practice
Diagnose and fix the issue: Multiple jobs write the same table inconsistently
The table sometimes shows missing rows, overwritten partitions, or transaction conflicts depending on timing.
MixedMixed LabData Lake and Lakehouse ScenariosMultiple jobs write the same table inconsistently
MixedBeginnerPremium
The Format Decision
You will practice
Diagnose and fix the issue: Choosing Parquet vs Avro vs JSON for different layers
The interviewer wants to know whether you understand why one format is not best for every layer.
MixedMixed LabData Lake and Lakehouse ScenariosChoosing Parquet vs Avro vs JSON for different layers
SQLBeginnerPremium
The Time Travel Misread
You will practice
Diagnose and fix the issue: Analysts query the wrong table version and draw false conclusions
Confusion grows because people compare the wrong version timestamps, assume current schemas apply to older versions, and sometimes publish conclusions from a snapshot that was never the official business close.
SqlBroken SqlData Lake and Lakehouse ScenariosAnalysts query the wrong table version and draw false conclusions
KafkaAdvancedPremium
The Vacuum vs Streaming Conflict
You will practice
Diagnose and fix the issue: Aggressive cleanup breaks slow readers or long-running streaming jobs
After the change, some readers fail because files referenced by their transaction state are no longer available, and teams argue about whether the retention policy or the consumers are at fault.
KafkaLog AnalysisData Lake and Lakehouse ScenariosAggressive cleanup breaks slow readers or long-running streaming jobs
MixedIntermediatePremium
The Dynamic Overwrite Blast Radius
You will practice
Diagnose and fix the issue: Partition overwrite logic replaces more data than intended
After a code or schema change, the job deletes or replaces a wider slice than intended, and downstream teams discover that previously healthy partitions have disappeared.
MixedLog AnalysisData Lake and Lakehouse ScenariosPartition overwrite logic replaces more data than intended
MixedIntermediatePremium
The Rewrite Amplification Problem
You will practice
Diagnose and fix the issue: Small logical updates cause large physical file rewrites
In practice, the update workflow rewrites many files, consumes large compute budgets, and slows both writers and readers.
MixedLog AnalysisData Lake and Lakehouse ScenariosSmall logical updates cause large physical file rewrites
MixedIntermediatePremium
The Data Skipping Disappointment
You will practice
Diagnose and fix the issue: Table statistics exist, but queries still scan far more than expected
Teams then discover that some important queries still scan heavily, leading to disappointment and confusion about why the advertised optimization is not working well.
MixedMixed LabData Lake and Lakehouse ScenariosTable statistics exist, but queries still scan far more than expected
MixedBeginnerPremium
The Clone vs Copy Decision
You will practice
Diagnose and fix the issue: Teams need safe experimentation without corrupting production tables
Some engineers propose copying the entire table to a new location, while others suggest cloning, snapshotting, or creating a shallow environment-specific branch depending on platform capabilities.
MixedLog AnalysisData Lake and Lakehouse ScenariosTeams need safe experimentation without corrupting production tables
KafkaIntermediatePremium
The Decimal Precision Shock
You will practice
Diagnose and fix the issue: Schema evolution changes numeric precision and silently alters downstream behavior
Some downstream jobs fail, while others continue but round values differently, causing subtle mismatches in finance and reconciliation reports.
KafkaMixed LabData Lake and Lakehouse ScenariosSchema evolution changes numeric precision and silently alters downstream behavior
SQLAdvancedPremium
The Manifest Consistency Gap
You will practice
Diagnose and fix the issue: External query engines see stale or inconsistent table state
Writes succeed in the source engine, but one or more external consumers see stale files, partial state, or delayed visibility, creating cross-tool inconsistency.
SqlLog AnalysisData Lake and Lakehouse ScenariosExternal query engines see stale or inconsistent table state
MixedAdvancedPremium
The Erasure Request Challenge
You will practice
Diagnose and fix the issue: Personal-data deletion requests collide with append-only historical design
The data exists across raw, refined, and serving layers, and some datasets are intentionally append-only for audit or replay. Teams struggle to honor deletion without breaking lineage or creating hidden remnants.
MixedLog AnalysisData Lake and Lakehouse ScenariosPersonal-data deletion requests collide with append-only historical design
MixedIntermediatePremium
The CDC Bootstrap Merge
You will practice
Diagnose and fix the issue: Combining initial snapshot load with ongoing CDC creates duplicates or ordering issues
After cutover, duplicate keys and inconsistent ordering appear because the snapshot and change stream overlap imperfectly.
MixedMixed LabData Lake and Lakehouse ScenariosCombining initial snapshot load with ongoing CDC creates duplicates or ordering issues
MixedIntermediatePremium
The Orphan File Incident
You will practice
Diagnose and fix the issue: Failed writes or manual operations leave files not tracked by the table state
Confusion follows because some engineers want to clean them manually, others worry about breaking recovery, and storage costs slowly creep upward from unmanaged remnants.
MixedLog AnalysisData Lake and Lakehouse ScenariosFailed writes or manual operations leave files not tracked by the table state
MixedIntermediatePremium
The Catalog Drift Problem
You will practice
Diagnose and fix the issue: Table metadata in the catalog diverges from actual storage or contract reality
Eventually consumers see schema mismatches, stale locations, or conflicting definitions for what they believe is the same dataset.
MixedLog AnalysisData Lake and Lakehouse ScenariosTable metadata in the catalog diverges from actual storage or contract reality
KafkaIntermediatePremium
The Consumer Lag Crisis
You will practice
Diagnose and fix the issue: Kafka consumers fall behind during traffic spikes
During campaigns and product launches, consumer lag spikes badly and downstream freshness SLAs are missed.
KafkaLog AnalysisStreaming and Kafka ScenariosKafka consumers fall behind during traffic spikes
KafkaAdvancedPremium
The Exactly-Once Myth
You will practice
Diagnose and fix the issue: Candidate confuses source offsets with end-to-end exactly once
The interviewer pushes back and asks whether duplicates are still possible in the target warehouse or lakehouse.
KafkaLog AnalysisStreaming and Kafka ScenariosCandidate confuses source offsets with end-to-end exactly once
KafkaIntermediatePremium
The Out-of-Order Event Puzzle
You will practice
Diagnose and fix the issue: Late and out-of-order events distort aggregates
Users in unstable network conditions send events late, so dashboards change retroactively or show inconsistent counts.
KafkaMixed LabStreaming and Kafka ScenariosLate and out-of-order events distort aggregates
KafkaIntermediatePremium
The Hot Partition Problem
You will practice
Diagnose and fix the issue: One Kafka partition carries disproportionate traffic
A few enterprise customers generate far more traffic than others, causing one or two partitions to carry most of the load while others remain underutilized.
KafkaLog AnalysisStreaming and Kafka ScenariosOne Kafka partition carries disproportionate traffic
KafkaAdvancedPremium
The Replay Without Downtime Challenge
You will practice
Diagnose and fix the issue: Need to reprocess historical events while live stream continues
The interviewer asks how you would replay history without corrupting current processing.
KafkaLog AnalysisStreaming and Kafka ScenariosNeed to reprocess historical events while live stream continues
KafkaIntermediatePremium
The CDC Duplication Debate
You will practice
Diagnose and fix the issue: Database CDC emits duplicates or out-of-order updates
The downstream team sees what look like duplicate updates for the same key and is unsure whether the connector or the source is broken.
KafkaLog AnalysisStreaming and Kafka ScenariosDatabase CDC emits duplicates or out-of-order updates
KafkaIntermediatePremium
The Poison Pill Event
You will practice
Diagnose and fix the issue: A malformed event repeatedly breaks the stream
Because offsets are not advancing past that record, the pipeline enters a fail-restart-fail loop and freshness collapses.
KafkaLog AnalysisStreaming and Kafka ScenariosA malformed event repeatedly breaks the stream
KafkaIntermediatePremium
The Rebalance Storm
You will practice
Diagnose and fix the issue: Frequent consumer-group rebalances destroy throughput
Freshness collapses even though raw incoming throughput has not changed much, because consumers spend too much time rejoining rather than processing.
KafkaMixed LabStreaming and Kafka ScenariosFrequent consumer-group rebalances destroy throughput
KafkaIntermediatePremium
The Schema Registry Compatibility Break
You will practice
Diagnose and fix the issue: Producer schema evolution breaks downstream consumers
A producer deploys a new schema version that passes its own tests, but one or more consumers fail, mis-parse records, or begin dropping fields after the rollout.
KafkaMixed LabStreaming and Kafka ScenariosProducer schema evolution breaks downstream consumers
KafkaBeginnerPremium
The Producer Retry Duplicate
You will practice
Diagnose and fix the issue: Upstream retries create duplicate events that downstreams must tolerate
Downstream consumers begin seeing duplicate events for the same logical business action, and some pipelines amplify the issue because they assumed each event would appear exactly once.
KafkaLog AnalysisStreaming and Kafka ScenariosUpstream retries create duplicate events that downstreams must tolerate
KafkaIntermediatePremium
The Log Compaction Misunderstanding
You will practice
Diagnose and fix the issue: Teams misuse compacted topics because they confuse history with current state
Later, another team uses the same topic for audit-style analytics and is surprised that older change history is incomplete or no longer retained as they expected.
KafkaLog AnalysisStreaming and Kafka ScenariosTeams misuse compacted topics because they confuse history with current state
KafkaIntermediatePremium
The Event Version Drift
You will practice
Diagnose and fix the issue: Different event versions coexist and consumers handle them inconsistently
Some consumers cope well, but others silently ignore new fields, mis-handle old versions, or break on unexpected combinations when replaying older data.
KafkaLog AnalysisStreaming and Kafka ScenariosDifferent event versions coexist and consumers handle them inconsistently
KafkaIntermediatePremium
The DLQ Black Hole
You will practice
Diagnose and fix the issue: Dead-letter queues collect bad events but no one actually resolves them
At first this improves uptime, but later the DLQ becomes a graveyard of unresolved records, and business users realize some important events never made it back into the main datasets.
KafkaLog AnalysisStreaming and Kafka ScenariosDead-letter queues collect bad events but no one actually resolves them
KafkaAdvancedPremium
The Stream-Table Join State Blowup
You will practice
Diagnose and fix the issue: Joining events with changing reference data causes large or unstable state
As the joined state grows, checkpoint size increases, recovery slows, and the team struggles to balance freshness, memory, and correctness.
KafkaMixed LabStreaming and Kafka ScenariosJoining events with changing reference data causes large or unstable state
KafkaAdvancedPremium
The Historical Replay into a Live Topic
You will practice
Diagnose and fix the issue: Backfilling older events into an active topic disrupts consumers
Some consumers handle the replay badly, lag grows sharply, and ordering assumptions break because old events are now mixed into the live stream.
KafkaLog AnalysisStreaming and Kafka ScenariosBackfilling older events into an active topic disrupts consumers
KafkaIntermediatePremium
The Clock Skew Event-Time Bug
You will practice
Diagnose and fix the issue: Producer clocks distort event-time processing and watermark logic
Watermarks and windowing behave strangely because some events appear to come from the future or far from the expected event-time distribution.
KafkaLog AnalysisStreaming and Kafka ScenariosProducer clocks distort event-time processing and watermark logic
KafkaAdvancedPremium
The External API Sink Illusion
You will practice
Diagnose and fix the issue: Teams expect exactly-once outcomes while writing stream results to an external API
Leadership asks whether the pipeline is exactly once end to end, but engineers know retries, timeouts, and uncertain acknowledgements make that claim shaky.
KafkaLog AnalysisStreaming and Kafka ScenariosTeams expect exactly-once outcomes while writing stream results to an external API
KafkaIntermediatePremium
The Retention vs Replay Trade-off
You will practice
Diagnose and fix the issue: Topic retention is too short for real operational replay needs
Then an important downstream system falls behind for longer than expected, or a bug requires replay older than the topic still retains.
KafkaMixed LabStreaming and Kafka ScenariosTopic retention is too short for real operational replay needs
KafkaAdvancedPremium
The Multi-Topic Correlation Delay
You will practice
Diagnose and fix the issue: Joining multiple topics with different latency patterns creates inconsistent outputs
One topic is fast and clean, another is bursty and late, and the correlated output oscillates or requires frequent correction as related events arrive on different timelines.
KafkaLog AnalysisStreaming and Kafka ScenariosJoining multiple topics with different latency patterns creates inconsistent outputs
KafkaAdvancedPremium
The Transactional Producer Overclaim
You will practice
Diagnose and fix the issue: Teams misuse producer transactions and exaggerate what they guarantee
Later, consumers still encounter duplicates at the business level, and stakeholders feel misled because they assumed the producer feature solved end-to-end correctness automatically.
KafkaLog AnalysisStreaming and Kafka ScenariosTeams misuse producer transactions and exaggerate what they guarantee
KafkaAdvancedPremium
The Unified Customer 360 Design
You will practice
Diagnose and fix the issue: Designing batch and streaming together
They want to see whether you can balance raw ingestion, curated models, low-latency serving, and historical correctness in one architecture.
KafkaLog AnalysisArchitecture, Cloud, and Governance ScenariosDesigning batch and streaming together
MixedIntermediatePremium
The Cloud Migration without a Big Bang
You will practice
Diagnose and fix the issue: Moving on-prem ETL to cloud safely
The interviewer asks how you would migrate while controlling risk.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMoving on-prem ETL to cloud safely
MixedIntermediatePremium
The Cost Spike Mystery
You will practice
Diagnose and fix the issue: Cloud data platform spend increases unexpectedly
You are asked to investigate where the money is going without damaging delivery SLAs.
MixedMixed LabArchitecture, Cloud, and Governance ScenariosCloud data platform spend increases unexpectedly
MixedIntermediatePremium
The Data Quality Incident
You will practice
Diagnose and fix the issue: Business discovers wrong numbers after release
Leadership wants both a recovery plan and a prevention plan.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosBusiness discovers wrong numbers after release
MixedAdvancedPremium
The PII Exposure Risk
You will practice
Diagnose and fix the issue: Sensitive data lands in the lake without proper controls
No breach is confirmed, but governance controls are clearly inadequate.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosSensitive data lands in the lake without proper controls
MixedIntermediatePremium
The Observability Black Hole
You will practice
Diagnose and fix the issue: Pipeline fails but logs and metrics are insufficient
The interviewer asks how you would improve observability so incidents become faster to diagnose.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosPipeline fails but logs and metrics are insufficient
Data QualityAdvancedPremium
The End-to-End Sales Pipeline Design
You will practice
Diagnose and fix the issue: Comprehensive scenario covering batch, late data, idempotency, and quality
The interviewer wants to hear how you would build the system and what trade-offs you would make.
Data QualityMixed LabArchitecture, Cloud, and Governance ScenariosComprehensive scenario covering batch, late data, idempotency, and quality
MixedIntermediatePremium
The Medallion by Dogma
You will practice
Diagnose and fix the issue: Too many mandatory layers slow delivery without adding value
Over time some small, stable datasets move through unnecessary layers, delivery slows, and teams begin copying data mainly to satisfy architecture doctrine rather than a real need.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosToo many mandatory layers slow delivery without adding value
MixedIntermediatePremium
The Build vs Buy Workflow Debate
You will practice
Diagnose and fix the issue: Choosing managed services versus custom platform components
Debates become ideological: one group argues managed services are always faster, while another argues custom control is essential for scale and long-term flexibility.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosChoosing managed services versus custom platform components
MixedAdvancedPremium
The Domain Ownership Collision
You will practice
Diagnose and fix the issue: Central platform standards conflict with domain-team autonomy
Friction emerges because domain teams want speed and freedom, while the platform team wants consistent security, lineage, and operational controls.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosCentral platform standards conflict with domain-team autonomy
MixedAdvancedPremium
The Data Residency Split
You will practice
Diagnose and fix the issue: Global architecture must respect regional data residency and access rules
Business leaders still want global reporting and machine learning, so the architecture must balance regional isolation with cross-region insight.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosGlobal architecture must respect regional data residency and access rules
MixedIntermediatePremium
The Chargeback Blind Spot
You will practice
Diagnose and fix the issue: Cloud cost rises, but teams cannot see which workloads are responsible
Engineers can guess broadly, but the platform lacks reliable workload tagging, cost attribution, or standardized lineage between spend and value.
MixedMixed LabArchitecture, Cloud, and Governance ScenariosCloud cost rises, but teams cannot see which workloads are responsible
MixedAdvancedPremium
The Features vs BI Truth Split
You will practice
Diagnose and fix the issue: Machine-learning features and BI metrics drift apart from the same source data
Eventually the product and analytics teams compare outputs and realize the 'same' concept means different things in models versus dashboards.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMachine-learning features and BI metrics drift apart from the same source data
MixedIntermediatePremium
The Bad Data Incident Command
You will practice
Diagnose and fix the issue: Wrong numbers reach business users and the response is unstructured
Engineers start debugging immediately, but the response is chaotic: no one knows whether to pull the report, page consumers, freeze downstream jobs, or keep the platform running while the root cause is investigated.
MixedMixed LabArchitecture, Cloud, and Governance ScenariosWrong numbers reach business users and the response is unstructured
MixedIntermediatePremium
The Data Contract Rollout
You will practice
Diagnose and fix the issue: Producer-consumer contracts exist in theory but are hard to enforce in practice
Everyone agrees in principle, but adoption stalls because producer teams see contracts as bureaucracy and consumers still lack confidence that changes will be caught early.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosProducer-consumer contracts exist in theory but are hard to enforce in practice
MixedAdvancedPremium
The Platform Disaster Recovery Plan
You will practice
Diagnose and fix the issue: Data platform recovery design is vague until a real outage occurs
Leadership assumes the platform is resilient, yet engineers realize the actual recovery sequence, RTO, and data-loss boundaries are poorly defined.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosData platform recovery design is vague until a real outage occurs
MixedIntermediatePremium
The Self-Serve Guardrail Balance
You will practice
Diagnose and fix the issue: Making the platform self-serve without opening the door to chaos
At the same time, security, governance, and reliability teams worry that broad self-service will create an uncontrolled sprawl of low-quality data products and risky permissions.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosMaking the platform self-serve without opening the door to chaos
MixedIntermediatePremium
The Backup vs Replay vs Recompute Choice
You will practice
Diagnose and fix the issue: Teams do not know whether to restore, replay, or rebuild after data loss
Some want to restore from backup, others want to replay from raw events, and others argue the dataset should simply be recomputed from upstream sources - but no one has a framework for choosing.
MixedMixed LabArchitecture, Cloud, and Governance ScenariosTeams do not know whether to restore, replay, or rebuild after data loss
MixedIntermediatePremium
The Noisy Neighbor Platform
You will practice
Diagnose and fix the issue: One team's heavy workload degrades the shared platform for everyone else
At peak times a few large jobs monopolize resources, and unrelated teams experience slower queries, missed SLAs, or higher costs even though their own usage has not changed.
MixedMixed LabArchitecture, Cloud, and Governance ScenariosOne team's heavy workload degrades the shared platform for everyone else
MixedAdvancedPremium
The Executive KPI Mismatch
You will practice
Diagnose and fix the issue: Different teams publish different versions of the same top-line metric
The platform itself may be running fine, but organizational trust drops because no one can say which number is canonical or why the definitions diverged.
MixedLog AnalysisArchitecture, Cloud, and Governance ScenariosDifferent teams publish different versions of the same top-line metric