The Data Foundry

Built by Data with Pranjal

PySparkAdvancedBroken PySpark FixFree

Checkpoint Chaos

A real-time fraud detection stream reads Kafka events, enriches them, and writes to a Delta sink with checkpointing.

Scenario context

After a deployment and infrastructure interruption, the stream restarts but either reprocesses events, produces duplicates, or struggles with state-store recovery.

Business requirement

Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.

Schema

DataFrames depend on the scenario. Assume large production-scale inputs, skewed keys, retries, and partitioned lake storage.

Sample input

Use the code comments and logs to infer the input shape. Focus on the production failure mode, not local toy execution.

Broken logic / code

from pyspark.sql import functions as F

source_df = spark.read.parquet(source_path)

# Broken: this code is functionally plausible but unsafe for production scale/reruns.
result_df = (source_df
  .join(reference_df, on='id', how='left')
  .groupBy('id')
  .agg(F.count('*').alias('record_count')))

result_df.write.mode('append').parquet(output_path)

Logs / error

[Spark] Checkpoint Chaos
Stage progress: most tasks finished, one or more tasks are long-running.
Metrics to inspect: shuffle read, spill, skew ratio, executor lost count, file count.
After a deployment and infrastructure interruption, the stream restarts but either reprocesses
events, produces duplicates, or struggles with state-store recovery.

Expected output / expected logic

Corrected PySpark code or approach should reduce the failure mode, preserve correctness, and include validation/monitoring.

Your attempt

Write the corrected PySpark approach

Think before revealing the answer. A partial but honest attempt is better practice than reading the model solution first.

Saved locally

Interview-style explanation

Now explain your solution as if you are in an interview: symptom, root cause, fix, edge cases, trade-offs, monitoring, and prevention.