The Endless Final Stage
You own a daily Spark ETL that joins a very large clickstream_events table with a user_profiles table to build a personalization layer for dashboards.
Scenario context
The job reaches 99 percent quickly, 199 out of 200 tasks finish, but one task runs for hours with high GC and 100 percent CPU before ending with OutOfMemoryError.
Business requirement
Fix the PySpark code so the pipeline is correct, scalable, and safe to rerun.
Schema
DataFrames depend on the scenario. Assume large production-scale inputs, skewed keys, retries, and partitioned lake storage.Sample input
Use the code comments and logs to infer the input shape. Focus on the production failure mode, not local toy execution.Broken logic / code
from pyspark.sql import functions as F
events = spark.read.parquet(clickstream_path)
profiles = spark.read.parquet(user_profiles_path)
# Broken: one hot/null user_id can push most rows into one shuffle task.
result = (events
.join(profiles, on='user_id', how='left')
.groupBy('user_id')
.agg(F.count('*').alias('event_count')))
result.write.mode('overwrite').parquet(output_path)Logs / error
[Spark] The Endless Final Stage
Stage progress: most tasks finished, one or more tasks are long-running.
Metrics to inspect: shuffle read, spill, skew ratio, executor lost count, file count.
The job reaches 99 percent quickly, 199 out of 200 tasks finish, but one task runs for hours with
high GC and 100 percent CPU before ending with OutOfMemoryError.Expected output / expected logic
Corrected PySpark code or approach should reduce the failure mode, preserve correctness, and include validation/monitoring.Your attempt
Write the corrected PySpark approach
Think before revealing the answer. A partial but honest attempt is better practice than reading the model solution first.
Saved locally
Interview-style explanation
Now explain your solution as if you are in an interview: symptom, root cause, fix, edge cases, trade-offs, monitoring, and prevention.