PySparkIntermediateBroken PySpark FixFree

Append Mode Created Duplicate Daily Loads

A daily orders job was rerun after a cluster failure. The DAG succeeded, but the dashboard shows exactly 2x orders for the rerun date.

Scenario context

The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.

Business requirement

Make the daily write idempotent for order_date.

Sample production data

Use these small tables to reason about the bug before writing the fix. The browser checker seeds this data when you click Check Answer.

orders_df

order_id	customer_id	order_date	updated_at	amount
9001	101	2026-05-02	2026-05-02 09:10:00	120
9002	102	2026-05-02	2026-05-02 09:12:00	80
9001	101	2026-05-02	2026-05-02 09:10:00	120

gold_orders_partition_after_retry

Broken logic / code

orders_df
  .filter(F.col('order_date') == run_date)
  .write
  .mode('append')
  .partitionBy('order_date')
  .parquet(gold_orders_path)

Actual output

order_date=2026-05-02 has 2 copies of the same order_id values after rerun.

Expected output / expected logic

Rerunning the same date replaces or merges that date without duplicates.

Your attempt

Write the corrected PySpark approach

Think before revealing the answer. A partial but honest attempt is better practice than reading the model solution first.

Saved locally

Now explain your solution as if you are in an interview: symptom, root cause, fix, edge cases, trade-offs, monitoring, and prevention.