Append Mode Created Duplicate Daily Loads
A daily orders job was rerun after a cluster failure. The DAG succeeded, but the dashboard shows exactly 2x orders for the rerun date.
Scenario context
The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.
Business requirement
Make the daily write idempotent for order_date.
Sample production data
Use these small tables to reason about the bug before writing the fix. The browser checker seeds this data when you click Check Answer.
orders_df
| order_id | customer_id | order_date | updated_at | amount |
|---|---|---|---|---|
| 9001 | 101 | 2026-05-02 | 2026-05-02 09:10:00 | 120 |
| 9002 | 102 | 2026-05-02 | 2026-05-02 09:12:00 | 80 |
| 9001 | 101 | 2026-05-02 | 2026-05-02 09:10:00 | 120 |
gold_orders_partition_after_retry
| order_id | order_date | load_attempt |
|---|---|---|
| 9001 | 2026-05-02 | 1 |
| 9002 | 2026-05-02 | 1 |
| 9001 | 2026-05-02 | 2 |
| 9002 | 2026-05-02 | 2 |
Broken logic / code
orders_df
.filter(F.col('order_date') == run_date)
.write
.mode('append')
.partitionBy('order_date')
.parquet(gold_orders_path)Actual output
order_date=2026-05-02 has 2 copies of the same order_id values after rerun.Expected output / expected logic
Rerunning the same date replaces or merges that date without duplicates.Your attempt
Write the corrected PySpark approach
Think before revealing the answer. A partial but honest attempt is better practice than reading the model solution first.
Saved locally
Interview-style explanation
Now explain your solution as if you are in an interview: symptom, root cause, fix, edge cases, trade-offs, monitoring, and prevention.