The Data Foundry

Built by Data with Pranjal

PySparkIntermediateBroken PySpark FixFree

Append Mode Created Duplicate Daily Loads

A daily orders job was rerun after a cluster failure. The DAG succeeded, but the dashboard shows exactly 2x orders for the rerun date.

Scenario context

The PySpark job writes in append mode for a deterministic daily partition, so retries and reruns duplicate the same day.

Business requirement

Make the daily write idempotent for order_date.

Sample production data

Use these small tables to reason about the bug before writing the fix. The browser checker seeds this data when you click Check Answer.

orders_df

order_idcustomer_idorder_dateupdated_atamount
90011012026-05-022026-05-02 09:10:00120
90021022026-05-022026-05-02 09:12:0080
90011012026-05-022026-05-02 09:10:00120

gold_orders_partition_after_retry

order_idorder_dateload_attempt
90012026-05-021
90022026-05-021
90012026-05-022
90022026-05-022

Broken logic / code

orders_df
  .filter(F.col('order_date') == run_date)
  .write
  .mode('append')
  .partitionBy('order_date')
  .parquet(gold_orders_path)

Actual output

order_date=2026-05-02 has 2 copies of the same order_id values after rerun.

Expected output / expected logic

Rerunning the same date replaces or merges that date without duplicates.

Your attempt

Write the corrected PySpark approach

Think before revealing the answer. A partial but honest attempt is better practice than reading the model solution first.

Saved locally

Interview-style explanation

Now explain your solution as if you are in an interview: symptom, root cause, fix, edge cases, trade-offs, monitoring, and prevention.