Practice Spark production fixes without paying for a cluster: inspect the data, repair the code, run a concept check, then compare with the model answer.
Labs
20
Free
5
Runtime
Code review
beginnerPySparkDeduplicationIdempotency
PySpark 1: Daily Rerun Created Duplicate Orders
The daily orders job failed after writing half the partition. The Airflow retry ran the same input again and the revenue dashboard doubled a few orders.
Data engineer task
Rewrite the broken PySpark write so rerunning the same daily file does not duplicate records.
Expected outcome
A safe PySpark fix should deduplicate by order_id or event_id, write only the affected partition, and avoid blind append for reruns.
Sample production data
raw_orders
order_id
order_date
amount
ingest_run_id
101
2026-05-30
599
run_17
102
2026-05-30
1299
run_17
101
2026-05-30
599
run_retry_17
Fix workspace
Fix the PySpark code or write the production-safe approach. The browser checks the concepts, APIs, and trade-offs.