The Data Foundry

Built by Data with Pranjal

Back to labs

PySpark Lab

Browser-based data engineering practice.

Practice Spark production fixes without paying for a cluster: inspect the data, repair the code, run a concept check, then compare with the model answer.

Labs

20

Free

5

Runtime

Code review

beginnerPySparkDeduplicationIdempotency

PySpark 1: Daily Rerun Created Duplicate Orders

The daily orders job failed after writing half the partition. The Airflow retry ran the same input again and the revenue dashboard doubled a few orders.

Data engineer task

Rewrite the broken PySpark write so rerunning the same daily file does not duplicate records.

Expected outcome

A safe PySpark fix should deduplicate by order_id or event_id, write only the affected partition, and avoid blind append for reruns.

Sample production data

raw_orders

order_idorder_dateamountingest_run_id
1012026-05-30599run_17
1022026-05-301299run_17
1012026-05-30599run_retry_17

Fix workspace

Fix the PySpark code or write the production-safe approach. The browser checks the concepts, APIs, and trade-offs.

Saved locally