PySpark Lab

Browser-based data engineering practice.

Practice Spark production fixes without paying for a cluster: inspect the data, repair the code, run a concept check, then compare with the model answer.

Labs

Free

Runtime

Code review

beginnerPySparkDeduplicationIdempotency

PySpark 1: Daily Rerun Created Duplicate Orders

The daily orders job failed after writing half the partition. The Airflow retry ran the same input again and the revenue dashboard doubled a few orders.

Data engineer task

Rewrite the broken PySpark write so rerunning the same daily file does not duplicate records.

Expected outcome

A safe PySpark fix should deduplicate by order_id or event_id, write only the affected partition, and avoid blind append for reruns.

Sample production data

raw_orders

order_id	order_date	amount	ingest_run_id
101	2026-05-30	599	run_17
102	2026-05-30	1299	run_17
101	2026-05-30	599	run_retry_17

Fix workspace

Fix the PySpark code or write the production-safe approach. The browser checks the concepts, APIs, and trade-offs.

Saved locally