The Data Foundry

Built by Data with Pranjal

Platform roadmap

Use The Data Foundry in the right order.

This is not a seven-day checklist or a race through random questions. It is one practical route from SQL fundamentals to production debugging and system design. Move forward when you can demonstrate the skill.

Your progress

0/8

stages completed

0%
Start with SQL foundations

01

Practice to evidence

Complete a target number of labs instead of merely opening one question.

02

Build in layers

Use each skill in production scenarios before moving to architecture.

03

Review weak signals

Let your dashboard and failed attempts decide what you practice next.

Your learning sequence

Eight stages, one coherent journey

The targets are guidance, not gates. If you already know a skill, use the checkpoints to verify it and continue.

Stage 1SQLUp next

Build SQL foundations

Start with browser-based SQL practice so joins, aggregations, NULL handling, windows, and output grain become reliable.

Practice target

Complete 8-10 SQL labs before moving forward.

Open this stage

Ready to move on when you can

  • Filter and aggregate data at the correct grain
  • Use joins without duplicating or silently dropping rows
  • Handle NULL values, rankings, and deduplication
  • Explain why your output matches the business requirement
Stage 2Python

Practice Python for data work

Use Python to transform records, handle files and nested data, and write clear logic that survives realistic edge cases.

Practice target

Complete 5-8 Python labs with all sample cases passing.

Open this stage

Ready to move on when you can

  • Work confidently with lists, dictionaries, strings, and records
  • Process JSON, CSV, and file-like data safely
  • Test empty, duplicate, malformed, and missing-value cases
  • Write readable functions instead of one-off scripts
Stage 3PySpark

Debug PySpark pipelines

Move from syntax to production reasoning by fixing DataFrame logic, skew, rerun duplication, partitions, and small-file problems.

Practice target

Complete at least 5 PySpark production labs.

Open this stage

Ready to move on when you can

  • Use built-in DataFrame functions before Python UDFs
  • Reason about joins, shuffle, skew, and partition counts
  • Make reruns idempotent and safe
  • Explain the operational trade-off behind your fix
Stage 4Airflow

Operate Airflow workflows

Learn to classify scheduling delays, retries, sensor pressure, backfill failures, and orchestration anti-patterns from real evidence.

Practice target

Complete 5 Airflow incident labs across different failure classes.

Open this stage

Ready to move on when you can

  • Separate scheduler, executor, worker, and downstream failures
  • Design retries, sensors, pools, and backfills safely
  • Keep heavy compute and large payloads outside Airflow
  • Explain idempotency, observability, and operational trade-offs
Stage 5AWS

Make AWS platform decisions

Choose storage, compute, streaming, security, governance, and serving services from workload evidence rather than memorized definitions.

Practice target

Complete 6 AWS incidents covering at least 4 service areas.

Open this stage

Ready to move on when you can

  • Start with workload, scale, latency, and operating constraints
  • Compare the chosen service with its nearest alternative
  • Include IAM, networking, encryption, cost, and failure handling
  • Name monitoring signals that prove the design works
Stage 6Scenario

Enter the Broken Pipeline Lab

Apply SQL, PySpark, orchestration, and data-quality skills to incidents that look like a data engineer's daily production work.

Practice target

Solve 6 production scenarios across at least 3 topics.

Open this stage

Ready to move on when you can

  • Attempt the diagnosis before opening hints
  • Run the corrected query or logic when execution is available
  • State the root cause and business impact clearly
  • Add monitoring, reconciliation, or prevention steps
Stage 7System design

Develop system design judgment

Practice turning requirements into a dependable data platform and defending architecture choices instead of memorizing diagrams.

Practice target

Complete 3 system design cases and explain each aloud.

Open this stage

Ready to move on when you can

  • Clarify scale, latency, consumers, and data contracts
  • Choose batch, streaming, storage, and serving layers deliberately
  • Discuss cost, reliability, consistency, and complexity trade-offs
  • Include observability, replay, security, and failure recovery
Stage 8Revision

Use feedback to close weak areas

Return to the dashboard, review weak skills and incomplete attempts, then repeat the platform loop with harder material.

Practice target

Review your dashboard after every 5 completed practices.

Open this stage

Ready to move on when you can

  • Re-attempt scenarios marked Weak or Okay
  • Compare scores and identify recurring gaps
  • Explain completed fixes in interview-ready language
  • Choose the next lab from evidence, not random browsing

Need help choosing your first lab?

Onboarding uses your current stage, goal, available time, and interview timeline to recommend where you should enter this roadmap.

Get a starting recommendation