Welcome to the bootcamp
Over the next 18 sections you'll go from "I've heard of Spark" to confidently designing, building, tuning, and deploying real Spark pipelines that process data at massive scale. This is a hands-on, Python-first course built on Apache Spark 4.0 and PySpark — every concept is reinforced with runnable code, a guided exercise, or a lab, and the back half is built around full projects you can put on a résumé or GitHub.
You don't need any prior distributed-systems or Spark experience. If you're comfortable with Python and can read
a SQL SELECT with a JOIN and a GROUP BY, you're ready.
The path you'll walk
The course is a ladder — each rung reuses and extends the skills from the last.
- Foundations — why Big Data exists, how clusters split work, and why Spark beats classic MapReduce.
- Core APIs — RDDs for low-level control, then the DataFrame API and Spark SQL for real analytics.
- Scale & speed — every major file format, joins and shuffles, performance tuning, and Structured Streaming.
- Production — the Delta lakehouse, MLlib, cloud and Kubernetes deployment, Kafka, orchestration, and five capstone projects.
How the course is organized
The roadmap moves deliberately from fundamentals to full applications:
| Phase | Sections | What you'll be able to do |
|---|---|---|
| Foundations | 1–4 | Reason about distributed systems, Spark's architecture, and stand up a Spark 4.0 environment |
| Core APIs | 5–8 | Build pipelines with RDDs, DataFrames, and Spark SQL |
| Data & performance | 9–13 | Read every format, tune joins/shuffles, stream data, and test pipelines |
| Production | 14–18 | Deploy to cloud/K8s, build a lakehouse, train ML models, and ship capstones |
What you'll need
You'll set all of this up in Section 4 — nothing to install yet, just know what's coming:
- Java 17 — Spark 4.0 runs on the JVM and requires a modern JDK.
- Python 3.11+ and PySpark 4 in a virtual environment.
- Docker — for a reproducible local Spark cluster you'll reuse all course.
- A code editor such as VS Code, or Jupyter for interactive work.