Preview this course
₹199 ₹499 60% off
incl. GST
left at this price!
Sign up to buy Already a member? Log in

7-day money-back guarantee

Bestseller Recently updated Technology · Big Data

Apache Spark & Big Data Bootcamp: Process Data at Massive Scale with PySpark

Go from zero distributed-systems experience to building, optimizing, and deploying production Spark pipelines — RDDs, DataFrames, Spark SQL, Structured Streaming, the lakehouse, ML, and cloud, all on Spark 4.0 with PySpark.

4.2 (48 ratings) Created by Ananya Iyer
Intermediate 177 lessons 40h 9m Updated Jun 2026 English
Preview this course
₹199 ₹499 60% off
incl. GST
left at this price!
Sign up to buy Already a member? Log in

7-day money-back guarantee

What you'll learn

Explain how Spark executes work across a cluster — driver, executors, jobs, stages, tasks, and the Spark Connect client/server model
Build data pipelines with both the RDD API and the DataFrame API, and know when to use each
Write production Spark SQL and read query plans to understand the Catalyst optimizer and ANSI mode
Read and write CSV, JSON, Parquet, ORC, Avro, JDBC, and semi-structured VARIANT data at scale
Master joins, shuffles, partitioning, bucketing, and data-skew handling — the core of Spark performance
Build fault-tolerant Structured Streaming pipelines with watermarking, windows, and exactly-once guarantees
Diagnose and fix performance problems with the Spark UI, explain plans, caching, and Adaptive Query Execution
Build ACID lakehouse tables with Delta Lake, train ML pipelines with MLlib, and deploy Spark to AWS, Azure, GCP, and Kubernetes

This course includes

40h 9m of on-demand content
177 lessons across 18 sections
Access on mobile and desktop
Certificate of completion
Lifetime access
Curriculum

Course content

18 sections · 177 lessons · 40h 9m

Welcome to the Bootcamp Preview 10 min
Why Apache Spark Wins Preview 14 min
The Spark 4.0 Ecosystem 14 min
Big Data Career Paths 12 min
Section Quiz · Course Introduction & the Big-Data Landscape 6 min

The Scaling Problem 13 min
Distributed Systems Basics 14 min
Parallel Computing Concepts 14 min
MapReduce Fundamentals 15 min
Distributed Storage: HDFS vs Object Storage 14 min
The CAP Theorem 13 min
Batch vs Streaming Processing 14 min
Section Quiz · Foundations of Distributed Computing 6 min

The Spark Ecosystem Overview 13 min
Cluster Architecture: Driver, Executors & Cluster Manager 16 min
Execution Flow: Jobs, Stages & Tasks 16 min
Spark Connect: The Client/Server Model 15 min
Spark Deployment Modes 14 min
Spark Memory Management 14 min
Fault Tolerance & Lineage 13 min
Section Quiz · Apache Spark Architecture 6 min

Installing Java 17 10 min
Installing Spark 4.0 12 min
Python & PySpark Setup 12 min
Spark Local Setup & the Shell 13 min
Jupyter Notebook Setup 12 min
VS Code Setup 10 min
Dockerized Spark Environment 16 min
Spark Connect & Running Your First Job 18 min
Section Quiz · Environment Setup & Your First Job 6 min

Introduction to PySpark 12 min
Creating a SparkSession 14 min
SparkContext & Entry Points 12 min
Reading Data 14 min
Writing Data 13 min
Basic Transformations 14 min
Actions 13 min
Lazy Evaluation & the DAG 15 min
Hands-On: Read → Transform → Write Pipeline 22 min
Section Quiz · PySpark Fundamentals 6 min

What is an RDD? 13 min
Creating RDDs 13 min
Parallel Collections & Partitions 13 min
RDD Transformations 15 min
RDD Actions 13 min
Key-Value (Pair) RDDs 14 min
Pair RDD Operations 15 min
Partitioning & Persistence 15 min
When to Use RDDs vs DataFrames 13 min
Project: RDD Word Count 25 min
Section Quiz · Resilient Distributed Datasets (RDDs) 6 min

Why DataFrames? 13 min
Schema Fundamentals 14 min
Creating DataFrames 13 min
DataFrame Operations 14 min
Filtering & Sorting 13 min
Aggregations 14 min
Window Functions 16 min
Handling Nulls (ANSI-Aware) 14 min
UDFs, Pandas UDFs & Arrow 16 min
Built-in Functions & Pandas API on Spark 14 min
Project: Analytics Notebook 28 min
Section Quiz · Spark DataFrames 6 min

Spark SQL Architecture 13 min
Temporary & Global Views 14 min
Running SQL Queries 13 min
Joins in SQL 15 min
Subqueries 13 min
Window Analytics in SQL 14 min
ANSI SQL Mode (Default in 4.0) 15 min
The Catalyst Optimizer & Query Plans 16 min
Project: SQL Analytics 28 min
Section Quiz · Spark SQL Deep Dive 6 min

CSV Files 13 min
JSON & Nested Data 14 min
Parquet: The Default Format 14 min
ORC & Avro 13 min
Semi-Structured Data & the VARIANT Type 15 min
JDBC Sources 14 min
The Python Data Source API 15 min
Partitioned Data & the Small-File Problem 15 min
Incremental Loads 14 min
Project: Multi-Format Ingestion Pipeline 28 min
Section Quiz · Working with Data Sources 6 min

Joins Deep Dive 15 min
Broadcast Joins 14 min
Shuffle Operations 15 min
Repartition vs Coalesce 14 min
Bucketing 14 min
Skew Handling: Salting & AQE 16 min
Complex Aggregations: Rollup, Cube & Pivot 14 min
Project: Multi-Stage ETL Pipeline 28 min
Section Quiz · Advanced Transformations: Joins, Shuffles & Skew 6 min

Streaming Concepts & Event Time 14 min
From DStreams to Structured Streaming 13 min
The Micro-Batch Architecture 15 min
Streaming Sources 14 min
Sinks & foreachBatch 14 min
Watermarking & Late Data 15 min
Windowed Aggregations 15 min
Stateful Processing with transformWithState 16 min
Checkpointing & Exactly-Once 15 min
Project: Real-Time Dashboard 28 min
Section Quiz · Structured Streaming 6 min

Spark Performance Principles 13 min
Reading Execution Plans with explain 15 min
Caching & Persistence Levels 15 min
Shuffle Optimization 15 min
Broadcast Strategies 13 min
Adaptive Query Execution (AQE) 16 min
Resource Allocation 15 min
The Spark UI 15 min
Executor Monitoring & Structured Logging 15 min
Diagnosing Failures: OOM, GC & Shuffle 16 min
Project: Tune a Slow Job 28 min
Section Quiz · Performance Tuning & Monitoring 6 min

Why Test Data Pipelines? 13 min
Unit Testing PySpark with pytest 16 min
DataFrame Assertions: assertDataFrameEqual & chispa 15 min
Data Quality Checks 14 min
In-Pipeline Validation & Quarantine 15 min
Section Quiz · Testing & Data Quality 5 min

Spark on AWS & EMR 15 min
Spark on Azure & Databricks 15 min
Spark on GCP & Dataproc 14 min
Object Storage Integration (S3 / ADLS / GCS) 15 min
Cloud Cost Optimization 15 min
Kubernetes Fundamentals for Spark 14 min
The Spark Operator 14 min
Deploying Spark on Kubernetes 16 min
Scaling & Monitoring on Kubernetes 14 min
Production Best Practices 14 min
Section Quiz · Running Spark on the Cloud & Kubernetes 6 min

Warehouses, Lakes & the Lakehouse 14 min
Delta Lake Fundamentals 15 min
ACID via the Transaction Log 15 min
Time Travel 13 min
MERGE & Upserts 15 min
Slowly Changing Dimensions 15 min
Apache Iceberg vs Hudi vs Delta 15 min
Medallion Architecture: Bronze / Silver / Gold 14 min
Project: Versioned Lakehouse Table 28 min
Section Quiz · Data Lakehouse & Open Table Formats 6 min

MLlib Overview 13 min
Feature Engineering 15 min
ML Pipelines 15 min
Classification 15 min
Regression 14 min
Clustering 14 min
Model Evaluation 14 min
Hyperparameter Tuning 15 min
Project: End-to-End ML Pipeline 26 min
Graph Analytics with GraphFrames 16 min
Section Quiz · Machine Learning & Graph Analytics with Spark 6 min

Kafka Fundamentals 14 min
Producers & Consumers 15 min
Kafka + Spark Architecture 14 min
Streaming Integration with Kafka 15 min
Real-Time ETL from Kafka 15 min
Why Orchestration? 13 min
Airflow DAGs for Spark 16 min
Packaging & spark-submit 15 min
CI/CD for Data Pipelines 15 min
Secrets & Security Basics 14 min
Section Quiz · Kafka Integration, Orchestration & CI/CD 6 min

Capstone 1: E-Commerce Analytics Platform 20 min
Capstone 2: Real-Time Fraud Detection 20 min
Capstone 3: Log Analytics Pipeline 20 min
Capstone 4: Data Lakehouse Implementation 20 min
Capstone 5: Kafka + Spark End-to-End 20 min
Interview Prep: Concepts & Architecture 16 min
Interview Prep: SQL, DataFrames & Tuning 16 min
Building a Data Engineering Portfolio 14 min
Spark Certifications & the 2026+ Roadmap 14 min
Course Wrap-Up & Congratulations 10 min
Section Quiz · Capstone Projects, Interview Prep & Career 6 min

Requirements

  • Comfort with Python — functions, comprehensions, and basic object-oriented code
  • Working knowledge of SQL (SELECT, JOIN, GROUP BY)
  • Basic command-line familiarity; no prior distributed-systems or Spark experience required
  • A computer that can run Java 17 and Docker (for the local Spark environment)

Description

Master Apache Spark from the ground up and learn how modern organizations process data at petabyte scale. This is a hands-on, Python-first bootcamp built on Apache Spark 4.0 and PySpark — every concept is reinforced with runnable code, a guided exercise, or a lab, and the back half is built around full projects you can put on a résumé or GitHub.

You'll start with the why — what Big Data actually is, how distributed computing works, and why Spark beats classic MapReduce — then build steadily through Spark's core abstractions and into production-grade data engineering.

What you'll learn

  • Foundations — distributed computing, clusters, partitioning, fault tolerance, and batch-vs-streaming trade-offs
  • Spark architecture — driver, executors, cluster managers, jobs/stages/tasks, and the modern Spark Connect client/server model
  • The core APIs — RDDs for low-level control, then the DataFrame API and Spark SQL for real analytics
  • Data sources — CSV, JSON, Parquet, ORC, Avro, JDBC, and semi-structured VARIANT data
  • Performance — joins, shuffles, partitioning, bucketing, skew handling, caching, and Adaptive Query Execution
  • Structured Streaming — event-time processing, watermarking, windows, stateful processing, and exactly-once guarantees
  • The lakehouse — Delta Lake ACID transactions, time travel, upserts, and the Bronze/Silver/Gold pattern
  • ML & graphs — MLlib pipelines and GraphFrames analytics
  • Production — testing and data quality, cloud deployment on AWS/Azure/GCP, Spark on Kubernetes, Kafka integration, and orchestration with Airflow + CI/CD

What you'll build

Section projects throughout (RDD word count, analytics notebooks, multi-format ingestion, ETL pipelines, a real-time dashboard, a tuned job, a versioned Delta table, an ML pipeline) — capped by five capstone walkthroughs: an e-commerce analytics platform, real-time fraud detection, a log-analytics pipeline, a full data lakehouse, and an end-to-end Kafka + Spark pipeline.

Who it's for

Aspiring and working Data Engineers, Analysts, Data Scientists, and ML Engineers who want real, production Spark skills. You need comfort with Python and basic SQL — no prior distributed-systems or Spark experience required. You finish ready to design, build, test, optimize, deploy, and operate scalable Spark applications, and prepared for data-engineering interviews and certifications.

Your instructor
A

Ananya Iyer

Python & Big Data Engineer · 10 yrs · Data Engineering Lead, Tessellate

4.2 course rating 3 courses

Ananya has built data platforms in Python for a decade, wrangling everything from gnarly ETL jobs to petabyte-scale Spark pipelines. She loves the craft of clean, testable Python and teaching the data-engineering fundamentals that survive whichever framework is trendy this year.

4.2 course rating · 48 ratings

D
Darrion Keeling
1 month ago

Solid introduction with great real-world framing. I learned plenty and finished motivated.

Helpful?
S
Stevie Robel
8 months ago

Top notch content and a wonderful teaching style. I would happily pay double for this.

Helpful?
L
Lonie Douglas
1 year ago

Loved every lesson. Concise, practical, and immediately applicable to my day-to-day work.

Helpful?
X
Xander Daniel Sr.
4 months ago

Phenomenal from start to finish. Practical, well-paced, and packed with real-world examples I could use immediately.

Helpful?

Frequently asked questions

Yes — once you enroll, the course is yours to revisit forever. New revisions and bonus lessons are added at no extra cost.

Finish every lesson and you'll unlock a shareable certificate you can post on LinkedIn or include with job applications.

If the course isn't a fit, request a refund within 7 days of purchase — no questions asked.

Code, slides, and worksheets are downloadable on each lesson page. Videos stream from our CDN so you can watch on any device.

Each course states its level in the hero. If you're comfortable with the prerequisites listed, you're ready to start.

Students also bought

Advanced
AI

Generative AI Masterclass

3.9 (109)

A complete, hands-on path from LLM fundamentals to a deployed Generative AI system. You build S...

29h 51m Marcus Chen
₹299
₹199 ₹499
Sign up to buy