The Modern Data Stack · Data Engineering Bootcamp

🎬 The scene

Walk up to any data team's whiteboard and you'll see roughly the same drawing: a handful of boxes connected by arrows, data flowing left to right. The tool names in each box change from company to company, but the boxes themselves almost never do. Learn this one diagram and suddenly every job ad, every architecture review, and every "where does this fit?" conversation clicks into place.

Five boxes, one flow

A "data stack" sounds intimidating, but it's just the chain of systems data passes through on its way from being created to being useful. Strip away the brand names and almost every stack is the same five layers in the same order.

Sources. Where data is born: the app and website, third-party tools (payment processors, ad platforms), and uploaded files. ShopStream's app, marketing platform, and supplier spreadsheets all live here.
Operational databases. The live system that runs the product. When a customer places an order, it's written here right now, optimized for fast single-record reads and writes — not for big analytical questions.
ETL / ELT pipelines. The movers. They extract data from the sources, optionally transform (clean, reshape) it, and load it into the warehouse — on a schedule, automatically. This is the heart of the data engineer's job.
Data warehouse. A separate database built for analysis, not for running the app. It holds history from all sources together and is tuned to crunch millions of rows for one big question.
BI & analytics tools. The dashboards, reports, and notebooks where humans finally read the data and make decisions. The CEO's revenue chart lives here.

Data flows left to right through these boxes, and your career follows the same line. This course teaches the boxes roughly in order: SQL and the database (Sections 4–6), warehousing (Section 10), pipelines and ETL/ELT (Sections 8–9), and analytics on top (Sections 11–12). When you can point at any tool — Fivetran, Airflow, dbt, Snowflake, Power BI — and say which box it lives in, you've understood the stack better than most beginners ever do.

The modern data stack — sources → databases → ETL/ELT → warehouse → BI. The tool names change; the boxes rarely do.

🧠 Think of it like…

A restaurant supply chain. Sources are the farms. The operational database is the delivery truck handling today's orders. ETL/ELT is the loading dock that receives, inspects, and stocks everything. The warehouse is the walk-in cold room where all ingredients live together. And BI tools are the plated dishes the diners actually see. Same ingredients, transformed step by step into something you can consume.

Database vs. warehouse — the one distinction beginners miss

This trips up nearly everyone, so let's nail it now. Both a database and a warehouse store tables of data, so why have two? Because they're tuned for opposite jobs. An operational database (sometimes called OLTP — online transaction processing) is built to handle thousands of tiny, fast operations: "insert this one order," "update this customer's email." It must never slow down the live app.

A data warehouse (OLAP — online analytical processing) is built for the opposite: a few enormous questions that scan millions of rows, like "total revenue by country by month for the last three years." If you ran that query directly against the live app's database, you could slow checkout to a crawl for real customers. So we copy the data into a separate warehouse and analyze it there, safely. That copy-and-separate step is exactly what pipelines do — and it's why the data engineer sits squarely in the middle of the stack. We'll go deep on this contrast in Section 10; for now, just remember: database runs the business, warehouse analyzes the business.

Walkthrough: ShopStream

Let's map ShopStream directly onto the five boxes. The very same order from Lesson 1.1 passes through every layer. Here's a peek at what that looks like as data moves from the live app database into the warehouse.

-- BOX 2: in ShopStream's operational database, optimized for the live app.
-- A single, fast lookup of one order:
SELECT order_id, customer_id, status
FROM orders
WHERE order_id = 90118;

-- BOX 4: in the warehouse, the same data is reshaped for analysis.
-- A big rollup across millions of rows that we'd never run on the live app:
SELECT c.country,
       COUNT(*)                              AS num_orders,
       SUM(oi.quantity * oi.unit_price)      AS revenue
FROM orders o
JOIN customers   c  ON c.customer_id = o.customer_id
JOIN order_items oi ON oi.order_id   = o.order_id
WHERE o.status = 'completed'
GROUP BY c.country
ORDER BY revenue DESC;

The first query is a quick, surgical lookup — the kind the live app fires constantly. The second scans and groups everything — the kind a dashboard needs but a live app must be protected from. A pipeline (Box 3) is what carried the data from the first world to the second. The result a BI tool (Box 5) would chart:

Warehouse rollup that feeds the ShopStream revenue dashboard
country	num_orders	revenue
USA	4,180	612,440
Germany	2,905	388,210
India	2,613	301,775
Nigeria	1,402	176,900

⌨️ Hands-on Exercise

On paper (or in a note), draw ShopStream's data flow as five boxes with arrows. For each box, label both the layer name and which course section teaches it.

Expected: five boxes left to right — Sources → Database → ETL/ELT → Warehouse → BI — each tagged with a real ShopStream example and a section number.

Show solution

Sources              →  Database (OLTP)   →  ETL/ELT pipeline  →  Warehouse (OLAP)  →  BI / dashboard
ShopStream app,         live orders &        loads & cleans       all history,         CEO revenue
marketing, supplier     customers table      on a schedule        analysis-ready       chart
CSVs
[Section 2 — data    [Sections 4–6 — SQL] [Sections 8–9 —      [Section 10 —        [Sections 11–12 —
 types & sources]                          ETL/ELT & pipelines] warehousing]         analytics & BI]

Exact section numbers don't have to be perfect — the win is seeing every tool you'll ever meet snap into one of these five boxes.

🌍 In the wild

A Netflix-scale company and a ten-person startup run the same five boxes — they just swap the tools. The startup might use Postgres, a few Python scripts, and a Google Sheet; the giant uses Kafka, Spark, Snowflake, and Tableau. Recognizing the shared shape is why a data engineer can move between companies: the stack is familiar even when the logos are new.

⚠️ Common traps

Thinking you need every fancy tool. The boxes matter, not the brand names. A solid stack can be plain PostgreSQL and a scheduled Python script.
Conflating "database" and "warehouse." Same idea (tables of data), opposite jobs — one runs the app, the other analyzes it. (We return to this in Section 10.)
Picturing data flowing in random directions. In a healthy stack it flows one way: sources → BI. When it loops back chaotically, that's usually a design smell.