How This Course Works + Lab Setup · Database Internals & Query Optimization: 100x Faster Queries

Stop guessing. Start measuring.

You've probably added an index because "indexes make queries faster," or rewrote a subquery as a JOIN because someone said that's faster, and then … you weren't sure. Maybe it was a bit quicker? Maybe the same? This course is built around a different approach: you measure before you change anything, and you measure after. Everything else is a hypothesis.

The good news is that PostgreSQL and MySQL both ship with world-class instrumentation built right in. EXPLAIN ANALYZE gives you actual row counts, actual timing per operator, and actual I/O statistics. pgbench and sysbench drive repeatable load so you can benchmark, not just time. Together these tools make the scientific method practical on a laptop.

Key idea

Every lab in this course follows the same four steps: baseline → diagnose → change one thing → re-measure. If you only remember one thing from this lesson, let it be those four steps — in that exact order.

The two sample datasets

All labs use one of two datasets, chosen for what they stress-test. You'll meet them in Lesson 2 and use them throughout the course.

Dataset A — Small OLTP schema (`shopdb`)

A normalized e-commerce database: customers, orders, order_items, and products. It's small enough that you can reason about every row, which makes it perfect for checking correctness — confirming that a plan change returns the same results, understanding how MVCC snapshot isolation affects concurrent reads, or tracing what VACUUM actually cleaned up.

Dataset B — Large, skewed dataset (`eventsdb`)

One hundred million event rows, with a deliberately skewed distribution on the user_id and event_type columns. A handful of users generate 40% of the events; most generate none. This skew is intentional — it's the kind of real-world distribution that breaks the optimizer's independence assumption and produces wildly wrong cardinality estimates. You'll use this dataset whenever you need to benchmark something at scale or deliberately mislead the query planner.

Note

Loading scripts for both datasets are provided in the course repository. They run on both PostgreSQL 16 and MySQL 8 and take roughly 3–8 minutes on a laptop depending on disk speed. You do not need to load them before this lesson's lab — that's part of the lab itself.

The standard optimization workflow

Every optimization you do in this course — and every one you'll do in production — follows the same cycle. Here it is visually:

The four-step optimization cycle used in every lab. The dashed feedback loop reminds you to iterate — each pass reveals the next bottleneck.

The most violated rule in this diagram is step 3: "change one thing." When you change two things at once and performance improves, you don't know which change helped. Worse, one change may have helped and the other hurt — and you've lost both signals. Discipline here pays for itself every time.

Gotcha

Always warm the database's buffer pool between runs and always measure the same cache state. A cold-cache run compared to a warm-cache run can differ by 100×. That's not an optimization — it's a measurement error. In the labs, you'll run each query at least five times after a warm-up pass and record p50/p95/p99, not the average.

Performance-measurement tools

EXPLAIN — the estimated plan

EXPLAIN shows you what the optimizer plans to do, without executing the query. It's fast and safe on any table size. Use it when you want to see whether the planner is choosing the right index or the right join strategy — before you commit to running a long query.

EXPLAIN ANALYZE — the actual plan

EXPLAIN ANALYZE runs the query and annotates the plan with actual row counts and timing per operator. The gap between "rows estimated" and "rows actual" is often the most important number on the page — a large discrepancy tells you that statistics are stale or the distribution is skewed in a way the optimizer didn't account for. In PostgreSQL you'll add BUFFERS to also see page reads and cache hits.

-- PostgreSQL
-- Estimated plan only (safe on any table size)
EXPLAIN
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'shipped'
  AND o.created_at >= '2024-01-01';

-- Actual plan with timing and I/O statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'shipped'
  AND o.created_at >= '2024-01-01';

-- MySQL 8
-- Estimated plan (classic tabular format)
EXPLAIN
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'shipped'
  AND o.created_at >= '2024-01-01';

-- Actual plan with timing (MySQL 8.0.18+)
EXPLAIN ANALYZE
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'shipped'
  AND o.created_at >= '2024-01-01';

Note

MySQL's classic EXPLAIN output is a table with columns like type, key, rows, and Extra. PostgreSQL's output is a nested tree. MySQL 8 also supports FORMAT=JSON and FORMAT=TREE for a tree view closer to PostgreSQL's style. Neither format is better — you'll get comfortable with both in Section 6.

pgbench — repeatable load for PostgreSQL

pgbench is bundled with every PostgreSQL installation. It drives a TPC-B-like workload (or any SQL you write) at a specified number of clients and transactions, then reports transactions per second (TPS) and latency percentiles. You'll use it whenever you need to compare two configurations under sustained load rather than a single-query timer.

# pgbench — initialize and run
# Initialize the built-in benchmark schema at scale factor 10 (~1.5 million rows)
pgbench -i -s 10 shopdb

# Run for 30 seconds with 4 clients, 2 threads — report TPS and latency
pgbench -c 4 -j 2 -T 30 -P 5 shopdb

# Run a custom script instead of the built-in workload
pgbench -c 4 -j 2 -T 30 -f my_query.sql shopdb

sysbench — repeatable load for MySQL (and PostgreSQL)

sysbench is the standard benchmarking tool for InnoDB. It ships with a rich set of OLTP scripts — point lookups, range scans, read/write mixes — and can be pointed at PostgreSQL too. You'll use it in the MySQL-specific labs to run a controlled workload and measure the impact of buffer-pool sizing, index changes, and schema design choices.

# sysbench — initialize and run (MySQL)
# Prepare: create the sysbench tables (1 million rows)
sysbench oltp_read_write \
  --db-driver=mysql \
  --mysql-host=127.0.0.1 \
  --mysql-user=root \
  --mysql-password=secret \
  --mysql-db=eventsdb \
  --tables=4 \
  --table-size=1000000 \
  prepare

# Run: 60-second mixed read/write with 8 threads
sysbench oltp_read_write \
  --db-driver=mysql \
  --mysql-host=127.0.0.1 \
  --mysql-user=root \
  --mysql-password=secret \
  --mysql-db=eventsdb \
  --tables=4 \
  --table-size=1000000 \
  --threads=8 \
  --time=60 \
  run

Reading execution plans — the absolute basics

You'll spend a full section (Section 6) becoming fluent in plan reading. For now, you need just enough to get through the early labs. Here are the three numbers that matter most on every plan node.

1. Estimated rows vs actual rows

Every plan node shows rows=N (estimated) and, after ANALYZE, actual rows=M. When N and M differ by more than 10×, the optimizer was working with bad information — stale statistics, correlated columns, or a skewed distribution. That bad estimate ripples up the tree and can cause the optimizer to choose the wrong join algorithm or the wrong index.

2. Cost

PostgreSQL shows costs as cost=startup..total in abstract units (roughly 8 KB page reads). The optimizer picks the plan with the lowest total cost. The absolute number isn't useful on its own — comparing two plans' costs for the same query is what matters.

3. Actual time

After ANALYZE, each node shows actual time=first_row..last_row ms. The most expensive node (highest time) is your starting point. Everything else is secondary.

Analogy

Think of an execution plan like a receipt from a complicated restaurant order. The total at the bottom is what you care about — but when it's unexpectedly high, you scan line by line to find the one item that cost ten times more than it should have. Finding the one outrageously expensive node in a plan is exactly the same process.

In PostgreSQL, plan nodes are nested and executed from the innermost outward. The bottom node runs first, feeds its output upward, and the topmost node returns your result set. In MySQL's tabular output, rows are listed in roughly the order the optimizer processes them — the row with id=1 drives the join; higher IDs are inner tables.

Lab setup: Docker (recommended)

The fastest path to a working lab is Docker. The commands below give you both databases running on their default ports in under five minutes, with data directories mounted on the host so they survive container restarts.

# PostgreSQL 16 — Docker
# Pull and start PostgreSQL 16
docker run -d \
  --name pg16 \
  -e POSTGRES_PASSWORD=secret \
  -e POSTGRES_DB=shopdb \
  -p 5432:5432 \
  -v pg16_data:/var/lib/postgresql/data \
  postgres:16

# Verify it's running
docker exec -it pg16 psql -U postgres -c "SELECT version();"

# Connect from the host (requires psql installed locally)
psql -h 127.0.0.1 -U postgres -d shopdb

# MySQL 8 — Docker
# Pull and start MySQL 8
docker run -d \
  --name mysql8 \
  -e MYSQL_ROOT_PASSWORD=secret \
  -e MYSQL_DATABASE=shopdb \
  -p 3306:3306 \
  -v mysql8_data:/var/lib/mysql \
  mysql:8

# Verify it's running
docker exec -it mysql8 mysql -uroot -psecret \
  -e "SELECT version();"

# Connect from the host (requires mysql client locally)
mysql -h 127.0.0.1 -uroot -psecret shopdb

Native installation (alternative)

If you prefer a native install: PostgreSQL packages are at postgresql.org/download and MySQL at dev.mysql.com/downloads. Both offer installers for macOS, Windows, and all major Linux distributions. Accept the defaults — you can tune configuration later in Section 3 and Section 14.

Loading the sample datasets

The course repository includes two loading scripts: load-shopdb.sql (small OLTP) and load-eventsdb.sql (large, skewed). Both are idempotent — running them twice won't break anything. After loading, run the verification queries below to confirm the row counts are correct.

# PostgreSQL — load & verify
# Load both datasets
psql -h 127.0.0.1 -U postgres -d shopdb \
  -f load-shopdb.sql

psql -h 127.0.0.1 -U postgres -d eventsdb \
  -f load-eventsdb.sql

# MySQL — load & verify
# Load both datasets
mysql -h 127.0.0.1 -uroot -psecret shopdb \
  < load-shopdb.sql

mysql -h 127.0.0.1 -uroot -psecret eventsdb \
  < load-eventsdb.sql

-- Verify row counts (both engines)
-- shopdb: small OLTP
SELECT 'customers'  AS tbl, COUNT(*) FROM customers
UNION ALL
SELECT 'orders',          COUNT(*) FROM orders
UNION ALL
SELECT 'order_items',     COUNT(*) FROM order_items
UNION ALL
SELECT 'products',        COUNT(*) FROM products;

-- eventsdb: large, skewed
SELECT COUNT(*) FROM events;   -- expect ~100 000 000

Your first EXPLAIN ANALYZE — both engines side by side

Once the datasets are loaded, run your first annotated plan on both engines. Notice what's the same (node types, the concept of estimated vs actual rows) and what's different (output format, I/O statistics, default join strategy). You'll spend Section 6 decoding every line — for now, just observe.

-- PostgreSQL 16
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name,
       COUNT(o.id)        AS order_count,
       SUM(o.total)       AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= '2024-01-01'
GROUP BY c.id, c.name
ORDER BY revenue DESC
LIMIT 10;

-- MySQL 8
EXPLAIN ANALYZE
SELECT c.name,
       COUNT(o.id)        AS order_count,
       SUM(o.total)       AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= '2024-01-01'
GROUP BY c.id, c.name
ORDER BY revenue DESC
LIMIT 10;

Copy the output of each into a text file — that's your baseline. Every optimization you make in this course will be compared against a baseline exactly like this one.

Key idea

A baseline is a contract with your future self. When you come back after making three changes, a recorded baseline is the only way to know whether you actually improved anything — or just got lucky with a warmer cache.

How the rest of the course is structured

Every lesson follows the same shape you're experiencing right now: a hook, a conceptual explanation, at least one dual-engine SQL example, a diagram or interactive widget, and a lab that ends with a number you record. The labs are cumulative — the schema you set up today is the one you'll be tuning in Sections 4, 6, 7, and beyond.

The course is organized into nine parts, following the exact path a query travels through the database engine. You start with storage bytes (Part II), climb through indexing and query execution (Parts III–IV), tackle joins and advanced optimization (Part V), descend into transactions and durability (Part VI), scale up (Part VII), and finish with engine-specific tuning and a capstone project (Parts VIII–IX). The bonus covers modern engines — LSM-trees, columnar stores, and cloud-native architectures — for those who want to go further.

You don't need to read sections in strict order, but the arc is deliberate. Concepts build on each other: understanding buffer pools (Section 3) makes index efficiency (Section 4) intuitive; understanding cardinality estimation (Section 7) makes the execution-plan labs (Section 6) much more productive.

Exercise

Install PostgreSQL 16 and MySQL 8 using Docker (or a native installer). Confirm both are running by executing SELECT version(); on each. Record the exact version strings — you'll want them when reading documentation.
Load both sample datasets (shopdb and eventsdb) on both engines. Run the verification queries to confirm row counts match expectations. If anything is off, re-run the loading script — they're idempotent.
Run your first EXPLAIN ANALYZE on the top-10-customers-by-revenue query shown above, on both PostgreSQL and MySQL. Save the full output as baseline-pg.txt and baseline-mysql.txt. Identify: (a) which node has the highest actual time, and (b) whether estimated rows and actual rows match on every node.
Compare the output formats. PostgreSQL shows a tree of operators; MySQL's classic EXPLAIN shows a table, and EXPLAIN ANALYZE shows a tree. Find the equivalent of PostgreSQL's actual rows in MySQL's output. Write one sentence describing the biggest structural difference you notice.
Run the query five times after a warm-up pass and record the execution time each run (use \timing in psql or SET profiling=1; in MySQL). Calculate p50 and p99 from your five samples. Notice whether they differ — and why they would.

Stop guessing. Start measuring.

The two sample datasets

Dataset A — Small OLTP schema (shopdb)

Dataset B — Large, skewed dataset (eventsdb)

The standard optimization workflow

Performance-measurement tools

EXPLAIN — the estimated plan

EXPLAIN ANALYZE — the actual plan

pgbench — repeatable load for PostgreSQL

sysbench — repeatable load for MySQL (and PostgreSQL)

Reading execution plans — the absolute basics

1. Estimated rows vs actual rows

2. Cost

3. Actual time

Lab setup: Docker (recommended)

Native installation (alternative)

Loading the sample datasets

Your first EXPLAIN ANALYZE — both engines side by side

How the rest of the course is structured

Dataset A — Small OLTP schema (`shopdb`)

Dataset B — Large, skewed dataset (`eventsdb`)