Introduction to Database Performance · Database Internals & Query Optimization: 100x Faster Queries

The query that took 30 seconds

Imagine this: it's 9 AM on a Monday, your on-call pager fires, and your company's dashboard has been timing out for twenty minutes. Half the customer-support team can't load their queue. You SSH into the database server, run SHOW PROCESSLIST, and find a single query — a straightforward SELECT with two joins — that has been running for 31 seconds. On a table with 4 million rows.

You add one index. The same query now takes 48 milliseconds. The dashboard comes back. The pager stops. Total fix time: four minutes.

That 640× improvement wasn't magic. It was the predictable result of understanding how the database physically finds data. This course teaches you exactly that understanding — so the next time it happens, you're the one who fixes it in four minutes.

Key idea

Performance problems in databases are almost never random. They follow from the physical structure of how data is stored and accessed. Once you understand that structure, the fix is usually obvious — and measurable before you even run it.

To make this concrete, here is the before-and-after benchmark from that scenario. The widget below lets you see the scale of the improvement at a glance:

Why queries become slow

A query becomes slow for one of a small number of root causes. Understanding this list is the foundation of all systematic diagnosis — and diagnosis is the skill this entire course builds.

The database has to read too much data

The most common cause of slow queries is also the simplest: the engine is reading far more data than it needs to answer the question. Without an appropriate index, finding rows that match a predicate requires reading every single page in the table — a sequential scan. On a 4-million-row table with 8 KB pages, that can easily mean reading 500 MB off disk just to return 12 rows.

Analogy

Imagine you're looking for a specific recipe in a cookbook with no index and no table of contents. You have to read every page from the beginning until you find it. A database without the right index does exactly that — every time. An index is the table of contents. It tells the engine which page to open directly.

The database is doing unnecessary work

Sometimes the query itself asks for more than it needs: SELECT * when you need two columns, an outer join where an inner join would do, or a correlated subquery that re-executes for every outer row. The engine executes exactly what you write — it cannot read your mind.

Resource contention is serializing work

Even a perfectly tuned query runs slowly when it has to wait for a lock held by another transaction, when memory is exhausted and intermediate results spill to disk, or when the CPU is saturated by dozens of simultaneous queries. Contention turns a fast individual operation into a slow queued one.

Bad cardinality estimates send the optimizer the wrong way

The query optimizer picks an execution plan based on statistics — its best guess about how much data each operation will touch. When those statistics are stale or misleading (skewed data, correlated columns), the optimizer may pick a nested-loop join over a hash join for a million-row table. The plan looks reasonable on paper and is catastrophic in practice.

Latency vs throughput — and why they're not the same thing

These two words appear constantly in performance discussions, and they mean different things. Confusing them leads to optimizing the wrong dimension — sometimes making one dramatically worse while improving the other.

Latency is how long a single operation takes from start to finish. When your API endpoint returns a user's order history, the latency is the wall-clock time between "query sent" and "last row received." Users feel latency directly. It is measured in milliseconds (or, for bad queries, in seconds).

Throughput is how many operations the system can complete per unit of time — queries per second, rows inserted per minute, transactions per hour. High throughput means the system can keep up with demand. It is a measure of capacity, not speed for any individual request.

Gotcha

Optimizing for throughput can increase latency. If you batch 1,000 small writes into a single transaction to maximize insert throughput, the first write in the batch waits until all 1,000 are ready — latency goes up even though throughput does too. Always know which dimension your users or your SLA actually cares about before you start tuning.

The relationship between latency and throughput is governed by Little's Law: the average number of requests in the system equals the arrival rate multiplied by the average time each spends in the system. Double the latency, and at the same arrival rate, you double the queued-up requests — which eventually exhausts threads, connections, and memory.

The four bottlenecks: CPU, memory, disk, network

Every slow database operation is ultimately bottlenecked on one of four physical resources. Identifying which one is the first step of any real diagnosis.

The four hardware bottleneck types, their typical symptoms, and first-line remedies. Most slow queries are disk I/O–bound — which is why indexes matter so much.

The critical insight is that these bottlenecks form a hierarchy of cost. RAM access is roughly 100,000× faster than a spinning-disk seek. An SSD narrows that gap significantly but is still 100–1,000× slower than RAM. A single cache miss that forces a disk read can cost more time than running a tight CPU loop for an entire millisecond. This is the fundamental reason that getting data into RAM — and keeping it there — is the single biggest lever in database performance.

OLTP vs OLAP — two completely different problems

The optimal database configuration for an e-commerce checkout flow is almost the opposite of the optimal configuration for a business-intelligence dashboard. Understanding this split is essential: a technique that helps one workload can actively hurt the other.

OLTP — Online Transaction Processing

OLTP workloads are characterized by many small, short-lived transactions that touch a handful of rows each. Think: place an order, update a user's address, look up a product by SKU. The key properties are:

High concurrency — hundreds or thousands of sessions running simultaneously
Point lookups and narrow ranges — you need row 4,827,391, not a summary of all rows
Low latency is the critical metric — users are waiting in real time
Write-heavy — lots of INSERTs, UPDATEs, and DELETEs alongside reads

OLTP optimization focuses on: efficient indexes for point lookups, minimal lock contention, fast commit paths (WAL tuning), connection pooling, and keeping frequently-accessed data in the buffer pool.

OLAP — Online Analytical Processing

OLAP workloads are the opposite: a few complex queries that aggregate millions or billions of rows. Think: "total revenue by product category for Q3 across all regions." The key properties are:

Low concurrency — a handful of analysts, not thousands of customers
Full scans and large aggregations — you need to touch most of the table
Throughput over latency — a query taking 10 seconds is acceptable if it was 5 minutes before
Read-heavy — rarely writes, frequently reads enormous datasets

OLAP optimization focuses on: columnar storage (reading only the columns you need), vectorized execution (processing thousands of values at once), partitioning (skipping irrelevant data entirely), and parallel query execution.

Note

Most production databases serve a mixed workload — OLTP for the application and a growing OLAP demand from analytics. This is called HTAP (Hybrid Transactional/Analytical Processing). The practical answer is usually to separate the workloads: a replica or a dedicated analytical store for reporting, keeping OLTP performance clean. Sections 13 and 15 cover this in depth.

Measuring performance correctly

You cannot improve what you do not measure. But measuring badly is almost worse than not measuring at all — it gives you false confidence. Two measurement traps destroy more optimization efforts than almost anything else.

The warm vs cold cache trap

Run a query for the first time and the database reads data from disk — every page is a cache miss. Run it immediately again and the data is in the buffer pool — nearly every read is a cache hit. The two runs can differ by 10× or more, and they tell you completely different things.

When you're tuning a query that runs repeatedly in production (the common case), you want to measure with a warm cache: prime the buffer pool with a few dummy runs, then measure. When you're stress-testing worst-case cold-start behavior, you want to flush the cache first. Know which you need before you start.

-- PostgreSQL — flush OS cache & reset shared_buffers
-- Discard the OS page cache (run as root in the shell, not SQL):
-- echo 3 > /proc/sys/vm/drop_caches

-- In psql: discard the shared_buffers for a specific table
-- (no built-in flush — restart pg with small shared_buffers for cold test)

-- For warm-cache timing, run the query 3+ times first, then measure:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT customer_id, SUM(amount)
FROM   orders
WHERE  created_at >= '2024-01-01'
GROUP  BY customer_id;

-- MySQL — flush tables & reset buffer pool
-- Flush table data from buffer pool (simulates cold cache):
FLUSH TABLES orders;
-- or globally: SET GLOBAL innodb_buffer_pool_dump_now = ON;

-- For warm-cache timing: run query 3+ times, then time it:
EXPLAIN ANALYZE
SELECT customer_id, SUM(amount)
FROM   orders
WHERE  created_at >= '2024-01-01'
GROUP  BY customer_id;

The average lies to you

The arithmetic mean is a terrible statistic for query latency. Here's why: imagine 99 queries each complete in 50 ms, and one query — the one that hit a cold page and waited for a lock — takes 5,000 ms. The average is 99.5 ms. You'd conclude your queries are fast. Meanwhile, 1% of your users are waiting 5 full seconds.

The right answer is percentiles:

p50 (median) — the typical experience: half of requests are faster, half are slower
p95 — 95% of requests are faster than this; the "high end of normal"
p99 — 99% of requests are faster than this; this is what your slowest 1% of users experience
p99.9 — often called the "tail latency"; critical for SLA-bound services

Key idea

Report p50, p95, and p99. Report them separately for warm-cache and cold-cache conditions. If your p99 is more than 5–10× your p50, you have tail latency hiding behind a deceptively good average — and your worst-affected users will notice before your metrics do.

Collecting percentile data in both engines

-- PostgreSQL — pg_stat_statements percentile query
-- Enable pg_stat_statements in postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'

-- After load, inspect per-query statistics:
SELECT
  LEFT(query, 80)                          AS query_snippet,
  calls,
  ROUND(mean_exec_time::numeric, 2)        AS avg_ms,
  ROUND(min_exec_time::numeric, 2)         AS min_ms,
  ROUND(max_exec_time::numeric, 2)         AS max_ms,
  ROUND(stddev_exec_time::numeric, 2)      AS stddev_ms
FROM   pg_stat_statements
ORDER  BY mean_exec_time DESC
LIMIT  10;

-- MySQL — performance_schema query digest
-- Enable the statement summary (on by default in MySQL 8):
-- [mysqld] performance_schema = ON

-- Inspect the slowest queries by average latency:
SELECT
  LEFT(DIGEST_TEXT, 80)                    AS query_snippet,
  COUNT_STAR                               AS calls,
  ROUND(AVG_TIMER_WAIT / 1e9, 2)          AS avg_ms,
  ROUND(MIN_TIMER_WAIT / 1e9, 2)          AS min_ms,
  ROUND(MAX_TIMER_WAIT / 1e9, 2)          AS max_ms
FROM   performance_schema.events_statements_summary_by_digest
ORDER  BY AVG_TIMER_WAIT DESC
LIMIT  10;

Notice that both engines give you min, max, and average — but not percentiles directly. For true percentile measurement, use a benchmarking tool like pgbench (PostgreSQL) or sysbench (MySQL) and collect the histogram output they produce. Lesson 1.3 covers the full toolchain.

Real-world performance incidents — what actually goes wrong

Theory is more useful when anchored to the kinds of failures that appear in real production systems. Here are four patterns you will almost certainly encounter — and that this course equips you to diagnose and fix.

The missing index at 3 AM

A new feature ships with a query against a column that was never indexed. At low traffic, it's unnoticeable — 50 ms on a table with 100K rows. Six months later, the table has 10 million rows, and the same sequential scan takes 8 seconds. This is the most common production incident, and it's entirely preventable with systematic index design (Sections 4 and 6).

The optimizer chose wrong

Statistics grew stale after a large data migration. The optimizer thinks a table has 50,000 rows; it actually has 50 million. It chooses a nested-loop join, which is catastrophic at that scale. The fix: update statistics and understand cardinality estimation (Section 7).

Lock contention cascade

A long-running report query holds a lock on a table. Application code tries to write to the same table — and waits. Those writers pile up. Within 30 seconds, connection limits are hit, and the application reports database unavailability even though no query is individually broken. The fix: isolation levels, lock granularity, and MVCC (Sections 11–12).

The working set outgrew RAM

A database ran fine for years. Then a new client segment started querying historical data that had always been cold. The buffer pool now couldn't fit the working set, hit rates dropped from 99% to 60%, and every query slowed down. The fix: understand buffer pool sizing and eviction (Section 3).

What "100× faster" actually means

When someone says they made a query "100× faster," it sounds like marketing — but in databases, these numbers are real and achievable because the range of possible implementations spans many orders of magnitude.

Consider the data access costs at each layer of the storage hierarchy. A CPU register access takes fractions of a nanosecond. An L1 cache hit takes ~1 ns. An L3 cache hit takes ~10 ns. A RAM access takes ~100 ns. An SSD read takes ~100 µs (100,000 ns). A spinning disk seek takes ~10 ms (10,000,000 ns). That last number is 10 million times the first.

A query that does 1,000 disk seeks instead of 10 is 100× slower — not because of bad code, but because of bad data access patterns. Replacing a full table scan (1 million page reads) with an index seek (3 page reads) is a 330,000× reduction in I/O. The practical speedup is limited by other factors, but 100× improvements are commonplace when moving from the wrong data structure to the right one.

You will build your own latency intuition throughout this course. Section 3 starts with the storage hierarchy in depth and includes the latency ladder exercise that makes these numbers visceral.

Exercise

The plan for this lesson: run the same query cold and warm; record p50/p95/p99 across many runs and explain why the average lied. Complete these steps on your local PostgreSQL and MySQL instances.

Baseline cold run. On the sample OLTP schema, identify a table with at least 100,000 rows and a column that has no index. Flush the buffer pool (restart the server or use the flush commands above). Run a range query against the unindexed column. Record the execution time with EXPLAIN ANALYZE. Note "Buffers: shared hit=0 read=X" in PostgreSQL, or the handler_read_rnd_next counter in MySQL.
Warm cache run. Without flushing, run the exact same query 10 more times. Record the timing for each run. Calculate the average. Then calculate p50, p95, and p99 by sorting the 10 timings. If you see the first run as an outlier, explain in one sentence why.
Expose the lying average. Add 2–3 deliberately slow runs to your dataset (for example, run while another session is doing a large sequential scan to compete for I/O). Recalculate average vs p99. Write down by how much the average underrepresents the worst-case experience.
Add an index and re-measure. Create an index on the column you queried. Flush the cache. Re-run the same query cold. Record the new timing. Compute the speedup ratio: old_time / new_time. Also compare "Buffers: read" before and after to see how many fewer disk pages were touched.
Characterize your workload. Using pg_stat_statements or MySQL's performance_schema, run 5 minutes of mixed application load against the sample database and identify: (a) the query with the highest total time, (b) the query with the highest average time, and (c) whether those are the same query. Write a one-paragraph summary of what each metric tells you and which one you would prioritize fixing first.