Inside the Database Engine · Database Internals & Query Optimization: 100x Faster Queries

Why You Need This Mental Model

Imagine calling a mechanic because your car is "slow" — then being handed a parts diagram you've never seen before. You'd have no idea where to look first. That's exactly the situation most developers are in when they open a slow-query log for the first time. They see a number (say, 4.2 seconds) but have no mental map of where inside the engine that time went.

Before you can speed anything up, you need to know what you're speeding up. This lesson gives you that parts diagram. Every performance tool, every EXPLAIN output, and every tuning knob in the rest of this course will map back to one of the five subsystems you'll see here.

Key idea

A database engine is not a single monolithic program. It is a carefully layered stack of specialized subsystems, each with its own job, its own failure modes, and its own performance levers. Knowing the layers is the prerequisite to everything else.

The Five-Subsystem Architecture

Regardless of whether you're using PostgreSQL, MySQL/InnoDB, SQL Server, or Oracle, every mature relational engine contains the same five conceptual layers. The boundaries are drawn slightly differently per engine, but the responsibilities are universal.

Figure 1 — The five subsystems of a relational database engine. A client SQL statement enters at the top and must pass through all five layers before data is safely returned (and durably written).

Subsystem 1 — The Query Processor

The query processor is the engine's "brain." It receives your raw SQL text and turns it into an optimized execution plan. It has three distinct stages inside it.

Parser

The parser applies SQL grammar rules to your text and builds an abstract syntax tree (AST). At this stage, the engine doesn't know whether your tables even exist — it's only checking grammar. A missing parenthesis or a misspelled keyword dies here.

Rewriter / Binder

After parsing, the engine resolves names against the catalog (does this table exist? does this user have permission?), expands view definitions inline, and rewrites certain constructs (for example, some engines rewrite NOT IN as an anti-join). In PostgreSQL this stage is called the rewriter; MySQL calls it the preprocessor.

Planner / Optimizer

This is where most of the interesting engineering happens. The optimizer considers many candidate plans (which indexes to use? which join algorithm? in what order to join tables?) and estimates the cost of each using table statistics. It picks the plan with the lowest estimated cost and hands it to the execution engine. Entire course sections — Sections 5, 6, and 7 — are devoted to understanding and influencing what happens here.

Execution Engine

The execution engine runs the chosen plan. Classically it uses the Volcano / iterator model: each operator exposes a simple next() interface and pulls rows from its children one at a time. Modern engines (DuckDB, ClickHouse, later versions of MySQL) use vectorized execution — processing a batch of rows per call to amortize function-call overhead.

Analogy

Think of the query processor as a GPS navigation system. You type a destination (your SQL). The parser checks that the address is syntactically valid. The planner calculates several possible routes and picks the fastest one based on current traffic data (statistics). The execution engine drives the chosen route turn by turn.

Subsystem 2 — The Transaction Manager

The transaction manager enforces the ACID guarantees — Atomicity, Consistency, Isolation, and Durability — for every operation that touches data. It coordinates two internal services: the lock manager and the MVCC concurrency controller.

The lock manager grants and tracks locks (shared, exclusive, intent locks) so that concurrent queries don't corrupt each other. MVCC (Multi-Version Concurrency Control) lets readers see a consistent snapshot of the database without blocking writers and vice versa — a crucial insight that underpins the performance of every modern database.

Gotcha

PostgreSQL and MySQL both use MVCC, but they implement it very differently — and that difference is one of the most important architectural facts in this course. PostgreSQL stores multiple versions of a row inside the same heap table (old tuple versions are cleaned up by VACUUM). MySQL/InnoDB stores one current version on the clustered B-tree leaf and keeps old versions in a separate undo log. Both achieve snapshot isolation, but the cleanup mechanics, bloat behavior, and operational implications are very different.

Subsystem 3 — The Buffer Manager

Disk I/O is the single biggest cost in most database workloads. The buffer manager is the subsystem that tries to eliminate it. Its job is to maintain a pool of 8 KB pages in RAM so that the execution engine can read and write data without going to disk if at all possible.

When a page is needed, the buffer manager checks whether it's already in the pool (a cache hit). If not, it reads the page from disk, evicts another page if the pool is full, and places the new page in the pool (a cache miss). When a page is modified, it becomes dirty; dirty pages are eventually written back to disk (flushed) by the background writer or at checkpoint time.

Note — PostgreSQL vs MySQL

PostgreSQL's buffer manager manages shared_buffers (default 128 MB, typically tuned to 25% of RAM). Pages that don't fit there can still be served by the operating system's page cache, giving Postgres an effective two-level cache. MySQL/InnoDB has a single unified innodb_buffer_pool (default 128 MB, should be 60–80% of RAM for a dedicated database server) and does not rely on the OS page cache in the same way. Both use clock-sweep-like eviction, but InnoDB adds a midpoint insertion policy to protect hot data from full-table-scan pollution.

Subsystem 4 — The Storage Engine

The storage engine is responsible for how rows are physically organized on disk and how they are read or written. It is the layer closest to raw bytes.

This is one of the starkest differences between PostgreSQL and MySQL:

PostgreSQL uses a heap table: rows are written wherever there is free space on a page; the physical order of rows has no relation to any index. Row versions are stored in the heap itself. A row is located by its physical address, called a ctid (e.g., (42, 3) means page 42, tuple slot 3).
MySQL/InnoDB uses a clustered primary key: the table's data pages are the primary key's B-tree leaf nodes, ordered by the primary key value. A row is located by its primary key. Secondary indexes store the primary key value as the row pointer instead of a physical address — which means a secondary-index lookup always does two B-tree traversals (secondary index → primary key → row).

Key idea

The storage engine choice shapes everything above it. Because InnoDB's table is a B-tree, range scans on the primary key are blazing fast (sequential disk reads). Because PostgreSQL uses a heap, every index is a secondary index — but updates never have to reorganize rows to maintain order. Neither architecture is universally better; understanding both lets you pick the right tool and predict how each will behave.

Subsystem 5 — The Logging Subsystem

Databases must survive crashes. If the server loses power mid-transaction, no committed data should be lost and no partial writes should appear. The logging subsystem makes that guarantee by implementing Write-Ahead Logging (WAL): every change is written to the log before it is applied to the data pages on disk. If the server crashes, the engine replays the log on restart to reconstruct any changes that hadn't yet reached the data files.

PostgreSQL WAL vs InnoDB Redo + Undo Logs

PostgreSQL uses a single WAL file stream for both redo and MVCC history. WAL records describe physical page changes ("at offset X in page Y, write bytes Z"). The MVCC old versions live in the heap and are reclaimed by VACUUM — they are not in the WAL.

MySQL/InnoDB splits the work across two log structures: the redo log records changes for crash recovery (like WAL), while the undo log stores the before-images of rows needed for MVCC snapshots and transaction rollback. The undo log lives in the shared tablespace (or a separate undo tablespace), not in the data pages themselves.

Tracing a SELECT End-to-End

Now let's put it all together. Consider this query on an orders table with a few million rows:

-- PostgreSQL
-- Find all high-value orders placed this month
SELECT order_id, customer_id, total_amount
FROM   orders
WHERE  total_amount > 500
  AND  created_at >= date_trunc('month', now())
ORDER  BY total_amount DESC
LIMIT  20;

-- MySQL
-- Equivalent query on MySQL
SELECT order_id, customer_id, total_amount
FROM   orders
WHERE  total_amount > 500
  AND  created_at >= DATE_FORMAT(NOW(), '%Y-%m-01')
ORDER  BY total_amount DESC
LIMIT  20;

Here is the journey that query makes, one subsystem at a time:

Step 1 — Network & Parser

Your client sends the SQL text over a TCP connection (port 5432 for PostgreSQL, 3306 for MySQL). The parser tokenizes the text, checks grammar, and builds an AST. If you mistype FORM instead of FROM, the journey ends here.

Step 2 — Rewriter / Binder

The engine resolves orders against the catalog, checks that total_amount and created_at actually exist, verifies that your role has SELECT privilege, and expands any view definitions if orders were a view.

Step 3 — Planner / Optimizer

The optimizer looks at statistics for orders. How many rows? What is the distribution of total_amount? Is there an index on created_at or total_amount? It estimates how many rows will survive the WHERE clause and generates candidate plans (seq scan, index scan, bitmap index scan). It picks the plan with the lowest estimated cost — say, a bitmap index scan on a composite index followed by a sort — and produces a physical plan tree.

Step 4 — Transaction Manager checks snapshot

Before any data is read, the transaction manager establishes a snapshot: a point-in-time view of which row versions are visible to this transaction. In PostgreSQL, this means noting the current transaction ID; in InnoDB, it means recording the read-view from the undo log. Any row versions committed after this point are invisible to the query.

Step 5 — Buffer Manager serves pages

The execution engine, running the plan tree, asks the buffer manager for each needed page. Pages already in the buffer pool (shared_buffers / buffer pool) are returned immediately — a cache hit takes roughly 100 nanoseconds. Pages not in the pool require a disk read — which can take 100–500 microseconds on an SSD, or up to 10 milliseconds on a spinning disk. The ratio of hits to misses is the single biggest driver of query latency.

Step 6 — Storage Engine reads rows

On each page, the storage engine iterates over row slots (PostgreSQL heap item pointers, InnoDB page directory entries), applies the visibility check (is this tuple version visible to my snapshot?), and passes matching rows up to the filter operator.

Step 7 — Logging Subsystem (if writing)

A pure SELECT doesn't write data, so the WAL/redo log is not exercised for the data rows themselves. However, PostgreSQL may still write WAL for hint bits (a minor bookkeeping detail updated during visibility checks), and InnoDB will record undo information if the query runs under a transaction with higher isolation levels. For any DML statement (INSERT, UPDATE, DELETE), this step is critical: the change is logged before it touches the data page, ensuring crash safety.

Step 8 — Results flow back to the client

The execution engine assembles the result rows, applies the ORDER BY sort and LIMIT 20, serializes the output into the wire protocol format, and streams it back to your client. The journey is complete.

Observing the Architecture Yourself

You don't have to take this on faith. Both engines expose commands that let you see what the planner decided and what the execution engine actually did:

-- PostgreSQL — EXPLAIN ANALYZE BUFFERS
-- Shows: plan tree (planner), actual rows + time (execution engine),
-- and buffer hits/misses (buffer manager)
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT order_id, customer_id, total_amount
FROM   orders
WHERE  total_amount > 500
  AND  created_at >= date_trunc('month', now())
ORDER  BY total_amount DESC
LIMIT  20;

-- MySQL — EXPLAIN ANALYZE
-- Shows: plan tree with estimated AND actual rows + loop time
-- (MySQL 8.0.18+)
EXPLAIN ANALYZE
SELECT order_id, customer_id, total_amount
FROM   orders
WHERE  total_amount > 500
  AND  created_at >= DATE_FORMAT(NOW(), '%Y-%m-01')
ORDER  BY total_amount DESC
LIMIT  20;

In the output you'll see nodes labeled Seq Scan, Index Scan, Sort, Limit — each one is an execution operator in the plan tree the query processor built. The Buffers: shared hit=N, read=M line in PostgreSQL tells you exactly how many pages came from the buffer manager's cache vs disk. That single line often tells you more about query performance than the timing does.

Note — Section 6 goes deep

We'll spend all of Section 6 learning to read EXPLAIN output like a professional. For now, just recognise that the output is a direct window into the query processor and buffer manager — two of the five subsystems you just learned.

Exercise

The plan for this lesson is: trace a query end-to-end on paper, naming which subsystem handles each stage. Here are five concrete steps to make that hands-on.

Draw the subsystem stack from memory. Close this lesson. On paper or a whiteboard, sketch all five subsystems (query processor, transaction manager, buffer manager, storage engine, logging subsystem) in order, and write one sentence of responsibility for each. Check your drawing against Figure 1 above.
Trace a SELECT on paper. Write out the query from the "Tracing a SELECT" section on paper. Beside it, list the eight steps (parser → rewriter → planner → transaction snapshot → buffer manager → storage engine → WAL check → result). For each step, note: which subsystem, what it checks or produces, and what failure at this step would look like (e.g., parser failure = syntax error message to the client).
Run EXPLAIN (ANALYZE, BUFFERS) on PostgreSQL. Set up the orders table with at least 100,000 rows (or use the course sample dataset). Run the PostgreSQL EXPLAIN command above. Find the line that says Buffers: shared hit=X read=Y. Record X and Y. Run the query a second time — how did X and Y change, and why?
Run EXPLAIN ANALYZE on MySQL. Run the MySQL version of the same query. In the output, find the actual rows vs rows (estimated) figures on the innermost scan node. Are they close? If they're wildly different, which subsystem is responsible for that mismatch — and what might fix it?
Identify the bottleneck subsystem. Based on what you observed: was the slow part in the buffer manager (many page reads), the storage engine (many rows examined but few returned), or the query processor (a bad plan)? Write a one-paragraph diagnosis, citing the specific numbers from the EXPLAIN output.