Scale Dimensions and the CAP Tradeoff · The System Design Masterclass

Scale is a vector, not a number

When someone says "we need to scale," the first useful question is: scale in which direction? "Scale" is shorthand for at least four very different problems, each of which stresses different parts of the system.

Users. The count of distinct identities the system must keep state for.
Requests. The rate of operations per second flowing through the system.
Data. The volume of bytes the system must store, index, and retrieve.
Geography. The physical distance between where data lives and where it's used.

A system that handles 100M users but a quiet 5k QPS is utterly unlike a system that handles 1M users at 200k QPS. The architectures look nothing alike. So whenever you hear "scale," ask which axes are growing and at what rate. The answers carve up the design space.

Each axis stresses a different part of the architecture. They are not interchangeable.

Why this matters: bottlenecks move

A system that's user-bound (Facebook's "graph") will hit storage and fanout limits first. A system that's request-bound (an ad bidder) will hit CPU and tail latency first. A system that's data-bound (a data warehouse) will hit IO and shuffle limits first. A system that's geography-bound (a multinational app) will hit the speed of light first. Most architectures are bad at one of these and adequate at the others — that's the whole game.

Numbers you should know in your bones

Senior engineers carry a small set of mental rulers. Internalize these — they will save you from making confident-sounding nonsense estimates:

Operation	Approx. latency	Mental model
L1 cache reference	0.5 ns	Free
Main memory reference	100 ns	Very cheap
Send 1 MB sequentially over network	~10 ms	Network is slow
SSD random read	~150 µs	300× slower than RAM
Round trip within a datacenter	~0.5 ms	Same building
Round trip US east ↔ US west	~70 ms	Speed of light, basically
Round trip US ↔ EU	~80–100 ms	A whole human reaction
Round trip US ↔ Singapore	~180 ms	Painful for sync UIs

Calibration drill

The single most useful calibration: a cross-region round trip is 50–200ms. If your design requires N synchronous cross-region hops on the critical path, your p99 is at least N × 100ms. Many "real-time" systems quietly violate this. Notice when yours does.

CAP: what it actually says (and what it doesn't)

CAP theorem is the most-cited and most-misunderstood result in distributed systems. The pop-culture version — "pick two of consistency, availability, partition tolerance" — is technically true but practically useless, because partition tolerance is not optional in any system that runs on more than one machine.

Here's the version that actually helps:

In a distributed system, when the network partitions — and it will partition — you must choose: respond with possibly stale data, or refuse to respond.

That's it. CAP isn't about picking two-of-three; it's about what the system does during the rare moments when nodes can't reach each other. The choice is between:

CP — consistency preferred. During a partition, refuse writes (or refuse reads) rather than risk divergence. Examples: classic relational DBs in primary-failover mode, ZooKeeper, etcd, Spanner.
AP — availability preferred. During a partition, keep accepting reads and writes; reconcile when the partition heals. Examples: Cassandra in its loose configuration, DynamoDB (configurable), eventually-consistent caches.

CAP only kicks in during a partition. The rest of the time, the choice doesn't apply.

The PACELC sharpening

The reason CAP feels incomplete is that it ignores what happens when there isn't a partition — which is, you know, 99.99% of the time. PACELC extends it:

If there's a Partition: choose Availability or Consistency.
Else: choose Latency or Consistency.

That second clause is the one that actually shapes most architectures. Even with no partition, every consistency guarantee costs latency, because consistency requires coordination, and coordination requires waiting for someone. When you see a system advertised as "strongly consistent and low latency," ask which one bends under load — it's almost always latency, via a long tail.

What CAP forces you to decide

When you encounter a real subsystem, CAP doesn't tell you what to build — it tells you which questions to answer:

What does this data mean if it's stale? A like count being a few seconds behind is fine. An account balance being a few seconds behind is a regulatory incident.
What's the read-after-write contract? "My own profile change should be visible immediately to me" is read-your-writes consistency, which is much weaker (and cheaper) than full linearizability.
What happens during reconciliation? If two partitioned writes conflict, who wins? Last-write-wins is the default but is often wrong (it silently loses data). CRDTs, vector clocks, and merge functions are all answers to this.
How frequent are partitions, really? Within a datacenter: rare and brief. Across regions: depressingly common. Your CAP choice should match your blast radius.

Practical heuristic

For each data type your system stores, write one sentence: "If this is stale by N seconds, the worst thing that happens is X." That sentence is your real consistency requirement. Most architects skip it; that's why so many systems are over-coordinated and slow.

Tying it back: scale forces CAP

On one machine there is no CAP. The choice only appears when you cross a network — and the more you scale (especially geographically), the more often you cross networks, the more often you partition, and the more loudly the CAP choice screams. Scale doesn't create the tradeoff; it just exposes it.

Key takeaways

"Scale" is a four-dimensional vector: users, requests, data, geography. Different directions break different parts of the system.
Memorize the order-of-magnitude numbers (RAM, disk, intra-DC, cross-region). They calibrate every estimate.
CAP is about behavior during partitions, not a buffet. The real choice is CP or AP.
PACELC is the version that matters in practice: consistency costs latency even with no partition.
The right consistency level is the weakest one your domain can tolerate. Coordination is expensive.

Exercise

For three systems you know — a chat app, a ride-hailing app, and a banking app — answer for each: which axis of scale is the dominant constraint? Which subsystems would you make CP and which AP, and why? Write three sentences per system. The act of forcing a choice is the lesson.