Scale is a vector, not a number
When someone says "we need to scale," the first useful question is: scale in which direction? "Scale" is shorthand for at least four very different problems, each of which stresses different parts of the system.
- Users. The count of distinct identities the system must keep state for.
- Requests. The rate of operations per second flowing through the system.
- Data. The volume of bytes the system must store, index, and retrieve.
- Geography. The physical distance between where data lives and where it's used.
A system that handles 100M users but a quiet 5k QPS is utterly unlike a system that handles 1M users at 200k QPS. The architectures look nothing alike. So whenever you hear "scale," ask which axes are growing and at what rate. The answers carve up the design space.
Why this matters: bottlenecks move
A system that's user-bound (Facebook's "graph") will hit storage and fanout limits first. A system that's request-bound (an ad bidder) will hit CPU and tail latency first. A system that's data-bound (a data warehouse) will hit IO and shuffle limits first. A system that's geography-bound (a multinational app) will hit the speed of light first. Most architectures are bad at one of these and adequate at the others — that's the whole game.
Numbers you should know in your bones
Senior engineers carry a small set of mental rulers. Internalize these — they will save you from making confident-sounding nonsense estimates:
| Operation | Approx. latency | Mental model |
|---|---|---|
| L1 cache reference | 0.5 ns | Free |
| Main memory reference | 100 ns | Very cheap |
| Send 1 MB sequentially over network | ~10 ms | Network is slow |
| SSD random read | ~150 µs | 300× slower than RAM |
| Round trip within a datacenter | ~0.5 ms | Same building |
| Round trip US east ↔ US west | ~70 ms | Speed of light, basically |
| Round trip US ↔ EU | ~80–100 ms | A whole human reaction |
| Round trip US ↔ Singapore | ~180 ms | Painful for sync UIs |
CAP: what it actually says (and what it doesn't)
CAP theorem is the most-cited and most-misunderstood result in distributed systems. The pop-culture version — "pick two of consistency, availability, partition tolerance" — is technically true but practically useless, because partition tolerance is not optional in any system that runs on more than one machine.
Here's the version that actually helps:
In a distributed system, when the network partitions — and it will partition — you must choose: respond with possibly stale data, or refuse to respond.
That's it. CAP isn't about picking two-of-three; it's about what the system does during the rare moments when nodes can't reach each other. The choice is between:
- CP — consistency preferred. During a partition, refuse writes (or refuse reads) rather than risk divergence. Examples: classic relational DBs in primary-failover mode, ZooKeeper, etcd, Spanner.
- AP — availability preferred. During a partition, keep accepting reads and writes; reconcile when the partition heals. Examples: Cassandra in its loose configuration, DynamoDB (configurable), eventually-consistent caches.
The PACELC sharpening
The reason CAP feels incomplete is that it ignores what happens when there isn't a partition — which is, you know, 99.99% of the time. PACELC extends it:
If there's a Partition: choose Availability or Consistency.
Else: choose Latency or Consistency.
That second clause is the one that actually shapes most architectures. Even with no partition, every consistency guarantee costs latency, because consistency requires coordination, and coordination requires waiting for someone. When you see a system advertised as "strongly consistent and low latency," ask which one bends under load — it's almost always latency, via a long tail.
What CAP forces you to decide
When you encounter a real subsystem, CAP doesn't tell you what to build — it tells you which questions to answer:
- What does this data mean if it's stale? A like count being a few seconds behind is fine. An account balance being a few seconds behind is a regulatory incident.
- What's the read-after-write contract? "My own profile change should be visible immediately to me" is read-your-writes consistency, which is much weaker (and cheaper) than full linearizability.
- What happens during reconciliation? If two partitioned writes conflict, who wins? Last-write-wins is the default but is often wrong (it silently loses data). CRDTs, vector clocks, and merge functions are all answers to this.
- How frequent are partitions, really? Within a datacenter: rare and brief. Across regions: depressingly common. Your CAP choice should match your blast radius.
Tying it back: scale forces CAP
On one machine there is no CAP. The choice only appears when you cross a network — and the more you scale (especially geographically), the more often you cross networks, the more often you partition, and the more loudly the CAP choice screams. Scale doesn't create the tradeoff; it just exposes it.
Key takeaways
- "Scale" is a four-dimensional vector: users, requests, data, geography. Different directions break different parts of the system.
- Memorize the order-of-magnitude numbers (RAM, disk, intra-DC, cross-region). They calibrate every estimate.
- CAP is about behavior during partitions, not a buffet. The real choice is CP or AP.
- PACELC is the version that matters in practice: consistency costs latency even with no partition.
- The right consistency level is the weakest one your domain can tolerate. Coordination is expensive.