Two kinds of requirement, two kinds of conversation
Every system has two layers of requirements. The first describes what the system does: the verbs, the user-visible behaviors, the features. The second describes how well it must do them: the constraints under which those behaviors must hold up. The two map to two very different conversations.
- Functional requirements (FRs). "A user can post a message. A driver can be matched to a rider. An order can be refunded." These are the behaviors a product manager writes on a whiteboard.
- Non-functional requirements (NFRs). "p99 message-post latency < 200ms at 50k QPS. Match a driver within 5s in 95% of cases. Refunds reconcile within 24h with zero double-credits." These are the constraints that decide whether the same FR set produces Twitter or a wedding-photo site.
Most teams have rigorous FR conversations and impoverished NFR conversations. That asymmetry is the single biggest predictor of an architecture that surprises everyone in year two.
Why NFRs drive architecture, not FRs
Consider two systems with identical functional requirements: "users can post short text messages and see other people's posts." That FR is satisfied by:
- A single Postgres instance with a Flask app. Costs $50/mo. Serves 1,000 users.
- A globally-distributed, multi-region, fanout-on-write timeline system with edge caching, push notifications, and 14 microservices. Costs $50M/mo. Serves Twitter.
Both satisfy the FR. The difference between them is entirely the NFRs — peak throughput, p99 latency, availability target, geographic distribution, durability of reposts. FRs decide the shape of your data model; NFRs decide the shape of your architecture.
The five categories of NFR worth tracking
NFRs sprawl. To keep them organized, classify each one into one of five buckets. Most systems have at least one binding NFR in each:
1. Performance
Latency, throughput, concurrency. Always specify with a percentile and a load condition. "Fast" is not an NFR. "p99 latency < 200ms at 50k QPS sustained" is.
2. Reliability
Availability, durability, fault tolerance, RPO/RTO. "Highly available" is not an NFR. "99.99% availability measured monthly, RPO < 30s, RTO < 5 min" is.
3. Security & compliance
AuthN/Z model, encryption, audit, data residency, regulatory standards (PCI, HIPAA, SOC2, GDPR). These are often binding in ways engineers don't see until late — "we need PCI" can rewrite an architecture.
4. Operability
Deploy frequency, rollback time, observability, configurability, on-call burden. Operational NFRs are routinely under-specified and create slow-burn pain. "We must be able to roll back any deploy in <5 min" has dramatic implications.
5. Cost
The honest NFR most slides omit. "Total infrastructure spend < $X/month at Y users" is a real constraint. A 10× cheaper architecture that meets every other NFR is usually the right answer, even if a "better" one exists on paper.
How to extract NFRs from a vague stakeholder
Stakeholders rarely volunteer NFRs. They will tell you "we need it to be fast" and consider their job done. Your job is to translate. A short script that almost always works:
- "What's the worst thing that could happen?" Their answers are durability and security requirements in disguise. "We can't lose any payment" → strong durability. "Users can't see each other's data" → tenant isolation.
- "What does the user experience if it's slow?" Their answers are latency requirements. "They'll bounce" → strict p99. "We have a loading spinner" → relaxed.
- "How many users at peak?" Plus "When is peak?" Their answers are throughput and burst-handling requirements.
- "What if it's down for an hour?" Their answers are availability and DR requirements. "We'd lose $X" sharpens it instantly.
- "What are we not allowed to do with this data?" Their answers are compliance and residency requirements.
The binding NFR is usually one or two
A typical system has a dozen NFRs on paper. Two or three of them are actually binding — they determine the shape of the architecture. The rest fall out for free or can be traded away. Identifying which is which is much of the work.
A useful test: if I tightened this NFR by 10×, would the architecture have to change? If yes, it's binding. If no, it's noise. For a typical OLTP system, latency-at-percentile and availability are binding; durability is easy if your DB is competent; cost binds at a different scale than the others.
NFRs are why architectures get rewritten
The most common story in systems engineering: the FRs are stable for years, but one NFR changes (10× more users, a regulator showing up, a new latency target from a competitor) and a working architecture suddenly stops working. The system did not break — its environment did. This is why "architecture is evolved, not chosen" (Lesson 1.1). The trigger is almost always an NFR shift.
Key takeaways
- FRs describe what the system does. NFRs describe how well it must do them under stress.
- NFRs — not FRs — drive every interesting architectural decision.
- Group NFRs into five categories: performance, reliability, security/compliance, operability, cost.
- An NFR without a number is a wish. Force a number with a unit and a condition.
- Find the one or two binding NFRs; the rest fall out for free.
- Architectures get rewritten because NFRs shift — anticipate which one might next.