What makes data "big"?
A 50 MB spreadsheet isn't Big Data, even if it feels big to you. Big Data is defined by what it does to your tools: when a dataset no longer fits in one machine's memory, or arrives faster than one machine can process it, or comes in shapes your database can't model cleanly — you've crossed into Big Data territory. The classic way to characterize it is the 5 Vs.
The 5 Vs
| V | What it means | Example |
|---|---|---|
| Volume | Sheer size — terabytes to petabytes | Years of clickstream logs from millions of users |
| Velocity | How fast data arrives and must be processed | Payment events streaming in thousands per second |
| Variety | Structured, semi-structured, and unstructured shapes | SQL tables + JSON + images + free text together |
| Veracity | How trustworthy and clean the data is | Duplicate, missing, or malformed records |
| Value | The business insight worth extracting | A churn model that saves millions in revenue |
Why one machine isn't enough
Imagine you're asked to count word frequencies across 10 TB of text. A laptop with 16 GB of RAM can't even hold a fraction of that in memory, and reading it from one disk would take many hours. You have two choices: buy a single, monstrously expensive machine (scale up), or split the work across many ordinary machines that each handle a slice (scale out). Big Data tools — Spark included — are built on the second idea. We'll dig into that trade-off in the next section.