Chapter 1 · Lesson 2

Understanding Big Data: The 5 Vs

"Big Data" isn't just "a lot of data" — it's data whose volume, velocity, and variety break the tools that work fine on a single machine.

What makes data "big"?

A 50 MB spreadsheet isn't Big Data, even if it feels big to you. Big Data is defined by what it does to your tools: when a dataset no longer fits in one machine's memory, or arrives faster than one machine can process it, or comes in shapes your database can't model cleanly — you've crossed into Big Data territory. The classic way to characterize it is the 5 Vs.

Big Data Volume terabytes+ Velocity speed of arrival Variety many formats Veracity trustworthiness Value business worth
The 5 Vs: Volume, Velocity, Variety, Veracity, and Value.

The 5 Vs

VWhat it meansExample
VolumeSheer size — terabytes to petabytesYears of clickstream logs from millions of users
VelocityHow fast data arrives and must be processedPayment events streaming in thousands per second
VarietyStructured, semi-structured, and unstructured shapesSQL tables + JSON + images + free text together
VeracityHow trustworthy and clean the data isDuplicate, missing, or malformed records
ValueThe business insight worth extractingA churn model that saves millions in revenue

Why one machine isn't enough

Imagine you're asked to count word frequencies across 10 TB of text. A laptop with 16 GB of RAM can't even hold a fraction of that in memory, and reading it from one disk would take many hours. You have two choices: buy a single, monstrously expensive machine (scale up), or split the work across many ordinary machines that each handle a slice (scale out). Big Data tools — Spark included — are built on the second idea. We'll dig into that trade-off in the next section.

Real-world scenario
A retailer wants to recommend products in real time. Their data is huge (years of orders), fast (live browsing events), varied (orders, page views, images, reviews), messy (returns, bots, test accounts), and valuable (better recommendations = more revenue). That's all five Vs at once — a textbook job for Spark.
Lesson Summary
Big Data is defined by what it does to your tools, not by a fixed size. The 5 Vs — Volume, Velocity, Variety, Veracity, and Value — capture why single-machine tools break down and why we need to spread work across a cluster.