OLTP vs OLAP
| Property | Transaction processing systems (OLTP) | Analytic systems (OLAP) |
|---|---|---|
| Main read pattern | Small number of records per query, fetched by key | Aggregate over large number of records |
| Main write pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream |
| Primarily used by | End user/customer, via web application | Internal analyst, for decision support |
| What data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
Data Warehouse
A data warehouse, is a separate database that analysts can query without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.
Columnar Storage
Row vs. Column
Row-Oriented: Stores all data for a single record (e.g., Name, Age, Salary) together. This is ideal for OLTP (Online Transaction Processing) where you frequently read or write specific individual records (e.g., adding a new user).
Column-Oriented: Stores all data for a single attribute (e.g., all "Salaries") together. This is ideal for OLAP (Online Analytical Processing) where you want to perform calculations on a specific metric across millions of records (e.g., "What is the average salary?").
Let's consider a simple Employees table with the following schema:
id (int), age (int), salary (int), dept (string)
| id | age | salary | dept |
|---|---|---|---|
| 0 | 25 | 50000 | HR |
| 1 | 30 | 60000 | Tech |
| 2 | 45 | 90000 | Sales |
1. Row-Oriented Storage
In a row store, data is written row by row. The storage on disk looks like a long continuous stream of records:
0, 25, 50000, HR; 1, 30, 60000, Tech; 2, 45, 90000, Sales
If you want to calculate the Average Salary:
- The generic "Row" engine must read the entire row (including id, age, and dept) just to get to the salary.
- This wastes I/O bandwidth reading data you don't need.
2. Column-Oriented Storage
In a column store, data is written column by column.
0, 1, 2 (IDs)
25, 30, 45 (Ages)
50000, 60000, 90000 (Salaries)
HR, Tech, Sales (Depts)
If you want to calculate the Average Salary:
- The engine only reads the Salaries block: 50000, 60000, 90000.
- It ignores IDs, Ages, and Depts entirely.
- This is significantly faster for aggregations (OLAP).
Why Columnar Storage is better for OLAP
- I/O Efficiency: When performing an aggregation, you only need to read the columns that are relevant to the query. This reduces the amount of data that needs to be read from disk.
- Compression: Since all data in a column is of the same type, it can be compressed much more effectively than row-oriented storage.
- Cache Efficiency: Since related data is stored together, it is more likely to be found in the CPU cache, which reduces the time it takes to access the data.