OLTP vs OLAP

Property	Transaction processing systems (OLTP)	Analytic systems (OLAP)
Main read pattern	Small number of records per query, fetched by key	Aggregate over large number of records
Main write pattern	Random-access, low-latency writes from user input	Bulk import (ETL) or event stream
Primarily used by	End user/customer, via web application	Internal analyst, for decision support
What data represents	Latest state of data (current point in time)	History of events that happened over time
Dataset size	Gigabytes to terabytes	Terabytes to petabytes

Data Warehouse

A data warehouse, is a separate database that analysts can query without affecting OLTP operations. The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.

Columnar Storage

Row vs. Column

Row-Oriented: Stores all data for a single record (e.g., Name, Age, Salary) together. This is ideal for OLTP (Online Transaction Processing) where you frequently read or write specific individual records (e.g., adding a new user).

Column-Oriented: Stores all data for a single attribute (e.g., all "Salaries") together. This is ideal for OLAP (Online Analytical Processing) where you want to perform calculations on a specific metric across millions of records (e.g., "What is the average salary?").

Let's consider a simple Employees table with the following schema: id (int), age (int), salary (int), dept (string)

id	age	salary	dept
0	25	50000	HR
1	30	60000	Tech
2	45	90000	Sales

1. Row-Oriented Storage

In a row store, data is written row by row. The storage on disk looks like a long continuous stream of records:

0, 25, 50000, HR; 1, 30, 60000, Tech; 2, 45, 90000, Sales

If you want to calculate the Average Salary:
- The generic "Row" engine must read the entire row (including id, age, and dept) just to get to the salary.
- This wastes I/O bandwidth reading data you don't need.

2. Column-Oriented Storage

In a column store, data is written column by column.

0, 1, 2 (IDs) 25, 30, 45 (Ages) 50000, 60000, 90000 (Salaries) HR, Tech, Sales (Depts)

If you want to calculate the Average Salary: - The engine only reads the Salaries block: 50000, 60000, 90000.
- It ignores IDs, Ages, and Depts entirely.
- This is significantly faster for aggregations (OLAP).

Why Columnar Storage is better for OLAP

I/O Efficiency: When performing an aggregation, you only need to read the columns that are relevant to the query. This reduces the amount of data that needs to be read from disk.
Compression: Since all data in a column is of the same type, it can be compressed much more effectively than row-oriented storage.
Cache Efficiency: Since related data is stored together, it is more likely to be found in the CPU cache, which reduces the time it takes to access the data.