Quick recap
- Data Engineer: skills and responsibilities
- Data Engineering lifecycle
Agenda
- Storage engines
- Log based storage
- Hash indexes
Why study storage internals?
- To pick right tools for your use case
- To be able to tune the configurations of storage engine as per your requirements
Storage engine overview
| Property | Transaction processing systems(OLTP) | Analytics processing(OLAP) |
|---|---|---|
| Primarily used for | End user/customer, via web application | Internal analyst, for decision support |
| What data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
| Main write pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream |
| Main read pattern | Small number of records per query, fetched by key | Aggregate over large number of records |
| Implementation types | - Log structured database - B tree based databases |
- Columnar storage |
Log based storage
Refer code
A log is an append-only sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.
#!/bin/bash
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
| Pros | Cons |
|---|---|
| - Faster writes since appending to a file is a cheap operation | - Read complexity is O(n) since each read requires us to read the whole file |
Log based storage + hash indexes
Refer code
Key idea: Along with appending to the log file maintain a in-memory hashmap that acts as a signpost to the actual keys.
Image credits: Designing Data-Intensive Applications
Implementation details:
- How do we avoid running out of disk space?: via compaction: Process of timely removing the duplicates from segment file.
- File format: CSV is not the best format for a log. It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).
- How to delete record? If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone). When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.
- Crash recovery:
- Read all segment and regenerate hash map on service restart(painful if we have large segment files)
- Maintain hashmap on disk also and read it into memory on restart.
- Concurrency: One write thread, multiple read thread
Tradeoffs
Pros:
- Appending and segment merging are sequential write operations, which are generally much faster than random writes, especially on magnetic spinning-disk hard drives.
- Concurrency and crash recovery are much simpler if segment files are append-only or immutable.
- Merging old segments avoids the problem of data files getting fragmented over time
Cons:
- Hash table must fit into main memory. so if you have a very large number of keys, you’re out of luck.
- Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999—you’d have to look up each key individu‐ ally in the hash maps.
References
- Chapter 1, Computer Systems: A Programmer's Perspective(CSAPP)
- Chapter 3, Designing Data intensive Applications
- Bitcask - A Log-Structured fast KV store