Chapter 3: Storage and Retrieval

As a dev, you should have some idea of what dbs are doing under the hood in order to select the correct ones for your workloads and desired characteristics
- Big diff between dbs optimized for transactional processing and analytics
Log structured storage engines vs page oriented storage engines (e.g. B-trees)
Simplest db - append to a text file in key, value format, retrieve last value matching key
- This is a "log", an append-only data file
  - Many real dbs use these, albeit with handling
    - Concurrency controls
    - Reclaiming disk space so log doesn’t grow forever
    - Handling errors + partial writes
- Writing is pretty good - just append to a file
- Reading is bad when db gets long - have to scan entire db every time, O(n)
Index - additional metadata that helps you find what you seek faster
- An additional structure (extra data) that affects query time but not the underlying data itself
- Each index incurs write overhead
Hash index (hashmap = O(1) lookup)
- keep key -> byteOffset hashmap in memory. On read, seek to byteOffset from hashmap and read value
- hashmap in memory, so it must fit in RAM (values live on disk)
- this is what Bitcask does
- suitable for reasonable number of keys, values updated often (many writes-per-key and can keep all keys in memory)
How to avoid running out of disk space with append-only log
- Break log into segments, where new segment starts once some size limit is reached
- Compaction - get rid of duplicate keys and keep only most recent value. Merge process async in a background thread, writing to a new segment file. When done, switch read requests to point to new merged file and delete old segment files

Other details for hash index, log file key-value store:
- Each segment has its own hash table as an index - queries check each segment’s index until they find the key they seek
- Binary format faster and simpler than CSV (encode length of string in bytes followed by raw (unescaped) string
- Deletion requires tombstones that are detected on segment merge
- Crash recovery sped up by writing copy of hashmaps to disk and reading those into memory on restart
- Can use checksums to ignore partially written log data post-crash
- Concurrency - append only log and otherwise immutable, so one writer thread. Multiple reader threads ok
Why append only and not updating the file in place?
- Sequential writes (appending and segment merging) good for magnetic spinning disk drives and SSDs
- Partial writes are easier to handle - crash could occur mid update, leaving some of the old, some of the new
- Merging segments avoid file fragmentation (files stay sequential in location on disk)
Limitations of hash table index:
- Hash table must fit in memory (can't have too many keys)
- Range queries, e.g. keys between 1 - 1000, aren't efficient and require individual lookups for each byte offset
SSTables and LSM Tree Indexes:
- Sorted string table - all items in the log (segment files) are sorted by key
- Don’t need to have every key in index - can use the knowledge that it’s sorted to go to a nearby offset for a key that isn’t in the index and scan from there
  - Each key in index can point to the start of a compressed block - compression saves disk space and reduces I/O bandwidth use
- How to sort data by key?
  - Could use a sorted structure on disk (e.g. B-trees)
  - Easier to maintain in memory tree data structure where you insert data in unsorted order and get it back out in sorted order (e.g. Red-Black tree or AVL tree)
  - In memory structure sometimes called memtable
- Example process
  - When writes come in, add to current in memory memtable structure (red-black tree)
  - When size(memtable) > threshold, write it to disk as SSTable. New file becomes most recent segment. New writes go to new memtable.
  - For reads, look in current memtable, then in newest to oldest segment files
  - Periodically run background merge + compaction process on segment files to discard overwritten + deleted files
- Crash recovery - memtable lost from memory, most recent values lost! So maintain an append-only log on disk that's unsorted. Used to restore memtable on restart. Can be discarded on segment write out.
- LevelDB and RocksDB uses SSTables + memtables as above^. These are key-value stores designed to be embedded in other applications. Google’s Bigtable paper introduced SSTable and memtable terminology, and also inspired Cassandra and HBase
- An SSTable is one implemention of the disk component for an LSM ('log structured merge')-tree
  - LSM-trees: keep a cascade of SSTables that are merged in the background
  - LSM storage engines = storage engine based on merging + compacting sorted files
  - Lucene, indexing engine for full-text search used by Elasticsearch & Solr, uses similar methods
- Optimizations:
  - Reading key that doesn't exist has to check memtable + every segment (slow!). Can use bloom filter - memory efficient data structure for approximating a set - to tell if it's absent in the first place
  - Compaction methods (size-tiered vs leveled)
    - size-tiered: smaller newer SSTables merged into larger and older ones
    - leveled: key range split into smaller SSTables and older data moved to separate levels (more incremental, less disk space)
- Benefits:
  - Data stored in sorted order so range queries very efficient
  - High write throughput due to sequential writes
B-tree Indexes - the most common of all
- main similarity to SSTables: key value pairs are sorted by key, allowing efficient lookups and range queries
- breaks db into fixed size blocks/pages, typically 4kB in size
- read & write one page at a time
- each page has an address - like a pointer on disk. Can therefore construct tree of pages
- page with individual (sequential) keys is a “leaf page” - contains value itself or final ref to value
- number of refs to other pages in a single page is the "branching factor" (6 above). In practice it depends on size constraints, but usually several hundred
- updating a value involves updating a page and overwriting it, reference to page staying valid
  - very different to LSM trees, which append only (or delete old) but don't modify in place
- adding new key involves either adding the new key-val pair to a page
  - if the page doesn’t have enough space for new key, split the page into two and updating the parent references
- tree is always balanced, not too deep (log(#keys) deep), usually ~4 deep with branching factor 500 or so
- B-tree reliability:
  - updating and/or splitting pages is a dangerous operation - writing the new 2 child pages and overwriting the parent. If crash occurs during, can corrupt the index.
  - to mitigate this risk, common to keep a write-ahead log (WAL) on disk. Append only file to which every b tree operation must be written before doing it to the page(s) itself. Can follow this to recover from a crash
  - in order to only see tree in consistent state, must control concurrency/multiple threads accessing pages - protect with latches (lightweight locks). More complex than log-structured approaches - LSM trees merge in the background async then atomically update segments without interfering with incoming queries
- some optimizations:
  - copy-on-write approach where new page is copied elsewhere and parent's reference is updated (useful for concurrency control as well)
  - abbreviated keys for higher branching factor
  - attempts to store pages sequentially on disk (this is easier in LSM-trees)
  - additional leaf page pointers like pointers to left/right sibling pages
  - variants like “fractal trees” reduce disk seeks
B-trees vs LSM-trees
Other indexing structures
Transaction Processing or Analytics?
Column-Oriented Storage
Aggregation: Data Cubes and Materialized Views