-
Other details for hash index, log file key-value store:
- Each segment has its own hash table as an index - queries check each segment’s index until they find the key they seek
- Binary format faster and simpler than CSV (encode length of string in bytes followed by raw (unescaped) string
- Deletion requires tombstones that are detected on segment merge
- Crash recovery sped up by writing copy of hashmaps to disk and reading those into memory on restart
- Can use checksums to ignore partially written log data post-crash
- Concurrency - append only log and otherwise immutable, so one writer thread. Multiple reader threads ok
-
Why append only and not updating the file in place?
- Sequential writes (appending and segment merging) good for magnetic spinning disk drives and SSDs
- Partial writes are easier to handle - crash could occur mid update, leaving some of the old, some of the new
- Merging segments avoid file fragmentation (files stay sequential in location on disk)
-
Limitations of hash table index:
- Hash table must fit in memory (can't have too many keys)
- Range queries, e.g. keys between 1 - 1000, aren't efficient and require individual lookups for each byte offset
-
SSTables and LSM Tree Indexes:
- Sorted string table - all items in the log (segment files) are sorted by key

- Don’t need to have every key in index - can use the knowledge that it’s sorted to go to a nearby offset for a key that isn’t in the index and scan from there
- Each key in index can point to the start of a compressed block - compression saves disk space and reduces I/O bandwidth use

- How to sort data by key?
- Could use a sorted structure on disk (e.g. B-trees)
- Easier to maintain in memory tree data structure where you insert data in unsorted order and get it back out in sorted order (e.g. Red-Black tree or AVL tree)
- In memory structure sometimes called memtable
- Example process
- When writes come in, add to current in memory memtable structure (red-black tree)
- When size(memtable) > threshold, write it to disk as SSTable. New file becomes most recent segment. New writes go to new memtable.
- For reads, look in current memtable, then in newest to oldest segment files
- Periodically run background merge + compaction process on segment files to discard overwritten + deleted files
- Crash recovery - memtable lost from memory, most recent values lost! So maintain an append-only log on disk that's unsorted. Used to restore memtable on restart. Can be discarded on segment write out.
- LevelDB and RocksDB uses SSTables + memtables as above^. These are key-value stores designed to be embedded in other applications. Google’s Bigtable paper introduced SSTable and memtable terminology, and also inspired Cassandra and HBase
- An SSTable is one implemention of the disk component for an LSM ('log structured merge')-tree
- LSM-trees: keep a cascade of SSTables that are merged in the background
- LSM storage engines = storage engine based on merging + compacting sorted files
- Lucene, indexing engine for full-text search used by Elasticsearch & Solr, uses similar methods
- Optimizations:
- Reading key that doesn't exist has to check memtable + every segment (slow!). Can use bloom filter - memory efficient data structure for approximating a set - to tell if it's absent in the first place
- Compaction methods (size-tiered vs leveled)
- size-tiered: smaller newer SSTables merged into larger and older ones
- leveled: key range split into smaller SSTables and older data moved to separate levels (more incremental, less disk space)
- Benefits:
- Data stored in sorted order so range queries very efficient
- High write throughput due to sequential writes
-
B-tree Indexes - the most common of all
- main similarity to SSTables: key value pairs are sorted by key, allowing efficient lookups and range queries
- breaks db into fixed size blocks/pages, typically 4kB in size
- read & write one page at a time
- each page has an address - like a pointer on disk. Can therefore construct tree of pages

-
page with individual (sequential) keys is a “leaf page” - contains value itself or final ref to value
-
number of refs to other pages in a single page is the "branching factor" (6 above). In practice it depends on size constraints, but usually several hundred
-
updating a value involves updating a page and overwriting it, reference to page staying valid
- very different to LSM trees, which append only (or delete old) but don't modify in place
-
adding new key involves either adding the new key-val pair to a page
- if the page doesn’t have enough space for new key, split the page into two and updating the parent references

-
tree is always balanced, not too deep (log(#keys) deep), usually ~4 deep with branching factor 500 or so
-
B-tree reliability:
- updating and/or splitting pages is a dangerous operation - writing the new 2 child pages and overwriting the parent. If crash occurs during, can corrupt the index.
- to mitigate this risk, common to keep a write-ahead log (WAL) on disk. Append only file to which every b tree operation must be written before doing it to the page(s) itself. Can follow this to recover from a crash
- in order to only see tree in consistent state, must control concurrency/multiple threads accessing pages - protect with latches (lightweight locks). More complex than log-structured approaches - LSM trees merge in the background async then atomically update segments without interfering with incoming queries
-
some optimizations:
- copy-on-write approach where new page is copied elsewhere and parent's reference is updated (useful for concurrency control as well)
- abbreviated keys for higher branching factor
- attempts to store pages sequentially on disk (this is easier in LSM-trees)
- additional leaf page pointers like pointers to left/right sibling pages
- variants like “fractal trees” reduce disk seeks
-
B-trees vs LSM-trees
-
Other indexing structures
-
Transaction Processing or Analytics?
-
Column-Oriented Storage
-
Aggregation: Data Cubes and Materialized Views