• As a dev, you should have some idea of what dbs are doing under the hood in order to select the correct ones for your workloads and desired characteristics
    • Big diff between dbs optimized for transactional processing and analytics
  • Log structured storage engines vs page oriented storage engines (e.g. B-trees)
  • Simplest db - append to a text file in key, value format, retrieve last value matching key
    • This is a "log", an append-only data file
      • Many real dbs use these, albeit with handling
        • Concurrency controls
        • Reclaiming disk space so log doesn’t grow forever
        • Handling errors + partial writes
    • Writing is pretty good - just append to a file
    • Reading is bad when db gets long - have to scan entire db every time, O(n)
  • Index - additional metadata that helps you find what you seek faster
    • An additional structure (extra data) that affects query time but not the underlying data itself
    • Each index incurs write overhead
  • Hash index (hashmap = O(1) lookup)
    • keep key -> byteOffset hashmap in memory. On read, seek to byteOffset from hashmap and read value
    • hashmap in memory, so it must fit in RAM (values live on disk)
    • this is what Bitcask does
    • suitable for reasonable number of keys, values updated often (many writes-per-key and can keep all keys in memory)
  • How to avoid running out of disk space with append-only log
    • Break log into segments, where new segment starts once some size limit is reached
    • Compaction - get rid of duplicate keys and keep only most recent value. Merge process async in a background thread, writing to a new segment file. When done, switch read requests to point to new merged file and delete old segment files

  • Other details for hash index, log file key-value store:

    • Each segment has its own hash table as an index - queries check each segment’s index until they find the key they seek
    • Binary format faster and simpler than CSV (encode length of string in bytes followed by raw (unescaped) string
    • Deletion requires tombstones that are detected on segment merge
    • Crash recovery sped up by writing copy of hashmaps to disk and reading those into memory on restart
    • Can use checksums to ignore partially written log data post-crash
    • Concurrency - append only log and otherwise immutable, so one writer thread. Multiple reader threads ok
  • Why append only and not updating the file in place?

    • Sequential writes (appending and segment merging) good for magnetic spinning disk drives and SSDs
    • Partial writes are easier to handle - crash could occur mid update, leaving some of the old, some of the new
    • Merging segments avoid file fragmentation (files stay sequential in location on disk)
  • Limitations of hash table index:

    • Hash table must fit in memory (can't have too many keys)
    • Range queries, e.g. keys between 1 - 1000, aren't efficient and require individual lookups for each byte offset
  • SSTables and LSM Tree Indexes:

    • Sorted string table - all items in the log (segment files) are sorted by key

    • Don’t need to have every key in index - can use the knowledge that it’s sorted to go to a nearby offset for a key that isn’t in the index and scan from there
      • Each key in index can point to the start of a compressed block - compression saves disk space and reduces I/O bandwidth use

    • How to sort data by key?
      • Could use a sorted structure on disk (e.g. B-trees)
      • Easier to maintain in memory tree data structure where you insert data in unsorted order and get it back out in sorted order (e.g. Red-Black tree or AVL tree)
      • In memory structure sometimes called memtable
    • Example process
      • When writes come in, add to current in memory memtable structure (red-black tree)
      • When size(memtable) > threshold, write it to disk as SSTable. New file becomes most recent segment. New writes go to new memtable.
      • For reads, look in current memtable, then in newest to oldest segment files
      • Periodically run background merge + compaction process on segment files to discard overwritten + deleted files
    • Crash recovery - memtable lost from memory, most recent values lost! So maintain an append-only log on disk that's unsorted. Used to restore memtable on restart. Can be discarded on segment write out.
    • LevelDB and RocksDB uses SSTables + memtables as above^. These are key-value stores designed to be embedded in other applications. Google’s Bigtable paper introduced SSTable and memtable terminology, and also inspired Cassandra and HBase
    • An SSTable is one implemention of the disk component for an LSM ('log structured merge')-tree
      • LSM-trees: keep a cascade of SSTables that are merged in the background
      • LSM storage engines = storage engine based on merging + compacting sorted files
      • Lucene, indexing engine for full-text search used by Elasticsearch & Solr, uses similar methods
    • Optimizations:
      • Reading key that doesn't exist has to check memtable + every segment (slow!). Can use bloom filter - memory efficient data structure for approximating a set - to tell if it's absent in the first place
      • Compaction methods (size-tiered vs leveled)
        • size-tiered: smaller newer SSTables merged into larger and older ones
        • leveled: key range split into smaller SSTables and older data moved to separate levels (more incremental, less disk space)
    • Benefits:
      • Data stored in sorted order so range queries very efficient
      • High write throughput due to sequential writes
  • B-tree Indexes - the most common of all

    • main similarity to SSTables: key value pairs are sorted by key, allowing efficient lookups and range queries
    • breaks db into fixed size blocks/pages, typically 4kB in size
    • read & write one page at a time
    • each page has an address - like a pointer on disk. Can therefore construct tree of pages

    • page with individual (sequential) keys is a “leaf page” - contains value itself or final ref to value

    • number of refs to other pages in a single page is the "branching factor" (6 above). In practice it depends on size constraints, but usually several hundred

    • updating a value involves updating a page and overwriting it, reference to page staying valid

      • very different to LSM trees, which append only (or delete old) but don't modify in place
    • adding new key involves either adding the new key-val pair to a page

      • if the page doesn’t have enough space for new key, split the page into two and updating the parent references

      Untitled

    • tree is always balanced, not too deep (log(#keys) deep), usually ~4 deep with branching factor 500 or so

    • B-tree reliability:

      • updating and/or splitting pages is a dangerous operation - writing the new 2 child pages and overwriting the parent. If crash occurs during, can corrupt the index.
      • to mitigate this risk, common to keep a write-ahead log (WAL) on disk. Append only file to which every b tree operation must be written before doing it to the page(s) itself. Can follow this to recover from a crash
      • in order to only see tree in consistent state, must control concurrency/multiple threads accessing pages - protect with latches (lightweight locks). More complex than log-structured approaches - LSM trees merge in the background async then atomically update segments without interfering with incoming queries
    • some optimizations:

      • copy-on-write approach where new page is copied elsewhere and parent's reference is updated (useful for concurrency control as well)
      • abbreviated keys for higher branching factor
      • attempts to store pages sequentially on disk (this is easier in LSM-trees)
      • additional leaf page pointers like pointers to left/right sibling pages
      • variants like “fractal trees” reduce disk seeks
  • B-trees vs LSM-trees

  • Other indexing structures

  • Transaction Processing or Analytics?

  • Column-Oriented Storage

  • Aggregation: Data Cubes and Materialized Views