Chapter 6: Partitioning/Sharding

For very large datasets or high throughput, break up the data into partitions (”sharding”)
Each piece of data belongs to exactly one partition
Each partition is like a small db, although the db may support ops that touch multiple partitions at once
Main reason for sharding is scalability - data can be distributed across more disks and load distributed across more processors
Usually combined with replication so that copies each partition are stored on multiple nodes (fault tolerance)
For leader-follower replication, each node may be the leader for some partitions and the follower for others
- Ignoring replication the rest of this chapter as it’s much like chapter 5

Untitled

Partitioning of Key-Value Data
- Goal: spread data and query load evenly across nodes (avoid skew and hot spots)
- If you randomly assigned data, it would be evenly distributed, but you’d have to do parallel reads of all nodes for every query (no determinism for which data is on which partition(s))
- Partitioning by Key Range
  - Assign continuous range of keys from some min → max to a specific partition
    - In order to have even-ness, partition boundaries must adapt to the data (e.g. T - Z is a bigger range than B - C in pic above)
    - Can choose boundaries manually or automatically rebalance depending on db
  - Within each partition, keys stay in sorted order - range scans are easy and can treat key like an index (e.g. timestamp key lets you get all data for a certain month quickly)
  - Downside is that it can lead to hotspots, e.g. only one partition gets all the writes
    - In timestamped sensor example, you can prefix the timestamp by sensor name to spread write load, but now need to perform a separate range query for each sensor name
- Partitioning by Hash of Key
  - To alleviate risk of skew and hotspots, use a hash of the key
  - Choose a good hash function (good distribution, similar keys produce different outputs, etc.) and assign each partition a range of hashes
  - Good at distributing keys fairly amongst partitions
  - Consistent hashing = particular approach to rebalancing - generally say “hash partitioning” instead unless it explicitly means consistent hashing
  - No more efficient range queries (range queries access all partitions)
  - Cassandra has a compound index approach - first part of key is hashed to determine partition, but rest of cols used to sort the data. If a query specifies a fixed value for the first (partition-determining) col, can do range query over the other cols
    - e.g. (user_id, update_timestamp) allows you to easily retrieve all updates in some time made by some user
  - Still vulnerable to skew + hotspots if keys are repeated (e.g. celebrity user)
    - App logic must handle this, e.g. add random 2 digit number to key (makes reads harder)
Partitioning and Secondary Indexes
- Used not for finding particular values (key → value), but for finding occurrences of things (e.g. all actions by user 123)
- Secondary indexes super important for relational & document dbs, but don’t map cleanly to partitioning
- Partitioning Secondary Indexes by Document
  - e.g. car db partitioned by ID, want users to be able to search by color and make
  - Each partition totally separate - “local index”
  - Reads need to reference secondary indexes across all partitions (scatter/gather)
    - Can be prone to tail latency amplification
- Partitioning Secondary Indexes by Term
  - Global index covers data in all partitions, itself partitioned so that no one node is the bottleneck
  - e.g. the index has colors a → r on partition 0, colors s → z on partition 1
  - “term-partitioned” - the term we’re looking for (e.g. color:red) determines the partition of the index
  - Can partition by the term itself or the hash of the term
    - similar tradeoffs - term itself better for range queries, hash better for distributing load
  - Can make reads faster (only access one partition), writes slower (every term in the updated data might be on a different partition)
    - Updates to global secondary indexes are often async (avoid writes requiring distributed transaction across all partitions)
Rebalancing Partitions
- Rebalance = moving load (data storage, read&write requests) from one node to another
  - After rebalance, load should be more balanced
  - During rebalance, database should still be available
  - Minimum data transfer necessary should occur during rebalance
- Strategies for Rebalancing
  - Do not do hash mod N - if # nodes (N) changes, all hash-mod-N results change! Makes rebalancing unnecessarily expensive
  - Fixed # partitions - more than you need, then spread them out over more partitions as N increases
    - Old partition assignments are used until all data is copied over
    - Can assign more partitions to higher powered nodes if you want
    - Used in Riak, Elasticsearch, Couchbase, Voldemort
    - Tricky to choose initial N that accounts for $ and overhead but is big enough to account for future growth
  - Dynamic partitioning
    - Used by key range-partitioned databases (HBase, RethinkDB)
    - Split partitions once they grow over a threshold (e.g. 10GB) - similar to B-Trees
      - Can also merge partitions if a lot of data is deleted
    - Advantage is that # partitions adapts to total data volume
    - Empty DB starts with a single partition, or can do “pre-splitting” (pre-defined partitions) to spread load
    - Can also use hash-partitioned data
  - Partitioning proportional to nodes
    - Fixed # partitions per node (Cassandra, Ketama)
    - Partitions grow in size until you increase the number of nodes again
    - New node joins, randomly chooses some partitions to split, leaving the other half in place (unfair splits averaged out)
    - Requires hash-based partitioning so that boundaries can be selected from the range of the hash function output
- Operations: Automatic and Manual Rebalancing
  - Some dbs generate suggested rebalance config but require db admin to confirm/commit it
  - Rebalancing is expensive, so sometimes automation is not what you want (overloads network)
    - e.g. if overloaded node is deemed dead and auto-rebalanced, copy data out of it actually adds MORE load to it
  - May want a human in the loop for oversight
Request Routing
- General problem is “service discovery”. In this case, it’s “if I want to read “key:foo”, which partition(s) do I go to”?
- Approaches:
  - Clients contact any node, which either returns or forwards the request depending on if it owns the requested partition or not
  - All requests go to a routing tier first as a “partition aware loadbalancer”
  - Require clients to be aware of node-partition assignments and they can contact nodes directly
- Key issue: how does the thing knowledgable of node-partition assignments (nodes, routing tier, or clients) know about changes?
  - All participants must agree, otherwise requests reach wrong nodes
  - May require distributed consensus algo
- ZooKeeper is one system that can keep authoritative tracking of nodes → partitions
  - Nodes register themselves with ZooKeeper
  - Routing tier or clients can subscribe to info from ZK
- Can also use a “gossip protocol” to disseminate changes in cluster state
Parallel Query Execution