-
Replication = keeping copies of data on multiple networked machines. Reasons:
- Keep data close to users
- Allow partial failures
- Scale out num machines to increase throughput
-
3 popular algorithms for replicating changes to nodes
- Single-leader
- Multi-leader
- Leaderless
-
Tradeoffs/configuration options for databases
- Sync/async replication
- How to handle failed replicas
-
Leader-based replication most common method for ensuring all writes hit all replicas (aka active/passive, master/slave)

- One replica is designated leader/master/primary, which all writes go to
- Other replicas designated followers (read replicas, slaves, secondaries, hot/warm standbys)
- On writes, leader writes changes to local storage and sends change data to followers via replication log/change stream
- All followers consume the log entries and apply them locally in same order as leader, potentially sending confirmation of each change back to the leader
- Writes only go to leader but reads can happen from any replica
- Tons of stuff optionally does leader based replication: Postgres, Mongo, Kafka, RabbitMQ, even some network filesystems
-
Sync/async replication

- Configurable in many dbs
- Synchronous: leader waits for confirmation from follower(s) before confirming change with client
- Pro: guaranteed that follower(s) are in sync with leader
- Con: if synchronous follower fails to apply change, leader is blocks other writes while it waits
- Asynchronous: leader does not wait for confirmation from follower(s) before confirming change with client (weakens durability)
- Pro: leader can continue processing writes even if all followers are behind
- Con: if the leader fails, all unreplicated writes may be lost
- Enabling sync usually means one follower is sync, not all of them, so that leader doesn’t block writes - “semi synchronous”
- If sync follower is slow or down, another follower becomes the sync follower
- Guarantees up to date copies of data on at least 2 nodes (leader and sync follower)
-
Setting up new followers
- Copying data files from one node to the new one isn’t sufficient as all new writes won’t be captured
- Could lock out writes from the db while new node is set up, but reduces availability
- No downtime setup:
- Snapshot leader’s data at a point in time (log sequence number, binlog coordinates) without locking
- Copy snapshot to new follower node
- Follower node connects to leader and requests all changes since snapshot
- Follower catches up by processing all changes, then can continue processing data changes from leader as they happen
-
Handling node outages
- Could occur due to faults or admin actions like upgrades
- Follower failure: Catch-up recovery
- Freshly recovered followers know the last processed transaction from their log
- Connects to the leader and requests all the changes since then, then continues processing new changes
- Leader failure: Failover
- A follower needs to be promoted to be the new leader - “failover”
- Automatic failover starts with determining the leader has failed, usually via a timeout
- Then a new leader is chosen from the followers, either through election/consensus or by a “controller node”. Usually the follower with the most up to date data is chosen to minimize data loss
- Clients now need to send writes to the new leader, and the old leader needs to know that it is now a follower
- Potential issues:
- If async replication, could be writes on the old leader that weren’t on the new leader - what to do with them? Discarding is common, but decreases durability
- Especially bad if other systems rely on durability, e.g. github with auto-incrementing primary keys used in MySQL and Redis (TODO: add github incident to Voltus notes)
- If more than one node believes they are the leader (”split brain”), data loss or corruption could occur without process for resolving conflicts. Some systems auto-kill one of the leader nodes, but sometimes kill all leader nodes!
- Choosing the right timeout is tough, as failover triggered by high load/latency is likely to make things worse, but too high a timeout and recovery time goes up by too much
-
Implementation of Replication Logs
- Statement based replication: leader logs every write request (INSERT/UPDATE/DELETE) and sends it to all followers to execute verbatim. Potential issues:
- NOW(), RAND(), other non-deterministic functions will have different results on different nodes
- Auto-incrementing cols, UPDATE..WHEREs, etc. must be executed in exactly the same order on each replica
- Side effects like triggers, stored procedures, UDFs, etc. may be non-deterministic on each replica
- The leader could replace all non-deterministic stuff with their results before forwarding to followers, but there are many edge cases. Could also just require transactions to be deterministic
- Used to be used by MySQL, but no longer
- Write-ahead log (WAL) shipping
- Append only sequence of bytes containing all writes to the db
- Leader writes WAL to disk and also sends it out to all followers to process
- Used in Postgres
- WAL contains details of which bytes were changed in which disk blocks - very low level descriptions
- Tightly couples replication to the storage engine - kind of a big deal. Replication protocol limits zero-downtime deploy potential as all nodes have to be running the same software at the same time
- Logical (row-based) log replication
- Using different formats for the log and the storage engine decouples the replication log from the storage engine’s physical data representation (back compat)
- If back compat, can upgrade all followers then perform a failover
- Logical rather than physical representation
- Usually a sequence of records describing writes to a database at granularity of rows
- Transactions affecting multiple rows generates multiple log entries
- Inserted row → log contains new values of all cols
- Deleted row → enough info to id deleted row
- Updated row → enough info to identify and update row
- Also good for change data capture (sending contents of a db to an external system like data warehouse)
- Trigger-based replication
- If replicating only some of the data or across DB systems or need special conflict resolution logic, you may need to move the replication up into the application logic layer rather than the DB layer
- Triggers and stored procedures available in many relational dbs - allow you to register custom code automatically executed on writes
- Generally greater overhead, more limited, and bug prone, but also more flexible
-
Problems with Replication Lag
-
With high read:write ratio, can just scale up # followers to increase throughput
-
Realistically only works with async replication — sync replication of all followers could block all writes when any failure occurs
-
Async replication implies eventual consistency (different followers in different states at each time)
-
Reading your own writes
- App where user submits data then displays the submitted data - writes go to leader but reads can be from any follower

- Implementing read-after-write (”read-your-writes”) consistency:
- Read from leader if known that reading modified/new data (e.g. read own profile data from leader, other immutable profiles data from followers)
- Track time of last update and read from leader if recently modified
- Prevent queries on followers with high replication lag
- Clients submit a time and followers ensure they return results only if the data reflects up to that time (pass it on or wait otherwise)
- Could also want “cross device” read-after-write consistency (e.g. same user edits on desktop then views on mobile)
-
Monotonic Reads
-
Consistent Prefix Reads
-
Solutions for replication lag
-
Multi-Leader Replication (master-master, active/active)
-
Handling write conflicts
-
Multi-Leader Replication Topologies
-
Leaderless Replication
-
Limitations of Quorum Consistency