Chapter 9: Consistency and Consensus

Untitled

Linearizability: writes appear instant, all later reads read the same value (single object) - “atomic consistency”
vs Serializability: guarantee about transactions - appears as if transactions occurred in some serialized order, even if some/all were actually concurrent
Examples of algorithms and protocols for building fault-tolerant distributed systems
Packets can be lost, reordered, duplicated, or delayed, clocks are approx, nodes can pause or crash
Consensus allows applications to rely on the fact that nodes can all agree on things
Consistency Guarantees
- e.g. eventual consistency - all nodes will eventually return the same value if writes are paused
- kinda like transaction isolation levels seen prior, but that’s about avoiding race conditions due to concurrent transactions
- distributed consistency is about coordinating replica state in presence of delays and faults
Linearizability
- a distributed db gives the illusion that there’s only one copy - all reads provide the same data, no worrying about replication lag
  - also acts like all operations are atomic
  - application never has to think about replicas - appears not distributed
- this is called linearizability, aka atomic, strong, immediate, or external consistency
- someone queries data after someone else has received it and still doesn’t receive it - violation of linearizability
- What Makes a System Linearizable?
  - if reads that are concurrent with a write can return either the old or the new value, then readers could see a value flip back and forth between the old and the new value several times while a write is going on. That is not what we expect of a system that emulates a single copy of the data.
  - cas = compare and set
  - It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order
- Relying on Linearizability
  - When do we need linearizability?
    - Locking and leader election - all nodes must agree on which node owns the lock (and therefore is leader)
    - Constraints and uniqueness guarantees
      - like acquiring a lock on a unique username, or a compare-and-set atomic op
      - bank accounts never going negative, sell more stock than you have, etc.
      - this all requires a single up-to-date value that all nodes agree on
    - Cross-channel timing deps
      - can’t have steps 3 & 4 together be faster than internal replication in file storage
      - two different communication channels - file storage replication and message queue
      - can use other approaches like reading your own writes, but linearizability is a less complex way
- Implementing Linearizable Systems
  - the simplest answer would be to really only use a single copy of the data
    - would not be able to tolerate faults
  - making different replication methods linearizable:
    - Single-leader replication (potentially linearizable)
      - If you make reads from the leader, or from synchronously updated followers, they have the potential to be linearizable. However, not every single-leader database is actually linearizable, either by design (e.g., because it uses snapshot isolation) or due to concurrency bugs
      - relies on the assumption that you know for sure who the leader is
    - Consensus algorithms (linearizable)
      - consensus protocols contain measures to prevent split brain and stale replicas
      - This is how Zoo‐ Keeper and etcd work
    - Multi-leader replication (not linearizable)
      - they concurrently process writes on multiple nodes and asynchronously replicate them to other nodes
      - they can produce conflicting writes that require resolution - an artifact of the lack of a single copy of the data
    - Leaderless replication (probably not linearizable)
      - people sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes (w + r > n) - not quite true
      - “Last write wins” conflict resolution methods based on time-of-day clocks are nonlinearizable, because clock timestamps cannot be guaranteed to be consistent with actual event ordering due to clock skew
      - Sloppy quorums not linearizable
  - Linearizability and quorums
    - when we have variable network delays, it is possible to have race conditions
    - quorum condition is met (w=3 + r=2 > n=3), but this execution is nevertheless not linearizable
      - B’s request begins after A’s request completes, but B returns the old value
    - it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability
  - The Cost of Linearizability
    - If the network between datacenters is interrupted in a single-leader setup, clients con‐ nected to follower datacenters cannot contact the leader, so they cannot make any writes to the database, nor any linearizable reads. They can still make reads from the follower, but they might be stale (nonlinearizable). If the application requires linear‐ izable reads and writes, the network interruption causes the application to become unavailable in the datacenters that cannot contact the leader.
    - If clients can connect directly to the leader datacenter, this is not a problem, since the application continues to work normally there. But clients that can only reach a fol‐ lower datacenter will experience an outage until the network link is repaired.
    - The CAP theorem
      - If your application requires linearizability, and some replicas are disconnected from the other replicas due to a network problem, then some replicas cannot process requests while they are disconnected: they must either wait until the net‐ work problem is fixed, or return an error (either way, they become unavailable).
      - If your application does not require linearizability, then it can be written in a way that each replica can process requests independently, even if it is disconnected from other replicas (e.g., multi-leader). In this case, the application can remain available in the face of a network problem, but its behavior is not linearizable.
      - Thus, applications that don’t require linearizability can be more tolerant of network problems - CAP theorem
        
        These two choices are sometimes known as CP (consistent but not available under network partitions) and AP (available but not consistent under network partitions)
        
        CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading because network parti‐ tions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not
        
        a better way of phras‐ ing CAP would be either Consistent or Available when Partitioned
        
        there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided
    - Linearizability and network delays
      - surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable
        
        The reason for dropping linearizability is performance, not fault tolerance (CPU cache, async writes to memory)
      - Linearizability is slow—and this is true all the time, not only during a network fault.
    - Ordering Guarantees
      - past examples of ordering
        
        the main purpose of the leader in single-leader replica‐ tion is to determine the order of writes in the replication log—that is, the order in which followers apply those writes
        
        Serializability is about ensuring that transac‐ tions behave as if they were executed in some sequential order
        
        The use of timestamps and clocks in distributed systems for example to determine which one of two writes happened later
      - Ordering and Causality
        
        causal dependency between a question and the answer
        
        a row must first be created before it can be updated
        
        either A happened before B, or B happened before A, or A and B are concurrent
        
        snapshot isolation - a transaction reads from a consistent snapshot. But what does “consistent” mean in this context? It means consistent with causality: if the snapshot contains an answer, it must also contain the question being answered
        
        Serializable snapshot isolation detects write skew by track‐ ing the causal dependencies between transactions
        
        Alice’s exclamation is causally dependent on the announcement of the score, so Bob should also be able to see the score after hearing Alice
        
        If a system obeys the ordering imposed by causality, we say that it is causally consis‐ tent
        
        e.g. snapshot isolation
      - The causal order is not a total order
        
        A total order allows any two elements to be compared, so if you have two elements, you can always say which one is greater and which one is smaller. For example, natu‐ ral numbers
        
        Sets are partially ordered
        
        Linearizability = total order
        
        Causality = partial order (incomparable if concurrent)
      - Linearizability is stronger than causal consistency
      - Capturing causal dependencies
  - Sequence Number Ordering
  - Total Order Broadcast (atomic broadcast, total order multicast)
Distributed Transactions and Consensus