Chapter 8: The Trouble with Distributed Systems

Great Carly Rae Jepsen I just met you quote to start
Cloud Computing and Supercomputing
- Different computing environments with faults handled differently at each:
  - Vertically scaled computations - supercomputer with thousands of CPUs
    - Checkpoint to disk periodically. On failure, stop whole process, fix things, then restart at last checkpoint
    - More like a single-node computer than distributed
    - Built with high quality components, shared memory, RDMA
    - Specific mesh topologies rather than IP/Ethernet
  - Horizontally scaled internet services - cloud computing (multi-tenant data centers, etc)
    - Online (not batch) jobs with low latency requirements - can’t just turn off and fix
    - Built from commodity (lower quality) machines
    - IP & Ethernet based
    - Higher volumes (something is always broken)
    - Kill and request a new one, rolling upgrades, etc
    - Geographically distributed, so comms over the internet
    - Build reliable system from unreliable components
  - Something in between - enterprise data centers
- Reasons for distributed systems
  - Low latency (geographically distributed machines close to users)
  - Fault tolerance (some nodes can go offline while system still works)
Unreliable Networks
- “Shared nothing” systems = everything over an unreliable network, nodes have own memory and disk
- Network requests can have many issues, but clients can’t tell why response isn’t there
  - Request lost (e.g. cable unplugged)
  - Waiting in a queue
  - Remote node failed/delayed (e.g. stuck in GC)
  - Response lost in network
  - Response processed but delayed
- Network faults are surprisingly common and difficult to eliminate completely, so must be planned for
Detecting Faults
- Systems often need to autodetect failed nodes
  - e.g. load balancer stop sending requests to a node
  - e.g. distributed database with single-leader replication needs to detect if leader fails
- Sometimes you can get some feedback about functionality
  - RST or FIN packet in reply from the OS if nothing is listening on the port
  - If node process crashes or is killed but OS is still running, a script can notify other nodes
  - Link layer down detection within data center
  - IP unreachable network error
- Node can crash mid-request, then you don’t know how much was processed
- Even if TCP says delivered, app could have crashed before processing it - need confirmation from the app layer
- Usually you want to just retry with timeouts
Timeouts and Unbounded Delays
- No simple answer to picking timeout lengths
- Don’t want false positives (kill healthy nodes with shorter timeout) or false negatives (wait too long before killing unhealthy nodes with longer timeout)
- In particular, don’t want to exacerbate things when things are running slow due to already high load (don’t want to add to load with a failover for example)
- In reality you don’t know the max time (upper bound) for delays - our systems have unbounded delays
- Network congestion and queueing
  - Variability in network performance most often due to queueing
  - Can be due to network queuing (busy network link), OS queuing (busy CPU), virtual machine queueing, TCP flow control queueing at the sender
  - TCP timeouts and retries also look like delays to the client
  - Batch workloads can saturate queues (noisy neighbor)
- When you have a known system, you can get a distribution of delays over time and pick timeouts based off of it
  - Even better, your system could automatically measure this and adjust timeouts accordingly
Synchronous vs Asynchronous Networks
- Telephone networks set up a circuit with a fixed data flow frequency, establishing a bounded delay
- TCP very different - circuit is a fixed amount of reserved bandwidth, whereas TCP opportunistically uses whatever network bandwidth is available
  - When TCP conn is idle, doesn’t use any bandwidth
- Packet-switching (Ethernet, IP, TCP) vs circuit-switching (telephone)
- Networks use packet-switching as they are optimized for bursty traffic (”just do it as quickly as possible”)
- Some attempts to get circuit-switching characteristics in packet-switching (bounded delays), but not popular
Unreliable Clocks
- Apps use clocks to measure both durations and points in time:
  - If request timed out
  - Service response time percentiles and other stats
  - When should email send
  - Cache entry expiration
  - Log file timestamps
- Each machine in a distributed system has its own clock (physical quartz crystal oscillator)
- Network Time Protocol (NTP) is one attempt to synchronize clocks across machines - sync clock with a set of machines getting their time from e.g. a satellite
- Time-of-day clocks:
  - Returns current date and time relative to some calendar (wall-clock time)
  - e.g. epoch seconds
  - Usually sync’d via NTP - can jump backwards in time if too far out of sync
    - Bad for measuring durations
  - Often not very granular, e.g. 10ms increments, but less of a problem today
- Monotonic clocks:
  - Measures durations, e.g. response time or timeout
  - Guaranteed to always move forward
  - Difference between values is meaningful, but not the absolute value at any given time
  - Makes no sense to compare across computers
  - Maybe ns since computer start, or per CPU even - OS presents monotonic view to application threads even across CPUs
  - NTP may slightly adjust the rate of increase if too slow/fast (0.05%), but can’t jump forwards or backwards
  - Usually ok to use in a distributed system to measure durations
- Other trickyness:
- Can get clock sync pretty good with enough regulation (e.g. financial high freq trading)
Relying on Synchronized Clocks
Knowledge, Truth and Lies
System Model and Reality