• Great Carly Rae Jepsen I just met you quote to start
  • Cloud Computing and Supercomputing
    • Different computing environments with faults handled differently at each:
      • Vertically scaled computations - supercomputer with thousands of CPUs
        • Checkpoint to disk periodically. On failure, stop whole process, fix things, then restart at last checkpoint
        • More like a single-node computer than distributed
        • Built with high quality components, shared memory, RDMA
        • Specific mesh topologies rather than IP/Ethernet
      • Horizontally scaled internet services - cloud computing (multi-tenant data centers, etc)
        • Online (not batch) jobs with low latency requirements - can’t just turn off and fix
        • Built from commodity (lower quality) machines
        • IP & Ethernet based
        • Higher volumes (something is always broken)
        • Kill and request a new one, rolling upgrades, etc
        • Geographically distributed, so comms over the internet
        • Build reliable system from unreliable components
      • Something in between - enterprise data centers
    • Reasons for distributed systems
      • Low latency (geographically distributed machines close to users)
      • Fault tolerance (some nodes can go offline while system still works)
  • Unreliable Networks
    • “Shared nothing” systems = everything over an unreliable network, nodes have own memory and disk

    • Network requests can have many issues, but clients can’t tell why response isn’t there

      • Request lost (e.g. cable unplugged)
      • Waiting in a queue
      • Remote node failed/delayed (e.g. stuck in GC)
      • Response lost in network
      • Response processed but delayed

      Untitled

    • Network faults are surprisingly common and difficult to eliminate completely, so must be planned for

  • Detecting Faults
    • Systems often need to autodetect failed nodes
      • e.g. load balancer stop sending requests to a node
      • e.g. distributed database with single-leader replication needs to detect if leader fails
    • Sometimes you can get some feedback about functionality
      • RST or FIN packet in reply from the OS if nothing is listening on the port
      • If node process crashes or is killed but OS is still running, a script can notify other nodes
      • Link layer down detection within data center
      • IP unreachable network error
    • Node can crash mid-request, then you don’t know how much was processed
    • Even if TCP says delivered, app could have crashed before processing it - need confirmation from the app layer
    • Usually you want to just retry with timeouts
  • Timeouts and Unbounded Delays
    • No simple answer to picking timeout lengths

    • Don’t want false positives (kill healthy nodes with shorter timeout) or false negatives (wait too long before killing unhealthy nodes with longer timeout)

    • In particular, don’t want to exacerbate things when things are running slow due to already high load (don’t want to add to load with a failover for example)

    • In reality you don’t know the max time (upper bound) for delays - our systems have unbounded delays

    • Network congestion and queueing

      • Variability in network performance most often due to queueing

      Untitled

      • Can be due to network queuing (busy network link), OS queuing (busy CPU), virtual machine queueing, TCP flow control queueing at the sender
      • TCP timeouts and retries also look like delays to the client
      • Batch workloads can saturate queues (noisy neighbor)
    • When you have a known system, you can get a distribution of delays over time and pick timeouts based off of it

      • Even better, your system could automatically measure this and adjust timeouts accordingly
  • Synchronous vs Asynchronous Networks
    • Telephone networks set up a circuit with a fixed data flow frequency, establishing a bounded delay
    • TCP very different - circuit is a fixed amount of reserved bandwidth, whereas TCP opportunistically uses whatever network bandwidth is available
      • When TCP conn is idle, doesn’t use any bandwidth
    • Packet-switching (Ethernet, IP, TCP) vs circuit-switching (telephone)
    • Networks use packet-switching as they are optimized for bursty traffic (”just do it as quickly as possible”)
    • Some attempts to get circuit-switching characteristics in packet-switching (bounded delays), but not popular
  • Unreliable Clocks
    • Apps use clocks to measure both durations and points in time:
      • If request timed out
      • Service response time percentiles and other stats
      • When should email send
      • Cache entry expiration
      • Log file timestamps
    • Each machine in a distributed system has its own clock (physical quartz crystal oscillator)
    • Network Time Protocol (NTP) is one attempt to synchronize clocks across machines - sync clock with a set of machines getting their time from e.g. a satellite
    • Time-of-day clocks:
      • Returns current date and time relative to some calendar (wall-clock time)
      • e.g. epoch seconds
      • Usually sync’d via NTP - can jump backwards in time if too far out of sync
        • Bad for measuring durations
      • Often not very granular, e.g. 10ms increments, but less of a problem today
    • Monotonic clocks:
      • Measures durations, e.g. response time or timeout
      • Guaranteed to always move forward
      • Difference between values is meaningful, but not the absolute value at any given time
      • Makes no sense to compare across computers
      • Maybe ns since computer start, or per CPU even - OS presents monotonic view to application threads even across CPUs
      • NTP may slightly adjust the rate of increase if too slow/fast (0.05%), but can’t jump forwards or backwards
      • Usually ok to use in a distributed system to measure durations
    • Other trickyness:
    • Can get clock sync pretty good with enough regulation (e.g. financial high freq trading)
  • Relying on Synchronized Clocks
  • Knowledge, Truth and Lies
  • System Model and Reality