“Shared nothing” systems = everything over an unreliable network, nodes have own memory and disk
Network requests can have many issues, but clients can’t tell why response isn’t there
Network faults are surprisingly common and difficult to eliminate completely, so must be planned for
No simple answer to picking timeout lengths
Don’t want false positives (kill healthy nodes with shorter timeout) or false negatives (wait too long before killing unhealthy nodes with longer timeout)
In particular, don’t want to exacerbate things when things are running slow due to already high load (don’t want to add to load with a failover for example)
In reality you don’t know the max time (upper bound) for delays - our systems have unbounded delays
Network congestion and queueing
When you have a known system, you can get a distribution of delays over time and pick timeouts based off of it