Chapter 1: Reliable, Scalable, and Maintainable Applications

Stream processing: send a message to another process to be handled asynchronously
Reliable, scalable, maintainable data systems
DB, Message Queue, Caches etc. are all different. Why call them all "Data Systems?"
- These days, databases that are also message queues (Redis), and message queues that have database durability (Kafka)
- Applications are now responsible for stitching together and synchronizing tools that each do their own thing will

Three main components of a data system:
1. Reliability: system works correctly, even in the face of adversity
2. Scalability: reasonable ways of dealing with growth as it occurs
3. Maintainability: many people can work on, change and improve the system over time productively
Reliability
- My paraphrased definition: correctness over time
- Things go wrong = faults, tolerance of faults = resilience
- Fault = one component going out of spec; failure = system crash
- Can build reliable systems from unreliable parts
- Faults:
  - Hardware faults: disk crash, faulty RAM, unplug the power. Set up redundancy, backups, software to tolerate machine loss
    - Uncorrelated, i.e. usually don’t happen all at once
  - Software faults: bugs, runaway process eating CPU/Memory/Disk/Network bandwidth, faulty dependent service, cascading failures
    - Often correlated, i.e. all happen at once
  - Avoid human error: encapsulation + good APIs, sandbox environments, testing, easy-to-roll-back config/deploys, rolling deploys, tooling to recompute data, telemetry (perf metrics, error rates, etc.)
- Be conscious of where you cut corners
Scalability
- Not a binary - more a question of what the options are for coping with growth
- Load parameters - metrics for a system load
  - Requests/sec to web server
  - Reads:Writes ratio for DB
  - active users
  - Cache hit rate
- Could be the average val of a given load param or the extreme val you care about
- Very cool real world twitter example
  - Approach 1: followers request timeline, series of joins follows to get all tweets of all followees
  - Approach 2: followee tweets and twitter updates cache of all follower news feeds accordingly
  - Switched to hybrid approach as 2 is a lot of work when high-follower-count (celebs) tweet (e.g. 31MM caches to update per tweet), so use 1 for celebrity tweets
- Performance:
  - If load param increases, how much does perf degrade?
  - If load param increases, how much do resources have to increase to match same perf as before?
- Throughput: # records processed/sec, or time to run on given dataset
- Response time: time b/n sending request & receiving response
  - Networking delays, queueing delays, latency, service time
  - Latency = time request is waiting to be handled (latent)
  - Response time is a distribution, not a single #
  - Best to look at response time by percentile, not by mean
- Percentiles often used in SLOs (service level objectives, i.e. goals) and SLAs (service level agreements, i.e. contracts, agreements with clients for what defines a service being up)
- Head of line blocking: a few slow requests in a queue hold up all the rest behind it
- Measure response time on the client side to include queueing times
  - Does Datadog measure on client side?
- Important that load testing actually builds up a queue and doesn't wait for first to respond before sending next
- The response time of a request is related to the slowest backend call it makes, even in parallel
- Tail latency amplification: the more backend calls made per request, the greater the odds of hitting a slow part in the distribution and having an overall slow response time
- Do not average percentiles! Add histograms, then recalculate
- On a fast growing service, will likely rethink architecture every order of mag growth in data (or more often)
- Scale up - better machine, vs. Scale out - compute shared across more machines (i.e. shared nothing architecture)
  - Pragmatic mixture is often good (see Stack Overflow's architecture)
- Elastic systems scale up/down automatically - good if load is unpredictable
  - Distributed systems have higher complexity, however
- Manually scaled systems have fewer surprises
- Scaling is use-case specific
  - reads/writes
  - data volume
  - data complexity
  - response time reqs
  - access patterns
- In a startup, more important to iterate quickly than it is to scale to a hypothetical future load
Maintainability
- “Legacy software” is no fun, but often necessary, and we should try to design systems so our own work doesn’t turn in to it
- Fixing bugs, keeping systems on, investigating failures, adapting to new platforms, new use cases, repaying tech debt, new feature dev
- Main principles:
  - Operability: make it easy for ops to keep things running smoothly
    - Health monitoring, restoring faults/failures
    - Tracking down causes of problems
    - Updates and security patches
    - Keep tabs on dependencies
    - Capacity planning
    - Deployment, configuration tooling
    - Security
    - Process development
    - Knowledge organization
    - For data:
      - Visibility + monitoring
      - Automation and integration tooling
      - Redundancy across machines
      - Good docs and a clear operational model
      - Good defaults + customizability
      - Minimize surprises
  - Simplicity: remove complexity from the system
    - App can have complex functionality, but any complexity just due to implementation and not functionality is undesirable
    - Good abstractions + encapsulation
  - Evolvability: make it changeable
    - AGILE, TDD, refactoring
Systems have functional reqs (does what it says) and nonfunctional reqs (security, reliability, compliance, scalability, compatibility, maintainability)

active users

reads/writes