• Stream processing: send a message to another process to be handled asynchronously
  • Reliable, scalable, maintainable data systems
  • DB, Message Queue, Caches etc. are all different. Why call them all "Data Systems?"
    • These days, databases that are also message queues (Redis), and message queues that have database durability (Kafka)
    • Applications are now responsible for stitching together and synchronizing tools that each do their own thing will

  • Three main components of a data system:
    1. Reliability: system works correctly, even in the face of adversity
    2. Scalability: reasonable ways of dealing with growth as it occurs
    3. Maintainability: many people can work on, change and improve the system over time productively
  • Reliability
    • My paraphrased definition: correctness over time
    • Things go wrong = faults, tolerance of faults = resilience
    • Fault = one component going out of spec; failure = system crash
    • Can build reliable systems from unreliable parts
    • Faults:
      • Hardware faults: disk crash, faulty RAM, unplug the power. Set up redundancy, backups, software to tolerate machine loss
        • Uncorrelated, i.e. usually don’t happen all at once
      • Software faults: bugs, runaway process eating CPU/Memory/Disk/Network bandwidth, faulty dependent service, cascading failures
        • Often correlated, i.e. all happen at once
      • Avoid human error: encapsulation + good APIs, sandbox environments, testing, easy-to-roll-back config/deploys, rolling deploys, tooling to recompute data, telemetry (perf metrics, error rates, etc.)
    • Be conscious of where you cut corners
  • Scalability
    • Not a binary - more a question of what the options are for coping with growth
    • Load parameters - metrics for a system load
      • Requests/sec to web server
      • Reads:Writes ratio for DB
      • active users

      • Cache hit rate
    • Could be the average val of a given load param or the extreme val you care about
    • Very cool real world twitter example
      • Approach 1: followers request timeline, series of joins follows to get all tweets of all followees
      • Approach 2: followee tweets and twitter updates cache of all follower news feeds accordingly
      • Switched to hybrid approach as 2 is a lot of work when high-follower-count (celebs) tweet (e.g. 31MM caches to update per tweet), so use 1 for celebrity tweets
    • Performance:
      • If load param increases, how much does perf degrade?
      • If load param increases, how much do resources have to increase to match same perf as before?
    • Throughput: # records processed/sec, or time to run on given dataset
    • Response time: time b/n sending request & receiving response
      • Networking delays, queueing delays, latency, service time
      • Latency = time request is waiting to be handled (latent)
      • Response time is a distribution, not a single #
      • Best to look at response time by percentile, not by mean
    • Percentiles often used in SLOs (service level objectives, i.e. goals) and SLAs (service level agreements, i.e. contracts, agreements with clients for what defines a service being up)
    • Head of line blocking: a few slow requests in a queue hold up all the rest behind it
    • Measure response time on the client side to include queueing times
      • Does Datadog measure on client side?
    • Important that load testing actually builds up a queue and doesn't wait for first to respond before sending next
    • The response time of a request is related to the slowest backend call it makes, even in parallel
    • Tail latency amplification: the more backend calls made per request, the greater the odds of hitting a slow part in the distribution and having an overall slow response time
    • Do not average percentiles! Add histograms, then recalculate
    • On a fast growing service, will likely rethink architecture every order of mag growth in data (or more often)
    • Scale up - better machine, vs. Scale out - compute shared across more machines (i.e. shared nothing architecture)
      • Pragmatic mixture is often good (see Stack Overflow's architecture)
    • Elastic systems scale up/down automatically - good if load is unpredictable
      • Distributed systems have higher complexity, however
    • Manually scaled systems have fewer surprises
    • Scaling is use-case specific
      • reads/writes

      • data volume
      • data complexity
      • response time reqs
      • access patterns
    • In a startup, more important to iterate quickly than it is to scale to a hypothetical future load
  • Maintainability
    • “Legacy software” is no fun, but often necessary, and we should try to design systems so our own work doesn’t turn in to it
    • Fixing bugs, keeping systems on, investigating failures, adapting to new platforms, new use cases, repaying tech debt, new feature dev
    • Main principles:
      • Operability: make it easy for ops to keep things running smoothly
        • Health monitoring, restoring faults/failures
        • Tracking down causes of problems
        • Updates and security patches
        • Keep tabs on dependencies
        • Capacity planning
        • Deployment, configuration tooling
        • Security
        • Process development
        • Knowledge organization
        • For data:
          • Visibility + monitoring
          • Automation and integration tooling
          • Redundancy across machines
          • Good docs and a clear operational model
          • Good defaults + customizability
          • Minimize surprises
      • Simplicity: remove complexity from the system
        • App can have complex functionality, but any complexity just due to implementation and not functionality is undesirable
        • Good abstractions + encapsulation
      • Evolvability: make it changeable
        • AGILE, TDD, refactoring
  • Systems have functional reqs (does what it says) and nonfunctional reqs (security, reliability, compliance, scalability, compatibility, maintainability)