- Stream processing: send a message to another process to be handled asynchronously
- Reliable, scalable, maintainable data systems
- DB, Message Queue, Caches etc. are all different. Why call them all "Data Systems?"
- These days, databases that are also message queues (Redis), and message queues that have database durability (Kafka)
- Applications are now responsible for stitching together and synchronizing tools that each do their own thing will

- Three main components of a data system:
- Reliability: system works correctly, even in the face of adversity
- Scalability: reasonable ways of dealing with growth as it occurs
- Maintainability: many people can work on, change and improve the system over time productively
- Reliability
- My paraphrased definition: correctness over time
- Things go wrong = faults, tolerance of faults = resilience
- Fault = one component going out of spec; failure = system crash
- Can build reliable systems from unreliable parts
- Faults:
- Hardware faults: disk crash, faulty RAM, unplug the power. Set up redundancy, backups, software to tolerate machine loss
- Uncorrelated, i.e. usually don’t happen all at once
- Software faults: bugs, runaway process eating CPU/Memory/Disk/Network bandwidth, faulty dependent service, cascading failures
- Often correlated, i.e. all happen at once
- Avoid human error: encapsulation + good APIs, sandbox environments, testing, easy-to-roll-back config/deploys, rolling deploys, tooling to recompute data, telemetry (perf metrics, error rates, etc.)
- Be conscious of where you cut corners
- Scalability
- Not a binary - more a question of what the options are for coping with growth
- Load parameters - metrics for a system load
- Requests/sec to web server
- Reads:Writes ratio for DB
-
active users
- Cache hit rate
- Could be the average val of a given load param or the extreme val you care about
- Very cool real world twitter example
- Approach 1: followers request timeline, series of joins follows to get all tweets of all followees
- Approach 2: followee tweets and twitter updates cache of all follower news feeds accordingly
- Switched to hybrid approach as 2 is a lot of work when high-follower-count (celebs) tweet (e.g. 31MM caches to update per tweet), so use 1 for celebrity tweets
- Performance:
- If load param increases, how much does perf degrade?
- If load param increases, how much do resources have to increase to match same perf as before?
- Throughput: # records processed/sec, or time to run on given dataset
- Response time: time b/n sending request & receiving response
- Networking delays, queueing delays, latency, service time
- Latency = time request is waiting to be handled (latent)
- Response time is a distribution, not a single #
- Best to look at response time by percentile, not by mean
- Percentiles often used in SLOs (service level objectives, i.e. goals) and SLAs (service level agreements, i.e. contracts, agreements with clients for what defines a service being up)
- Head of line blocking: a few slow requests in a queue hold up all the rest behind it
- Measure response time on the client side to include queueing times
- Does Datadog measure on client side?
- Important that load testing actually builds up a queue and doesn't wait for first to respond before sending next
- The response time of a request is related to the slowest backend call it makes, even in parallel
- Tail latency amplification: the more backend calls made per request, the greater the odds of hitting a slow part in the distribution and having an overall slow response time
- Do not average percentiles! Add histograms, then recalculate
- On a fast growing service, will likely rethink architecture every order of mag growth in data (or more often)
- Scale up - better machine, vs. Scale out - compute shared across more machines (i.e. shared nothing architecture)
- Pragmatic mixture is often good (see Stack Overflow's architecture)
- Elastic systems scale up/down automatically - good if load is unpredictable
- Distributed systems have higher complexity, however
- Manually scaled systems have fewer surprises
- Scaling is use-case specific
-
reads/writes
- data volume
- data complexity
- response time reqs
- access patterns
- In a startup, more important to iterate quickly than it is to scale to a hypothetical future load
- Maintainability
- “Legacy software” is no fun, but often necessary, and we should try to design systems so our own work doesn’t turn in to it
- Fixing bugs, keeping systems on, investigating failures, adapting to new platforms, new use cases, repaying tech debt, new feature dev
- Main principles:
- Operability: make it easy for ops to keep things running smoothly
- Health monitoring, restoring faults/failures
- Tracking down causes of problems
- Updates and security patches
- Keep tabs on dependencies
- Capacity planning
- Deployment, configuration tooling
- Security
- Process development
- Knowledge organization
- For data:
- Visibility + monitoring
- Automation and integration tooling
- Redundancy across machines
- Good docs and a clear operational model
- Good defaults + customizability
- Minimize surprises
- Simplicity: remove complexity from the system
- App can have complex functionality, but any complexity just due to implementation and not functionality is undesirable
- Good abstractions + encapsulation
- Evolvability: make it changeable
- Systems have functional reqs (does what it says) and nonfunctional reqs (security, reliability, compliance, scalability, compatibility, maintainability)