Chapter 11: Stream Processing

A lot of data is “unbounded” (never actually complete)
Stream processing: process every event as it happens
Transmitting Event Streams
- Parse binary into sequence of records/events
- Events generated once by producer, processed by one or more consumers
- Related events grouped into a topic/stream
- Relational dbs have triggers, but they are more limited
Messaging systems
- What if producers make too many records for consumers to handle? 3 options:
  - drop messages
  - buffer messages in queue
  - backpressure, prevent producer producing more messages
- What happens if queues fill up? System dependent
- What happens if nodes crash/go offline?
  - Tradeoff between throughput/latency and accepting loss of some records
- Direct messaging from producers to consumers
  - UDP multicast for stock markets
  - Brokerless libraries like ZeroMQ
  - StatsD uses UDP
  - HTTP/RPC/Web Hooks to produce messages
  - Require application code to consider data loss
  - Retries tricky
- Message brokers/queues
  - Question of durability moved to broker
  - Usually queue messages - consumption happens asynchronously
  - Comparison to DBs
    - Can do 2 phase commit
    - Delete data after consumed
    - Secondary indexes sort of like topic subscriptions
    - Don’t support arbitrary queries, just notify-on-changes
- Multiple consumers, 2 patterns:
  - Load balancing: consumers share work of processing messages in topic
  - Fan out: all messages delivered to all consumers
  - Consumer groups combine these - each group receives all messages (fan out) but within a group only one consumer receives each message (load balance)
- Ack and redelivery/retries
  - Consumers tell brokers when they have consumed something
  - Retries affect order delivery if you use load balancing
Partitioned Logs
- Can’t run the same consumer again and get same results, unlike batch processing (records deleted once consumed)
- Log-based message brokers combine some idea of storage with brokers
- Kinda like tail -f
- Partition log across multiple machines
- Topic = group of partitions with records of same type
- Monotonically increasing offset given to each partition
- Kafka/Kinesis/DistributedLog like this
- Fault tolerance by replicating messages
Logs vs messaging
- Fan out easy, reading message doesn’t delete it from log
- Load balance by assigning partitions to consumers in group
  - Load balancing limited by # partitions
  - Single message slow to process will block that partition
Consumer offsets
- Broker just tracks consumer offsets rather than ACKs for every record
- Similar to log sequence number - message broker like leader, consumer like follower
- Consumer fails, another consumer takes over, potentially reconsuming messages
Disk space usage - logs divided into segments and old ones deleted or archived
Can monitor how far behind consumer offsets are
Replaying old messages
- Can control offset - makes more like batch processing
Databases and Streams
- Events can also be db writes (replication log)
Syncing Systems
- Dual writes sometimes used when db dumps slow
- Race conditions possible
Change Data Capture
- Implementing CDC
  - Derived data systems consume records
  - Make one db leader, other followers
  - Can use db triggers
  - Async, replication lag can occur
- Initial snapshot
  - Can replicate db by replaying the log
  - In order to delete old messages, snapshot db in time
  - Log compaction can help
- API support for change streams
  - eg Kafka Connect
Event Sourcing
- Store all changes to app state as log of events
- Immutable (unlike CDC) - record actions rather than effects
- Can replay log to get current state
  - Log compaction trickier, usually snapshot state at checkpoints
- Command = user request
- Events = validated, accepted command
- Facts = generated events
- Consumers can’t reject records
State, Streams and Immutability
- Have permanent system of record/ledger
- Can derive multiple views from same events
- Don’t need to worry about indexing
- Concurrency control tricky - requires transactions and/or partitioning
- Immutability limited, you run out of storage or need to delete PII
Processing Streams
Fault Tolerance