-
A lot of data is “unbounded” (never actually complete)
-
Stream processing: process every event as it happens
-
Transmitting Event Streams
- Parse binary into sequence of records/events
- Events generated once by producer, processed by one or more consumers
- Related events grouped into a topic/stream
- Relational dbs have triggers, but they are more limited
-
Messaging systems
-
What if producers make too many records for consumers to handle? 3 options:
- drop messages
- buffer messages in queue
- backpressure, prevent producer producing more messages
-
What happens if queues fill up? System dependent
-
What happens if nodes crash/go offline?
- Tradeoff between throughput/latency and accepting loss of some records
-
Direct messaging from producers to consumers
- UDP multicast for stock markets
- Brokerless libraries like ZeroMQ
- StatsD uses UDP
- HTTP/RPC/Web Hooks to produce messages
- Require application code to consider data loss
- Retries tricky
-
Message brokers/queues
- Question of durability moved to broker
- Usually queue messages - consumption happens asynchronously
- Comparison to DBs
- Can do 2 phase commit
- Delete data after consumed
- Secondary indexes sort of like topic subscriptions
- Don’t support arbitrary queries, just notify-on-changes
-
Multiple consumers, 2 patterns:
- Load balancing: consumers share work of processing messages in topic
- Fan out: all messages delivered to all consumers

- Consumer groups combine these - each group receives all messages (fan out) but within a group only one consumer receives each message (load balance)
-
Ack and redelivery/retries
- Consumers tell brokers when they have consumed something
- Retries affect order delivery if you use load balancing

-
Partitioned Logs
- Can’t run the same consumer again and get same results, unlike batch processing (records deleted once consumed)
- Log-based message brokers combine some idea of storage with brokers
- Kinda like
tail -f
- Partition log across multiple machines
- Topic = group of partitions with records of same type
- Monotonically increasing offset given to each partition

- Kafka/Kinesis/DistributedLog like this
- Fault tolerance by replicating messages
-
Logs vs messaging
- Fan out easy, reading message doesn’t delete it from log
- Load balance by assigning partitions to consumers in group
- Load balancing limited by # partitions
- Single message slow to process will block that partition
-
Consumer offsets
- Broker just tracks consumer offsets rather than ACKs for every record
- Similar to log sequence number - message broker like leader, consumer like follower
- Consumer fails, another consumer takes over, potentially reconsuming messages
-
Disk space usage - logs divided into segments and old ones deleted or archived
-
Can monitor how far behind consumer offsets are
-
Replaying old messages
- Can control offset - makes more like batch processing
-
Databases and Streams
- Events can also be db writes (replication log)
-
Syncing Systems
- Dual writes sometimes used when db dumps slow
- Race conditions possible

-
Change Data Capture

- Implementing CDC
- Derived data systems consume records
- Make one db leader, other followers
- Can use db triggers
- Async, replication lag can occur
- Initial snapshot
- Can replicate db by replaying the log
- In order to delete old messages, snapshot db in time
- Log compaction can help
- API support for change streams
-
Event Sourcing
- Store all changes to app state as log of events
- Immutable (unlike CDC) - record actions rather than effects
- Can replay log to get current state
- Log compaction trickier, usually snapshot state at checkpoints
- Command = user request
- Events = validated, accepted command
- Facts = generated events
- Consumers can’t reject records
-
State, Streams and Immutability

- Have permanent system of record/ledger
- Can derive multiple views from same events
- Don’t need to worry about indexing
- Concurrency control tricky - requires transactions and/or partitioning
- Immutability limited, you run out of storage or need to delete PII
-
Processing Streams
-
Fault Tolerance