• A lot of data is “unbounded” (never actually complete)

  • Stream processing: process every event as it happens

  • Transmitting Event Streams

    • Parse binary into sequence of records/events
    • Events generated once by producer, processed by one or more consumers
    • Related events grouped into a topic/stream
    • Relational dbs have triggers, but they are more limited
  • Messaging systems

    • What if producers make too many records for consumers to handle? 3 options:

      • drop messages
      • buffer messages in queue
      • backpressure, prevent producer producing more messages
    • What happens if queues fill up? System dependent

    • What happens if nodes crash/go offline?

      • Tradeoff between throughput/latency and accepting loss of some records
    • Direct messaging from producers to consumers

      • UDP multicast for stock markets
      • Brokerless libraries like ZeroMQ
      • StatsD uses UDP
      • HTTP/RPC/Web Hooks to produce messages
      • Require application code to consider data loss
      • Retries tricky
    • Message brokers/queues

      • Question of durability moved to broker
      • Usually queue messages - consumption happens asynchronously
      • Comparison to DBs
        • Can do 2 phase commit
        • Delete data after consumed
        • Secondary indexes sort of like topic subscriptions
        • Don’t support arbitrary queries, just notify-on-changes
    • Multiple consumers, 2 patterns:

      • Load balancing: consumers share work of processing messages in topic
      • Fan out: all messages delivered to all consumers

      Untitled

      • Consumer groups combine these - each group receives all messages (fan out) but within a group only one consumer receives each message (load balance)
    • Ack and redelivery/retries

      • Consumers tell brokers when they have consumed something
      • Retries affect order delivery if you use load balancing

      Untitled

  • Partitioned Logs

    • Can’t run the same consumer again and get same results, unlike batch processing (records deleted once consumed)
    • Log-based message brokers combine some idea of storage with brokers
    • Kinda like tail -f
    • Partition log across multiple machines
    • Topic = group of partitions with records of same type
    • Monotonically increasing offset given to each partition

    Untitled

    • Kafka/Kinesis/DistributedLog like this
    • Fault tolerance by replicating messages
  • Logs vs messaging

    • Fan out easy, reading message doesn’t delete it from log
    • Load balance by assigning partitions to consumers in group
      • Load balancing limited by # partitions
      • Single message slow to process will block that partition
  • Consumer offsets

    • Broker just tracks consumer offsets rather than ACKs for every record
    • Similar to log sequence number - message broker like leader, consumer like follower
    • Consumer fails, another consumer takes over, potentially reconsuming messages
  • Disk space usage - logs divided into segments and old ones deleted or archived

  • Can monitor how far behind consumer offsets are

  • Replaying old messages

    • Can control offset - makes more like batch processing
  • Databases and Streams

    • Events can also be db writes (replication log)
  • Syncing Systems

    • Dual writes sometimes used when db dumps slow
    • Race conditions possible

    Untitled

  • Change Data Capture

    Untitled

    • Implementing CDC
      • Derived data systems consume records
      • Make one db leader, other followers
      • Can use db triggers
      • Async, replication lag can occur
    • Initial snapshot
      • Can replicate db by replaying the log
      • In order to delete old messages, snapshot db in time
      • Log compaction can help
    • API support for change streams
      • eg Kafka Connect
  • Event Sourcing

    • Store all changes to app state as log of events
    • Immutable (unlike CDC) - record actions rather than effects
    • Can replay log to get current state
      • Log compaction trickier, usually snapshot state at checkpoints
    • Command = user request
    • Events = validated, accepted command
    • Facts = generated events
    • Consumers can’t reject records
  • State, Streams and Immutability

    Untitled

    • Have permanent system of record/ledger
    • Can derive multiple views from same events
    • Don’t need to worry about indexing
    • Concurrency control tricky - requires transactions and/or partitioning
    • Immutability limited, you run out of storage or need to delete PII
  • Processing Streams

  • Fault Tolerance