• Often, changing an apps features means changing its stored data

  • When data format or schemas change, application code has to as well

  • Old and new versions of data and code will exist at the same time

    • Server side apps, perform a rolling upgrades update one node at a time
    • Client side apps depend on user upgrading themselves - no guarantee of when data will evolve
  • Maintaining functionality = maintaining compatibility

    • Backward compatibility = new code can read data written by older code (easier, know the past)
    • Forward compatibility = old code can read data written by new code (trickier, “predict” the future)
  • Formats for encoding data

    • In memory, data structures are optimized for efficient access and manipulation by the CPU (usually using pointers)
    • On disk or over networks, data must be encoded in self-containing bytes - much different than memory
    • ENCODING (serialization, marshalling) = in memory → bytes
    • DECODING (parsing, deserialization, unmarshalling) = bytes → in memory
  • Language specific formats

    • Java has serializable, ruby has marshall, python has pickle, etc
    • These promote language lock in
    • Security issues - potential for executing arbitrary code during decoding
    • Often not forwards and backwards compatible
    • Often not performant
    • Generally avoid except for super transient work
  • JSON, XML, and Binary variants

    • Very popular, lots of criticisms
    • Human readable
    • Hard to distinguish between numbers and strings
      • No distinction between ints and precise floats/doubles in JSON - leads to problems parsing big numbers in languages that use floating point numbers (e.g. JavaScript)
    • Good support for unicode, but none for binary strings. People get around it by base64 encoding data, but base64 only encodes 6bits per digit rather than 8bits with bytes, so 1/3 bigger size
    • JSON & XML have optional schema support, but it’s complicated and not used everywhere
      • CSV very vague - schemaless & escaping rules not standardized
    • Good for inter-organizational transport (e.g. emailing data around)
  • Binary encoding

    • Good for within an org and larger datasets
    • There are binary formats for JSON & XML (Messagepack, BSON, WBXML, etc), some add typing
    • No schema, so have to encode the keys as strings
  • Binary encoding example - JSON is 81 bytes

    Untitled

    • Messagepack - 66 bytes

      • some bytes represent type and length, others encode the value

      Untitled

    • Thrift & Protocol Buffers (protobufs)

      • Require a schema
        • Optionality in schema is for runtime existence checking
      • Come with code generation tools that take in schemas and produce classes implementing the schema in different programming languages
      • Thrift’s BinaryProtocol - 59 bytes
        • No field names - just field tags referencing the schema

      Untitled

      • Thrift’s CompactProtocol - 34 bytes
        • Encodes the type and field tag into one byte
        • Uses variable width integers (bigger numbers = more bytes)

      Untitled

      • Protobufs, quite similar to CompactProtocol - 33 bytes

      Untitled

  • Binary Protocol Field and Schema Evolution

    • Field tags are critical to the meaning of encoded data in thrift and protobufs
    • Can change the name of the field in the schema, but not its tag
    • Can add new fields (and associated field tags) to the schema
      • old code reading new data (forwards compat) can simply skip that bit of data using the length
    • To maintain backwards compat, new fields cannot be marked required (must be optional or have default)
    • You can remove an optional field, making sure you never reuse that field tag number
    • Type evolution is possible as long as you don’t cause truncation by precision reduction (e.g. 64 bit to 32 bit ints)
    • Lists:
      • protobufs don’t have arrays, but “repeated”. Can change optional fields to repeated fields
      • thrift has dedicated list datatype, which allows support for nested lists
  • Apache Avro - 32 bytes

    • Another binary encoding format, started for Hadoop
    • Two schema langs, one for human and one for machine editing/readability
    • No tag numbers in schema, no type data in the encoding (just len + utf-8 bytes)
    • Decoding requires going through fields in order in the schema to determine datatypes
      • Need exact same schema with which data was encoded to properly decode

    Untitled

    • Schema evolution: writer’s schema and reader’s schema compatibility

      • When data decoded, avro compares and reconciles writer’s and reader’s schemas

      Untitled

      • To maintain forwards/backwards compat, can only add/remove fields with default vals
      • Readers get the writer’s schema either from the start of a large file, from a database via a version number, or negotiated at the start of a network connection
    • Avro is friendlier to dynamically generated schemas - every schema change, you can just regenerate an avro schema without considering the previous field numbers like in protobufs/thrift

  • Many databases have their own binary protocol and driver to encode/decode (e.g. ODBC, ClickHouse Native protocol)

  • Schemas are nice

    • Allows for code generation for statically typed languages
    • Can omit field names from data (more compact)
    • Schema is valuable documentation
    • Can check forwards/backwards compat by schema database alone
  • Dataflow through Databases

    • If single process writing and reading, like sending yourself a message for later. Need backwards compat
    • Forwards compat required when multiple processes accessing db, e.g. rolling upgrade
    • “Data outlives code”
      • Can always do migrations, but not changing the schema, or at least not backfilling new data is preferred
      • When archiving/snapshotting data, can encode everything with the latest schema
    • Danger: old processes ignore and overwrite new fields they are not aware of

    Untitled

  • Dataflow through Services: REST & RPC

  • Web service

  • Problems with Remote Procedure Calls (RPCs)

  • Current direction for RPC

  • Data encoding + evolution for RPC

  • Message passing dataflow