Chapter 4: Encoding and Evolution

Often, changing an apps features means changing its stored data
When data format or schemas change, application code has to as well
Old and new versions of data and code will exist at the same time
- Server side apps, perform a rolling upgrades update one node at a time
- Client side apps depend on user upgrading themselves - no guarantee of when data will evolve
Maintaining functionality = maintaining compatibility
- Backward compatibility = new code can read data written by older code (easier, know the past)
- Forward compatibility = old code can read data written by new code (trickier, “predict” the future)
Formats for encoding data
- In memory, data structures are optimized for efficient access and manipulation by the CPU (usually using pointers)
- On disk or over networks, data must be encoded in self-containing bytes - much different than memory
- ENCODING (serialization, marshalling) = in memory → bytes
- DECODING (parsing, deserialization, unmarshalling) = bytes → in memory
Language specific formats
- Java has serializable, ruby has marshall, python has pickle, etc
- These promote language lock in
- Security issues - potential for executing arbitrary code during decoding
- Often not forwards and backwards compatible
- Often not performant
- Generally avoid except for super transient work
JSON, XML, and Binary variants
- Very popular, lots of criticisms
- Human readable
- Hard to distinguish between numbers and strings
  - No distinction between ints and precise floats/doubles in JSON - leads to problems parsing big numbers in languages that use floating point numbers (e.g. JavaScript)
- Good support for unicode, but none for binary strings. People get around it by base64 encoding data, but base64 only encodes 6bits per digit rather than 8bits with bytes, so 1/3 bigger size
- JSON & XML have optional schema support, but it’s complicated and not used everywhere
  - CSV very vague - schemaless & escaping rules not standardized
- Good for inter-organizational transport (e.g. emailing data around)
Binary encoding
- Good for within an org and larger datasets
- There are binary formats for JSON & XML (Messagepack, BSON, WBXML, etc), some add typing
- No schema, so have to encode the keys as strings
Binary encoding example - JSON is 81 bytes
- Messagepack - 66 bytes
  - some bytes represent type and length, others encode the value
- Thrift & Protocol Buffers (protobufs)
  - Require a schema
    - Optionality in schema is for runtime existence checking
  - Come with code generation tools that take in schemas and produce classes implementing the schema in different programming languages
  - Thrift’s BinaryProtocol - 59 bytes
    - No field names - just field tags referencing the schema
  - Thrift’s CompactProtocol - 34 bytes
    - Encodes the type and field tag into one byte
    - Uses variable width integers (bigger numbers = more bytes)
  - Protobufs, quite similar to CompactProtocol - 33 bytes
Binary Protocol Field and Schema Evolution
- Field tags are critical to the meaning of encoded data in thrift and protobufs
- Can change the name of the field in the schema, but not its tag
- Can add new fields (and associated field tags) to the schema
  - old code reading new data (forwards compat) can simply skip that bit of data using the length
- To maintain backwards compat, new fields cannot be marked required (must be optional or have default)
- You can remove an optional field, making sure you never reuse that field tag number
- Type evolution is possible as long as you don’t cause truncation by precision reduction (e.g. 64 bit to 32 bit ints)
- Lists:
  - protobufs don’t have arrays, but “repeated”. Can change optional fields to repeated fields
  - thrift has dedicated list datatype, which allows support for nested lists
Apache Avro - 32 bytes
- Another binary encoding format, started for Hadoop
- Two schema langs, one for human and one for machine editing/readability
- No tag numbers in schema, no type data in the encoding (just len + utf-8 bytes)
- Decoding requires going through fields in order in the schema to determine datatypes
  - Need exact same schema with which data was encoded to properly decode
- Schema evolution: writer’s schema and reader’s schema compatibility
  - When data decoded, avro compares and reconciles writer’s and reader’s schemas
  - To maintain forwards/backwards compat, can only add/remove fields with default vals
  - Readers get the writer’s schema either from the start of a large file, from a database via a version number, or negotiated at the start of a network connection
- Avro is friendlier to dynamically generated schemas - every schema change, you can just regenerate an avro schema without considering the previous field numbers like in protobufs/thrift
Many databases have their own binary protocol and driver to encode/decode (e.g. ODBC, ClickHouse Native protocol)
Schemas are nice
- Allows for code generation for statically typed languages
- Can omit field names from data (more compact)
- Schema is valuable documentation
- Can check forwards/backwards compat by schema database alone
Dataflow through Databases
- If single process writing and reading, like sending yourself a message for later. Need backwards compat
- Forwards compat required when multiple processes accessing db, e.g. rolling upgrade
- “Data outlives code”
  - Can always do migrations, but not changing the schema, or at least not backfilling new data is preferred
  - When archiving/snapshotting data, can encode everything with the latest schema
- Danger: old processes ignore and overwrite new fields they are not aware of
Dataflow through Services: REST & RPC
Web service
Problems with Remote Procedure Calls (RPCs)
Current direction for RPC
Data encoding + evolution for RPC
Message passing dataflow