-
Often, changing an apps features means changing its stored data
-
When data format or schemas change, application code has to as well
-
Old and new versions of data and code will exist at the same time
- Server side apps, perform a rolling upgrades update one node at a time
- Client side apps depend on user upgrading themselves - no guarantee of when data will evolve
-
Maintaining functionality = maintaining compatibility
- Backward compatibility = new code can read data written by older code (easier, know the past)
- Forward compatibility = old code can read data written by new code (trickier, “predict” the future)
-
Formats for encoding data
- In memory, data structures are optimized for efficient access and manipulation by the CPU (usually using pointers)
- On disk or over networks, data must be encoded in self-containing bytes - much different than memory
- ENCODING (serialization, marshalling) = in memory → bytes
- DECODING (parsing, deserialization, unmarshalling) = bytes → in memory
-
Language specific formats
- Java has serializable, ruby has marshall, python has pickle, etc
- These promote language lock in
- Security issues - potential for executing arbitrary code during decoding
- Often not forwards and backwards compatible
- Often not performant
- Generally avoid except for super transient work
-
JSON, XML, and Binary variants
- Very popular, lots of criticisms
- Human readable
- Hard to distinguish between numbers and strings
- No distinction between ints and precise floats/doubles in JSON - leads to problems parsing big numbers in languages that use floating point numbers (e.g. JavaScript)
- Good support for unicode, but none for binary strings. People get around it by base64 encoding data, but base64 only encodes 6bits per digit rather than 8bits with bytes, so 1/3 bigger size
- JSON & XML have optional schema support, but it’s complicated and not used everywhere
- CSV very vague - schemaless & escaping rules not standardized
- Good for inter-organizational transport (e.g. emailing data around)
-
Binary encoding
- Good for within an org and larger datasets
- There are binary formats for JSON & XML (Messagepack, BSON, WBXML, etc), some add typing
- No schema, so have to encode the keys as strings
-
Binary encoding example - JSON is 81 bytes

-
Binary Protocol Field and Schema Evolution
- Field tags are critical to the meaning of encoded data in thrift and protobufs
- Can change the name of the field in the schema, but not its tag
- Can add new fields (and associated field tags) to the schema
- old code reading new data (forwards compat) can simply skip that bit of data using the length
- To maintain backwards compat, new fields cannot be marked required (must be optional or have default)
- You can remove an optional field, making sure you never reuse that field tag number
- Type evolution is possible as long as you don’t cause truncation by precision reduction (e.g. 64 bit to 32 bit ints)
- Lists:
- protobufs don’t have arrays, but “repeated”. Can change optional fields to repeated fields
- thrift has dedicated list datatype, which allows support for nested lists
-
Apache Avro - 32 bytes
- Another binary encoding format, started for Hadoop
- Two schema langs, one for human and one for machine editing/readability
- No tag numbers in schema, no type data in the encoding (just len + utf-8 bytes)
- Decoding requires going through fields in order in the schema to determine datatypes
- Need exact same schema with which data was encoded to properly decode

-
Schema evolution: writer’s schema and reader’s schema compatibility
- When data decoded, avro compares and reconciles writer’s and reader’s schemas

- To maintain forwards/backwards compat, can only add/remove fields with default vals
- Readers get the writer’s schema either from the start of a large file, from a database via a version number, or negotiated at the start of a network connection
-
Avro is friendlier to dynamically generated schemas - every schema change, you can just regenerate an avro schema without considering the previous field numbers like in protobufs/thrift
-
Many databases have their own binary protocol and driver to encode/decode (e.g. ODBC, ClickHouse Native protocol)
-
Schemas are nice
- Allows for code generation for statically typed languages
- Can omit field names from data (more compact)
- Schema is valuable documentation
- Can check forwards/backwards compat by schema database alone
-
Dataflow through Databases
- If single process writing and reading, like sending yourself a message for later. Need backwards compat
- Forwards compat required when multiple processes accessing db, e.g. rolling upgrade
- “Data outlives code”
- Can always do migrations, but not changing the schema, or at least not backfilling new data is preferred
- When archiving/snapshotting data, can encode everything with the latest schema
- Danger: old processes ignore and overwrite new fields they are not aware of

-
Dataflow through Services: REST & RPC
-
Web service
-
Problems with Remote Procedure Calls (RPCs)
-
Current direction for RPC
-
Data encoding + evolution for RPC
-
Message passing dataflow