Chapter 2: Data Models and Query Languages

Data models layer to build an app (real world, JSON etc., bytes, electrical engineering)
Data model has implications about what a client interacting with it can perform (performant-ly, or at all)
Each layer hides the complexity of the layers below it by providing a clean data model
3 data models discussed: relational, document, graph
Relational (SQL):
- Data organized in relations (i.e. tables) where each relation is unordered collection of tuples (i.e. rows)
- Precursor DBs forced devs to consider DB implementation details
- Survived through many cycles of hyped up alternatives
- Ployglot persistence - that relational and other db models will be used together
- “NoSQL”: not only sql, doesn’t refer to any particular tech, driven by
  - Need for greater scalability, large datasets/throughput
  - Preference for OSS vs closed software
  - Specialized query ops not supported by relational model
  - Desire for dynamic and expressive schemas
- NoSQL = document + graph dbs
Object-relational mismatch: relations/rows in a db are a different model than application objects ("impedance mismatch")
- ORM (object relational model) frameworks, e.g. Rail's ActiveRecord, help to bridge this
LinkedIn using relations (normalized, links to values through foreign keys):
- Normalized = deduplicated, denormalized = duplicated/redundant data for performance
- Note - linking through foreign keys is kinda like an enum, allows easy updating, standardization, i18n localization, better search

Could also represent as JSON document:

{
	"user_id": 251,
	"first_name": "Bill",
	"last_name": "Gates",
	"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
	"region_id": "us:91",
	"industry_id": 131,
	"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
	"positions": [
		{"job_title": "Co-chair", "organization": "Bill & Melinda Gates Foundation"}, {"job_title": "Co-founder, Chairman", "organization": "Microsoft"}
	],
	"education": [
		{"school_name": "Harvard University", "start": 1973, "end": 1975},
		{"school_name": "Lakeside School, Seattle", "start": null, "end": null} ],
	"contact_info": {
		"blog": "<http://thegatesnotes.com>", "twitter": "<http://twitter.com/BillGates>"
	}
}

JSON
- Reduces the impedance mismatch between app/relational code
- Schemaless
- Better "locality" - all info in one place, one query enough
One-to-many-ness of LI profile ⇒ tree!

Issue with document DBs: joins are hard/unsupported. Not as good for many-to-one or many-to-many relationships.
- many-to-one: “many X map to one particular “Y”
- many-to-one also makes normalizing a document db difficult
Join difficulty in document dbs shifts the join logic to app code by making multiple queries
Data tends to become more interjoined over time, leading to more join-like requirements
- models that are initially join-free tend to evolve to places where joins would be convenient
- strings become references to entities
- updates within sub-entities (e.g. a user who recommended another user’s profile picture updates) need to propagate everywhere

IBM’s Information Management System (IMS) was like document db, represented data as tree of records nested within records
Network vs Relational models (1960/70s):
- Great debate over how to solve the limitations of the hierarchical model (bad for joins + many → one, many → many relationships)
- Network/CODASYL model
  - allowed tree nodes to have many parents (hierarchical model only one parent)
  - traverse a path of linked-list like connected nodes through "access paths"
  - Had to track these paths in application code, inflexible, but good for memory limits of the time
- Relational came along and was like "just do whatever you want to each table, implementation detail of joins and filters etc. are hidden". Query optimizer was hard, but you only had to write it once and then it could be reused by all.
  - App devs could add indices to effect the "access path" used by the query optimizer
Relational vs Document:
- Foreign keys == document references (same thing diff name)
- If one to many tree structure where all data loaded at once - document model is good
The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships.
- Many-to-many relationships don't make document dbs appealing
- Document = 'schema on read' vs relational = 'schema on write' — both have advantages/disavdvantages
- Relational and document dbs are converging somewhat
  - Relationals can provide locality by interleaving tables
  - Documents are adding some join functionality
- App code has to handle new and old cases in document dbs after migrations
- Document good for:
  - many types of objects and it’s not practicable to make a table for each
  - object data structure is determined by external systems which you have no control over
- Better data locality on disk for document model as long as you only want one document and you want all of it at once (not just a bit)
- SQL is declarative - don't worry about the implementation
  - Lends itself to parallel compute
  - Can improve the db query engine without having to change queries