Database Internals: A very short conspect

Leo Chashnikov

Nov 14, 202214 min read

About a year ago I listened to Software Engineering Radio podcast episode with Alex Petrov, where Alex was discussing his new book, "Database Internals: A Deep Dive Into How Distributed Data Systems Work". At the time I've already read (absolutely wonderful and highly recommended) "Designing Data-Intensive Applications" (by Martin Kleppman), and Alex's book seemed like an excellent "even deeper dive" on more specific topic of databases.

Finally this year I've read the book, and it was a total joy!

Very interestingly written, clear and concise. Making a conspect of the book wasn't that easy, actually - each sentence was so "precise", that I didn't feel like I can "shorten" it while preserving most of the info (as lots of motivational/non-fiction books can be distilled into 5-10 pages if you remove all the water).

However, I was making some notes while reading it, and now I though these might be helpful for someone, maybe to "refresh" what was read, or to locate specific topic inside the book. So these notes can be seen as "expanded table of contents", and not the contents themselves.

Disclaimer: I do not own any rights for the book itself. Most of sentences will be rephraised from the book, though still some short excerpts might be a copy-paste, as there was no way to re-phrase it without losing the meaning. I really encourage you to buy the book wherever works better for you - you won't regret reading it.

If there are some errors in the notes - it's probably I who introduced them, not the author of the book.

With that said, here are my short notes:

Part 1: Storage Engines

B-tree basics (Ch. 2)

Binary search Tree (BST) - low fanout (2), high height, lots of disc searches - not suitable for disc storage

B-Tree: High fanout (many child nodes from parent), low height. More than 1 key in a node. Keys stored in sorted order. Number of keys ~ 10 - 100s. Each node has N keys, N+1 pointers to child nodes

B+-Tree: stores values only on leaf level. Intermidiate nodes store only "separator" keys to guide seeks. Hence Interm. nodes need to be updated only on merge / split, not on each add. Referred to as B-Trees.

Constructred bottom-to-top. Tree height is changed when the root is split, or when two nodes are combined to form a new root

First pointer points to node with keys less than in current node, last - to keys >= current node. All values are separator keys.

File Formats (Ch. 3)

Flags / boolean params (like 0100 to represent 4 boolean flags). To set flag: use OR (|). To unset flag: bitwise AND (&) and ~ (negation). To check value: bitwise AND (&) compare to 0.

Storing file: starts with the fixed-size header, rest is split into pages, might be finished with a fixed-size "trailer". Pages are read/written in full. Append-only wait for page to fill, then flush it to disk. Having a schema for data helps to reduce size on disk - don't need to store strutcture for each row. We can store all fixed-size fields, followed by length of variable-sized fields, then fields themselves.

Slotted page / slot directory - stores variable-size records. Page is split into pointers and cells, on the different size of the page. Pointers only state location / length / content of the cell. We can assume that each node occupies one page. Cells hold (separation) keys and pointers to the pages representing child nodes. Page checksums are computed and placed in page header.

Implementing B-Trees (Ch. 4)

Page headers can be used to store metadata, magic numbers to validate that this is start of the page, sibling links (pointers to other nodes on the same level).

Some Nodes store High Key in the right-most cell - that's highest value possible in the children of that node.

Keys in the node are searched using Binary Search. Positive number means that key was found at certain position, and negative value that key was not found, and must be inserted at specified position to preserve ordering.

During traversal to leaf node, followed path (pointers to nodes, cell indices) is stored in stack (BTStack) to propagate changes up in case of split/merge. This path is called "breadcrumbs"

Tree may move (balance) data between nodes to avoid splits / merges. During such operations min / max values of nodes are updated.

Trees use fast-path optimizations for constant right insertions (when PK is constantly increasing value), keeping in memory pointer to the rightmost node

Compression is usually done either page-wise (without knowing data), or row-wise / column wise (requires knowing the data structure).

Transaction Processing (Ch. 5)

Transaction manager coordinates and track transactions. Lock manager guards access to resources. Log manager holds history of operations on cached pages, that were not yet persisted.

Pages are fetched to memory when accessed for the first time. All work / modifications are done with in-memory page. If page is modified, it is marked as "dirty", so it's contents will be flushed to disk. Some DBs have background processes to flush to disc dirty pages, that are likely to be evicted soon.

Nodes closer to the root are expected to be more popular. Highly-used nodes can be "pinned" in cache, so they won't get evicted.

Page eviction algorithms: FIFO (not optimal), LRU (least-recently used), improved by 2Q LRU and LRU-K. CLOCK-sweep - circular, marks pages as unused, then evicts them on second path. LFU - tracks page references instead of page-ins. Improvement: TinyLFU

Write-ahead log (WAL) is append-only disk structure, used for crash recovery. They usually store data when it's still in cache, and not flushed to disk. WAL records indicate transaction completion.

Data is usually trimmed from the log by separate process, that ensures data is written in "main storage". Process that flushes all dirty pages to disk is usually called "sync checkpoint".

Logs can be physical (storing byte changes) or logical (storing operations), or some combination of both. Pages can be flushed to disk even before transaction commits, but then we need to keep a log of undo operations to be able to revert the transaction. ARIES - steal / no-force recovery algorithm.

Types of concurrency control:

Optimistic: transactions execute reads / writes, transaction histories are checked for conflicts, in this case one of transactions is aborted

Mutli-version (MVCC). Multiple versions can coexist, guaranteeing consistent view existed at some point

Pessimistic: locking or non-locking. Non-locking version has operation lists.

Read anomalies with concurrent transactions: dirty (reading uncomitted transaction), non-repeatable (different results from querying same row) and phantom reads (same as non-repeatable, but for range queries).

Write anomalies: lost update (overwrite by another transaction), dirty write (write based on read of another uncommitted transction), write skew.Isolation levels: read ucommitted, read committed, repeatable read, serializable

2-phase locking (2PL) - not to be confused with 2-phased commit. Separates locking into growth (acquiring necessary locks) and shrinking (releasing locks) phases. No lock can be released before all locks are acquired.

Latches protect "physical" representation of tree - latches are aquired for separate nodes, and protect from seeing tree in inconsistent state - during split / merge

B-Tree Variants (Ch. 6)

Copy-on-write B-Trees are structured like B-Trees, but their nodes are immutable and are not updated in place. Instead, pages are copied, updated, and written to new locations. Such trees do not require latching, as readers read old versions, while writers create a new one. After readers finish, pointers are replaced to new version.

Lazy B-Trees reduce the number of I/O requests from subsequent same-node writes by buffering updates to nodes.

FD-Trees (similar to LSM Trees) - buffer updates in a small B-Tree. As soon as this tree fills up, its contents are written into an immutable run. Updates propagate between levels of immutable runs in a cascading manner, from higher levels to lower ones.

Bw-Trees separate B-Tree nodes into several smaller parts that are written in an append-only manner. This reduces costs of small writes by batching updates to the different nodes together. Updates are stored in a linked list, tail of the list is a "base version". All updates are "applied" on read.

Cache-oblivious B-Trees allow treating on-disk data structures in a way that is very similar to how we build in-memory ones.

Log-Structured Storage (Ch.7)

B-Trees - mutable structure, read-optimized.

Log-Structured Merge Trees (LSM) - immutable structure, append-only storage, write-optimized. Reads and writes don't intersect by design, so locks or latches are not necessary. Ordered LSMT store buffer in-memory, reorder entries when buffer size exceeds treshold (or periodically), and flush single file to disk. To read multi-component LSM Tree, multiway merge-sort is used (using min-heap in memory). When same-key records from different tables are read into in-memory heap, they're values are "reconciled" and single result value per key can be returned. Compaction process takes multiple on-disk tables, "reconciles" their records and writes resulting single table to disk, deleting old ones. RUM (Read, Update, Memory)

Conjencture - reducing two overheads changes third one to worse.

Disk-resident tables are implemented using Sorted String Tables (SSTables).

Bloom filter - probabilistic structure, that can be used to estimate if element might be in the table, or is definitely NOT in the table. Implementation - large bit array and running the key through multiple hash functions.

Skiplist - used to keep sorted data in memory. List with multiple pointers from each node (depending on randomly-assigned "level"), enabling "fast search" of values.

Part 2: Distributed Systems

Introduction and Overview (Ch. 8)

Common abstractions: links.

Fair-loss link. Message can be lost. Retransmitted infinite amount of times, it will eventually be delivered. Acknowledgement. Until ACK is received, message is unknown to be delivered or lost. After ACK, it's definitely delivered.

Stubborn link - retransmitting message indefinitely.

Perfect link: every message sent once will eventually be delivered. No message is delivered more than once. Delivered are only messages that were actually sent.

Two generals problem: two parties cannot agree on something reliably, always waiting for next ack from another party.

Consensus protocol - describes a system that, given multiple processes starting at its initial state, brings all of the processes to the decision state

Correct consensus protocol has properties: Agreement (decision is unanimous), Validity (agreed value was proposed by one of the parties), Termination (all processes reached final state).

FLP impossibility states that under certain assumptions (fully asynchronous system, no time-outs or any notion of time) correct consensus protocol cannot exist.

Consensus can be achieved in synchronous or partially-synchronous systems.

Failure models: crash-stop (process doesn't have to recover), crash-recovery (introduces recovery protocol), omission fault (certain algorithm steps are skipped, or results of execution are not visible), arbitrary faults (process continues executing steps, but in a way that contradicts the algorithm)

Failure Detection (Ch. 9)

There's always a trade-off between wrongly suspecting alive processes as dead (producing false-positives), and delaying marking an unresponsive process as dead (producing false-negatives)

Liveness is a property that guarantees that a specific intended event must occur. Safety guarantees that unintended events will not occur.

Failure detection algorithms have a trade-off between efficiency (how fast dead process was identified) and accuracy (whether process failure was detected).

Heartbeat - process proactively notifies neighbors that it is alive, ping - process sends a message to neigbors and expects a response to verify that they are alive. Both rely on message frequency and timeouts.

Timeout-free failure detectors:

Heartbeat (propagation of messages to neighbors, adding self to the path) - allows to deal with faulty 1-1 links.

Outsourced heartbeats: allowing process to ask other processes to contact unresponsive neighbor (P1 pings P2, P2 doesn't answer, P1 asks P3 and P4 to ping P2).

Phi-accrual: replaces dead/alive binary decision with a spectrum of probability, by accumulating response stats in certain window

Gossip: each node maintains counters for itself and it's neighbors and updates it when receiving messages from them.

FUSE: node stops responding, when it notices another node in same group that is unresponsive. This way unavailability propagates, making whole group unresponsive.

Leader Election (Ch. 10)

Bully algorithm: each process gets a rank, during election highest-ranking node wins. This algorithm is susceptible to split-brain.

Next-in-line failover: leader sends a list of next candidates. If leader fails, detected node notifies next in line to take position, without election rounds.

Candidate / ordinary optimization: nodes are split into 2 sets. Only candidate nodes take part in the election. Election is started by node from ordinary set.

Invitation algorithm: each process starts as a leader, and contacts other nodes to join their group. If contacted node is also a leader, two groups merge. If node, it responds with ID of its leader.

Ring algorithm: each node knows ring topology and contacts it's next node to select a leader, adding itself to "live list" in the message. When message goes full circle, it accumulates all currently live nodes. Highest-ranking of them is leader.

Replication and Consistency (Ch. 11)

CAP theorem: it is impossible to create a system that guarantees both availability and consistency in the presence of network partitions. We can guarantee one property, and provide best effort on another.

CP systems prefer failing requests to serving potentially inconsistent data.

AP systems loosen the consistency requirement and allow serving potentially inconsistent values during the request.

"Tunable" options regarding CA - harvest (request can return 99 rows when 100 were requested, if some node is unavailable) and yield (number of successful request compared to attemepted). We can attempt tuning those parameters per request (serve user request only if user data is on available nodes, not serve critical request if not all data is available).

Types of registers (units of storage):

safe - return arbitrary values in range during writes

regular - read can return only value written by last completed write, or by write that is concurrent with read.

atomic - guarantee linearizability - single moment, before each read returns old value, and after - a new value.

Consistency models:

Strict consistency - every write is immediately available to read. Theoretical / unachievable in practiceLinearizability - read after writes completion sees the written value. Stale reads are prohibited.

Sequential consistency - operations have some order, operations of one process have sequential order. All processes observe operations in same order. Stale reads are possible.

Causal consistency - each process specifies transaction "cause", that it's transaction follows on. Eventually consistent systems usually implement tunable consistency: replication factor (N, number of nodes that will store the data), write consistency (W, number of nodes to ack the write for it to be successful), read consistency (R, number of nodes to respond to read for it to be successful). If R + W > N then returning latest write is guaranteed.

CRDT - Conflict-Free Replicated Data Types. Allow conflict resolution on read, reconstructing system state from separate local instances.

Anti-Entropy and Dissemination (Ch. 12)

Propagation of cluster-wide metadata is more important than propagation of data records.

Usual communication patterns:

Broadcast - from one process to all others. Simple, may be expensive and unreliable in big cluster. Requires all nodes to be up when broadcasting.

Anti-entropy - pair-wise connections between peers. Helps to update nodes after recovery / primary delivery failure.

Gossip - each recipient becomes a broadcaster, and sends a message further.

Background anti-entropy mechanisms may use Merkle trees or update logs to identify divergence. Foreground anti-entrpoy mechanisms piggyback on read / write requests.

Read Repair - coordinator requests data from nodes, if returned data differs - sends missing updates to divergent nodes

Digest reads - coordinator issues "full read" from one node, and digests (hashes) of data from others, if digests match data - then nodes are in sync.

Hinted handoff - if node did not acknowledge write, hint is saved for write to be replayed when node comes back online

Merkle tree - a tree of hashes of data. Lower level - direct hashes of ranges of data, going up - hashes of lower levels of tree. Two trees can be compared to find range where discrepancy occurs, by going lower and lower from root. If roots are same - no discrepancy. Change in data causes recalc of entire tree.

Gossip mechanisms are useful in systems where nodes often get added and removed, mesh networks. To ensure nodes do not retranslate messages indefinitely, "loss of interest" function is used. At the start, process chooses F peers at random. When node receives new hot information, it will try distributing it. There is always some redundancy. Stopping gossip is called convergence. Period to reach convergence is called latency. Interest loss can be computed probabilistically or by some treshold (do not retranslate after getting known info N-th time).

Hybrid gossip (push / lazy push) - node sends full message to some peers, and message_id to others. If other node has not seen message with that ID, it contacts it peers to get it.

Distributed Transactions (Ch. 13)

Atomic commitment - class of algorithms to make multiple operations appear atomic. Examples: 2-phase commit, 3-phase commit.

2PC: First phase (prepare): value is distributed, voices are collected. Second phase (commit / abort): flip pointer to make result visible.

Cohort failures: If propose fails, coordinator cannot proceed. So availability of all nodes is required. If node accepts, than fails - it needs to read last committed state from coordinator.If coordinator fails after collecting votes, but before committing / aborting, cluster is left in blocked undecided state. New coordinator has to collect votes again and decide.

3PC: Propose (send value, collect votest) -> Prepare (notify nodes on vote result. will be aborted on timeout) -> Commit (actual replacement. can be committed by timeout). Worst case of failure in 3PC - potential network split, some nodes executing prepare while others not getting the message and aborting transaction

Calvin - fast distributed transaction protocol. Used in FaunaDB.

Sequencer - determines global order of transaction execution. Splits timeline into epochs (microbatches). When transaction is replicated, it is forwarded to scheduler.

Scheduler - orchestrates transaction execution. Scheduler can execute part of transaction in parallel. No communication to sequencer is required.

Transaction is split into read set and write set.

Spanner - distributed transaction protocol / manager. Used in CockroachDB / YugaByteDB. Uses TrueTime - high precision wall-clock API, that uses uncertainty boundaries.

Three operation types: read-write transactions (locks, leader replica required), read-only transactions (no locks, any replica), snapshot reads.

Writes need to go through leader (each leader holds lock table), reads can be served from any replica. Multi-shard transactions coordinate through leaders (leaders have Transaction Managers for that) and use 2PC.

Sharding - splitting keys into certain ranges and each node stores only certain range. Keys used to find correct node is called routing key.

The most straightforward way to implement sharding is taking KEY % Number_of_nodes. Problem - full reshuffle on Number_of_nodes change.

Fix for that: consistent hashing. Values returned by prev. function are mapped to a ring, each node manages certain segment of the ring. Hence adding node only shuffles neighbours.

Snapshot isolation: transaction reads only values, that were committed on the transaction start. So only repeatable reads are allowed. Values are consistent. Still, write skew if possible, if conflicting transaction are writing values for different keys (no conflict on write, but inconsistence as a result).

Coordination Avoidance. Coordination can be avoided if operations are invariant confluent. Invariant Confluence (I-Confluence) is a property that ensures that two invariant-valid but diverged database states can be merged into a single valid, final state.

Consensus (Ch. 14)

Consensus algorithm should have 3 properties:

Agreement (value is same for all correct processes)

Validity (value was proposed by one of the processes)

Termination (all valid processes eventually reach the decision)

Best effort broadcast - sender is responsible to deliver the message. If sender fails, no receiver tries to re-broadcast. Reliable broadcast - on sender failure, each receiver does a re-broadcast.

To deliver messages in order, we need Atomic Broadcast (aka Total Order Multicast).

Properties:

Atomicity (processes agree on set of messages. All of set are received, or none),

Order (messages delivered in same order)

ZAB - implementation used in Zookeeper. Phases:

Discovery: Leader proposes a new epoch. After that no follower will accepts events from previous epochs.

Synchronization: catch-up phase. Leader proposes to elect himself, and delivers all the messages from previous epochs, that some followers might missed.

Broadcast: leader receives messages, establishes order, broadcasts to followers.

PAXOS. Consensus algorithm. Proposal - unique monotonically increasing number. Participant can take any roles, and collocate them. Three roles:

Proposers - receive value from clients, create proposal, collect votes.

Acceptors - vote for accepting or rejecting proposal

Learners - replicas, storing outcomes of accepted proposals.

Two phases: voting / propose (proposers compete to establish their leadership) and replication (proposer distributes the value to the acceptors).

Multi-Paxos adds leader - distinguished proposer. Leader sends to proposers leases, notifying them that it's alive. Proposer don't accepts proposals from other potential leaders.

Flexible Paxos. Defines consensus as non-empty overlapping group of nodes participating in Propose and Accept.

Generalised Paxos: each server contains registers. Register is written only once, and can be unwritten, with a value, or having nil (special value). Register with same index on different nodes form "register set". Quorum can be undecided (Any or Maybe_v) and decided (None or Decided_v).

RAFT. Leader establishes global message order. Time divided into terms / epochs, each epoch has 1 leader. Roles of participants:

Candidate - attempts to collect votes to become leaderLeader - current leader, handling requests and interacts with replicated state machine. Term is arbitrary period of time, identified by monotonical id.

Follower - passive, persists log entries, responds to requests. Similar to Acceptor AND Learner. Every process starts as follower.

When node gets higher term, it updates it and stops accepting messages from prev. terms. When node suspects a leader crash, it becomes candidate and attempts to gather votes.

Leader can be elected only from nodes having all log entries. Follower rejects a higher-numbered entry if the ID and term of the entry that immediately precedes it, sent by the leader, do not match the highest entry according to its own records

PBFT. Consensus algorithm with Byzantine failure protection (when certain node can be maliciously overtaken). Participants cross-validate responses.

Chashnikov.dev