LSM-Tree Complete Technical Guide: From Theory to Industrial Practice

Author: Organized based on LSM-Tree classic papers and modern optimizations
Version: 1.0
Date: 2026-03-06
Description: This document systematically organizes the complete technical evolution of LSM-Tree from its theoretical foundation in 1996 to industrial practice in the 2020s

Overview
LSM-Tree Theoretical Foundation (1996)
LevelDB Engineering Implementation (2011)
Comparison Between LevelDB and LSM-Tree Paper
WiscKey Key-Value Separation Optimization (2016)
PebblesDB FLSM Optimization (2017)
RocksDB and Industrial Practice
LSM-Tree Optimization Technology Map
Performance Comparison Summary
References and Further Reading

1. Overview

1.1 What is LSM-Tree

LSM-Tree (Log-Structured Merge-Tree) is a write-optimized persistent data structure, specifically designed for high write throughput scenarios. It significantly improves write performance by transforming random writes into sequential writes.

1.2 Core Design Philosophy

The design of LSM-Tree revolves around four core principles:

1. Defer and Batch Processing

Write operations are first recorded in memory structures (MemTable), avoiding immediate disk I/O
When MemTable reaches a certain threshold, it is batch-flushed to disk to form SSTable
Background Compaction merges and maintains data ordering
This batch processing amortizes the disk overhead of single writes

2. Sequential Write Optimization

Once SSTable is written, it cannot be modified; only new files can be appended
All disk operations are sequential writes, fully utilizing disk bandwidth (HDD sequential write can reach 100MB/s+, while random write is only 1MB/s)
Sacrifice some read performance (need to check multiple levels) in exchange for extreme write performance
Compensate read performance through Bloom Filter, caching, and other technologies

3. Multi-Level Hierarchical Structure

L0 (Level 0): Memory MemTable + recently flushed immutable files (overlapping allowed)
L1-LN (Level 1-N): Disk files, each level size increases by a fixed ratio (usually 10 times)
Data flows from upper levels to lower levels; colder data resides in deeper levels
Upper levels have small data volume and fast queries; lower levels have large data volume and slower queries but higher storage efficiency

4. Space for Time

Allow data redundancy: multiple versions of the same key may coexist in different levels
Use write amplification (rewriting data) in exchange for high throughput of sequential I/O
Use space amplification (temporarily retaining invalid data) in exchange for efficient delete operations (only need to mark Tombstone)

1.3 Technology Evolution Timeline

Year	Milestone	Core Contribution
1996	LSM-Tree Paper Published	Theoretical foundation, proposed Rolling Merge, multi-component architecture, cost model
2004	Google Bigtable	Distributed LSM practice, proving LSM feasibility in large-scale scenarios
2006	Apache HBase	Open source Bigtable implementation, promoting LSM adoption in Hadoop ecosystem
2008	Facebook Cassandra	LSM + distributed + decentralized design, supporting multi-data centers
2011	Google LevelDB	Embedded LSM engine, engineering simplification (SkipList, SSTable), approximately 20K lines of code
2012	Facebook RocksDB	LevelDB optimization branch, multi-threading, rich features, becoming industrial standard
2014	MongoDB WiredTiger	Supporting LSM storage engine, providing alternative to B-Tree
2016	WiscKey Paper	Key-value separation architecture, write amplification reduced to near 1×, suitable for large value scenarios
2017	PebblesDB Paper	FLSM data structure, reducing write amplification by 4.8× through Guards mechanism
2018	Titan Open Source	TiKV’s key-value separation engine, industrial-grade implementation of WiscKey ideas
2020	RocksDB Integrated BlobDB	Key-value separation functionality matured, becoming officially recommended solution
2021+	ZNS SSD / PMem Adaptation	LSM optimizations for new hardware (Zoned Namespace SSD, Persistent Memory)
2024-2026	RocksDB v10-v11	HyperClockCache, Interpolation Search, Wide-Column support, and other cutting-edge features

2. LSM-Tree Theoretical Foundation (1996)

2.1 Core Contributions of the Paper

Paper: The Log-Structured Merge-Tree (LSM-Tree)
Authors: Patrick O’Neil, Edward Cheng, Dieter Gawlick, Elizabeth O’Neil
Published: Acta Informatica, 1996

Problems Solved

Problems of traditional B-Tree in high-write scenarios:

Random Writes: Each insertion requires random disk I/O
Index Maintenance Cost: Real-time indexing doubles transaction I/O cost
Write Amplification: B-Tree updates may require multiple disk writes

Core Innovations

Two-Component Architecture: LSM-Tree divides storage into memory component C0 and disk component C1.

C0 resides entirely in memory, supporting fast insertion and query
C1 resides on disk, organized in multi-page blocks, supporting sequential read/write
Data migrates progressively from C0 to C1 through Rolling Merge

Rolling Merge Mechanism: A gradual data migration process similar to merge sort.

Batch read C1’s multi-page blocks into memory buffer
Read a continuous segment of entries from C0, merge with data in C1 blocks
Create new C1 blocks and write to new locations (without overwriting old blocks)
Cycle repeatedly to achieve continuous data flushing

Cost Model: The cost advantage of LSM-Tree over B-Tree stems from differences in disk I/O patterns.

B-Tree: Random page I/O, each update may trigger multiple disk seeks
LSM: Multi-page block sequential I/O, batch writes amortize seek costs
Cost ratio ≈ (Sequential I/O cost / Random I/O cost) × (1 / Merge batch size)

Two-Component Architecture Diagram:

                LSM-Tree Two-Component Architecture
┌─────────────────────────────────────────────────────┐
│                                                     │
│    C0 Component (Memory)      C1 Component (Disk)      │
│    ┌──────────────┐            ┌──────────────┐     │
│    │  Index Structure │         │  Multi-page  │     │
│    │  (B-Tree)    │            │  Block 1     │     │
│    ├──────────────┤  Rolling   ├──────────────┤     │
│    │ Key1: Val1   │  Merge     │  Multi-page  │     │
│    │ Key2: Val2   │ ═════════▶ │  Block 2     │     │
│    │ Key3: Val3   │            │  Multi-page  │     │
│    │    ...       │            │  Block 3     │     │
│    └──────────────┘            │     ...      │     │
│                                └──────────────┘     │
│                                                     │
│    Characteristics:              Characteristics:    │
│    - Completely memory-resident   - Disk sequential   │
│    - Fast read/write support      - Append-write opt  │
│    - Capacity limited             - Large capacity    │
│                                                     │
└─────────────────────────────────────────────────────┘

2.2 Rolling Merge Mechanism Detailed Explanation

Rolling Merge is the core mechanism of LSM-Tree, responsible for migrating data from memory component to disk component.

Rolling Merge Process Diagram:

Step 1: Read C1 Multi-page Block       Step 2: Merge C0 Entries
┌─────────────────────┐            ┌─────────────────────┐
│ C0: [a,b,c,d,e]     │            │ C0: [d,e]           │
│      ↑              │            │       (Merged)      │
│  Merge Cursor       │            │                     │
│                     │            │ C1 Buffer:          │
│ C1 Buffer:          │            │ [a,b,c] + [d,e]     │
│ [Block1: x,y,z]     │ ════════▶  │ = [a,b,c,d,e]       │
│                     │            │                     │
└─────────────────────┘            └─────────────────────┘

Step 3: Write New C1 Block            Step 4: Advance Cursor
┌─────────────────────┐            ┌─────────────────────┐
│ C0: [d,e,f,g]       │            │ C0: [f,g,h,i,j]     │
│       (Next Batch)  │            │          ↑          │
│                     │            │      New Cursor     │
│ New C1 Block:       │ ════════▶  │                     │
│ [a,b,c,d,e]         │            │ C1 Buffer:          │
│ (Write to new       │            │ [Block2: m,n,o]     │
│  disk location)     │            │                     │
└─────────────────────┘            └─────────────────────┘

Workflow:

Initialization Phase
- Determine the range of multi-page blocks in C1 to be merged (usually contiguous disk blocks)
- Read these blocks into memory buffer
Merge Phase
- Read a continuous segment of entries from C0 (called merge cursor)
- Sort and merge C0 entries with existing data in C1 blocks by key
- Handle updates (overwrite old values) and deletes (generate Tombstone)
Write Phase
- Write merged data to new location in C1
- Use append-write instead of overwrite, ensuring crash safety
- Update index to point to new block
Advance Phase
- Move merge cursor to next segment of C0
- Loop execution until all C0 data is migrated

Key Characteristics:

Gradualness: Not a one-time full merge, but performed in batches, controlling single I/O overhead
Concurrency Safety: Through “Filling Block” and “Emptying Block” design, supports concurrent read/write
Crash Recovery: Using WAL and checkpoints, can resume from interruption after failure

Difference from LevelDB Compaction:

Rolling Merge: Continuous, fixed-rhythm merge
LevelDB Compaction: On-demand triggered, batch-processing background task

2.3 Multi-Component Architecture

When the memory cost of two-component (C0+C1) architecture is too high, the paper proposes extending to multi-component architecture.

Architecture Form:

          LSM-Tree Multi-Component Architecture
┌────────────────────────────────────────┐
│                                        │
│   C0 (Memory)                          │
│   ┌────────────────┐                   │
│   │ Key range:     │ ← New writes      │
│   │ [latest keys]  │                   │
│   └───────┬────────┘                   │
│           │ Rolling Merge              │
│           ▼                            │
│   C1 (Disk)                            │
│   ┌────────────────┐  Size: 10× C0    │
│   │ Key range:     │ ← Older data      │
│   │ [older keys]   │                   │
│   └───────┬────────┘                   │
│           │ Rolling Merge              │
│           ▼                            │
│   C2 (Disk)                            │
│   ┌────────────────┐  Size: 10× C1    │
│   │ Key range:     │ ← Oldest data     │
│   │ [oldest keys]  │                   │
│   └───────┬────────┘                   │
│           │                            │
│           ▼                            │
│   C3, C4, ... (More Disk Levels)       │
│                                        │
│   Each level capacity increases        │
│   (usually 10 times)                   │
│   Colder data resides in deeper levels │
│                                        │
└────────────────────────────────────────┘

Component Size Design:

Optimal size ratio: r = Si/Si-1 = (SK/S0)^(1/K)
Si: Size of i-th component
S0: Smallest component (usually C0)
SK: Largest component
K: Number of components

Practical Experience:

Usually using 3 components (C0, C1, C2) can satisfy most scenarios
Benefits diminish beyond 3 components, but management complexity increases significantly
C0 is completely memory-resident, C1 and C2 are disk-resident

Merge Strategy:

C0→C1: Most frequent merge, keeping C0 size controllable
C1→C2: Less frequent, triggered when C1 reaches certain threshold
Asynchronous execution: Merges at each level can proceed in parallel

This multi-component design reduces memory costs significantly while maintaining write performance. The multi-level Compaction strategy used in actual systems (like LevelDB, RocksDB) is the engineering implementation of this idea.

2.4 Operation Semantics

Delete Operation

LSM-Tree deletion is logical deletion rather than physical deletion:

Tombstone Mechanism Diagram:

Step 1: Insert Tombstone in C0
┌─────────────────────────────────────────┐
│  C0 (Memory)                            │
│  ┌──────────────────────────────────┐   │
│  │ key=X: [Tombstone]  ← New write  │   │
│  │ key=A: value1                    │   │
│  │ key=B: value2                    │   │
│  └──────────────────────────────────┘   │
│                                         │
│  C1 (Disk)                              │
│  ┌──────────────────────────────────┐   │
│  │ key=X: old_value  ← To be cleaned│   │
│  │ key=C: value3                    │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

Step 2: Rolling Merge Propagation
┌─────────────────────────────────────────┐
│  C0 (Memory)                            │
│  ┌──────────────────────────────────┐   │
│  │ key=A: value1                    │   │
│  │ key=B: value2                    │   │
│  └──────────────────────────────────┘   │
│           │                             │
│           ▼ Rolling Merge               │
│                                         │
│  C1 (Disk)                              │
│  ┌──────────────────────────────────┐   │
│  │ key=X: [Tombstone]  ← Propagated │   │
│  │ key=X: old_value    ← Marked for │   │
│  │                     cleaning     │   │
│  │ key=C: value3                    │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

Step 3: Annihilation During Merge
┌─────────────────────────────────────────┐
│  C1 (New Block Written)                 │
│  ┌──────────────────────────────────┐   │
│  │ key=A: value1                    │   │
│  │ key=B: value2                    │   │
│  │ key=C: value3                    │   │
│  │ (key=X cleaned, no record)       │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘

Mechanism Explanation:

Tombstone Generation
- Insert special “delete marker” entry (Tombstone) in C0
- Marker contains deleted key and deletion timestamp
- Original value position no longer matters, only need to mark key as deleted
Delete Propagation
- Tombstone propagates from C0 to C1, C2… along with Rolling Merge
- During merge, Tombstone “annihilates” when meeting corresponding actual entry
- Eventually completely cleaned, releasing space
Query Processing
- When querying, scan components from new to old in order
- When encountering Tombstone, immediately stop search, return key does not exist
- If actual entry is encountered first, return that value (indicating version written before deletion)

Advantage: Delete operations only require memory writes, no disk random access, extremely high performance.

Predicate Delete

Supports batch deletion based on conditions (such as “delete data older than 20 days”):

Check each entry during Rolling Merge process
Entries meeting conditions are directly discarded, not written to new blocks
No need for separate scan and marking, utilize existing merge process to complete cleanup

Application Scenarios: Time-series data expiration, log cleanup, privacy data deletion, etc.

2.5 Concurrency and Recovery

Concurrency Control Mechanism

Component-Level Locking:

             Component-Level Read-Write Lock Mechanism
┌────────────────────────────────────────┐
│                                        │
│   Read Operations (Multiple Concurrent) │
│   ┌─────┐ ┌─────┐ ┌─────┐             │
│   │ R1  │ │ R2  │ │ R3  │             │
│   └──┬──┘ └──┬──┘ └──┬──┘             │
│      └───────┼───────┘                │
│              ▼                         │
│   C0: Read Lock (Shared) ────────┐     │
│   ┌──────────────────────────────┴─┐   │
│   │         C0 Memory Component    │   │
│   └────────────────────────────────┘   │
│                                        │
│   Write Operation (Exclusive)          │
│   ┌─────┐                             │
│   │  W  │ ──▶ C0: Write Lock (Exclusive)│
│   └─────┘                             │
│                                        │
│   Rolling Merge (Requires Two-Level Lock)│
│   ┌─────┐                             │
│   │ RM  │ ──▶ Holds C0+C1 Write Locks  │
│   └─────┘                             │
│                                        │
└────────────────────────────────────────┘

Cursor Crossing Handling:

             Filling Block and Emptying Block Design
┌────────────────────────────────────────┐
│                                        │
│   Rolling Merge Cursor Position        │
│   C0: [a,b,c,d,e,f,g,h]                │
│            ↑                           │
│         Cursor (scanned c)             │
│                                        │
│   Concurrent write key=c (new data)    │
│   ┌────────────────┐                   │
│   │ Filling Block  │ ◀── Temp new write│
│   │ key=c: new_val │                   │
│   └────────────────┘                   │
│                                        │
│   C1 Block to be merged                │
│   ┌────────────────┐                   │
│   │ Emptying Block │ ◀── Original data │
│   │ [x,y,z,c_old]  │    (Locked)       │
│   └────────────────┘                   │
│                                        │
│   Merge Result                         │
│   ┌────────────────┐                   │
│   │ [x,y,z,c_new]  │ ◀── Filling block │
│   └────────────────┘    merged in      │
│                                        │
└────────────────────────────────────────┘

Rolling Merge uses cursor to traverse C0 and C1
Concurrent writes may insert at positions already scanned by cursor
Solution: “Filling Block” and “Emptying Block” design
- Lock C1’s pending block before merge (Emptying Block)
- Concurrent writes temporarily stored in special area (Filling Block)
- After merge completes, Filling Block data is merged into result

Recovery Mechanism

Checkpoint and WAL Mechanism Diagram:

             Checkpoint and WAL Collaborative Work
┌────────────────────────────────────────┐
│                                        │
│   Time ─────────────────────────────▶  │
│                                        │
│   ┌──────────┐    ┌──────────┐        │
│   │ Write 1  │    │ Write 2  │        │
│   │ (WAL 1)  │    │ (WAL 2)  │        │
│   └────┬─────┘    └────┬─────┘        │
│        │               │              │
│   ┌────▼───────────────▼─────┐        │
│   │      WAL Log File        │        │
│   │ [W1][W2][W3][W4][W5]...  │        │
│   └──────────────────────────┘        │
│        ↑                              │
│   ┌────┴────┐    ┌──────────┐        │
│   │Checkpoint│◀───│ Write 3  │        │
│   │  (t=N)   │    │ (WAL 3)  │        │
│   └────┬────┘    └──────────┘        │
│        │                              │
│   ┌────▼──────────────┐               │
│   │ Checkpoint File   │               │
│   │ C0 State Snapshot │               │
│   │ C1 Boundary Key   │               │
│   └───────────────────┘               │
│                                        │
│   Recovery After Crash:                │
│   1. Load checkpoint (t=N state)       │
│   2. Replay WAL (W3, W4, W5...)        │
│   3. Restore to latest state           │
│                                        │
└────────────────────────────────────────┘

Checkpoint:

Periodically flush all or part of C0 content to disk
Record current state of each component (size, boundary key, etc.)
Recover from checkpoint after crash, reducing WAL replay volume

WAL (Write-Ahead Log):

Each write operation is first recorded to sequential log
Log records contain key, value, operation type, timestamp
After crash, scan WAL to replay un-persisted operations

Recovery Process:

Load most recent checkpoint, restore C0 and C1 state
Scan WAL, replay all operations after checkpoint
Verify consistency of each component (checksum, size, etc.)
Recovery complete, system available

3. LevelDB Engineering Implementation (2011)

3.1 Architecture Overview

LevelDB adopts the classic LSM-Tree architecture, divided into memory components and disk components.

LevelDB Overall Architecture Diagram:

                LevelDB Architecture
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Write Path:                                                 │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────────┐   │
│  │ Put(key) │───▶│   WAL    │───▶│      MemTable        │   │
│  └──────────┘    │  (log)   │    │   (SkipList)         │   │
│                  └──────────┘    └──────────┬───────────┘   │
│                                             │                │
│  Flush (Background):                      ▼                │
│                              ┌──────────────────────────┐   │
│                              │   Immutable MemTable     │   │
│                              └──────────┬───────────────┘   │
│                                         │                    │
│                                         ▼                    │
│  SSTable Files:                                             │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │
│  │Level 0  │  │Level 1  │  │Level 2  │  │Level 3  │        │
│  │~4MB     │  │~10MB    │  │~100MB   │  │~1GB     │        │
│  │(Overlapping)│ │(Non-overlap)│ │(Non-overlap)│ │(Non-overlap)│    │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘        │
│         ▲                                               │
│         │ Major Compaction (Background Merge)            │
│         └───────────────────────────────────────────────┘
│                                                              │
│  MANIFEST File: (Version Change Log)                        │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Records all SSTable file metadata for crash recovery │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Level Size Configuration:

Level	Target Size	File Size	Characteristics
Level 0	4 MB	~2 MB	Max 4 files, keys may overlap between files
Level 1	10 MB	~2 MB	Keys do not overlap between files
Level 2	100 MB	~2 MB	Size is 10× L1
Level 3	1 GB	~2 MB	And so on…
Level 4	10 GB	~2 MB
Level 5	100 GB	~2 MB
Level 6	1 TB	~2 MB

Important Threshold Parameters:

kL0_SlowdownWritesTrigger = 8: Write slowdown when L0 file count reaches 8
kL0_StopWritesTrigger = 12: Write stop when L0 file count reaches 12
kTargetFileSize = 2 MB: SSTable target file size

Write Path:

User calls Put(key, value)
First write sequentially to WAL (Write-Ahead Log) to ensure durability
Then insert into memory MemTable (SkipList implementation)
Return write success (delayed flush)

MemTable Flush (Minor Compaction):

When MemTable reaches threshold (default 4MB), convert to Immutable MemTable
Background thread writes Immutable MemTable to disk as Level 0 SSTable
After write completes, release Immutable MemTable, reset WAL

Background Merge (Major Compaction):

Level 0 SSTables may have overlapping key ranges
Triggered when Level 0 file count exceeds threshold, or lower level size exceeds limit
Merge Level N SSTable with overlapping files in Level N+1
Generate new sorted SSTables into lower level after merge

Read Path:

First search in MemTable
Then search in Immutable MemTable
Scan SSTables at each level from new to old
Use Bloom Filter for quick exclusion at each level
Locate specific Data Block through index
Binary search target key within Data Block

3.2 Core Data Structures

MemTable: SkipList

LevelDB chooses SkipList over B-Tree as MemTable implementation:

SkipList Structure Diagram:

                SkipList Multi-Level Index Structure
┌────────────────────────────────────────────────────────────┐
│                                                             │
│  Level 3 (Highest): head ──────────────────────▶ [70]      │
│                      │                                      │
│  Level 2:           head ─────▶ [30] ─────────▶ [70]      │
│                      │          │               │          │
│  Level 1:           head ─────▶ [30] ──▶ [50] ─▶ [70]      │
│                      │          │          │      │        │
│  Level 0 (Bottom):  head ──▶ [10]─▶ [30]─▶ [50]─▶ [70]─▶ [90] │
│                                data   data   data   data   │
│                                                             │
│  Search [50]: Start from Level 3, descend level by level    │
│  - L3: head ──▶ 70 (>50), descend to L2                     │
│  - L2: head ──▶ 30 (<50), forward ──▶ 70 (>50), descend to L1│
│  - L1: 30 ──▶ 50 (Hit!)                                    │
│                                                             │
│  Time Complexity: O(log n), avg levels ≈ log₁/ₚ(n)         │
│                                                             │
└────────────────────────────────────────────────────────────┘

SkipList Core Characteristics:

Simple Implementation: ~400 lines of code, no complex balancing operations
Concurrency Friendly: Supports lock-free reads, implemented via atomic operations
Memory Efficient: Uses Arena allocator for batch memory allocation, reduces fragmentation
Range Query: Ordered structure, supports efficient range scans

Node Height Generation:

Each node’s height is randomly generated with probability p = 0.25
Probability of height k: p^(k-1) * (1-p)
Expected height: 1/(1-p) = 1.33 (average ~1.33 pointers per node)

Time Complexity:

Lookup/Insert/Delete: O(log n)
Space Complexity: O(n), average 1/(1-p) pointers per node

Comparison with B-Tree:

Characteristic	B-Tree	SkipList	LevelDB Benefit
Implementation Code	~2000+ lines	~400 lines	80% maintenance cost reduction
Concurrent Read	Requires lock	Lock-free	2-3× read performance improvement
Memory Allocation	Frequent alloc/free	Arena batch allocation	Reduced memory fragmentation
Implementation Complexity	High (requires balancing)	Low (probabilistic balancing)	Lower bug rate

SSTable File Format Detailed Explanation

SSTable (Sorted String Table) is LevelDB’s core disk file format, designed to efficiently store ordered key-value pairs and support fast queries.

Overall Structure

SSTable files are divided into four regions from top to bottom:

Data Region (Data Blocks):

Stores actual key-value pair records
Default each Data Block is 4KB (before compression)
Records within block are ordered by key
Uses prefix compression (shared prefixes) to reduce space

Metadata Region (Meta Blocks):

Filter Block: Stores Bloom Filter data, accelerates queries for non-existent keys
Meta Index Block: Index pointing to Filter Block

Index Region (Index Block):

Each Data Block corresponds to one index entry
Index entry key: Maximum key of Data Block
Index entry value: Offset and size of Data Block

Footer:

Fixed 48 bytes
Contains location information of Meta Index Block and Index Block
Magic Number for file format verification

Data Block Internal Structure

Data Block is the smallest unit storing actual key-value pairs in SSTable, default size is 4KB (after compression).

Structure Composition:

Key-Value Record Sequence: Actual stored data
Restart Point Array: Used to accelerate in-block lookups

Prefix Compression Mechanism:

Records within block are sorted by key, adjacent keys usually share common prefixes
Each record only stores the difference from previous key
Set a Restart Point every 16 records, storing complete key at that point
During lookup, decode from nearest Restart Point to reduce computation

Record Format:

Shared Length: Length of prefix shared with previous key
Non-shared Length: Length of non-shared portion of this key
Value Length: Value length
Key Delta: Non-shared key bytes
Value: Actual value

Lookup Process:

Binary search in Restart Point array to determine target record range
Decode sequentially from starting Restart Point of that range
Compare keys until found or out of range

Record Encoding Details

LevelDB uses varint (variable-length integer) encoding to reduce storage space for small values:

Filter Block stores SSTable’s Bloom Filter data, used to quickly exclude Data Blocks not containing target key.

Filter Block Layout:

┌─────────────────────────────────────────────────────────────┐
│                     Filter Block Structure                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Filter Data 1 (corresponds to Data Block 1)            │  │
│  │ [Bloom Filter bits, ~2KB]                              │  │
│  │ One Filter per 2KB Data Block                          │  │
│  └───────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Filter Data 2 (corresponds to Data Block 2)            │  │
│  │ [Bloom Filter bits]                                    │  │
│  └───────────────────────────────────────────────────────┘  │
│                         ...                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Filter Metadata (offset array)                         │  │
│  │ [offset_0: 4 bytes] ← Filter Data 0 offset             │  │
│  │ [offset_1: 4 bytes] ← Filter Data 1 offset             │  │
│  │ ...                                                    │  │
│  │ [base_lg: 1 byte]   ← log2(block size), default 11     │  │
│  │                       (2KB)                            │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Bloom Filter Parameters:

bits_per_key: Default 10 bits
k (number of hash functions): Calculated from bits_per_key, k ≈ 0.693 * bits_per_key ≈ 7
False Positive Rate: ~1% (when bits_per_key = 10)

False Positive Rate Calculation:

P(false positive) ≈ (1 - e^(-k*n/m))^k

Where:
- n = number of keys
- m = total number of bits
- k = number of hash functions

bits_per_key	False Positive Rate	Memory Overhead (per million keys)
5	5.0%	0.625 MB
8	1.0%	1.0 MB
10	0.8%	1.25 MB
15	0.05%	1.875 MB

Index Block Structure

Index Block stores index information for each Data Block, used to quickly locate Data Blocks.

Index Block Layout:

┌─────────────────────────────────────────────────────────────┐
│                     Index Block Structure                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Index Entry 1 (corresponds to Data Block 1)            │  │
│  │ ┌─────────────────────────────────────────────────┐   │  │
│  │ │ Key: Maximum key of Data Block 1                 │   │  │
│  │ │ Value: [offset: varint32][size: varint32]        │   │  │
│  │ └─────────────────────────────────────────────────┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Index Entry 2 (corresponds to Data Block 2)            │  │
│  │ ┌─────────────────────────────────────────────────┐   │  │
│  │ │ Key: Maximum key of Data Block 2                 │   │  │
│  │ │ Value: [offset: varint32][size: varint32]        │   │  │
│  │ └─────────────────────────────────────────────────┘   │  │
│  └───────────────────────────────────────────────────────┘  │
│                         ...                                  │
│                                                              │
│  Lookup Process:                                             │
│  1. Binary search for target key in Index Block              │
│  2. Find first Entry where max_key >= target_key             │
│  3. Read Data Block according to offset and size in Entry    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Footer is the tail of SSTable file, fixed 48 bytes, containing pointers to critical metadata blocks.

Footer Layout:

┌─────────────────────────────────────────────────────────────┐
│                     Footer (48 bytes)                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Offset 0-7:    Meta Index Block Handle (8 bytes)            │
│                 ├─ offset: varint32 (max 5 bytes)            │
│                 └─ size: varint32 (max 5 bytes)              │
│                                                              │
│  Offset 8-15:   Index Block Handle (8 bytes)                 │
│                 ├─ offset: varint32                          │
│                 └─ size: varint32                            │
│                                                              │
│  Offset 16-39:  Padding (24 bytes)                           │
│                 Reserved space for future extension,         │
│                 currently filled with 0                      │
│                                                              │
│  Offset 40-47:  Magic Number (8 bytes)                       │
│                 Fixed value: 0xdb4775248b80fb57              │
│                 (little-endian)                              │
│                 Used to verify file format integrity         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Why 48 bytes?

Historical reasons: Compatible with early SSTable format
Design consideration: Sufficient to store necessary metadata + future extension space
Alignment requirement: 8-byte alignment, improves read efficiency

Internal Key Format Detailed Explanation

LevelDB adds metadata when storing keys to support multi-versioning, delete markers, and transactions. Internal Key is the core mechanism for LSM-Tree to implement MVCC (Multi-Version Concurrency Control).

Internal Key Structure (Total length = user_key_length + 8 bytes):

Sequence Number

Sequence Number Characteristics:

Globally Incrementing: Each write operation (Put/Delete/Merge) increments sequence number by 1
Monotonicity Guarantee: Protected by DBImpl mutex lock, ensuring thread safety
Snapshot Isolation: Record current sequence number when reading, only read data ≤ that sequence number
Descending Order: Same User Key ordered by sequence number descending, ensuring latest version returned first

Type (Type Byte)

Type Descriptions:

Type	Value	Purpose	Cleanup Timing
kTypeDeletion	0x00	Mark key as deleted	Can be cleaned after Compaction when no snapshot reference
kTypeValue	0x01	Normal key-value pair	Can be cleaned when overwritten by new version and no snapshot reference
kTypeMerge	0x02	Merge operation (incremental update)	Generate new value after merging with base value
kTypeSingleDelete	0x07	Single delete optimization	Delete immediately upon encounter, no history retained (must ensure no same key before)

Internal Key Encoding Details

Encoding Layout (Little-Endian):

Sorting Rules Detailed Explanation

Internal Key comparator implementation:

Sorting Example:

Version Control and Garbage Collection

Version Visibility Rules:

Garbage Collection Timing:

Compaction: Main cleanup mechanism, checks snapshot references
Snapshot Release: When oldest snapshot advances, old versions can be cleaned
Periodic Cleanup: Background thread checks expired data

Internal Key Layout Diagram:

┌─────────────────────────────────────────────────────────────┐
│                 Internal Key Structure                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  [User Key (variable)] [Sequence Number (7 bytes)] [Type (1 byte)]│
│                                                              │
│  Example: key="hello", seq=100, type=kTypeValue              │
│  ┌──────────┬──────────────────┬──────────┐                  │
│  │  "hello" │  0x000000000064  │  0x01    │                  │
│  │  (5B)    │     (7B)         │  (1B)    │                  │
│  └──────────┴──────────────────┴──────────┘                  │
│                                                              │
│  Sorting Rule: User Key ascending → Sequence Number descending│
│  Effect: Latest version of same key appears first            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Lookup Key Format

Temporary Key format used during reads:

Lookup Key Characteristics:

Contains snapshot sequence number
Uses kValueTypeForSeek = 0xffffffff to ensure finding all types
Used to locate starting position in MemTable/SSTable

SSTable Lookup Process

SSTable Build Process

SSTable File Example

SSTable Advantages Summary

Characteristic	Description	Effect
Prefix Compression	Adjacent keys share prefixes	Save 30-50% space
Block-based	Fixed-size Data Blocks	Efficient caching and compression
Bloom Filter	Quickly exclude files not containing key	Reduce 90%+ disk reads
Index Structure	Two-level index (Index + Data Block)	O(log n) query
Immutability	SSTable unmodifiable once written	Simplify concurrency and caching

Internal Key format: [User Key | Sequence Number (7B) | Type (1B)] Sorting rule: User Key ascending → Sequence Number descending

3.3 Write Process

LevelDB’s write process adopts the classic WAL + MemTable pattern, ensuring data durability and high performance.

Write Flow Diagram:

                LevelDB Write Flow
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  User Call: Put(key, value)                                 │
│       │                                                      │
│       ▼                                                      │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  1. Join Writer Queue (Serialized writer queue)        │  │
│  │     - Multiple write requests merged into WriteBatch   │  │
│  │     - Batch write improves throughput                  │  │
│  └───────────────────────────────────────────────────────┘  │
│       │                                                      │
│       ▼                                                      │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  2. Write-Ahead Logging                               │  │
│  │     - Write WAL first (sequential write)               │  │
│  │     - Optional sync: force flush for durability        │  │
│  └───────────────────────────────────────────────────────┘  │
│       │                                                      │
│       ▼                                                      │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  3. Write to MemTable                                 │  │
│  │     - Insert into SkipList                             │  │
│  │     - Use custom comparator for sorting                │  │
│  └───────────────────────────────────────────────────────┘  │
│       │                                                      │
│       ▼                                                      │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  4. MemTable Full Transition                          │  │
│  │     - MemTable → Immutable MemTable                    │  │
│  │     - Create new mem_ and log file                     │  │
│  │     - Background Flush to L0 SSTable                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Write Batching Detailed Explanation:

Write batching is LevelDB’s key optimization for improving write throughput.

Working Mechanism:

Write Queue: All write requests enter FIFO queue
Batch Processing: Writer at queue head responsible for merging subsequent write requests
Batch WAL: Merged batch data written to WAL once
Batch MemTable: Batch data inserted into MemTable once
Notification Mechanism: Notify all waiting writers upon completion

Effect Quantification:

Thread Count	Throughput without Batching	Throughput with Batching	Improvement
1 thread	100 Kops/s	100 Kops/s	1×
4 threads	150 Kops/s	380 Kops/s	2.5×
8 threads	180 Kops/s	720 Kops/s	4×
16 threads	200 Kops/s	950 Kops/s	4.75×

Why Improvement?

Reduce WAL write count (multiple random writes → one sequential write)
Reduce lock contention (serialized processing)
Improve CPU cache hit rate (batch processing)

3.4 Compaction Mechanism Detailed Explanation

Compaction is the core mechanism of LSM-Tree, responsible for merging data from upper levels to lower levels while maintaining ordering. LevelDB has two types of Compaction:

Minor Compaction: MemTable → SSTable (Level 0)
Major Compaction: Level N → Level N+1

3.4.1 Minor Compaction Detailed Process

Minor Compaction converts Immutable MemTable in memory to Level 0 SSTable on disk.

Trigger Conditions:

MemTable size reaches threshold (default 4MB)
Background thread continuously checks, triggers when condition met

Detailed Steps:

Step 1: Create Immutable MemTable

Mark current MemTable as Immutable (cannot be modified)
Create new MemTable to receive new writes
Switch WAL to new log file

Step 2: Smart Output Level Selection

Default output to Level 0
But LevelDB checks if can “skip level” to higher level
If Immutable MemTable’s key range doesn’t overlap with Level 0 and higher levels, can directly place in L1/L2
Reduce subsequent L0→L1 Compaction overhead

Step 3: Write SSTable

Iterate through all key-values in Immutable MemTable
Write sequentially to new SSTable file
Build Data Blocks, Index Block, Filter Block
Finally write Footer

Step 4: Update Metadata

Add new SSTable info to VersionSet
Record to MANIFEST file to ensure durability
Delete old WAL file
Release Immutable MemTable memory

Special Optimization - Level Skipping: LevelDB decides Immutable MemTable’s output level by checking key range overlap:

If no overlap with L0, try placing in L1
If no overlap with L1 either, and small overlap with L2, can place in L2
And so on, up to L4
This smart selection significantly reduces Compaction overhead for small range updates

3.4.2 Major Compaction Detailed Process

Major Compaction merges data from upper levels to lower levels, maintaining ordering and level constraints.

Trigger Conditions:

Level 0 file count exceeds threshold (default 4)
Total size of a level exceeds target (L1: 10MB, L2: 100MB, L3: 1GB…)
Manual trigger (compact_range API)

Compaction Selection Algorithm:

L0→L1 Compaction Special Characteristics:

Level 0 SSTables may overlap, must all participate in merge
Find all L1 files overlapping with L0 files to be merged
Multi-way merge all overlapping files, generate new L1 SSTables

L1+ Compaction Standard Process:

Select “most crowded” SSTable in that level (file size/layer size ratio largest)
Only select this one file, find overlapping files in L+1 level
Merge selected files, generate new L+1 SSTables
Delete old files

Compaction Execution Process:

Input Collection: Determine all SSTables to be merged
Multi-way Merge: Use min-heap to merge multiple ordered inputs
Output Generation: Write new SSTables sequentially, maintain target file size (default 2MB)
Metadata Update: Atomically update VersionSet, record to MANIFEST
Old File Cleanup: Delete merged old SSTables (actually delayed deletion, ensure no reference before cleanup)

Why L0→L1 is Most Expensive:

L0 file overlap causes need to read more files
L1 size limit (10MB) causes frequent triggering
Is the bottleneck of entire LSM-Tree, subsequent optimizations (like Universal Compaction) mostly target this

3.4.3 Garbage Collection in Compaction

Compaction is not just merging data, but also responsible for cleaning expired data.

Garbage Collection Judgment Logic:

During Compaction, judge whether to discard each key:

Hidden by New Version:
- If same user key has newer version (higher sequence number)
- And no snapshot references old version
- Then discard old version
Delete Marker Cleanup:
- If it’s a delete marker (kTypeDeletion)
- And no snapshot references this marker
- And no data for this key in higher levels
- Then can discard this delete marker

Version Visibility Rules:

Each snapshot records sequence number at creation time
When reading, can only see data ≤ snapshot sequence number
During Compaction, check oldest snapshot, versions older than it can be cleaned

3.5 Version Management (VersionSet)

LevelDB uses Copy-on-Write style version control, supporting snapshot isolation and concurrent read/write.

VersionSet Structure:

                Version Management
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  VersionSet                                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  current_  ───────────────────────▶ Version 3 (Current)│  │
│  │                                      ├── FileMetaData[]│  │
│  │                                      ├── level_files_  │  │
│  │                                      └── refs: 1       │  │
│  └───────────────────────────────────────────────────────┘  │
│                  │                                           │
│                  ▼                                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Version 2 (Old Version)                               │  │
│  │  ├── FileMetaData[]                                    │  │
│  │  ├── refs: 0 (reference count 0, pending deletion)     │  │
│  │  └── next: Version 3                                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                  │                                           │
│                  ▼                                           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Version 1 (Even Older Version)                        │  │
│  │  └── ...                                               │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  MANIFEST File                                         │  │
│  │  - Records all version change logs                     │  │
│  │  - Used during crash recovery                          │  │
│  │  - VersionEdit serialization format                    │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Version Structure:

Each level contains a FileMetaData array, recording all SSTable files in that level
FileMetaData contains: file number, size, minimum key, maximum key
Reference count (refs): Cannot delete when snapshot reference exists

MANIFEST File:

Records all version changes (VersionEdit)
Including: new files, deleted files, comparator changes, etc.
Recover version state by replaying MANIFEST after crash

Snapshot:

Record current sequence number at creation
When reading, only read data ≤ snapshot sequence number
Support multi-version concurrency control (MVCC)
After release, old versions can be cleaned by Compaction

3.5.1 Version Management Process

Normal Read/Write:
- Acquire current version reference (refs++)
- Read/write based on SSTable collection of this version
- Release reference when complete (refs–)
Compaction Complete:
- Create new version, containing new SSTable collection
- Atomically switch current_ to point to new version
- Decrement old version reference count
Old Version Cleanup:
- When version’s refs == 0, can be recycled
- SSTable files unique to this version can be deleted

3.4.4 Compaction Strategy Comparison

Strategy	Principle	Write Amplification	Read Amplification	Space Amplification	Applicable Scenario
Leveled (LevelDB Default)	Each level fully sorted, adjacent level size ratio 10:1	High (10-30×)	Low (O(1))	Low	Read-heavy
Tiered (RocksDB Universal)	Multiple sorted runs per level, delayed merge	Low (2-5×)	High (O(T))	High	Write-heavy
Leveled-N	Hybrid strategy, L0-LN uses Tiered, LN+ uses Leveled	Medium	Medium	Medium	Mixed workload
FIFO	Only keep recent N files	Very Low	High	Very High	Time-series data, cache

Leveled vs Tiered Compaction Diagram:

3.4.5 Compaction Tuning Parameters

Tuning Suggestions:

3.6 Read Optimization

LevelDB designs multi-layer optimization strategy for LSM-Tree’s need to check multiple levels during reads.

Read Amplification Sources:

Worst case: Need to check MemTable + Immutable + L0-L6, total 8 levels
Each level may need to open one SSTable, read Bloom Filter, Index Block, Data Block
For non-existent key, need to check all levels to confirm

Optimization Strategies:

1. MemTable Priority

Latest data always in MemTable, fastest lookup
SkipList O(log n) lookup complexity

2. SSTable Cache (Table Cache)

Cache opened SSTable file descriptors and metadata
Avoid frequent open/close file system call overhead

3. Block Cache

LRU cache recently read Data Blocks
Hot data directly hits memory, no disk read needed
Default 8MB, configurable

4. Bloom Filter Filtering

Each SSTable has corresponding Bloom Filter
Check Filter first during query, quickly exclude files not containing target key
For non-existent key, reduce 90%+ disk reads

5. Level Skipping Optimization

Utilize SSTable file ordering, maintain key range index per level
If target key < minimum key of a level or > maximum key, skip that level directly

6. Prefetch Optimization

During range query, asynchronously prefetch next Block while reading current Block
Utilize disk sequential read high bandwidth

Read Performance Comparison (With/Without Optimization):

Scenario	Without Optimization	With Optimization	Improvement
Point Query (Exists)	3-8 ms	0.1-0.5 ms	6-30×
Point Query (Not Exists)	10-20 ms	0.5-1 ms	10-40×
Range Query	Linear Scan	Binary+Sequential Read	10-100×

4. Comparison Between LevelDB and Original LSM-Tree Paper

This chapter deeply analyzes the key evolution from LSM-Tree theory (1996) to LevelDB engineering implementation (2011), revealing how industrial practice optimizes theoretical models.

4.1 Core Differences Overview

The LSM-Tree paper proposed a theoretical framework and cost model, but transforming it into an industrial-grade system requires numerous engineering decisions. LevelDB, while maintaining core ideas, made several key improvements:

4.2 Detailed Comparison Analysis

4.2.1 Architecture Design Comparison

Aspect	LSM-Tree Paper (1996)	LevelDB (2011)	Improvement Significance
C0 Implementation	B-Tree variant or (2-3) tree	SkipList	5× code simplification (400 lines vs 2000+ lines), supports lock-free reads
Level Design	Conceptual C0, C1, C2…Ck	Fixed 7 levels L0-L6	Engineering simplification, clear parameters, easy to tune
L0 Special Handling	None (all levels same structure)	Allows file overlapping	Reduces write amplification by 30%+, lowers L0→L1 merge pressure
Disk Format	Generic multi-page blocks	SSTable format	Prefix compression, Bloom Filter, index optimization
MemTable Structure	B-Tree-like	SkipList	Higher memory efficiency, better range query performance

4.2.2 Core Data Structure Comparison

MemTable: B-Tree vs SkipList

SkipList Advantage Quantification:

Characteristic	B-Tree	SkipList	LevelDB Benefit
Implementation Code	~2000+ lines	~400 lines	80% maintenance cost reduction
Concurrent Read	Requires lock	Lock-free	2-3× read performance improvement
Memory Allocation	Frequent alloc/free	Arena batch allocation	Reduced fragmentation, faster allocation
Implementation Complexity	High (requires balancing)	Low (probabilistic balancing)	Lower bug rate, easier to optimize

4.2.3 Compaction Strategy Comparison

LSM-Tree Paper: Continuous Rolling Merge

LevelDB: On-demand Trigger + Background Thread

4.2.4 Read Optimization Comparison

LSM-Tree Paper: No Specialized Optimization

Sequential search of each level
Need to read complete data blocks for binary search

LevelDB: Multi-layer Optimization

Optimization Technology	Paper (1996)	LevelDB (2011)	Read Amplification Reduction
Bloom Filter	❌ Not mentioned	✅ Per-file + per-block filters	90%+
Block Cache	❌ Not mentioned	✅ LRU cache Data Block	50-70%
Index Block	❌ Not mentioned	✅ Fast Data Block location	Avoid full file scan
Table Cache	❌ Not mentioned	✅ Cache SSTable metadata	Reduce file open count
Prefix Compression	❌ Not mentioned	✅ Key shared prefix compression	Save 30-50% storage space

Bloom Filter Effect Quantification:

4.2.5 Concurrency and Consistency Comparison

Characteristic	LSM-Tree Paper (1996)	LevelDB (2011)	Improvement
Concurrent Read	Node-level locking	Version control + lock-free read	2-5× read performance improvement
Concurrent Write	Not discussed in detail	Write Batching	10×+ throughput improvement
Snapshot Isolation	Mentioned but not implemented	Full MVCC support	Supports transactions and backup
Recovery Mechanism	Checkpoint + WAL	MANIFEST + version management	3× recovery speed improvement

Write Batching Detailed Explanation:

4.2.6 Level Design Detailed Comparison

LSM-Tree Paper: Variable Multi-Component

LevelDB: Fixed 7 Levels + Level 0 Special Handling

4.3 Key Improvements Summary

Engineering Simplification

Performance Improvement

4.4 Design Philosophy Differences

The LSM-Tree paper and LevelDB demonstrate different design orientations from theory to engineering.

Theory-oriented vs Engineering-oriented:

Dimension	LSM-Tree Paper (1996)	LevelDB (2011)
Goal	Prove LSM superiority in cost model	Build industrial-grade high-performance embedded storage
Focus	Mathematical optimization, asymptotic complexity	Actual performance, engineering simplicity, maintainability
Architecture	Generic multi-component model	Fixed 7 levels + special L0
Implementation	Conceptual description	Complete runnable code (~20KB)
Optimization	Theoretical cost minimization	Comprehensive balance of read/write amplification, space amplification

Core Differences Analysis:

1. MemTable Choice: B-Tree vs SkipList

Paper: Suggests using B-Tree or (2-3) tree, ensuring deterministic performance
LevelDB: Uses SkipList, code volume reduced by 80%, simpler implementation
Engineering trade-off: Probabilistic balance for implementation simplicity, actual performance difference is small

2. Compaction Mode: Continuous vs On-demand

Paper: Rolling Merge is a continuously running background task
LevelDB: Compaction triggered on-demand, with clear trigger conditions
Engineering trade-off: Controllable resource usage, avoiding continuous I/O occupation

3. Level Design: Dynamic vs Fixed

Paper: Component count and size can be dynamically adjusted based on workload
LevelDB: Fixed 7 levels, each level size increases 10×
Engineering trade-off: Simplified configuration and management, easier to understand and tune

4. Read Optimization: None vs Multi-layer

Paper: Does not specifically discuss read optimization
LevelDB: Multi-layer optimizations including Bloom Filter, Block Cache, indexes, etc.
Engineering trade-off: 2011 SSD proliferation, random read performance improved, requiring targeted read optimization

Key Insights:

Theory to engineering requires trade-offs: LevelDB simplified the theoretical model but gained better engineering characteristics
Hardware changes drive design: 2011 SSD proliferation made optimizations like Bloom Filter, parallel reads important
Simplicity first: SkipList replacing B-Tree is a key factor in engineering success
Controllability matters: On-demand Compaction is more suitable for production than continuous Rolling Merge

4.5 Impact on Subsequent Systems

LevelDB’s design decisions influenced the entire LSM-Tree ecosystem:

5. WiscKey Key-Value Separation Optimization (2016)

Paper: WiscKey: Separating Keys from Values in SSD-Conscious Storage
Authors: Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Institution: University of Wisconsin, Madison
Published: FAST 2016
Core Idea: Key-Value Separation - Separating values from LSM-tree to significantly reduce I/O amplification

5.1 Research Background and Motivation

5.1.1 Core Problems of LSM-Tree

LSM-Tree (such as LevelDB, RocksDB), while avoiding random writes, suffers from severe I/O amplification problems:

Write Amplification: 12× - 50×

Cause: Data is repeatedly read and written during Compaction process
Example: Writing 1GB of data may require 12-50GB of actual disk writes

Read Amplification: 3× - 14×

Cause: Lookup needs to traverse multiple levels of SSTables
Example: Reading 1KB of data may require reading 3-14KB of actual data

Space Amplification

Actual occupation exceeds original data due to invalid data not being cleaned in time

5.1.2 Essential Differences Between SSD and HDD

HDD (Hard Disk Drive):

Random I/O is 100×+ slower than sequential I/O
LSM-Tree’s trade-off is reasonable: Use sequential writes in exchange for query performance

SSD (Solid State Drive):

Random read performance is close to sequential read (50% single-thread, can match with multi-thread)
Random writes still have overhead (erase-write cycle)
High I/O amplification wastes bandwidth and shortens SSD lifespan
Internal parallelism: Multiple channels/chips can process random reads in parallel

Conclusion: LSM-Tree’s design assumptions are no longer fully applicable on SSDs

5.1.3 Key Observations

Compaction only cares about Key order: Value sorting is not essential for range queries
Keys are usually much smaller than Values: In modern workloads, keys are ~16B, values are 100B-4KB+
SSD parallel random read ≈ sequential read: This makes range queries feasible after key-value separation

5.2 Key-Value Separation Architecture

5.2.1 Overall Architecture

                WiscKey Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  ┌─────────────────┐         ┌───────────────────────────┐  │
│  │   LSM-tree      │         │         vLog              │  │
│  │  (Stores only   │         │    (Value Log append-only)│  │
│  │   Keys)         │         │                           │  │
│  │                 │         │                           │  │
│  │  Key1 → addr1   │────────▶│  addr1: Value1            │  │
│  │  Key2 → addr2   │         │  addr2: Value2            │  │
│  │  Key3 → addr3   │         │  addr3: Value3            │  │
│  │  ...            │         │  ...                      │  │
│  │                 │         │                           │  │
│  │  L0-L6 (SSTable)│         │  Sequential append write  │  │
│  │  Size significantly│       │  Background Garbage       │  │
│  │  reduced        │         │  Collection               │  │
│  └─────────────────┘         └───────────────────────────┘  │
│                                                              │
│  Core Advantages:                                            │
│  1. LSM-tree greatly shrinks → Compaction cost significantly │
│     reduced                                                   │
│  2. Value sequential append → Write performance approaches   │
│     device bandwidth                                          │
│  3. Read operation: Query LSM-tree for address → Random read │
│     vLog                                                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

5.2.2 Detailed Data Layout

Record format in LSM-tree:

Key | Value-Address (vLog offset)

Record format in vLog:

Key | Value | Value-Size (used for GC verification)

vLog Management:

Head pointer (in memory): New Value append position
Tail pointer (persistent): Valid data starting position
Both Tail and Head pointers are stored in LSM-tree: <"tail", tail-vLog-offset>, <"head", head-vLog-offset>

5.2.3 Design Goals

Goal	Description
Low Write Amplification	Reduce unnecessary writes, extend SSD lifespan
Low Read Amplification	Improve query throughput, enhance cache efficiency
SSD Optimization	Match SSD’s parallel random read characteristics
Complete Functionality	Support LSM-tree features like range queries, snapshots
Realistic Key-Value Sizes	Optimize for small key (16B) + variable-length value scenarios

5.3 Key Challenges and Solutions

5.3.1 Challenge 1: Range Query Performance

Problem: After key-value separation, range queries require random reads to vLog, performance may degrade

Solution: Utilize SSD parallel random reads + prefetching

Prefetching Mechanism:

Sequentially read a batch of keys from LSM-tree (e.g., 1000)
Extract corresponding value addresses
Put addresses into work queue
32 background threads parallel random read vLog
Utilize SSD internal parallelism to achieve bandwidth close to sequential reads

Performance Results:

Small value (64B): 12× slower than LevelDB (device random read limitation)
Large value (4KB+): 8.4× faster than LevelDB
Sequentially written data: Performance close to LevelDB (data in vLog is already sorted)

Further Optimization: For small value scenarios, log reorganization (sorting) can be performed

5.3.2 Challenge 2: Garbage Collection

Problem: After delete/update, invalid values are generated in vLog, space needs to be reclaimed

Solution: Online lightweight garbage collector

Trigger Conditions:

Configured to run periodically
Or space usage reaches threshold
Supports offline mode for maintenance

Collection Process:

Sequentially read chunk from Tail (e.g., 16MB)
For each Key-Value pair:
- Look up this Key in LSM-tree
- Check if address in LSM-tree matches current vLog position
- If match: Valid data, append write to Head
- If not match: Invalid data, skip
Update Tail pointer
Use fallocate(FALLOC_FL_PUNCH_HOLE) to release space

Consistency Guarantee:

First fsync() appended data in vLog
Then synchronously update Tail pointer in LSM-tree
After crash: Everything before Tail is persisted, after may be lost

Performance Overhead (Random Write Workload): | Invalid Data Ratio | Throughput Degradation | |——————–|————————| | 100% | < 10% | | 50% | ~25% | | 25% | ~35% |

Can be balanced by adjusting GC frequency and space reservation

5.3.3 Challenge 3: Crash Consistency

Problem: After key-value separation, how to ensure data consistency during crash recovery?

Solution: Utilize append atomicity of modern file systems

File System Guarantees (ext4, btrfs, xfs):

Append operation atomicity: After crash, file only contains complete appended prefix
Will not occur: Partial writes, out-of-order writes, garbage data

WiscKey Write Order:

Value first appended to vLog
fsync() vLog to ensure persistence
Key + Address written to LSM-tree (persistence guaranteed by its WAL)

Crash Recovery Process:

Read Head pointer from LSM-tree
Scan vLog from Head to end of file
Rebuild Key → Address mapping
Duplicate data handled by LSM-tree deduplication mechanism

Exception Query Handling:

Key in LSM-tree, but Value address out of range: Delete Key
Key in LSM-tree, but Key in Value doesn’t match: Delete Key
Value exists but Key lost: Will be recycled by subsequent GC
Double verification during query: Address range + Key match

Recovery Time:

LevelDB: 0.7 seconds (1KB value)
WiscKey: 2.6 seconds (need to scan vLog)
Can be optimized by more frequent Head pointer persistence

5.3.4 Additional Optimization: Remove LSM-tree WAL

Observation:

LSM-tree’s WAL is used to recover MemTable data
WiscKey’s vLog already contains complete Key-Value data
Can recover by scanning vLog, no additional WAL needed

Optimization:

Completely remove LSM-tree’s log file, reduce one write

Recovery Process:

Get most recent Head pointer from LSM-tree
Scan vLog from Head to end
Rebuild LSM-tree index

Effect:

Particularly effective for small writes, reduce ~50% write volume

5.4 Performance Evaluation Results

5.4.1 Experimental Setup

Configuration	Parameters
Device	500GB Samsung 840 EVO SSD
Database Size	100 GB
Key Size	16 Bytes
Value Size	64B - 256KB
Comparison Systems	LevelDB 1.18, RocksDB

5.4.2 Micro Benchmark - Load Performance

Sequential Load - Throughput (MB/s):

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	10	10	10	10	10	10	10
WiscKey	40	80	150	350	450	480	500
Improvement	2.5×	5×	15×	46×	45×	48×	50×

Random Load - Throughput (MB/s):

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	10	10	10	10	10	10	10
WiscKey	40	80	150	350	450	480	500
Improvement	2.5×	5×	15×	111×	104×	85×	46×

Write Amplification Comparison:

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	14	12	12	12	12	12	12
WiscKey	4.5	2.5	1.2	1.05	1.01	~1	~1

5.4.3 Micro Benchmark - Query Performance

Random Lookup - Throughput (KOps/s):

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	25	10	4	3	2.5	2.5	2.5
WiscKey	40	48	48	42	40	36	30
Improvement	1.6×	4.8×	12×	14×	16×	14×	12×

Range Query - Throughput (MB/s):

Randomly written database (vLog unordered):

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	4	12	45	120	180	220	230
WiscKey	0.3	1.2	8	60	150	210	230
Relative Performance	0.12×	0.1×	0.18×	0.5×	0.83×	0.95×	1.0×

Sequentially written database (vLog ordered):

Value Size	64B	256B	1KB	4KB	16KB	64KB	256KB
LevelDB	4	12	45	120	180	220	230
WiscKey	3	10	40	120	180	220	230
Relative Performance	0.75×	0.83×	0.89×	1.0×	1.0×	1.0×	1.0×

5.4.4 Garbage Collection Overhead

Invalid Data Ratio	25%	50%	75%	100%
Throughput Degradation	~35%	~25%	~15%	<10%

Note: Even with 100% invalid data, GC overhead is small because it’s mainly sequential I/O

5.4.5 Space Amplification

Actual size of 100GB randomly written database:

System	Actual Size	Space Amplification
LevelDB	~120 GB	1.2×
WiscKey (Before GC)	~115 GB	1.15×
WiscKey (After GC)	~102 GB	1.02×

5.4.6 YCSB Macro Benchmark

Value = 1KB:

Workload	Description	LevelDB	RocksDB	WiscKey-GC	WiscKey
LOAD	Load 100GB	1.0×	1.2×	45×	50×
A	50% read, 50% update	1.0×	1.1×	4.6×	4.6×
B	95% read, 5% update	1.0×	1.3×	4.0×	5.4×
C	100% read	1.0×	1.1×	3.6×	5.4×
D	95% read, 5% insert	1.0×	1.7×	3.4×	3.6×
E	95% range query	1.0×	0.8×	0.7×	0.8×
F	50% read, 50% RMW	1.0×	0.7×	3.5×	3.4×

Value = 16KB:

Workload	LevelDB	RocksDB	WiscKey-GC	WiscKey
LOAD	1.0×	1.3×	100×	104×
A-F	1.0×	1.1-2.5×	2×-7.5×	2.3×-7.5×

Conclusion: WiscKey outperforms LevelDB and RocksDB in almost all YCSB workloads

5.4.7 Recovery Time

Database recovery time after crash (1KB value):

System	Recovery Time
LevelDB	0.7 seconds
WiscKey	2.6 seconds

Analysis:

WiscKey needs to scan vLog to rebuild index
Can be optimized by more frequent Head pointer persistence
Recovery time increases 3.7×, but still within acceptable range

5.5 Detailed Comparison with LevelDB

Characteristic	LevelDB	WiscKey	Description
Core Idea	LSM-Tree	Key-Value Separated LSM-Tree	WiscKey retains LSM advantages, removes main overhead
LSM-tree Content	Key + Value	Only Key + Value Address	LSM size significantly reduced
Value Storage	SSTable	Independent vLog (Append Log)	Sequential write, no Compaction
Write Amplification	12× - 50×	~1×	Most significant improvement
Read Amplification	3× - 14×	~1× + 1 random read	Still advantageous in small value scenarios
Range Query	Sequential read	Parallel random read	Better performance in large value scenarios
Garbage Collection	Compaction	vLog GC (Lightweight)	Only sequential I/O
Crash Recovery	LSM-tree log	vLog Scan	Can remove LSM WAL
Space Amplification	Higher	Lower (close to 1 after GC)	More SSD space efficient
CPU Usage	Lower	Slightly higher	Range queries require multi-thread processing
Applicable Scenarios	General	SSD + Medium/Large Value	Best effect when value > 1KB

Architecture Comparison Diagram:

┌─────────────────────────────────────────────────────────────────┐
│                      LevelDB vs WiscKey                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LevelDB:                                                        │
│  ┌─────────┐    ┌─────────┐    ┌─────────────────────────────┐  │
│  │ MemTable│───▶│ Immutable│───▶│ L0-L6 SSTable (Key+Value)  │  │
│  │(SkipList│    │ MemTable│    │ Compaction needs to sort    │  │
│  └─────────┘    └─────────┘    │ Key+Value                   │  │
│       ▲                          └─────────────────────────────┘  │
│       │ WAL                                                      │
│  ┌────┴────┐                                                     │
│  │   Log   │                                                     │
│  └─────────┘                                                     │
│                                                                  │
│  WiscKey:                                                        │
│  ┌─────────┐    ┌─────────┐    ┌─────────────────────────┐      │
│  │ MemTable│───▶│ Immutable│───▶│ L0-L6 SSTable (Key only)│      │
│  │(SkipList│    │ MemTable│    │ Compaction only sorts Key│      │
│  └─────────┘    └─────────┘    └─────────────────────────┘      │
│       │                              │                          │
│       │                              │ (Address reference)       │
│       ▼                              ▼                          │
│  ┌─────────────────────────────────────────┐                    │
│  │              vLog (Value Log)           │                    │
│  │  ┌─────────────────────────────────┐   │                    │
│  │  │ Tail │ ... Valid Values ... │ Head│   │                    │
│  │  └─────────────────────────────────┘   │                    │
│  │         (Sequential append, background GC)                  │
│  └─────────────────────────────────────────┘                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5.6 Design Insights and Subsequent Impact

5.6.1 Core Design Insights

Workload Characteristics Determine Storage Design
- Observation: In modern workloads, keys are small (16B), values vary greatly (100B-4KB+)
- Insight: Compaction only cares about Key order
- Conclusion: Key-value separation can significantly reduce Compaction overhead
Hardware Characteristics Influence Architecture Choice
- SSD random read close to sequential read → Key-value separation feasible
- SSD high internal parallelism → Multi-thread random reads effective
- SSD lifespan limited by write volume → Reducing write amplification necessary
Trade-off Re-evaluation
- Traditional LSM: Sequential write vs Query performance
- WiscKey: Lower write amplification + Utilize SSD parallelism
File System Characteristics Can Be Leveraged
- Append operation atomicity simplifies crash consistency design

5.6.2 Applicable Scenarios

WiscKey Best Suited For:

SSD storage devices
Value size >= 1KB
Write-intensive workloads
Scenarios requiring range queries but not too large query ranges

WiscKey Not Suited For:

HDD storage (poor random read performance)
Very small values (below 64B)
Workloads dominated by large-range sequential scans

WiscKey influenced multiple subsequent storage systems:

WiscKey (2016)
    │
    ├──→ Titan (PingCAP, 2018)
    │       └── TiKV's key-value separation engine
    │       └── Architecture similar to WiscKey
    │
    ├──→ TerarkDB (ByteDance, 2019)
    │       └── Combines WiscKey with other compression optimizations
    │
    ├──→ BlobDB (Facebook RocksDB)
    │       └── RocksDB's built-in key-value separation implementation
    │       └── Large values automatically separated to Blob files
    │
    ├──→ HashKV (SOSP 2018)
    │       └── Further optimizes GC efficiency, reduces data migration
    │
    └──→ RocksDB's Integrated BlobDB (2021+)
            └── Integrates WiscKey ideas into main branch

5.6.4 Paper Contribution Summary

Contribution Type	Specific Content
Problem Identification	Points out LSM-Tree’s high I/O amplification problem is particularly severe on SSDs
Core Innovation	Proposes key-value separation architecture, significantly reduces Compaction overhead
Challenge Solution	Solves three major challenges: range query, garbage collection, crash consistency
Experimental Proof	Performance significantly better than LevelDB/RocksDB under various workloads
Open Source Implementation	Implemented based on LevelDB 1.18, reproducible and extensible

5.6.5 Significance for LSM-Tree Development

WiscKey represents an important milestone in LSM-Tree optimization shifting from generality to hardware-awareness:

Before: LSM-Tree design mainly considered HDD characteristics
WiscKey: Redesigned for SSD characteristics, proving order-of-magnitude performance improvements achievable
After: More work focuses on hardware characteristics (such as ZNS SSD, Persistent Memory)

6. PebblesDB FLSM Optimization (2017)

Paper: PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
Authors: Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, Ittai Abraham
Institution: University of Texas at Austin, VMware Research
Published: SOSP 2017 (ACM Symposium on Operating Systems Principles)
Core Contribution: Proposed FLSM (Fragmented Log-Structured Merge Trees) data structure, combining Skip List and LSM-Tree, significantly reducing write amplification

6.1 Research Background and Motivation

6.1.1 LSM-Tree Write Amplification Problem

LSM-Tree (such as LevelDB, RocksDB), while avoiding random writes, has a severe write amplification problem:

Root Cause: LSM requires non-overlapping key ranges for sstables at each level

New data writes need to be merged and sorted with existing data
Same data is repeatedly read and written during Compaction process
Write amplification can reach 12× - 50×

Traditional LSM Compaction Example:

L0: {1,100}  →  L1: {1,50}  →  L2: {1,25}
                  {100,200}       {25,100}
                                  {100,400}

Data {1,100} is rewritten 3 times!

Consequences:

SSD lifespan shortened
Storage costs increased
Write throughput reduced (RocksDB write throughput is only 10% of read)

6.1.2 Limitations of Traditional Solutions

Solution	Method	Limitation
No sstable merging	Directly add new sstable to level	Read performance drops sharply, need to check multiple sstables
Specialized hardware	Utilize SSD FTL features (NVMKV)	Depends on specific hardware, poor generality
Sacrifice read performance	LSM-trie, Universal Compaction	Does not support range queries or poor read performance
Key-value separation	WiscKey	Range query performance drops, requires GC

6.1.3 Core Insight

Root cause of LSM write amplification: Requirement for non-overlapping (disjoint) key ranges for sstables at each level

If overlapping sstables are allowed at the same level, data rewriting during Compaction can be avoided.

Challenge: How to maintain efficient query performance while allowing overlap?

Solution: Use Guards (inspired by Skip List)

6.2 FLSM Core Design

6.2.1 Guards Mechanism

                FLSM Guards Mechanism
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Guard: Key randomly selected from inserted keys, used to    │
│         partition key space                                  │
│                                                              │
│  Key Properties:                                             │
│  ├── Guards have non-overlapping key ranges (disjoint)       │
│  ├── Each Guard can have multiple attached sstables          │
│  │   (overlap allowed)                                        │
│  ├── Higher-level Guards contain all Guards from lower       │
│  │   levels (similar to Skip List)                           │
│  └── Lowest level has most Guards                            │
│                                                              │
│  Guard Selection Probability:                                │
│  ├── Level 1: Lowest probability (fewest Guards)             │
│  ├── Level i+1: Probability > Level i (more Guards)          │
│  └── Example: If probability is 1/10, select 1 Guard per     │
│      10 keys                                                 │
│                                                              │
│  Example Layout:                                             │
│                                                              │
│  Level 3:  [Guard:5]────[Guard:100]────[Guard:375]────[Guard:1023] │
│              │            │              │               │   │
│  Level 2:    [G:5]────────[G:100]────────[G:375]              │
│              │            │              │                   │
│  Level 1:    [G:5]────────[G:100]                             │
│              │                                                │
│  Level 0:    (No Guards, new sstables attached directly)      │
│                                                              │
│  sstables under each Guard (overlap allowed):                │
│  Guard:5 may have: {1,20}, {2,35}, {5,40} (all contain key 5)│
│                                                              │
└─────────────────────────────────────────────────────────────┘

6.2.2 FLSM vs LSM Architecture Comparison

                LSM vs FLSM Architecture Comparison
┌─────────────────────────────────────────────────────────────────┐
│                                                                  │
│  Traditional LSM:                                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Level 1: [1..100] [200..300] [400..500]                │   │
│  │          ↑ Each key in only one sstable                 │   │
│  │ Level 0: [1..50] [60..80] [200..250]                   │   │
│  │          ↑ Can overlap                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  FLSM:                                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ Level 1: Guard:50 ────── Guard:200 ────── Guard:400   │   │
│  │          │                  │                │          │   │
│  │          ▼                  ▼                ▼          │   │
│  │        [1,60]             [150,250]        [350,450]   │   │
│  │        [20,80]            [180,300]        [380,500]   │   │
│  │        [40,100]           [200,350]        [400,550]   │   │
│  │        ↑ sstables under same Guard can overlap          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                  │
│  Key Differences:                                                │
│  ├── LSM: sstables at each level must be disjoint              │
│  ├── FLSM: Guards are disjoint, sstables within Guard can      │
│  │          overlap                                             │
│  └── FLSM: Query first finds Guard, then checks all sstables   │
│            under that Guard                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

6.2.3 FLSM Compaction Algorithm

                FLSM Compaction Process
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Traditional LSM Compaction:                                  │
│  1. Read Level i sstable and overlapping sstables in         │
│     Level i+1                                                │
│  2. Merge and sort all key-values                            │
│  3. Write new sstables to Level i+1                          │
│  4. Delete old sstables                                      │
│  → Data is rewritten!                                        │
│                                                              │
│  FLSM Compaction (Most Cases):                               │
│  1. Select a Guard in Level i (when its sstables count       │
│     exceeds threshold)                                       │
│  2. Partition all sstables of this Guard by Guards in        │
│     Level i+1                                                │
│  3. Directly append (append) partitioned sstables to         │
│     corresponding Guard in Level i+1                         │
│  4. Delete old sstables in Level i                           │
│  → Data is only partitioned, not rewritten!                  │
│                                                              │
│  Example:                                                    │
│  Level 1 Guard:5 has sstable {1, 20, 45, 101, 245}          │
│  Level 2 Guards: 1, 40, 200                                  │
│  Compaction Result:                                          │
│  ├── Guard:1 gets {1, 20}                                    │
│  ├── Guard:40 gets {45, 101}                                 │
│  └── Guard:200 gets {245}                                    │
│                                                              │
│  Note: Only at highest level (Last Level) needs rewriting    │
│        to control fragmentation                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Two Compaction Modes:

Mode	Trigger Condition	Operation	Write Amplification
Partition Compaction	Non-highest level	Only partition, no sorting	Close to 1×
Rewrite Compaction	Highest level or too fragmented	Merge, sort, rewrite	Similar to LSM

6.3 PebblesDB Implementation Optimizations

6.3.1 Solving FLSM Read Performance Problem

FLSM Problem: Queries need to check all sstables under Guard, read performance decreases.

PebblesDB Optimization Techniques:

Optimization Technique	Description	Effect
SSTable-level Bloom Filter	Each sstable has Bloom Filter	Avoid reading sstables not containing target key
Parallel Seek	Multi-thread parallel search of multiple sstables within Guard	Reduce seek latency
Seek-Based Compaction	Trigger Compaction when seek count exceeds threshold	Reduce sstable count within Guard
Aggressive Compaction	Trigger when Level i size exceeds 25% of Level i+1	Reduce number of levels to search

6.3.2 Guard Selection Algorithm

Hash-based Guard Selection:

PebblesDB uses hash-based Guard selection algorithm to avoid uneven Guard distribution caused by hot keys.

Core Idea:

Calculate hash value for each key
Check if last N bits of hash value are all 1s
Higher the level, smaller the N → Higher match probability → More Guards

Probability Design (Example):

Level 1: Last 8 bits must be 1 → Probability ~ 1/256 (fewest Guards)
Level 2: Last 6 bits must be 1 → Probability ~ 1/64
Level 3: Last 4 bits must be 1 → Probability ~ 1/16 (most Guards)

This design ensures higher levels have more Guards, forming a hierarchical structure similar to Skip List.

6.3.3 Guard Asynchronous Insertion

Guards are not inserted synchronously, but lazily bound:

When new key passes hash check to become Guard, first add to “uncommitted_guards” set
During subsequent Compaction process, sstables are partitioned and attached to new Guard
This lazy insertion avoids synchronous overhead while maintaining Guard hierarchical structure

6.3.4 PebblesDB Operation Flow

Get Operation:

Check MemTable
Search level by level from Level 0:
- Binary search to locate Guard
- Check all sstables under Guard (use Bloom Filter to filter)
- Return if found

Range Query:

Determine Guards involved at each level
Binary search to locate starting key in relevant sstables within each Guard
Use multi-way merge (similar to Merge Sort) to produce ordered result
Use parallel read optimization

6.4 Performance Evaluation

6.4.1 Experimental Environment

Hardware: Dell Precision Tower 7810, Intel Xeon 2.8GHz, 16GB RAM
Storage: 2× Intel 750 SSD (1.2TB each, RAID0)
System: Ubuntu 16.04 LTS, Linux 4.4 kernel, ext4
Dataset: 3× larger than memory (ensure data mainly resides on disk)

6.4.2 Micro Benchmark (db_bench)

Write Amplification Comparison (Insert 500 million key-value, total 45GB):

System	Total Write IO	Write Amplification
LevelDB	540 GB	12×
HyperLevelDB	600 GB	13.3×
RocksDB	840 GB	18.7×
PebblesDB	175 GB	3.9×

Throughput Comparison:

Random Write:

LevelDB: Baseline
HyperLevelDB: 1.1×
RocksDB: 0.5×
PebblesDB: 2.7× (6.7× vs RocksDB)

Random Read:

PebblesDB: Approximately equal to HyperLevelDB, similar or slightly better than other systems

Range Query:

Pure seek: PebblesDB 30% slower (worst case)
50 next(): PebblesDB 15% slower
1000 next(): PebblesDB 11% slower

Update Throughput (50M key-value):

Operation	PebblesDB	HyperLevelDB	LevelDB	RocksDB
Insert 50M	56.18 KOps/s	40.00 KOps/s	22.42 KOps/s	14.12 KOps/s
Update Round 1	47.85 KOps/s	24.55 KOps/s	12.29 KOps/s	7.60 KOps/s
Update Round 2	42.55 KOps/s	19.76 KOps/s	11.99 KOps/s	7.36 KOps/s

6.4.3 YCSB Macro Benchmark

Write-dominant workloads (Load A, Load E):

PebblesDB: 1.5-2× faster than RocksDB
PebblesDB: 1.5× faster than HyperLevelDB
Write IO: 50% less than RocksDB

Read-dominant workloads (B, C, D):

Workload C (100% read): PebblesDB better (caches more indexes)
Workload B/D: Comparable to HyperLevelDB
Workload E (95% range query): 6% slower

Workload F (RMW): Comparable to other systems

6.4.4 Real Application Testing

MongoDB (using PebblesDB as storage engine):

Load A: 18% faster than WiredTiger
A (50% write): 50% faster than WiredTiger
Write IO: 4% less than WiredTiger
40% less IO than RocksDB

Note: Application layer overhead limits performance improvement magnitude

HyperDex:

Average improvement: 18-105% (depends on workload)
Write IO: 35-55% reduction
Large value (16KB): More obvious improvement (105%)

6.4.5 Resource Consumption

Resource	PebblesDB vs Others	Description
Memory	Uses more	Need to store Guard metadata and more Bloom Filters
CPU	Uses more	Guard lookup and parallel search overhead
Space Amplification	Similar	Invalid data cleanup mechanism similar

6.5 Theoretical Analysis

6.5.1 Asymptotic Complexity (DAM Model)

Assumptions:

Total data items: n
Block size: B
Guard probability: 1/B (Guards increase B times per level)
Number of levels H = log_B(n)

Write Cost:

Write once per level: O(H) = O(log_B(n))
Rewrite at last level: O(B) times
Total write cost: O((B + log_B(n))/B)

Read Cost:

Binary search Guard at each level: O(log(B^H)) = O(H·log B) = O(log n)
Check sstables within Guard: O(1) after using Bloom Filter (most cases)
Total read cost: O(log n) memory operations + O(1) disk read

6.5.2 FLSM as Generalization of LSM

When max_sstables_per_guard = 1:

Each Guard can only have 1 sstable
Equivalent to traditional LSM (sstables disjoint at each level)
Performance comparable to LSM

FLSM Advantages:

max_sstables_per_guard > 1: Lower write amplification, higher write throughput
Tunable parameters balance read/write performance

6.6.1 Comparison with WiscKey

Characteristic	WiscKey	PebblesDB
Core Idea	Key-value separation	Guards organize overlapping sstables
Write Amplification Reduction	Yes (significant)	Yes (significant)
Range Query	Performance drops (random read)	Slight drop (optimizable)
Garbage Collection	Requires independent GC	Not needed (handled during Compaction)
MemTable	Key only	Key+Value
Applicable Scenarios	Large Value	General

6.6.2 Comparison with Other LSM Optimizations

System	Method	Limitation
bLSM	Snowshoveling scheduling	Does not reduce write amplification, only smooths writes
VT-tree	Avoid re-sorting already sorted data	Depends on data distribution
LSM-trie	Use Trie structure	Does not support range queries
TRIAD	Hot-cold separation + delayed Compaction	Orthogonal to FLSM, can be used together
PebblesDB	FLSM + Guards	Read performance slightly drops

6.7 Limitations and Applicable Scenarios

6.7.1 PebblesDB Limitations

Poor Sequential Write Performance
- LSM can directly move sstable
- FLSM must partition, generating extra IO
Range Query Overhead
- Pure seek operations 30% slower
- Large number of next() can amortize overhead
Memory Consumption
- Need to store Guard metadata
- More Bloom Filters
CPU Overhead
- Guard lookup
- Parallel search coordination

6.7.2 Applicable Scenarios

Best Suited For:

Random write-dominated workloads
Write throughput sensitive scenarios
SSD storage (extend lifespan)
Write-intensive NoSQL applications

Not Suited For:

Pure sequential writes (log type)
Large number of small range queries (no next() to amortize)
Memory-constrained environments

6.8 Summary and Impact

6.8.1 Core Contributions

FLSM Data Structure: First proposed Fragmented LSM, allowing overlapping sstables at each level
Guards Mechanism: Inspired by Skip List, efficiently organizes overlapping sstables
PebblesDB Implementation: Proved can simultaneously achieve low write amplification, high write throughput, acceptable read performance
Practical Application: Successfully integrated into MongoDB and HyperDex

6.8.2 Design Insights

Traditional LSM’s disjoint sstable constraint is the root cause of write amplification
Through Guards organizing data, can maintain query efficiency while allowing overlap
Write optimization and read optimization can be balanced, not necessarily zero-sum
Tunable parameters (max_sstables_per_guard) allow optimization for different workloads

6.8.3 Key Technologies Summary

Technology	Purpose	Effect
Guards	Organize overlapping sstables	Allow Compaction without re-sorting
SSTable Bloom Filters	Accelerate reads	Avoid reading irrelevant sstables
Parallel Seek	Accelerate range queries	Multi-thread search of multiple sstables
Seek-Based Compaction	Control Guard size	Balance read/write performance
Hash-based Guard Selection	Avoid skew	Evenly distribute Guards

6.8.4 Subsequent Impact

PebblesDB’s FLSM ideas influenced multiple subsequent systems:

Proved LSM can have lower write amplification implementations
Guards concept borrowed by other systems
Inspired more innovations in Compaction algorithms
Demonstrated feasibility of “partial sorting”

7. RocksDB and Industrial Practice

7.1 RocksDB Architecture Evolution

RocksDB is an industrial-grade LSM-Tree storage engine developed by Facebook based on LevelDB, with approximately 600,000 lines of C++ code and 1300+ files. Its core design goal is to provide enterprise-level high performance, scalability, and reliability while maintaining LevelDB’s simplicity.

Core Design Philosophy:

High Performance: Multi-threaded compaction, parallel I/O, CPU cache optimization
Scalability: Column Family, tiered storage, pluggable architecture
Flexibility: Rich configuration options, multiple Compaction strategies
Reliability: Complete transaction support, backup/recovery, crash recovery

Core Architecture Layers:

                RocksDB Core Architecture
┌─────────────────────────────────────────────────────────────┐
│                         API Layer                            │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │  Put    │ │  Get    │ │ Delete  │ │ Snapshot│           │
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘           │
└───────┼───────────┼───────────┼───────────┼─────────────────┘
        │           │           │           │
┌───────▼───────────▼───────────▼───────────▼─────────────────┐
│                   Write Layer (WAL + MemTable)               │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  WriteBatch → WAL (Write-Ahead Log) → MemTable        │  │
│  │  (SkipList / HashSkipList / VectorRep)                │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
        │
┌───────▼─────────────────────────────────────────────────────┐
│                  Flush Layer (Minor Compaction)              │
│  MemTable ──[Flush]──> Immutable MemTable ──[Build]──> L0   │
└─────────────────────────────────────────────────────────────┘
        │
┌───────▼─────────────────────────────────────────────────────┐
│               Compaction Layer (Major Compaction)            │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │  Level 0│→│  Level 1│→│  Level 2│→│  Level 6│           │
│  │ (Overlap)│ │(Non-overlap)│ │(Non-overlap)│ │(Non-overlap)│    │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘           │
└─────────────────────────────────────────────────────────────┘
        │
┌───────▼─────────────────────────────────────────────────────┐
│                    Storage Layer                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │SSTable  │ │SSTable  │ │BlobFile │ │MANIFEST │           │
│  │(Block- │ │(Filter- │ │(Large  │ │(Metadata)│           │
│  │ based) │ │ Block)  │ │ Value) │ │         │           │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘           │
└─────────────────────────────────────────────────────────────┘
        │
┌───────▼─────────────────────────────────────────────────────┐
│                   Cache & Memory Management                  │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐           │
│  │ Block   │ │ Table   │ │ Row     │ │ Clock/  │           │
│  │ Cache   │ │ Cache   │ │ Cache   │ │ LRU     │           │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘           │
└─────────────────────────────────────────────────────────────┘

Key Differences from LevelDB:

Aspect	LevelDB	RocksDB
Concurrency	Single-threaded compaction	Multi-threaded compaction
Transactions	None	Full support
Backup	None	BackupEngine
Column Family	None	Supported
BlobDB	None	Built-in
Cache	Simple LRU	HyperClockCache
Compaction Strategy	Leveled	Leveled/Universal/FIFO

7.2 BlobDB: Industrial Implementation of WiscKey

RocksDB’s BlobDB is an industrial-grade implementation of the WiscKey key-value separation concept, separating large values from SSTables for storage, significantly reducing write amplification.

7.2.1 Architecture Design

Key-Value Separation Principle:

Traditional LSM:

SSTable: [Key1|Value1] [Key2|Value2] ...
         ↓ Values repeatedly rewritten during Compaction

BlobDB:

SSTable: [Key1|BlobRef1] [Key2|BlobRef2] ...
BlobFile: [Value1] [Value2] ... (Append-only, no Compaction)

Core Components:

Component	Location	Function
BlobFileBuilder	`db/blob/blob_file_builder.cc`	Build Blob files
BlobFileReader	`db/blob/blob_file_reader.cc`	Read Blob files
BlobIndex	Stored in SSTable	Pointer to Blob file
BlobGarbageCollector	`db/blob/blob_garbage_collector.cc`	Garbage collection

7.2.2 Blob File Format

Blob File Layout
┌─────────────────────────────────────────────────────────────┐
│ Blob Records                                               │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ Key (for verification) │ Value │ CRC32 │                │   │
│ └───────────────────────────────────────────────────────┘   │
│ ...                                                         │
├─────────────────────────────────────────────────────────────┤
│ Blob Index (Stored in SSTable)                              │
│ ┌───────────────────────────────────────────────────────┐   │
│ │ blob_file_number │ offset │ size │ compression_type   │   │
│ └───────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

7.2.3 Key Configuration Parameters

Configuration	Default Value	Description
`enable_blob_files`	false	Enable key-value separation
`min_blob_size`	0	Values smaller than this still stored in SSTable
`blob_file_size`	256MB	Blob file target size
`enable_blob_garbage_collection`	true	Enable garbage collection
`blob_garbage_collection_age_cutoff`	0.25	GC only processes old files
`blob_garbage_collection_force_threshold`	1.0	Force GC threshold

7.2.4 Garbage Collection Mechanism

BlobDB’s GC is performed automatically during Compaction:

Scan: Traverse Blob Index in SSTable
Verify: Check if each value is still valid
Copy: Append valid values to new Blob file
Update: Modify Blob Index pointer in SSTable
Cleanup: Delete old Blob files

GC Trigger Conditions:

Configured to run periodically
Invalid data ratio exceeds threshold
Automatic check during Compaction

7.2.5 Differences from WiscKey

Characteristic	WiscKey	BlobDB
Implementation	Standalone system	RocksDB built-in
GC	Requires separate implementation	Coordinated with Compaction
Configuration	Limited	Rich configuration options
Monitoring	Basic	Comprehensive statistics
Applicable Scenarios	Large value specific	General, automatic selection

Applicable Scenarios:

Average value size > 1KB
Write-intensive, need to reduce write amplification
Can accept slight range query performance degradation

7.3 Compaction Strategy Detailed Explanation

RocksDB provides three main Compaction strategies, suitable for different workload scenarios.

7.3.1 Leveled Compaction

Source: db/compaction/compaction_picker_level.cc

Core Idea: SSTable key ranges at each level are completely non-overlapping, similar to LevelDB.

Trigger Conditions:

L0 file count exceeds level0_file_num_compaction_trigger (default 4)
Total size of a level exceeds max_bytes_for_level_base × (level_multiplier ^ (L-1))
Manual trigger (CompactRange)

Compaction Selection Algorithm:

Prioritize L0 (if file count exceeds threshold)
Otherwise select "most crowded" level (current size / target size ratio largest)
Select earliest created SSTable in that level as input
Find all overlapping SSTables in next level
Check overlap with grandparent level, limit Compaction size

Write Amplification: 10-30× (depends on data distribution) Read Amplification: Low (O(1), at most one SSTable per level)

Applicable Scenarios:

Read-heavy (read/write > 10:1)
Frequent range queries
Need stable read performance

7.3.2 Universal Compaction (Size-Tiered)

Source: db/compaction/compaction_picker_universal.cc

Core Idea: Delayed merging, each level allows multiple overlapping sorted runs, sorted by size.

Trigger Conditions:

Space amplification exceeds threshold (max_size_amplification_percent)
Number of similar-sized runs exceeds threshold
Total run count exceeds level0_file_num_compaction_trigger

Merge Strategy:

Space amplification merge: When total size > (100 + threshold)% of valid data
Size ratio merge: Merge runs of similar size
File count merge: When total run count is too high

Write Amplification: 2-5× (significantly lower than Leveled) Read Amplification: Higher (need to check multiple runs)

Applicable Scenarios:

Write-heavy (write/read > 10:1)
Log-type workloads
Sequential writes

7.3.3 FIFO Compaction

Source: db/compaction/compaction_picker_fifo.cc

Core Idea: Only keep latest data, similar to cache.

Trigger Conditions:

Total size exceeds compaction_options_fifo.max_table_files_size
Or TTL expired

Compaction Behavior:

Sort by file creation time
Delete oldest files until size within limit
No merge operations performed

Write Amplification: Very low (~1×, no rewriting) Read Amplification: Very high (may need to check many files)

Applicable Scenarios:

Time-series data
Cache
TTL-dominated scenarios

7.3.4 Strategy Comparison Summary

Strategy	Write Amplification	Read Amplification	Space Amplification	Applicable Scenario	RocksDB Recommendation
Leveled	10-30×	Low (O(1))	Low	Read-heavy	Default
Universal	2-5×	Medium-High	Medium	Write-heavy	Log-type
FIFO	~1×	Very High	Very High	Time-series/Cache	Special scenarios
BlobDB	2-5×	Medium	Low	Large value	General large value

7.3.5 Configuration Selection Recommendations

Read-heavy (read/write > 10:1):

options.compaction_style = kCompactionStyleLevel;
options.level0_file_num_compaction_trigger = 4;
options.level_slowdown_writes_trigger = 20;
options.level_stop_writes_trigger = 36;

Write-heavy (write/read > 10:1):

options.compaction_style = kCompactionStyleUniversal;
options.compaction_options_universal.max_size_amplification_percent = 200;
options.compaction_options_universal.size_ratio = 1;

Time-series data/TTL:

options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_table_files_size = 100 * 1024 * 1024 * 1024; // 100GB
options.compaction_options_fifo.allow_compaction = true; // Allow minor compaction
options.ttl = 7 * 24 * 60 * 60; // 7 days

Large Value (>1KB):

options.enable_blob_files = true;
options.min_blob_size = 1024;
options.blob_file_size = 256 * 1024 * 1024;
options.enable_blob_garbage_collection = true;

7.4 Industrial-Grade Features

RocksDB provides rich enterprise-level features to meet production environment requirements:

1. Multi-Version Concurrency Control (MVCC)

Snapshot isolation level, supports consistent reads
Serializable transactions, guarantees strict execution order
Deadlock detection and automatic resolution

2. Backup and Recovery

Hot backup (BackupEngine): Create consistent snapshots without stopping service
Incremental backup: Only backup changed data
Point-in-time recovery: Restore to state at specified timestamp

3. Checkpoint

Create lightweight consistent snapshots
Based on hard links, completes almost instantaneously
Used for quick backup or test environment cloning

4. Tiered Storage (Hot-Cold Separation)

Support storing data from different levels on different media
Hot data (L0-L2): SSD
Cold data (L3+): HDD or object storage
Automatic data tiered migration

5. Compression Algorithm Selection

Snappy: Default, fast compression, suitable for general scenarios
Zstd: High compression ratio, saves storage space
LZ4: Extremely fast compression, suitable for CPU-constrained scenarios
Support selecting different algorithms per level

6. Bloom Filter Tuning

Full table filter: One Filter per SSTable
Partition filter: Reduce memory usage, suitable for large SSTables
Per-block filter: Precise to Data Block level

7. Adaptive Compaction

Dynamically adjust Compaction rate based on workload
Configurable rate limiting, avoid affecting foreground
Priority scheduling, prioritize processing read-hot levels

8. Column Families

Logical isolation of different data collections
Independent Compaction strategy configuration per Column Family
Support cross-Column Family atomic operations

9. Monitoring and Diagnostics

Real-time statistics (Statistics): Operation counts, latency distribution
Performance context (PerfContext): Detailed latency of single operation
Tracing (Trace): Record operation sequence for diagnosis

10. Write Throttling and Flow Control

Prevent writes from being too fast causing Compaction to fall behind
Automatically reduce write rate, maintain system stability
Configurable delayed write strategy

7.5 RocksDB Latest Progress (2024-2026)

Note: This section involves RocksDB latest versions (v10.x - v11.x), some features are still rapidly iterating, please test thoroughly before production use.

7.5.1 RocksDB v11.0 New Features (2026)

Blob File Storage Wide-Column Entities:

Background: Supports HBase-like wide-column model, suitable for sparse column storage scenarios
Optimization: min_blob_size configuration supports wide-column scenarios, automatic storage method selection
Benefit: Reduce SST file size, improve read performance by 10-30%

FIFO Compaction Enhancements:

max_data_files_size: Trim old files based on SST+blob total size
use_kv_ratio_compaction: BlobDB-optimized intra-L0 compaction
Applicable Scenarios: Time-series data, cache and other TTL-dominated workloads

Interpolation Search:

Configuration: index_block_search_type new option
Principle: Utilize key distribution patterns to predict position, fallback to binary search for non-uniform distribution
Performance: Uniformly distributed keys (such as timestamps) can reduce comparison count by 30-50%

Key-Value Separated Data Blocks:

Configuration: separate_key_value_in_data_block option
Benefit: Improve CPU cache hit rate, improve compression ratio by 5-15%

7.5.2 RocksDB v10.x Important Updates

Version	Feature	Impact
v10.7.0	HyperClockCache Default	Replace LRU Cache, higher concurrent performance, reduced lock contention
v10.6.0	MultiScan API	Multi-range scan optimization, reduce duplicate index lookups
v10.5.0	User-Defined Index (UDI)	Support custom secondary indexes, flexibility improvement
v10.4.0	Parallel Compression Optimization	`CompressionOptions::parallel_threads` accelerates compression
v10.0.0	Integrated BlobDB Complete	Key-value separation functionality matured, recommended for production use

7.5.3 Frontier Research Directions

Hardware Adaptation:

ZNS SSD (Zoned Namespace SSD):

Characteristics: SSD divided into multiple Zones, can only write sequentially, requires explicit erase
Optimization: LSM SSTables naturally aligned with Zones
Benefit: Reduce SSD internal GC, extend lifespan, improve performance

Persistent Memory (PMem):

Characteristics: Byte-addressable, large capacity, non-volatile
Optimization: PMem as large-capacity MemTable or L0 layer
Benefit: Fast recovery, large-capacity buffering, reduced write amplification

CXL Memory Expansion:

Characteristics: Expand memory pool via CXL protocol
Optimization: Separate compute and storage, remote MemTable
Benefit: Compute-storage separated architecture, elastic scaling

Cloud-Native Optimization:

Compute-Storage Separation:

Compute nodes stateless, data stored in object storage (S3/OSS)
SSTable tiering: Hot data local, cold data remote
Challenges: Cold start latency, consistency guarantees

Remote Compaction:

Offload Compaction tasks to dedicated compute nodes
Release foreground node resources, improve service capacity
Technical challenges: Network overhead, task scheduling

Serverless LSM:

Launch LSM instances on demand, pay per query
Data persisted in object storage
Applicable: Low-frequency access, elastic demand scenarios

New Data Structures:

Learned Index:

Use machine learning models to replace traditional B-Tree indexes
Predict key position, reduce index size
Challenges: Model training, dynamic updates, worst-case guarantees

Adaptive Data Structures:

Automatically adjust structure based on workload
Read-heavy → B-Tree mode
Write-heavy → LSM mode
Representatives: Dostoevsky, Fluid LSM-Tree

Hardware Acceleration:

FPGA/GPU acceleration for Compaction
Dedicated accelerators for compression/decompression
Smart NIC offload network I/O

7.5.4 Version Selection Recommendations

Scenario	Recommended Version	Reason
Production Environment (Stability Priority)	v8.x - v9.x LTS	Large-scale verified, comprehensive community support
New Feature Requirements	v10.x	Rich functionality, relatively mature, optimizations like HyperClockCache worth adopting
Frontier Testing/Evaluation	v11.x	Latest features, such as wide-column support, interpolation search, need thorough testing before production

7.5.5 Getting Latest Updates

GitHub Releases: https://github.com/facebook/rocksdb/releases
RocksDB Blog: https://rocksdb.org/blog/
Official Wiki: https://github.com/facebook/rocksdb/wiki

8. LSM-Tree Optimization Technology Map

The three core performance metrics of LSM-Tree: Write Amplification, Read Amplification, Space Amplification. This chapter systematically organizes the complete optimization technology system for these three problems.

Root Causes of LSM-Tree Three Amplification Problems:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Write Amplification                                         │
│  ├── Root: Repeatedly rewriting data during Compaction      │
│  ├── Impact: Reduces write throughput, shortens SSD lifespan │
│  └── Typical: LevelDB 12-50×, RocksDB 10-30×                │
│                                                              │
│  Read Amplification                                          │
│  ├── Root: Need to check multiple levels/files to find key  │
│  ├── Impact: Increases read latency, reduces random read    │
│  │          performance                                      │
│  └── Typical: LevelDB 3-14×, RocksDB 2-10×                  │
│                                                              │
│  Space Amplification                                         │
│  ├── Root: Invalid data not cleaned timely, low compression │
│  │          efficiency                                       │
│  ├── Impact: Wastes disk space, increases costs             │
│  └── Typical: LevelDB 1.2×, RocksDB 1.15×                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.1 Write Amplification Optimization

Write amplification is the most core performance problem of LSM-Tree, defined as the ratio of actual data written to disk to user-written data.

8.1.1 Write Amplification Source Analysis

Three Major Sources of Write Amplification:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  1. Compaction Rewrite                                       │
│     ┌────────────────────────────────────────────────────┐  │
│     │ L0 → L1 Compaction Example:                         │  │
│     │                                                      │  │
│     │ L0: [File_A][File_B][File_C][File_D]               │  │
│     │              ↓ Merge                                 │  │
│     │ L1: [File_1][File_2][File_3]...[File_N]            │  │
│     │                                                      │  │
│     │ Input: 4 files × 2MB = 8MB                          │  │
│     │ Output: 10 files × 2MB = 20MB                       │  │
│     │ Write Amp: 20MB / 8MB = 2.5× (Only this level)     │  │
│     └────────────────────────────────────────────────────┘  │
│                                                              │
│  2. Bloom Filter Update                                      │
│     - Need to recalculate Bloom Filter for each Compaction  │
│     - Additional write overhead: ~5-10%                     │
│                                                              │
│  3. Index Rebuild                                            │
│     - Index Block and MetaIndex Block rebuild               │
│     - Additional write overhead: ~3-5%                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Overall Write Amplification Formula:
Write_Amp = Σ(Level_i_Size / Level_i-1_Size) × Compaction_Overhead

8.1.2 Key-Value Separation Optimization

Core Idea: Compaction only cares about Key order, Value can be stored separately to avoid repeatedly rewriting large Values.

Key-Value Separation Architecture Comparison:

Traditional LSM:                    Key-Value Separated LSM:
┌──────────┐                       ┌──────────┐    ┌──────────┐
│ LSM-Tree │                       │ LSM-Tree │    │   vLog   │
│ Key+Value│                       │ Key+Addr │    │  (Value) │
└──────────┘                       └────┬─────┘    └────┬─────┘
       ▲                                │               │
       │                                └───────────────┘
  Write Amp 12-50×                    Write Amp ~1×   Sequential Append

Advantages:
1. Compaction only processes Key (usually 16-64 bytes)
2. Value directly appended to vLog, no movement
3. GC reclaims invalid Values in background, doesn't affect foreground writes

Representative Implementations:

System	Proposal Time	Key Characteristics	Write Amp Reduction
WiscKey	2016 (FAST)	Academic pioneer, vLog + online GC	12× → 1.2× (90%)
Titan	2018	TiKV engine, offline GC	15× → 1.5× (90%)
BlobDB	2020	RocksDB integrated, production-ready	15× → 2× (87%)

WiscKey Detailed Design:

// vLog format
struct LogEntry {
  uint32_t key_length;
  char[] key_data;
  uint32_t value_length;
  char[] value_data;
  uint32_t checksum;
};

// Address stored in LSM-Tree
struct BlobAddress {
  uint64_t file_number;    // vLog file number
  uint64_t offset;         // Offset in vLog
  uint32_t size;           // Value size
};

// GC process
void GarbageCollection() {
  while (true) {
    // 1. Sequentially read chunk from Tail
    entries = ReadChunk(tail_file);
    
    // 2. Verify if each Key-Value is still valid
    valid_entries = [];
    for (entry : entries) {
      if (LSM.Lookup(entry.key) == entry.address) {
        valid_entries.push(entry);  // Still referenced
      }
    }
    
    // 3. Append valid data to Head
    for (entry : valid_entries) {
      new_addr = AppendToHead(entry.key, entry.value);
      LSM.Update(entry.key, new_addr);
    }
    
    // 4. Update Tail, release space
    tail_file = next_file();
  }
}

Bloom Filter Effect Quantification:

bits_per_key	False Positive Rate	Memory Overhead (per million keys)	Reduced Disk Reads
5	5.0%	0.625 MB	85%
8	1.0%	1.0 MB	90%
10	0.8%	1.25 MB	92%
15	0.05%	1.875 MB	94.5%

False Positive Rate Formula:

P(false positive) ≈ (1 - e^(-k*n/m))^k

Where:
- n = number of keys
- m = total bits
- k = number of hash functions ≈ 0.693 * (m/n)
- bits_per_key = m/n

LevelDB/RocksDB default bits_per_key = 10
→ k ≈ 7 hash functions
→ False positive rate ≈ 0.8%

8.1.3 Data Structure Optimization

PebblesDB FLSM (Fragmented LSM):

FLSM vs Traditional LSM Comparison:

Traditional LSM (Disjoint SSTables):
Level 1: [1-100][200-300]  ← key ranges non-overlapping
              ↑
         Must rewrite entire range during Compaction

FLSM (Overlap Allowed):
Level 1: Guard:50 ──── Guard:200
         │                │
      [1-60]           [150-250]
      [20-80]          [180-300]
      [40-100]         [200-350]
      ↑ Overlap allowed
      Only partition during Compaction, no rewrite!

Guard Mechanism:
- Guard: Randomly selected partition point from inserted keys
- Guards are disjoint, sstables within Guard can overlap
- Higher levels contain all Guards from lower levels (similar to Skip List)

FLSM Compaction Process:

Traditional LSM Compaction:        FLSM Compaction:
1. Read sstable                   1. Select Guard (sstables exceed threshold)
2. Merge and sort                 2. Partition by lower level Guards
3. Write new sstables             3. Direct append (no sorting!)
4. Delete old files               4. Delete old files

→ Data rewrite!                   → Only partition, no rewrite!

Write Amp Comparison:
- Traditional LSM: 18.7×
- PebblesDB (FLSM): 3.9× (79% reduction)

Other Data Structure Optimizations:

Technology	Principle	Write Amp Reduction	Representative System
LSM-trie	Trie structure reduces overlap	15× → 6× (60%)	Research prototype
Dostoevsky	Segmented LSM, delayed merge	20× → 8× (60%)	Research prototype
HashKV	Hash partitioning, local Compaction	18× → 7× (61%)	Research prototype

8.1.4 Compaction Strategy Optimization

Four Mainstream Compaction Strategies Comparison:

1. Leveled Compaction (LevelDB Default)

   L0: [A-M] [N-Z]           ← 4 files trigger
        ↓ Merge
   L1: [A-F] [G-L] [M-R] [S-Z]  ← Output fully sorted

   Characteristics:
   - Each level fully sorted, adjacent level size ratio 10:1
   - Lowest read amplification (O(1))
   - Highest write amplification (10-30×)
   - Applicable: Read-heavy scenarios

2. Tiered Compaction (RocksDB Universal)

   L0: [A-C] [D-F] [G-I] [J-L]  ← 4 sorted runs
        ↓ Delayed merge
   L1: [A-L]                     ← One-time merge

   Characteristics:
   - Each level allows multiple sorted runs
   - Lowest write amplification (2-5×)
   - Highest read amplification (O(T), T=number of runs)
   - Applicable: Write-heavy scenarios

3. Leveled-N (Hybrid Strategy)

   L0-L2: Tiered style          ← Lower level delayed merge
   L3-L6: Leveled style         ← Higher level strict sorting

   Characteristics:
   - Balance read/write amplification
   - Write amplification: 5-10×
   - Read amplification: 2-5×
   - Applicable: Mixed workloads

4. FIFO Compaction

   Keep only recent N files, delete when exceeded

   Characteristics:
   - Extremely low write amplification (~1×)
   - Extremely high read amplification
   - Extremely high space amplification
   - Applicable: Time-series data, cache

Compaction Strategy Selection Decision Tree:

Start
  │
  ▼
Workload Type?
  │
  ├── Read-dominated (Read > 80%)
  │   └──▶ Leveled Compaction
  │        - Low read amplification
  │        - Acceptable high write amplification
  │
  ├── Write-dominated (Write > 80%)
  │   ├──▶ Tiered Compaction
  │   │    - Low write amplification
  │   │    - Acceptable high read amplification
  │   │
  │   └──▶ FLSM (PebblesDB)
  │        - Extremely low write amplification
  │        - Medium read amplification
  │
  ├── Mixed workload
  │   └──▶ Leveled-N or Tuned Leveled
  │        - Dynamically adjust parameters
  │        - Balance read/write
  │
  └── Time-series data/Cache
      └──▶ FIFO Compaction
           - Only keep latest data
           - Minimal Compaction

8.1.5 Write Amplification Optimization Technology Summary

Write Amplification Optimization Technology Map:

┌─────────────────────────────────────────────────────────────┐
│                      Write Amplification Optimization        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ├── Key-Value Separation                                    │
│  │   ├── WiscKey (vLog + Online GC)                         │
│  │   ├── Titan (TiKV, Offline GC)                           │
│  │   └── RocksDB BlobDB (Industrial Implementation)         │
│  │       ├── enable_blob_files = true                       │
│  │       ├── min_blob_size = 4096 (Separate above 4KB)      │
│  │       └── blob_garbage_collection = true                 │
│  │                                                           │
│  ├── Data Structure Optimization                             │
│  │   ├── PebblesDB (FLSM + Guards)                          │
│  │   │   ├── Partition Compaction (Partition only)          │
│  │   │   └── Rewrite Compaction (Sort when necessary)       │
│  │   ├── LSM-trie (Trie structure reduces overlap)          │
│  │   └── Dostoevsky (Segmented LSM)                         │
│  │                                                           │
│  └── Compaction Strategy                                     │
│      ├── Tiered Compaction (Delayed merge)                   │
│      ├── Leveled-N (Hybrid strategy)                         │
│      ├── FIFO (Only keep latest)                             │
│      └── TRIAD (Delayed Compaction)                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Optimization Effect Comparison (Based on LevelDB 12×):

| Optimization Technology | Write Amp | Reduction | Cost |
|------------------------|-----------|-----------|------|
| No optimization (LevelDB) | 12× | - | - |
| RocksDB (Multi-threaded) | 15× | -25% | CPU overhead +20% |
| Tiered Compaction | 5× | 58% | Read Amp +3× |
| PebblesDB (FLSM) | 4× | 67% | Read Amp +2× |
| BlobDB (4KB+) | 2× | 83% | Random read latency +30% |
| WiscKey (4KB+) | 1.2× | 90% | Range query -20% |

8.2 Read Amplification Optimization

Read amplification is defined as the ratio of actual data read to requested data, directly affecting read latency.

8.2.1 Read Amplification Source Analysis

Three Major Sources of Read Amplification:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  1. Multi-Level Lookup                                       │
│     ┌────────────────────────────────────────────────────┐  │
│     │ Get(key) Lookup Path:                               │  │
│     │                                                      │  │
│     │ 1. MemTable (O(log n))                              │  │
│     │    ↓ Not Found                                      │  │
│     │ 2. Immutable MemTable (At most 2)                   │  │
│     │    ↓ Not Found                                      │  │
│     │ 3. Level 0 SSTables (At most 4, check one by one)   │  │
│     │    ↓ Not Found                                      │  │
│     │ 4. Level 1+ (Binary search to locate file each level)│  │
│     │                                                      │  │
│     │ Worst case: 1 + 2 + 4 + 6 = 13 file lookups         │  │
│     └────────────────────────────────────────────────────┘  │
│                                                              │
│  2. In-Block Scan                                            │
│     - Data Block default 4KB, need to scan to find key      │
│     - Optimized through Restart Points, still partial scan  │
│                                                              │
│  3. Merge Operator                                           │
│     - Merge type needs to collect multiple versions and merge│
│     - May span multiple SSTables                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Total Read Amplification Formula:
Read_Amp = Σ(Levels_Checked) × Files_Per_Level × (1 - Bloom_Hit_Rate)

8.2.2 Filtering Optimization

Bloom Filter Multi-Level Deployment:

Bloom Filter Three-Level Deployment:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Level 1: Whole Table Filter (Whole Key Filter)             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ SSTable Footer                                        │  │
│  │   └─→ Filter Block Handle                             │  │
│  │        └─→ [Filter for entire table]                  │  │
│  │                                                       │  │
│  │ Advantage: One Filter covers entire table, low memory │  │
│  │ Disadvantage: Relatively high false positive rate     │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  Level 2: Partitioned Filter (Partitioned Filter)           │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ MetaIndex Block                                       │  │
│  │   ├─→ Filter Partition 0 [keys 0-999]                 │  │
│  │   ├─→ Filter Partition 1 [keys 1000-1999]             │  │
│  │   └─→ Filter Partition N [...]                        │  │
│  │                                                       │  │
│  │ Advantage: Load partitions on demand, reduce memory   │  │
│  │ Disadvantage: Increased implementation complexity     │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
│  Level 3: Block Filter (Block Filter)                       │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ One Filter per Data Block                              │  │
│  │   Data Block 0 → Filter 0                             │  │
│  │   Data Block 1 → Filter 1                             │  │
│  │                                                       │  │
│  │ Advantage: Lowest false positive rate, precise        │  │
│  │ Disadvantage: Highest memory usage                    │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Configuration Example (RocksDB):
options.optimize_filters_for_hits = true;        // Optimize for hit rate
options.partition_filters = true;                // Enable partitioned filter
options.cache_index_and_filter_blocks = true;    // Cache filter

Prefix Bloom Filter:

Optimization for Prefix Query:

Scenario: Large number of queries share same prefix
Example: user:1001, user:1002, user:1003...

Traditional Bloom Filter:
- Store complete key: "user:1001", "user:1002"...
- Query: Must provide complete key

Prefix Bloom Filter:
- Only store prefix: "user:1001" → "1001"
- Query: Can quickly exclude using prefix

Implementation:
1. Configure prefix_extractor
2. Build Bloom Filter for each prefix in MemTable
3. Carry prefix Filter when flushing to SSTable

Effect:
- Prefix query speed improvement 5-10×
- Memory usage reduced 30-50%

8.2.3 Cache Optimization

Multi-Level Cache Architecture:

RocksDB Four-Level Cache System:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  L1: Row Cache (Optional)                                   │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Cache complete Key-Value pairs                         │  │
│  │                                                        │  │
│  │ Hit: Direct return, no disk I/O                       │  │
│  │ Miss: Continue lookup below                           │  │
│  └───────────────────────────────────────────────────────┘  │
│                         ↓ Miss                               │
│  L2: Block Cache (Core)                                     │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Cache Data Block, Index Block, Filter Block           │  │
│  │                                                        │  │
│  │ Implementation: LRU / HyperClockCache (RocksDB v10.7+)│  │
│  │ Size: Usually set to 30-50% of available memory       │  │
│  │ Hit rate target: > 90%                                │  │
│  └───────────────────────────────────────────────────────┘  │
│                         ↓ Miss                               │
│  L3: Table Cache                                            │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Cache SSTable metadata (Footer, Index Block pointers) │  │
│  │                                                        │  │
│  │ Function: Avoid repeated file open, reduce stat()     │  │
│  │ Size: Usually 1000-10000 table handles                │  │
│  └───────────────────────────────────────────────────────┘  │
│                         ↓ Miss                               │
│  L4: OS Page Cache                                          │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ Operating system level file cache                     │  │
│  │                                                        │  │
│  │ RocksDB utilizes via mmap or direct I/O               │  │
│  │ Size: Remaining available memory                      │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Cache Hit Rate Impact on Read Amplification:

| Cache Level | Hit Rate | Effective Read Amp |
|-------------|----------|-------------------|
| Row Cache 90% | 90% | 1.3× |
| Block Cache 90% | 90% | 2.5× |
| OS Cache Only | 50% | 6.0× |
| No Cache | 0% | 12× |

HyperClockCache (RocksDB v10.7+):

// Replace traditional LRU Cache, higher concurrent performance

Traditional LRU Cache Problems:
- Severe global lock contention
- Performance drops significantly under high concurrency
- Clock algorithm variant, lock-free design

HyperClockCache Advantages:
┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  1. Lock-Free Concurrency                                    │
│     - Use atomic operations and CAS                         │
│     - Support hundreds of threads concurrent access         │
│                                                              │
│  2. Adaptive Eviction                                        │
│     - Based on access frequency and time                    │
│     - Automatically adjust eviction strategy                │
│                                                              │
│  3. NUMA Awareness                                           │
│     - Prioritize access to local NUMA node                  │
│     - Reduce cross-node memory access                       │
│                                                              │
│  Performance Improvement:                                    │
│  - 32 threads concurrent: 2.5× faster than LRU              │
│  - 64 threads concurrent: 4× faster than LRU                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.2.4 Index Optimization

Index Block Two Search Methods:

1. Binary Search (Default)

   ┌───────────────────────────────────────────────────────┐
   │ Index Block Structure:                                 │
   │                                                        │
   │ [Key: "apple"]  → BlockHandle(offset: 0, size: 4KB)   │
   │ [Key: "banana"] → BlockHandle(offset: 4KB, size: 4KB) │
   │ [Key: "cherry"] → BlockHandle(offset: 8KB, size: 4KB) │
   │ ...                                                    │
   │                                                        │
   │ Search Process:                                        │
   │ 1. Binary search in Index Block                        │
   │ 2. Find first Entry >= target_key                      │
   │ 3. Read corresponding Data Block                       │
   │                                                        │
   │ Time Complexity: O(log N), N=Data Block count          │
   │ Applicable: Range queries, prefix queries              │
   └───────────────────────────────────────────────────────┘

2. Hash Search (Configuration enabled)

   ┌───────────────────────────────────────────────────────┐
   │ Hash Index Structure:                                  │
   │                                                        │
   │ Bucket 0 → [hash("apple"), BlockHandle]              │
   │ Bucket 1 → [hash("banana"), BlockHandle]             │
   │ Bucket 2 → [hash("cherry"), BlockHandle]             │
   │                                                        │
   │ Search Process:                                        │
   │ 1. Calculate hash value of key                         │
   │ 2. Directly locate corresponding Bucket                │
   │ 3. Read corresponding Data Block                       │
   │                                                        │
   │ Time Complexity: O(1)                                  │
   │ Applicable: Pure point query workloads                 │
   │ Not applicable: Range queries                          │
   └───────────────────────────────────────────────────────┘

Configuration:
BlockBasedTableOptions::index_type = kHashSearch;  // Enable hash index

Interpolation Search (Interpolation Search, RocksDB v11.0+):

Interpolation Search Principle:

Prerequisite: Keys uniformly distributed (e.g., timestamps, auto-increment IDs)

Traditional Binary Search:
  mid = (low + high) / 2
  
Interpolation Search:
  mid = low + ((target - keys[low]) / (keys[high] - keys[low])) * (high - low)
  
  Predict position based on target value, not fixed midpoint

Example:
Keys: [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
Target: 850

Binary Search:
  1st: mid = 5 (600) < 850, go right
  2nd: mid = 7 (800) < 850, go right
  3rd: mid = 8 (900) > 850, go left
  4th: Found 850

Interpolation Search:
  Predicted position = 0 + ((850-100)/(1000-100)) * 9 = 7.5 ≈ 8
  1st: Directly locate to index 8 (900)
  2nd: Go left to find 850
  
  Only 2 comparisons needed!

Performance Improvement:
- Uniformly distributed keys: Reduce comparison count by 30-50%
- Non-uniform distribution: Automatically fallback to binary search
- Configuration: index_block_search_type = kInterpolationSearch

8.2.5 Parallelization Optimization

WiscKey Parallel Range Query:

Problem: After key-value separation, range queries require multiple random reads to vLog

Solution: Utilize SSD parallel random read capability

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Serial Method (Traditional):                                │
│  GetRange(start, end):                                      │
│  1. Find keys in LSM: [k1, k2, k3, ..., kn]                 │
│  2. for each key:                                           │
│       addr = LSM.lookup(key)                                │
│       value = vLog.read(addr)  ← Sequential, n I/Os         │
│                                                              │
│  Total latency: n × t_random (t_random ≈ 100μs for SSD)    │
│                                                              │
│  Parallel Method (WiscKey):                                  │
│  1. Find keys in LSM: [k1, k2, k3, ..., kn]                 │
│  2. Create 32 background threads                            │
│  3. Batch submit vLog read requests                         │
│  4. Wait for all to complete, return in order               │
│                                                              │
│  Total latency: (n/32) × t_random ≈ 3% × n × t_random      │
│                                                              │
│  Performance Improvement:                                    │
│  - 4KB value: 14×                                           │
│  - 16KB value: 16×                                          │
│  - 64KB value: 14×                                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

RocksDB MultiGet:

// Batch point query optimization

// Traditional method (n independent Gets)
for (int i = 0; i < n; i++) {
  db->Get(key[i], &value[i]);  // n system calls, n lookups
}

// MultiGet method (1 batch Get)
std::vector<Slice> keys = {key1, key2, ..., keyn};
std::vector<PinnableSlice> values(n);
std::vector<Status> statuses = db->MultiGet(keys, &values);

// Optimization points:
// 1. One system call, reduce context switching
// 2. Shared MemTable/SSTable iterators
// 3. Batch read consecutive Data Blocks
// 4. Merged Bloom Filter checks

Performance Improvement:
- Compared to n independent Gets: 3-5× faster
- Suitable: Batch point query scenarios (e.g., query by ID list)

8.2.6 Read Amplification Optimization Technology Summary

Read Amplification Optimization Technology Map:

┌─────────────────────────────────────────────────────────────┐
│                      Read Amplification Optimization         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ├── Filtering                                               │
│  │   ├── Bloom Filter                                       │
│  │   │   ├── Whole Table Level (Entire SST)                 │
│  │   │   ├── Partition Level (Each partition)               │
│  │   │   └── Block Level (Each data block)                  │
│  │   ├── Prefix Bloom Filter                                │
│  │   │   └── Optimize for prefix queries                    │
│  │   └── Index Pruning                                      │
│  │       ├── Range Index                                    │
│  │       └── Time Index                                     │
│  │                                                           │
│  ├── Caching                                                 │
│  │   ├── Block Cache (LRU/HyperClockCache)                  │
│  │   ├── Table Cache (Metadata cache)                       │
│  │   ├── Index Block Cache                                  │
│  │   └── Filter Block Cache                                 │
│  │                                                           │
│  ├── Parallelization                                         │
│  │   ├── Parallel Seek (PebblesDB)                          │
│  │   │   └── Multi-thread search of sstables within Guard   │
│  │   ├── Multi-thread Range Query (WiscKey)                 │
│  │   │   └── 32 threads parallel read vLog                  │
│  │   └── MultiGet (Batch fetch)                             │
│  │                                                           │
│  └── Index Optimization                                      │
│      ├── Interpolation Search (RocksDB v11.0)               │
│      │   └── Better than binary search for uniform keys      │
│      ├── Hash Index                                          │
│      │   └── O(1) point query, no range query support        │
│      └── Adaptive Index                                      │
│          └── Auto-select based on workload                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Optimization Effect Comparison (Based on LevelDB Read Amp 10×):

| Optimization Technology | Read Amp | Reduction | Cost |
|------------------------|----------|-----------|------|
| No optimization (LevelDB) | 10× | - | - |
| + Bloom Filter (10bpk) | 3× | 70% | Memory 1.25MB/million keys |
| + Block Cache (90% hit) | 2× | 80% | Memory usage |
| + HyperClockCache | 1.8× | 82% | CPU +5% |
| + Parallel range query | 1.5× | 85% | Thread overhead |
| WiscKey (Large value) | 1.2× | 88% | Range query -20% |

8.3 Space Amplification Optimization

Space amplification is defined as the ratio of actual disk usage to effective data volume, directly affecting storage costs.

8.3.1 Space Amplification Source Analysis

Three Major Sources of Space Amplification:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  1. Invalid Data Accumulation                                │
│     ┌────────────────────────────────────────────────────┐  │
│     │ Scenario: Frequent Update/Delete                    │  │
│     │                                                      │  │
│     │ T0: Put(k1, v1)  → SSTable_A                        │  │
│     │ T1: Put(k1, v2)  → SSTable_B (v1 not cleaned)       │  │
│     │ T2: Put(k1, v3)  → SSTable_C (v1,v2 not cleaned)    │  │
│     │ T3: Delete(k1)   → SSTable_D (v1,v2,v3 not cleaned) │  │
│     │                                                      │  │
│     │ Valid data: 0 (deleted)                             │  │
│     │ Actual usage: 4 versions × size                     │  │
│     │ Space Amp: ∞ (Infinite)                             │  │
│     └────────────────────────────────────────────────────┘  │
│                                                              │
│  2. Low Compression Efficiency                               │
│     - No compression: 1.0× (No savings)                      │
│     - Snappy: 1.3-1.5× (Save 30-40%)                         │
│     - Zstd: 1.1-1.3× (Save 50-60%)                           │
│                                                              │
│  3. Metadata Overhead                                        │
│     - InternalKey: +8 bytes/key (seq+type)                   │
│     - Bloom Filter: 1.25MB/million keys                      │
│     - Index Block: ~5-10% additional space                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Total Space Amplification Formula:
Space_Amp = (Raw_Data + Metadata + Invalid_Data) / Effective_Data

8.3.2 Compression Optimization

Compression Algorithm Comparison:

┌─────────────────────────────────────────────────────────────┐
│                   Compression Algorithm Performance Comparison│
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Algorithm  Ratio   Compress Speed  Decompress Speed  CPU   │
│  ─────────────────────────────────────────────────────────   │
│  None      1.0×     ∞              ∞              0%        │
│  Snappy    1.3-1.5× 200MB/s        400MB/s        Low       │
│  LZ4       1.3-1.6× 400MB/s        800MB/s        Low       │
│  Zstd      1.5-2.5× 100MB/s        300MB/s        Medium    │
│  Bzip2     2.0-3.0× 50MB/s         150MB/s        High      │
│                                                              │
│  Compression Ratio Test (TPC-H Dataset):                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ Original: 100 GB                                     │   │
│  │ Snappy:  68 GB  (Save 32%)                           │   │
│  │ LZ4:     65 GB  (Save 35%)                           │   │
│  │ Zstd:    45 GB  (Save 55%)                           │   │
│  │ Bzip2:   38 GB  (Save 62%)                           │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Recommended Configuration:                                  │
│  options.compression = kSnappyCompression;  // Default      │
│  options.bottommost_compression = kZSTD;    // Bottom level │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Tiered Compression Strategy:

Different compression algorithms for different levels:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Level 0-1 (Hot Data):                                      │
│  - Frequently accessed, frequent Compaction                 │
│  - Use Snappy (Fast, low CPU overhead)                      │
│  - Compression ratio: 1.3×                                  │
│                                                              │
│  Level 2-4 (Warm Data):                                     │
│  - Medium access frequency                                  │
│  - Use Zstd (Balance compression ratio and speed)           │
│  - Compression ratio: 1.8×                                  │
│                                                              │
│  Level 5-6 (Cold Data):                                     │
│  - Rarely accessed, long-term storage                       │
│  - Use Zstd high compression level or Bzip2                 │
│  - Compression ratio: 2.5×                                  │
│                                                              │
│  Configuration:                                              │
│  options.compression_per_level = {                          │
│    kSnappyCompression,  // L0                               │
│    kSnappyCompression,  // L1                               │
│    kZSTDCompression,    // L2                               │
│    kZSTDCompression,    // L3                               │
│    kZSTDCompression,    // L4                               │
│    kBZip2Compression,   // L5                               │
│    kBZip2Compression,   // L6                               │
│  };                                                         │
│                                                              │
│  Effect: Overall space saving 40-50%, CPU overhead +10-15%  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.3.3 Garbage Collection Optimization

Timely Compaction:

Accelerate invalid data cleanup by adjusting Compaction parameters:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Parameter Tuning:                                           │
│                                                              │
│  1. level0_file_num_compaction_trigger (Default 4)          │
│     ↓ Reduce to 2                                           │
│     Effect: More frequent L0→L1 Compaction triggers,        │
│             faster invalid data cleanup                     │
│     Cost: Write amplification increases 20-30%              │
│                                                              │
│  2. max_bytes_for_level_base (Default 256MB)                │
│     ↓ Reduce to 128MB                                       │
│     Effect: Reduced capacity per level, more aggressive     │
│             Compaction                                      │
│     Cost: Increased Compaction frequency                    │
│                                                              │
│  3. periodic_compaction_seconds (RocksDB 6.29+)             │
│     Set to 7 days (604800 seconds)                          │
│     Effect: Force Compaction on all data every 7 days       │
│     Applicable: Periodic cleanup of expired data            │
│                                                              │
│  Monitoring Metrics:                                         │
│  - rocksdb.estimate-num-deletes: Estimated delete markers   │
│  - rocksdb.estimate-live-data-size: Estimated valid data    │
│                                                              │
│  When delete_ratio > 20%, recommend manual Compaction:      │
│  db->CompactRange(compact_options, nullptr, nullptr);       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Key-Value Separation GC:

Comparison of GC Mechanisms for WiscKey/Titan/BlobDB:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  WiscKey (Online GC):                                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ Trigger: Background periodic operation               │   │
│  │                                                      │   │
│  │ Process:                                              │   │
│  │ 1. Sequentially read chunk from Tail (256KB)         │   │
│  │ 2. For each entry, check if LSM still points to addr │   │
│  │ 3. Append valid entries to Head                      │   │
│  │ 4. Update Tail, release old files                    │   │
│  │                                                      │   │
│  │ Advantage: GC doesn't affect foreground I/O          │   │
│  │ Disadvantage: Complex implementation, vLog state     │   │
│  │             machine maintenance                      │   │
│  │                                                      │   │
│  │ Performance impact: Throughput drop < 10% even with  │   │
│  │                   100% invalid data                   │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Titan (Offline GC):                                        │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ Trigger: Synchronized with Compaction                │   │
│  │                                                      │   │
│  │ Process:                                              │   │
│  │ 1. Compaction reads blob addresses from SSTable      │   │
│  │ 2. Read corresponding values from vLog               │   │
│  │ 3. Only write valid values to new vLog               │   │
│  │ 4. Update addresses in SSTable                       │   │
│  │                                                      │   │
│  │ Advantage: Simple implementation, natural integration│   │
│  │            with Compaction                           │   │
│  │ Disadvantage: Increased I/O pressure during          │   │
│  │             Compaction                               │   │
│  │                                                      │   │
│  │ Performance impact: Compaction time extended 30-50%  │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  BlobDB (RocksDB Integrated):                               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ Trigger: Configurable, supports online + offline     │   │
│  │        mixed                                         │   │
│  │                                                      │   │
│  │ Special Features:                                     │   │
│  │ - use_kv_ratio_compaction: Optimize based on key/    │   │
│  │   value ratio                                        │   │
│  │ - max_data_files_size: Prune based on SST+blob       │   │
│  │   total size                                         │   │
│  │ - intra-L0 compaction: Merge L0 internal blob files  │   │
│  │                                                      │   │
│  │ Configuration Example:                                │   │
│  │ options.enable_blob_files = true;                    │   │
│  │ options.min_blob_size = 4096;  // Separate above 4KB │   │
│  │ options.blob_file_size = 256MB;                      │   │
│  │ options.enable_blob_garbage_collection = true;       │   │
│  │ options.blob_garbage_collection_age_cutoff = 0.25;   │   │
│  │                                                      │   │
│  │ Effect: Space amplification reduced from 1.5× to     │   │
│  │       1.05× (97% reduction)                          │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.3.4 Deduplication Optimization

HashKV Hash Partition Deduplication:

HashKV Core Idea: Keys with same hash go to same partition, localize deduplication

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Traditional LSM:                                            │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ All keys mixed together                               │   │
│  │ Compaction requires global scan                       │   │
│  │ Difficult to identify duplicate data                  │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  HashKV:                                                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ Partition 0: hash(key) % 100 == 0                    │   │
│  │   [key: "user:100", val: "v1"]                       │   │
│  │   [key: "user:100", val: "v2"]  ← Duplicate found!   │   │
│  │   [key: "user:200", val: "v1"]                       │   │
│  │                                                      │   │
│  │ Partition 1: hash(key) % 100 == 1                    │   │
│  │   [key: "user:101", val: "v1"]                       │   │
│  │   ...                                                │   │
│  │                                                      │   │
│  │ Deduplication process:                               │   │
│  │ 1. Detect duplicates within each partition during    │   │
│  │    Compaction                                        │   │
│  │ 2. Only keep latest version, delete old versions     │   │
│  │ 3. Generate deduplicated SSTable                     │   │
│  │                                                      │   │
│  │ Effect: Space amplification reduced from 1.8× to     │   │
│  │       1.2× (67% reduction)                           │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

8.3.5 Space Amplification Optimization Technology Summary

Space Amplification Optimization Technology Map:

┌─────────────────────────────────────────────────────────────┐
│                      Space Amplification Optimization        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ├── Compression                                             │
│  │   ├── Snappy (Default, ~200MB/s)                        │
│  │   ├── Zstd (High compression, ~100MB/s)                 │
│  │   ├── LZ4 (Fast, ~400MB/s)                              │
│  │   └── Bzip2 (Highest compression, ~50MB/s)              │
│  │                                                           │
│  ├── Garbage Collection                                      │
│  │   ├── Timely Compaction                                  │
│  │   │   ├── Reduce trigger threshold                      │
│  │   │   └── Periodic Compaction                            │
│  │   ├── vLog GC (Key-value separation)                     │
│  │   │   ├── WiscKey (Online GC)                           │
│  │   │   ├── Titan (Offline GC)                            │
│  │   │   └── BlobDB (Hybrid GC)                            │
│  │   └── Background cleanup threads                         │
│  │                                                           │
│  └── Deduplication                                           │
│      ├── HashKV (Hash partition deduplication)              │
│      ├── Global deduplication                                │
│      └── Local deduplication                                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Optimization Effect Comparison (Based on LevelDB space amplification 1.2×):

| Optimization Technology | Space Amp | Improvement | Cost |
|------------------------|-----------|-------------|------|
| No optimization (LevelDB) | 1.2× | - | - |
| + Zstd compression | 1.08× | 10% | CPU +10% |
| + Timely Compaction | 1.05× | 12.5% | Write amplification +20% |
| + BlobDB GC | 1.02× | 15% | Random read +30% |
| + HashKV deduplication | 1.01× | 16% | Memory +5% |

Comprehensive optimization case:
A log storage system, original data 1TB:
- Before optimization: 1.5TB disk usage (space amplification 1.5×)
- Enable Zstd: 1.2TB (1.2×)
- Enable BlobDB: 1.05TB (1.05×)
- Enable periodic Compaction: 1.02TB (1.02×)
- Final savings: 480GB disk space, annual cost savings $5000+

8.4 Synergy of Three Optimization Techniques

In practical systems, write amplification, read amplification, and space amplification constrain each other and require comprehensive trade-offs.

Trade-off Triangle:

                    Write Amplification
                     / \
                    /   \
                   /     \
                  /       \
                 /         \
                /           \
            Read Amp ←────→ Space Amp

Typical Trade-off Scenarios:

1. Reduce Write Amp ↔ Increase Read Amp
   - Tiered Compaction: Write Amp↓, Read Amp↑
   - Key-value separation: Write Amp↓↓, Random Read↑

2. Reduce Write Amp ↔ Increase Space Amp
   - Delayed Compaction: Write Amp↓, Space Amp↑
   - FIFO strategy: Write Amp↓↓, Space Amp↑↑

3. Reduce Read Amp ↔ Increase Space Amp
   - Bloom Filter: Read Amp↓, Memory↑
   - More cache: Read Amp↓, Memory↑

Optimal Strategy:
Choose the appropriate balance point based on specific workload!

Workload-Driven Optimization Selection:

┌─────────────────────────────────────────────────────────────┐
│                  Optimization Strategies for Different      │
│                  Scenarios                                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Scenario 1: Write-heavy, Read-light (Log storage,          │
│              Monitoring data)                                │
│  ├── Main conflict: Write amplification                      │
│  ├── Recommended: Tiered Compaction + BlobDB                │
│  ├── Acceptable cost: Slightly higher read amp,             │
│  │                     Slightly higher space amp             │
│  └── Expected effect: Write Amp 2-3×, Read Amp 5-8×,        │
│                     Space Amp 1.1×                          │
│                                                              │
│  Scenario 2: Read-heavy, Write-light (Config center,        │
│              Metadata storage)                               │
│  ├── Main conflict: Read amplification                       │
│  ├── Recommended: Leveled Compaction + Large Block Cache    │
│  ├── Acceptable cost: Slightly higher write amp             │
│  └── Expected effect: Write Amp 15-20×, Read Amp 1.5-2×,    │
│                     Space Amp 1.1×                          │
│                                                              │
│  Scenario 3: Large Value Storage (Object storage,           │
│              Document database)                              │
│  ├── Main conflict: Write amplification + Space             │
│  │                   amplification                            │
│  ├── Recommended: Key-value separation (WiscKey/Titan/      │
│  │                BlobDB)                                     │
│  ├── Acceptable cost: Slightly higher random read latency   │
│  └── Expected effect: Write Amp 1-2×, Read Amp 2-3×,        │
│                     Space Amp 1.05×                         │
│                                                              │
│  Scenario 4: Time-series Data (Metrics monitoring,          │
│              IoT data)                                       │
│  ├── Main conflict: Write throughput + Storage cost         │
│  ├── Recommended: FIFO Compaction + Tiered compression      │
│  ├── Acceptable cost: Historical data may be quickly        │
│  │                     cleaned                                │
│  └── Expected effect: Write Amp 1×, Read Amp 3-5×,          │
│                     Space Amp 1.1×                          │
│                                                              │
│  Scenario 5: Mixed Workload (E-commerce, Social            │
│              networks)                                       │
│  ├── Main conflict: Balance read/write                      │
│  ├── Recommended: Leveled-N + Moderate key-value            │
│  │                separation                                  │
│  ├── Acceptable cost: Increased implementation complexity   │
│  └── Expected effect: Write Amp 5-8×, Read Amp 2-3×,        │
│                     Space Amp 1.1×                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Chapter 9: Performance Comparison Summary

9.1 Core Metrics Comparison Across Systems

System	Year	Write Amp	Read Amp	Main Optimization	Applicable Scenario
LSM-Tree	1996	Theoretical model	Theoretical model	Sequential write	Theoretical foundation
LevelDB	2011	12-50×	3-14×	Engineering simplification	Embedded
RocksDB	2012	10-30×	2-10×	Multi-threading, Rich features	Industrial general purpose
WiscKey	2016	~1×	~1×+1	Key-value separation	Large value
PebblesDB	2017	3-4×	Medium	FLSM	High-throughput writes
Titan	2018	~1×	~1×+1	Key-value separation	TiKV ecosystem
BlobDB	2020	2-5×	2-5×	Industrial-grade key-value separation	RocksDB ecosystem

9.2 Quantified Optimization Effects

Using LevelDB as baseline (100%), compare effects of various optimization schemes:

Write Throughput (Random Write, MB/s): | System | Throughput | Relative to Baseline | |——–|————|———————| | LevelDB | 10 | 100% | | RocksDB | 15 | 150% | | PebblesDB | 27 | 270% | | WiscKey (4KB value) | 350 | 3500% | | BlobDB (4KB value) | 400 | 4000% |

Write Amplification (Lower is better): | System | Write Amp | Reduction | |——–|———–|———–| | LevelDB | 12× | - | | RocksDB | 15× | -25% | | PebblesDB | 4× | 67% | | WiscKey | 1.2× | 90% | | BlobDB | 2× | 83% |

Read Performance (Random Point Query, KOps/s): | System | Point Query Throughput | Relative to Baseline | |——–|———————-|———————| | LevelDB | 25 | 100% | | RocksDB | 30 | 120% | | PebblesDB | 25 | 100% | | WiscKey | 300 | 1200% | | BlobDB | 37 | 150% |

Space Efficiency (100GB Raw Data Actual Usage): | System | Actual Usage | Space Amplification | |——–|————-|———————| | LevelDB | 120 GB | 1.2× | | RocksDB | 115 GB | 1.15× | | WiscKey (After GC) | 102 GB | 1.02× | | BlobDB | 105 GB | 1.05× |

Optimization Benefits Summary:

Key-value separation is the most effective means to reduce write amplification (WiscKey/BlobDB)
Bloom Filter is the most effective means to reduce read amplification (Reduce 90%+ invalid disk reads)
Compression algorithm selection has significant impact on space efficiency (Zstd saves 30-50% space compared to Snappy)
Multi-threading can significantly improve write throughput (RocksDB is 50%+ higher than LevelDB)

Chapter 10: References and Further Reading

10.1 Core Papers

Paper	Year	Core Contribution
The Log-Structured Merge-Tree	1996	LSM-Tree theoretical foundation
WiscKey: Separating Keys from Values	2016	Key-value separation optimization
PebblesDB: Building Key-Value Stores using FLSM	2017	FLSM data structure
Bigtable: A Distributed Storage System	2006	Distributed LSM practice
The Five-Minute Rule Ten Years Later	2007	Storage cost analysis

10.2 Open Source Implementations

Project	Language	Characteristics
LevelDB	C++	Most classic implementation, concise code
RocksDB	C++	Industrial-grade, feature-rich
Badger	Go	WiscKey idea implementation
Titan	C++	TiKV’s key-value separation engine
mini-lsm	Rust	Educational, step-by-step implementation
sled	Rust	Pure Rust, modern design

10.3 Recommended Books

Title	Author	Focus
Designing Data-Intensive Applications	Martin Kleppmann	Chapter 3 Storage and Retrieval
Database Internals	Alex Petrov	B-Tree vs LSM comparison

Appendix: Glossary

Term	Explanation
LSM	Log-Structured Merge
SSTable	Sorted String Table
MemTable	In-memory table, write buffer
Compaction	Merge operation, maintains ordering
Bloom Filter	Bloom filter, fast exclusion for queries
Tombstone	Deletion marker
MVCC	Multi-Version Concurrency Control
WAL	Write-Ahead Log
GC	Garbage Collection
vLog	Value Log, value storage after key-value separation
Guard	Partition point for organizing sstables in PebblesDB
FLSM	Fragmented LSM, data structure proposed by PebblesDB

Document End

Table of Contents

1. Overview

1.1 What is LSM-Tree

1.2 Core Design Philosophy

1.3 Technology Evolution Timeline

2. LSM-Tree Theoretical Foundation (1996)

2.1 Core Contributions of the Paper

Problems Solved

Core Innovations

2.2 Rolling Merge Mechanism Detailed Explanation

2.3 Multi-Component Architecture

2.4 Operation Semantics

Delete Operation

Predicate Delete

2.5 Concurrency and Recovery

Concurrency Control Mechanism

Recovery Mechanism

3. LevelDB Engineering Implementation (2011)

3.1 Architecture Overview

3.2 Core Data Structures

MemTable: SkipList

SSTable File Format Detailed Explanation

Overall Structure

Data Block Internal Structure

Record Encoding Details

Filter Block Structure

Index Block Structure

Footer Detailed Format (48 bytes)

Internal Key Format Detailed Explanation

Sequence Number

Type (Type Byte)

Internal Key Encoding Details

Sorting Rules Detailed Explanation

Version Control and Garbage Collection

Lookup Key Format

SSTable Lookup Process

SSTable Build Process

SSTable File Example

SSTable Advantages Summary

3.3 Write Process

3.4 Compaction Mechanism Detailed Explanation

3.4.1 Minor Compaction Detailed Process

3.4.2 Major Compaction Detailed Process

3.4.3 Garbage Collection in Compaction

3.5 Version Management (VersionSet)

3.5.1 Version Management Process

3.4.4 Compaction Strategy Comparison

3.4.5 Compaction Tuning Parameters

3.6 Read Optimization

4. Comparison Between LevelDB and Original LSM-Tree Paper

4.1 Core Differences Overview

4.2 Detailed Comparison Analysis

4.2.1 Architecture Design Comparison

4.2.2 Core Data Structure Comparison

4.2.3 Compaction Strategy Comparison

4.2.4 Read Optimization Comparison

4.2.5 Concurrency and Consistency Comparison

4.2.6 Level Design Detailed Comparison

4.3 Key Improvements Summary

Engineering Simplification

Performance Improvement

4.4 Design Philosophy Differences

4.5 Impact on Subsequent Systems

5. WiscKey Key-Value Separation Optimization (2016)

5.1 Research Background and Motivation

5.1.1 Core Problems of LSM-Tree

5.1.2 Essential Differences Between SSD and HDD

5.1.3 Key Observations

5.2 Key-Value Separation Architecture

5.2.1 Overall Architecture

5.2.2 Detailed Data Layout

5.2.3 Design Goals

5.3 Key Challenges and Solutions

5.3.1 Challenge 1: Range Query Performance

5.3.2 Challenge 2: Garbage Collection

5.3.3 Challenge 3: Crash Consistency

5.3.4 Additional Optimization: Remove LSM-tree WAL

5.4 Performance Evaluation Results

5.4.1 Experimental Setup

5.4.2 Micro Benchmark - Load Performance