LSM-Tree Complete Technical Guide: From Theory to Industrial Practice
A comprehensive technical guide covering 30 years of LSM-Tree evolution from 1996 theory to 2026 industrial practice, including LevelDB, WiscKey, PebblesDB, and RocksDB optimizations
Author: Organized based on LSM-Tree classic papers and modern optimizations
Version: 1.0
Date: 2026-03-06
Description: This document systematically organizes the complete technical evolution of LSM-Tree from its theoretical foundation in 1996 to industrial practice in the 2020s
Table of Contents
- Overview
- LSM-Tree Theoretical Foundation (1996)
- LevelDB Engineering Implementation (2011)
- Comparison Between LevelDB and LSM-Tree Paper
- WiscKey Key-Value Separation Optimization (2016)
- PebblesDB FLSM Optimization (2017)
- RocksDB and Industrial Practice
- LSM-Tree Optimization Technology Map
- Performance Comparison Summary
- References and Further Reading
1. Overview
1.1 What is LSM-Tree
LSM-Tree (Log-Structured Merge-Tree) is a write-optimized persistent data structure, specifically designed for high write throughput scenarios. It significantly improves write performance by transforming random writes into sequential writes.
1.2 Core Design Philosophy
The design of LSM-Tree revolves around four core principles:
1. Defer and Batch Processing
- Write operations are first recorded in memory structures (MemTable), avoiding immediate disk I/O
- When MemTable reaches a certain threshold, it is batch-flushed to disk to form SSTable
- Background Compaction merges and maintains data ordering
- This batch processing amortizes the disk overhead of single writes
2. Sequential Write Optimization
- Once SSTable is written, it cannot be modified; only new files can be appended
- All disk operations are sequential writes, fully utilizing disk bandwidth (HDD sequential write can reach 100MB/s+, while random write is only 1MB/s)
- Sacrifice some read performance (need to check multiple levels) in exchange for extreme write performance
- Compensate read performance through Bloom Filter, caching, and other technologies
3. Multi-Level Hierarchical Structure
- L0 (Level 0): Memory MemTable + recently flushed immutable files (overlapping allowed)
- L1-LN (Level 1-N): Disk files, each level size increases by a fixed ratio (usually 10 times)
- Data flows from upper levels to lower levels; colder data resides in deeper levels
- Upper levels have small data volume and fast queries; lower levels have large data volume and slower queries but higher storage efficiency
4. Space for Time
- Allow data redundancy: multiple versions of the same key may coexist in different levels
- Use write amplification (rewriting data) in exchange for high throughput of sequential I/O
- Use space amplification (temporarily retaining invalid data) in exchange for efficient delete operations (only need to mark Tombstone)
1.3 Technology Evolution Timeline
| Year | Milestone | Core Contribution |
|---|---|---|
| 1996 | LSM-Tree Paper Published | Theoretical foundation, proposed Rolling Merge, multi-component architecture, cost model |
| 2004 | Google Bigtable | Distributed LSM practice, proving LSM feasibility in large-scale scenarios |
| 2006 | Apache HBase | Open source Bigtable implementation, promoting LSM adoption in Hadoop ecosystem |
| 2008 | Facebook Cassandra | LSM + distributed + decentralized design, supporting multi-data centers |
| 2011 | Google LevelDB | Embedded LSM engine, engineering simplification (SkipList, SSTable), approximately 20K lines of code |
| 2012 | Facebook RocksDB | LevelDB optimization branch, multi-threading, rich features, becoming industrial standard |
| 2014 | MongoDB WiredTiger | Supporting LSM storage engine, providing alternative to B-Tree |
| 2016 | WiscKey Paper | Key-value separation architecture, write amplification reduced to near 1×, suitable for large value scenarios |
| 2017 | PebblesDB Paper | FLSM data structure, reducing write amplification by 4.8× through Guards mechanism |
| 2018 | Titan Open Source | TiKV’s key-value separation engine, industrial-grade implementation of WiscKey ideas |
| 2020 | RocksDB Integrated BlobDB | Key-value separation functionality matured, becoming officially recommended solution |
| 2021+ | ZNS SSD / PMem Adaptation | LSM optimizations for new hardware (Zoned Namespace SSD, Persistent Memory) |
| 2024-2026 | RocksDB v10-v11 | HyperClockCache, Interpolation Search, Wide-Column support, and other cutting-edge features |
2. LSM-Tree Theoretical Foundation (1996)
2.1 Core Contributions of the Paper
Paper: The Log-Structured Merge-Tree (LSM-Tree)
Authors: Patrick O’Neil, Edward Cheng, Dieter Gawlick, Elizabeth O’Neil
Published: Acta Informatica, 1996
Problems Solved
Problems of traditional B-Tree in high-write scenarios:
- Random Writes: Each insertion requires random disk I/O
- Index Maintenance Cost: Real-time indexing doubles transaction I/O cost
- Write Amplification: B-Tree updates may require multiple disk writes
Core Innovations
Two-Component Architecture: LSM-Tree divides storage into memory component C0 and disk component C1.
- C0 resides entirely in memory, supporting fast insertion and query
- C1 resides on disk, organized in multi-page blocks, supporting sequential read/write
- Data migrates progressively from C0 to C1 through Rolling Merge
Rolling Merge Mechanism: A gradual data migration process similar to merge sort.
- Batch read C1’s multi-page blocks into memory buffer
- Read a continuous segment of entries from C0, merge with data in C1 blocks
- Create new C1 blocks and write to new locations (without overwriting old blocks)
- Cycle repeatedly to achieve continuous data flushing
Cost Model: The cost advantage of LSM-Tree over B-Tree stems from differences in disk I/O patterns.
- B-Tree: Random page I/O, each update may trigger multiple disk seeks
- LSM: Multi-page block sequential I/O, batch writes amortize seek costs
- Cost ratio ≈ (Sequential I/O cost / Random I/O cost) × (1 / Merge batch size)
Two-Component Architecture Diagram:
LSM-Tree Two-Component Architecture
┌─────────────────────────────────────────────────────┐
│ │
│ C0 Component (Memory) C1 Component (Disk) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Index Structure │ │ Multi-page │ │
│ │ (B-Tree) │ │ Block 1 │ │
│ ├──────────────┤ Rolling ├──────────────┤ │
│ │ Key1: Val1 │ Merge │ Multi-page │ │
│ │ Key2: Val2 │ ═════════▶ │ Block 2 │ │
│ │ Key3: Val3 │ │ Multi-page │ │
│ │ ... │ │ Block 3 │ │
│ └──────────────┘ │ ... │ │
│ └──────────────┘ │
│ │
│ Characteristics: Characteristics: │
│ - Completely memory-resident - Disk sequential │
│ - Fast read/write support - Append-write opt │
│ - Capacity limited - Large capacity │
│ │
└─────────────────────────────────────────────────────┘
2.2 Rolling Merge Mechanism Detailed Explanation
Rolling Merge is the core mechanism of LSM-Tree, responsible for migrating data from memory component to disk component.
Rolling Merge Process Diagram:
Step 1: Read C1 Multi-page Block Step 2: Merge C0 Entries
┌─────────────────────┐ ┌─────────────────────┐
│ C0: [a,b,c,d,e] │ │ C0: [d,e] │
│ ↑ │ │ (Merged) │
│ Merge Cursor │ │ │
│ │ │ C1 Buffer: │
│ C1 Buffer: │ │ [a,b,c] + [d,e] │
│ [Block1: x,y,z] │ ════════▶ │ = [a,b,c,d,e] │
│ │ │ │
└─────────────────────┘ └─────────────────────┘
Step 3: Write New C1 Block Step 4: Advance Cursor
┌─────────────────────┐ ┌─────────────────────┐
│ C0: [d,e,f,g] │ │ C0: [f,g,h,i,j] │
│ (Next Batch) │ │ ↑ │
│ │ │ New Cursor │
│ New C1 Block: │ ════════▶ │ │
│ [a,b,c,d,e] │ │ C1 Buffer: │
│ (Write to new │ │ [Block2: m,n,o] │
│ disk location) │ │ │
└─────────────────────┘ └─────────────────────┘
Workflow:
- Initialization Phase
- Determine the range of multi-page blocks in C1 to be merged (usually contiguous disk blocks)
- Read these blocks into memory buffer
- Merge Phase
- Read a continuous segment of entries from C0 (called merge cursor)
- Sort and merge C0 entries with existing data in C1 blocks by key
- Handle updates (overwrite old values) and deletes (generate Tombstone)
- Write Phase
- Write merged data to new location in C1
- Use append-write instead of overwrite, ensuring crash safety
- Update index to point to new block
- Advance Phase
- Move merge cursor to next segment of C0
- Loop execution until all C0 data is migrated
Key Characteristics:
- Gradualness: Not a one-time full merge, but performed in batches, controlling single I/O overhead
- Concurrency Safety: Through “Filling Block” and “Emptying Block” design, supports concurrent read/write
- Crash Recovery: Using WAL and checkpoints, can resume from interruption after failure
Difference from LevelDB Compaction:
- Rolling Merge: Continuous, fixed-rhythm merge
- LevelDB Compaction: On-demand triggered, batch-processing background task
2.3 Multi-Component Architecture
When the memory cost of two-component (C0+C1) architecture is too high, the paper proposes extending to multi-component architecture.
Architecture Form:
LSM-Tree Multi-Component Architecture
┌────────────────────────────────────────┐
│ │
│ C0 (Memory) │
│ ┌────────────────┐ │
│ │ Key range: │ ← New writes │
│ │ [latest keys] │ │
│ └───────┬────────┘ │
│ │ Rolling Merge │
│ ▼ │
│ C1 (Disk) │
│ ┌────────────────┐ Size: 10× C0 │
│ │ Key range: │ ← Older data │
│ │ [older keys] │ │
│ └───────┬────────┘ │
│ │ Rolling Merge │
│ ▼ │
│ C2 (Disk) │
│ ┌────────────────┐ Size: 10× C1 │
│ │ Key range: │ ← Oldest data │
│ │ [oldest keys] │ │
│ └───────┬────────┘ │
│ │ │
│ ▼ │
│ C3, C4, ... (More Disk Levels) │
│ │
│ Each level capacity increases │
│ (usually 10 times) │
│ Colder data resides in deeper levels │
│ │
└────────────────────────────────────────┘
Component Size Design:
- Optimal size ratio: r = Si/Si-1 = (SK/S0)^(1/K)
- Si: Size of i-th component
- S0: Smallest component (usually C0)
- SK: Largest component
- K: Number of components
Practical Experience:
- Usually using 3 components (C0, C1, C2) can satisfy most scenarios
- Benefits diminish beyond 3 components, but management complexity increases significantly
- C0 is completely memory-resident, C1 and C2 are disk-resident
Merge Strategy:
- C0→C1: Most frequent merge, keeping C0 size controllable
- C1→C2: Less frequent, triggered when C1 reaches certain threshold
- Asynchronous execution: Merges at each level can proceed in parallel
This multi-component design reduces memory costs significantly while maintaining write performance. The multi-level Compaction strategy used in actual systems (like LevelDB, RocksDB) is the engineering implementation of this idea.
2.4 Operation Semantics
Delete Operation
LSM-Tree deletion is logical deletion rather than physical deletion:
Tombstone Mechanism Diagram:
Step 1: Insert Tombstone in C0
┌─────────────────────────────────────────┐
│ C0 (Memory) │
│ ┌──────────────────────────────────┐ │
│ │ key=X: [Tombstone] ← New write │ │
│ │ key=A: value1 │ │
│ │ key=B: value2 │ │
│ └──────────────────────────────────┘ │
│ │
│ C1 (Disk) │
│ ┌──────────────────────────────────┐ │
│ │ key=X: old_value ← To be cleaned│ │
│ │ key=C: value3 │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Step 2: Rolling Merge Propagation
┌─────────────────────────────────────────┐
│ C0 (Memory) │
│ ┌──────────────────────────────────┐ │
│ │ key=A: value1 │ │
│ │ key=B: value2 │ │
│ └──────────────────────────────────┘ │
│ │ │
│ ▼ Rolling Merge │
│ │
│ C1 (Disk) │
│ ┌──────────────────────────────────┐ │
│ │ key=X: [Tombstone] ← Propagated │ │
│ │ key=X: old_value ← Marked for │ │
│ │ cleaning │ │
│ │ key=C: value3 │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Step 3: Annihilation During Merge
┌─────────────────────────────────────────┐
│ C1 (New Block Written) │
│ ┌──────────────────────────────────┐ │
│ │ key=A: value1 │ │
│ │ key=B: value2 │ │
│ │ key=C: value3 │ │
│ │ (key=X cleaned, no record) │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
Mechanism Explanation:
- Tombstone Generation
- Insert special “delete marker” entry (Tombstone) in C0
- Marker contains deleted key and deletion timestamp
- Original value position no longer matters, only need to mark key as deleted
- Delete Propagation
- Tombstone propagates from C0 to C1, C2… along with Rolling Merge
- During merge, Tombstone “annihilates” when meeting corresponding actual entry
- Eventually completely cleaned, releasing space
- Query Processing
- When querying, scan components from new to old in order
- When encountering Tombstone, immediately stop search, return key does not exist
- If actual entry is encountered first, return that value (indicating version written before deletion)
Advantage: Delete operations only require memory writes, no disk random access, extremely high performance.
Predicate Delete
Supports batch deletion based on conditions (such as “delete data older than 20 days”):
- Check each entry during Rolling Merge process
- Entries meeting conditions are directly discarded, not written to new blocks
- No need for separate scan and marking, utilize existing merge process to complete cleanup
Application Scenarios: Time-series data expiration, log cleanup, privacy data deletion, etc.
2.5 Concurrency and Recovery
Concurrency Control Mechanism
Component-Level Locking:
Component-Level Read-Write Lock Mechanism
┌────────────────────────────────────────┐
│ │
│ Read Operations (Multiple Concurrent) │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ R1 │ │ R2 │ │ R3 │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ │
│ └───────┼───────┘ │
│ ▼ │
│ C0: Read Lock (Shared) ────────┐ │
│ ┌──────────────────────────────┴─┐ │
│ │ C0 Memory Component │ │
│ └────────────────────────────────┘ │
│ │
│ Write Operation (Exclusive) │
│ ┌─────┐ │
│ │ W │ ──▶ C0: Write Lock (Exclusive)│
│ └─────┘ │
│ │
│ Rolling Merge (Requires Two-Level Lock)│
│ ┌─────┐ │
│ │ RM │ ──▶ Holds C0+C1 Write Locks │
│ └─────┘ │
│ │
└────────────────────────────────────────┘
Cursor Crossing Handling:
Filling Block and Emptying Block Design
┌────────────────────────────────────────┐
│ │
│ Rolling Merge Cursor Position │
│ C0: [a,b,c,d,e,f,g,h] │
│ ↑ │
│ Cursor (scanned c) │
│ │
│ Concurrent write key=c (new data) │
│ ┌────────────────┐ │
│ │ Filling Block │ ◀── Temp new write│
│ │ key=c: new_val │ │
│ └────────────────┘ │
│ │
│ C1 Block to be merged │
│ ┌────────────────┐ │
│ │ Emptying Block │ ◀── Original data │
│ │ [x,y,z,c_old] │ (Locked) │
│ └────────────────┘ │
│ │
│ Merge Result │
│ ┌────────────────┐ │
│ │ [x,y,z,c_new] │ ◀── Filling block │
│ └────────────────┘ merged in │
│ │
└────────────────────────────────────────┘
- Rolling Merge uses cursor to traverse C0 and C1
- Concurrent writes may insert at positions already scanned by cursor
- Solution: “Filling Block” and “Emptying Block” design
- Lock C1’s pending block before merge (Emptying Block)
- Concurrent writes temporarily stored in special area (Filling Block)
- After merge completes, Filling Block data is merged into result
Recovery Mechanism
Checkpoint and WAL Mechanism Diagram:
Checkpoint and WAL Collaborative Work
┌────────────────────────────────────────┐
│ │
│ Time ─────────────────────────────▶ │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Write 1 │ │ Write 2 │ │
│ │ (WAL 1) │ │ (WAL 2) │ │
│ └────┬─────┘ └────┬─────┘ │
│ │ │ │
│ ┌────▼───────────────▼─────┐ │
│ │ WAL Log File │ │
│ │ [W1][W2][W3][W4][W5]... │ │
│ └──────────────────────────┘ │
│ ↑ │
│ ┌────┴────┐ ┌──────────┐ │
│ │Checkpoint│◀───│ Write 3 │ │
│ │ (t=N) │ │ (WAL 3) │ │
│ └────┬────┘ └──────────┘ │
│ │ │
│ ┌────▼──────────────┐ │
│ │ Checkpoint File │ │
│ │ C0 State Snapshot │ │
│ │ C1 Boundary Key │ │
│ └───────────────────┘ │
│ │
│ Recovery After Crash: │
│ 1. Load checkpoint (t=N state) │
│ 2. Replay WAL (W3, W4, W5...) │
│ 3. Restore to latest state │
│ │
└────────────────────────────────────────┘
Checkpoint:
- Periodically flush all or part of C0 content to disk
- Record current state of each component (size, boundary key, etc.)
- Recover from checkpoint after crash, reducing WAL replay volume
WAL (Write-Ahead Log):
- Each write operation is first recorded to sequential log
- Log records contain key, value, operation type, timestamp
- After crash, scan WAL to replay un-persisted operations
Recovery Process:
- Load most recent checkpoint, restore C0 and C1 state
- Scan WAL, replay all operations after checkpoint
- Verify consistency of each component (checksum, size, etc.)
- Recovery complete, system available
3. LevelDB Engineering Implementation (2011)
3.1 Architecture Overview
LevelDB adopts the classic LSM-Tree architecture, divided into memory components and disk components.
LevelDB Overall Architecture Diagram:
LevelDB Architecture
┌─────────────────────────────────────────────────────────────┐
│ │
│ Write Path: │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
│ │ Put(key) │───▶│ WAL │───▶│ MemTable │ │
│ └──────────┘ │ (log) │ │ (SkipList) │ │
│ └──────────┘ └──────────┬───────────┘ │
│ │ │
│ Flush (Background): ▼ │
│ ┌──────────────────────────┐ │
│ │ Immutable MemTable │ │
│ └──────────┬───────────────┘ │
│ │ │
│ ▼ │
│ SSTable Files: │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │Level 0 │ │Level 1 │ │Level 2 │ │Level 3 │ │
│ │~4MB │ │~10MB │ │~100MB │ │~1GB │ │
│ │(Overlapping)│ │(Non-overlap)│ │(Non-overlap)│ │(Non-overlap)│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ▲ │
│ │ Major Compaction (Background Merge) │
│ └───────────────────────────────────────────────┘
│ │
│ MANIFEST File: (Version Change Log) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Records all SSTable file metadata for crash recovery │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Level Size Configuration:
| Level | Target Size | File Size | Characteristics |
|---|---|---|---|
| Level 0 | 4 MB | ~2 MB | Max 4 files, keys may overlap between files |
| Level 1 | 10 MB | ~2 MB | Keys do not overlap between files |
| Level 2 | 100 MB | ~2 MB | Size is 10× L1 |
| Level 3 | 1 GB | ~2 MB | And so on… |
| Level 4 | 10 GB | ~2 MB | |
| Level 5 | 100 GB | ~2 MB | |
| Level 6 | 1 TB | ~2 MB |
Important Threshold Parameters:
- kL0_SlowdownWritesTrigger = 8: Write slowdown when L0 file count reaches 8
- kL0_StopWritesTrigger = 12: Write stop when L0 file count reaches 12
- kTargetFileSize = 2 MB: SSTable target file size
Write Path:
- User calls
Put(key, value) - First write sequentially to WAL (Write-Ahead Log) to ensure durability
- Then insert into memory MemTable (SkipList implementation)
- Return write success (delayed flush)
MemTable Flush (Minor Compaction):
- When MemTable reaches threshold (default 4MB), convert to Immutable MemTable
- Background thread writes Immutable MemTable to disk as Level 0 SSTable
- After write completes, release Immutable MemTable, reset WAL
Background Merge (Major Compaction):
- Level 0 SSTables may have overlapping key ranges
- Triggered when Level 0 file count exceeds threshold, or lower level size exceeds limit
- Merge Level N SSTable with overlapping files in Level N+1
- Generate new sorted SSTables into lower level after merge
Read Path:
- First search in MemTable
- Then search in Immutable MemTable
- Scan SSTables at each level from new to old
- Use Bloom Filter for quick exclusion at each level
- Locate specific Data Block through index
- Binary search target key within Data Block
3.2 Core Data Structures
MemTable: SkipList
LevelDB chooses SkipList over B-Tree as MemTable implementation:
SkipList Structure Diagram:
SkipList Multi-Level Index Structure
┌────────────────────────────────────────────────────────────┐
│ │
│ Level 3 (Highest): head ──────────────────────▶ [70] │
│ │ │
│ Level 2: head ─────▶ [30] ─────────▶ [70] │
│ │ │ │ │
│ Level 1: head ─────▶ [30] ──▶ [50] ─▶ [70] │
│ │ │ │ │ │
│ Level 0 (Bottom): head ──▶ [10]─▶ [30]─▶ [50]─▶ [70]─▶ [90] │
│ data data data data │
│ │
│ Search [50]: Start from Level 3, descend level by level │
│ - L3: head ──▶ 70 (>50), descend to L2 │
│ - L2: head ──▶ 30 (<50), forward ──▶ 70 (>50), descend to L1│
│ - L1: 30 ──▶ 50 (Hit!) │
│ │
│ Time Complexity: O(log n), avg levels ≈ log₁/ₚ(n) │
│ │
└────────────────────────────────────────────────────────────┘
SkipList Core Characteristics:
- Simple Implementation: ~400 lines of code, no complex balancing operations
- Concurrency Friendly: Supports lock-free reads, implemented via atomic operations
- Memory Efficient: Uses Arena allocator for batch memory allocation, reduces fragmentation
- Range Query: Ordered structure, supports efficient range scans
Node Height Generation:
- Each node’s height is randomly generated with probability p = 0.25
- Probability of height k: p^(k-1) * (1-p)
- Expected height: 1/(1-p) = 1.33 (average ~1.33 pointers per node)
Time Complexity:
- Lookup/Insert/Delete: O(log n)
- Space Complexity: O(n), average 1/(1-p) pointers per node
Comparison with B-Tree:
| Characteristic | B-Tree | SkipList | LevelDB Benefit |
|---|---|---|---|
| Implementation Code | ~2000+ lines | ~400 lines | 80% maintenance cost reduction |
| Concurrent Read | Requires lock | Lock-free | 2-3× read performance improvement |
| Memory Allocation | Frequent alloc/free | Arena batch allocation | Reduced memory fragmentation |
| Implementation Complexity | High (requires balancing) | Low (probabilistic balancing) | Lower bug rate |
SSTable File Format Detailed Explanation
SSTable (Sorted String Table) is LevelDB’s core disk file format, designed to efficiently store ordered key-value pairs and support fast queries.
Overall Structure
SSTable files are divided into four regions from top to bottom:
Data Region (Data Blocks):
- Stores actual key-value pair records
- Default each Data Block is 4KB (before compression)
- Records within block are ordered by key
- Uses prefix compression (shared prefixes) to reduce space
Metadata Region (Meta Blocks):
- Filter Block: Stores Bloom Filter data, accelerates queries for non-existent keys
- Meta Index Block: Index pointing to Filter Block
Index Region (Index Block):
- Each Data Block corresponds to one index entry
- Index entry key: Maximum key of Data Block
- Index entry value: Offset and size of Data Block
Footer:
- Fixed 48 bytes
- Contains location information of Meta Index Block and Index Block
- Magic Number for file format verification
Data Block Internal Structure
Data Block is the smallest unit storing actual key-value pairs in SSTable, default size is 4KB (after compression).
Structure Composition:
- Key-Value Record Sequence: Actual stored data
- Restart Point Array: Used to accelerate in-block lookups
Prefix Compression Mechanism:
- Records within block are sorted by key, adjacent keys usually share common prefixes
- Each record only stores the difference from previous key
- Set a Restart Point every 16 records, storing complete key at that point
- During lookup, decode from nearest Restart Point to reduce computation
Record Format:
- Shared Length: Length of prefix shared with previous key
- Non-shared Length: Length of non-shared portion of this key
- Value Length: Value length
- Key Delta: Non-shared key bytes
- Value: Actual value
Lookup Process:
- Binary search in Restart Point array to determine target record range
- Decode sequentially from starting Restart Point of that range
- Compare keys until found or out of range
Record Encoding Details
LevelDB uses varint (variable-length integer) encoding to reduce storage space for small values:
Filter Block Structure
Filter Block stores SSTable’s Bloom Filter data, used to quickly exclude Data Blocks not containing target key.
Filter Block Layout:
┌─────────────────────────────────────────────────────────────┐
│ Filter Block Structure │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Filter Data 1 (corresponds to Data Block 1) │ │
│ │ [Bloom Filter bits, ~2KB] │ │
│ │ One Filter per 2KB Data Block │ │
│ └───────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Filter Data 2 (corresponds to Data Block 2) │ │
│ │ [Bloom Filter bits] │ │
│ └───────────────────────────────────────────────────────┘ │
│ ... │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Filter Metadata (offset array) │ │
│ │ [offset_0: 4 bytes] ← Filter Data 0 offset │ │
│ │ [offset_1: 4 bytes] ← Filter Data 1 offset │ │
│ │ ... │ │
│ │ [base_lg: 1 byte] ← log2(block size), default 11 │ │
│ │ (2KB) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Bloom Filter Parameters:
- bits_per_key: Default 10 bits
- k (number of hash functions): Calculated from bits_per_key, k ≈ 0.693 * bits_per_key ≈ 7
- False Positive Rate: ~1% (when bits_per_key = 10)
False Positive Rate Calculation:
P(false positive) ≈ (1 - e^(-k*n/m))^k
Where:
- n = number of keys
- m = total number of bits
- k = number of hash functions
| bits_per_key | False Positive Rate | Memory Overhead (per million keys) |
|---|---|---|
| 5 | 5.0% | 0.625 MB |
| 8 | 1.0% | 1.0 MB |
| 10 | 0.8% | 1.25 MB |
| 15 | 0.05% | 1.875 MB |
Index Block Structure
Index Block stores index information for each Data Block, used to quickly locate Data Blocks.
Index Block Layout:
┌─────────────────────────────────────────────────────────────┐
│ Index Block Structure │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Index Entry 1 (corresponds to Data Block 1) │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Key: Maximum key of Data Block 1 │ │ │
│ │ │ Value: [offset: varint32][size: varint32] │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Index Entry 2 (corresponds to Data Block 2) │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ Key: Maximum key of Data Block 2 │ │ │
│ │ │ Value: [offset: varint32][size: varint32] │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ ... │
│ │
│ Lookup Process: │
│ 1. Binary search for target key in Index Block │
│ 2. Find first Entry where max_key >= target_key │
│ 3. Read Data Block according to offset and size in Entry │
│ │
└─────────────────────────────────────────────────────────────┘
Footer Detailed Format (48 bytes)
Footer is the tail of SSTable file, fixed 48 bytes, containing pointers to critical metadata blocks.
Footer Layout:
┌─────────────────────────────────────────────────────────────┐
│ Footer (48 bytes) │
├─────────────────────────────────────────────────────────────┤
│ │
│ Offset 0-7: Meta Index Block Handle (8 bytes) │
│ ├─ offset: varint32 (max 5 bytes) │
│ └─ size: varint32 (max 5 bytes) │
│ │
│ Offset 8-15: Index Block Handle (8 bytes) │
│ ├─ offset: varint32 │
│ └─ size: varint32 │
│ │
│ Offset 16-39: Padding (24 bytes) │
│ Reserved space for future extension, │
│ currently filled with 0 │
│ │
│ Offset 40-47: Magic Number (8 bytes) │
│ Fixed value: 0xdb4775248b80fb57 │
│ (little-endian) │
│ Used to verify file format integrity │
│ │
└─────────────────────────────────────────────────────────────┘
Why 48 bytes?
- Historical reasons: Compatible with early SSTable format
- Design consideration: Sufficient to store necessary metadata + future extension space
- Alignment requirement: 8-byte alignment, improves read efficiency
Internal Key Format Detailed Explanation
LevelDB adds metadata when storing keys to support multi-versioning, delete markers, and transactions. Internal Key is the core mechanism for LSM-Tree to implement MVCC (Multi-Version Concurrency Control).
Internal Key Structure (Total length = user_key_length + 8 bytes):
Sequence Number
Sequence Number Characteristics:
- Globally Incrementing: Each write operation (Put/Delete/Merge) increments sequence number by 1
- Monotonicity Guarantee: Protected by DBImpl mutex lock, ensuring thread safety
- Snapshot Isolation: Record current sequence number when reading, only read data ≤ that sequence number
- Descending Order: Same User Key ordered by sequence number descending, ensuring latest version returned first
Type (Type Byte)
Type Descriptions:
| Type | Value | Purpose | Cleanup Timing |
|---|---|---|---|
| kTypeDeletion | 0x00 | Mark key as deleted | Can be cleaned after Compaction when no snapshot reference |
| kTypeValue | 0x01 | Normal key-value pair | Can be cleaned when overwritten by new version and no snapshot reference |
| kTypeMerge | 0x02 | Merge operation (incremental update) | Generate new value after merging with base value |
| kTypeSingleDelete | 0x07 | Single delete optimization | Delete immediately upon encounter, no history retained (must ensure no same key before) |
Internal Key Encoding Details
Encoding Layout (Little-Endian):
Sorting Rules Detailed Explanation
Internal Key comparator implementation:
Sorting Example:
Version Control and Garbage Collection
Version Visibility Rules:
Garbage Collection Timing:
- Compaction: Main cleanup mechanism, checks snapshot references
- Snapshot Release: When oldest snapshot advances, old versions can be cleaned
- Periodic Cleanup: Background thread checks expired data
Internal Key Layout Diagram:
┌─────────────────────────────────────────────────────────────┐
│ Internal Key Structure │
├─────────────────────────────────────────────────────────────┤
│ │
│ [User Key (variable)] [Sequence Number (7 bytes)] [Type (1 byte)]│
│ │
│ Example: key="hello", seq=100, type=kTypeValue │
│ ┌──────────┬──────────────────┬──────────┐ │
│ │ "hello" │ 0x000000000064 │ 0x01 │ │
│ │ (5B) │ (7B) │ (1B) │ │
│ └──────────┴──────────────────┴──────────┘ │
│ │
│ Sorting Rule: User Key ascending → Sequence Number descending│
│ Effect: Latest version of same key appears first │
│ │
└─────────────────────────────────────────────────────────────┘
Lookup Key Format
Temporary Key format used during reads:
Lookup Key Characteristics:
- Contains snapshot sequence number
- Uses
kValueTypeForSeek = 0xffffffffto ensure finding all types - Used to locate starting position in MemTable/SSTable
SSTable Lookup Process
SSTable Build Process
SSTable File Example
SSTable Advantages Summary
| Characteristic | Description | Effect |
|---|---|---|
| Prefix Compression | Adjacent keys share prefixes | Save 30-50% space |
| Block-based | Fixed-size Data Blocks | Efficient caching and compression |
| Bloom Filter | Quickly exclude files not containing key | Reduce 90%+ disk reads |
| Index Structure | Two-level index (Index + Data Block) | O(log n) query |
| Immutability | SSTable unmodifiable once written | Simplify concurrency and caching |
Internal Key format: [User Key | Sequence Number (7B) | Type (1B)] Sorting rule: User Key ascending → Sequence Number descending
3.3 Write Process
LevelDB’s write process adopts the classic WAL + MemTable pattern, ensuring data durability and high performance.
Write Flow Diagram:
LevelDB Write Flow
┌─────────────────────────────────────────────────────────────┐
│ │
│ User Call: Put(key, value) │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ 1. Join Writer Queue (Serialized writer queue) │ │
│ │ - Multiple write requests merged into WriteBatch │ │
│ │ - Batch write improves throughput │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ 2. Write-Ahead Logging │ │
│ │ - Write WAL first (sequential write) │ │
│ │ - Optional sync: force flush for durability │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ 3. Write to MemTable │ │
│ │ - Insert into SkipList │ │
│ │ - Use custom comparator for sorting │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ 4. MemTable Full Transition │ │
│ │ - MemTable → Immutable MemTable │ │
│ │ - Create new mem_ and log file │ │
│ │ - Background Flush to L0 SSTable │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Write Batching Detailed Explanation:
Write batching is LevelDB’s key optimization for improving write throughput.
Working Mechanism:
- Write Queue: All write requests enter FIFO queue
- Batch Processing: Writer at queue head responsible for merging subsequent write requests
- Batch WAL: Merged batch data written to WAL once
- Batch MemTable: Batch data inserted into MemTable once
- Notification Mechanism: Notify all waiting writers upon completion
Effect Quantification:
| Thread Count | Throughput without Batching | Throughput with Batching | Improvement |
|---|---|---|---|
| 1 thread | 100 Kops/s | 100 Kops/s | 1× |
| 4 threads | 150 Kops/s | 380 Kops/s | 2.5× |
| 8 threads | 180 Kops/s | 720 Kops/s | 4× |
| 16 threads | 200 Kops/s | 950 Kops/s | 4.75× |
Why Improvement?
- Reduce WAL write count (multiple random writes → one sequential write)
- Reduce lock contention (serialized processing)
- Improve CPU cache hit rate (batch processing)
3.4 Compaction Mechanism Detailed Explanation
Compaction is the core mechanism of LSM-Tree, responsible for merging data from upper levels to lower levels while maintaining ordering. LevelDB has two types of Compaction:
- Minor Compaction: MemTable → SSTable (Level 0)
- Major Compaction: Level N → Level N+1
3.4.1 Minor Compaction Detailed Process
Minor Compaction converts Immutable MemTable in memory to Level 0 SSTable on disk.
Trigger Conditions:
- MemTable size reaches threshold (default 4MB)
- Background thread continuously checks, triggers when condition met
Detailed Steps:
Step 1: Create Immutable MemTable
- Mark current MemTable as Immutable (cannot be modified)
- Create new MemTable to receive new writes
- Switch WAL to new log file
Step 2: Smart Output Level Selection
- Default output to Level 0
- But LevelDB checks if can “skip level” to higher level
- If Immutable MemTable’s key range doesn’t overlap with Level 0 and higher levels, can directly place in L1/L2
- Reduce subsequent L0→L1 Compaction overhead
Step 3: Write SSTable
- Iterate through all key-values in Immutable MemTable
- Write sequentially to new SSTable file
- Build Data Blocks, Index Block, Filter Block
- Finally write Footer
Step 4: Update Metadata
- Add new SSTable info to VersionSet
- Record to MANIFEST file to ensure durability
- Delete old WAL file
- Release Immutable MemTable memory
Special Optimization - Level Skipping: LevelDB decides Immutable MemTable’s output level by checking key range overlap:
- If no overlap with L0, try placing in L1
- If no overlap with L1 either, and small overlap with L2, can place in L2
- And so on, up to L4
- This smart selection significantly reduces Compaction overhead for small range updates
3.4.2 Major Compaction Detailed Process
Major Compaction merges data from upper levels to lower levels, maintaining ordering and level constraints.
Trigger Conditions:
- Level 0 file count exceeds threshold (default 4)
- Total size of a level exceeds target (L1: 10MB, L2: 100MB, L3: 1GB…)
- Manual trigger (compact_range API)
Compaction Selection Algorithm:
L0→L1 Compaction Special Characteristics:
- Level 0 SSTables may overlap, must all participate in merge
- Find all L1 files overlapping with L0 files to be merged
- Multi-way merge all overlapping files, generate new L1 SSTables
L1+ Compaction Standard Process:
- Select “most crowded” SSTable in that level (file size/layer size ratio largest)
- Only select this one file, find overlapping files in L+1 level
- Merge selected files, generate new L+1 SSTables
- Delete old files
Compaction Execution Process:
- Input Collection: Determine all SSTables to be merged
- Multi-way Merge: Use min-heap to merge multiple ordered inputs
- Output Generation: Write new SSTables sequentially, maintain target file size (default 2MB)
- Metadata Update: Atomically update VersionSet, record to MANIFEST
- Old File Cleanup: Delete merged old SSTables (actually delayed deletion, ensure no reference before cleanup)
Why L0→L1 is Most Expensive:
- L0 file overlap causes need to read more files
- L1 size limit (10MB) causes frequent triggering
- Is the bottleneck of entire LSM-Tree, subsequent optimizations (like Universal Compaction) mostly target this
3.4.3 Garbage Collection in Compaction
Compaction is not just merging data, but also responsible for cleaning expired data.
Garbage Collection Judgment Logic:
During Compaction, judge whether to discard each key:
- Hidden by New Version:
- If same user key has newer version (higher sequence number)
- And no snapshot references old version
- Then discard old version
- Delete Marker Cleanup:
- If it’s a delete marker (kTypeDeletion)
- And no snapshot references this marker
- And no data for this key in higher levels
- Then can discard this delete marker
Version Visibility Rules:
- Each snapshot records sequence number at creation time
- When reading, can only see data ≤ snapshot sequence number
- During Compaction, check oldest snapshot, versions older than it can be cleaned
3.5 Version Management (VersionSet)
LevelDB uses Copy-on-Write style version control, supporting snapshot isolation and concurrent read/write.
VersionSet Structure:
Version Management
┌─────────────────────────────────────────────────────────────┐
│ │
│ VersionSet │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ current_ ───────────────────────▶ Version 3 (Current)│ │
│ │ ├── FileMetaData[]│ │
│ │ ├── level_files_ │ │
│ │ └── refs: 1 │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Version 2 (Old Version) │ │
│ │ ├── FileMetaData[] │ │
│ │ ├── refs: 0 (reference count 0, pending deletion) │ │
│ │ └── next: Version 3 │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Version 1 (Even Older Version) │ │
│ │ └── ... │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ MANIFEST File │ │
│ │ - Records all version change logs │ │
│ │ - Used during crash recovery │ │
│ │ - VersionEdit serialization format │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Version Structure:
- Each level contains a FileMetaData array, recording all SSTable files in that level
- FileMetaData contains: file number, size, minimum key, maximum key
- Reference count (refs): Cannot delete when snapshot reference exists
MANIFEST File:
- Records all version changes (VersionEdit)
- Including: new files, deleted files, comparator changes, etc.
- Recover version state by replaying MANIFEST after crash
Snapshot:
- Record current sequence number at creation
- When reading, only read data ≤ snapshot sequence number
- Support multi-version concurrency control (MVCC)
- After release, old versions can be cleaned by Compaction
3.5.1 Version Management Process
- Normal Read/Write:
- Acquire current version reference (refs++)
- Read/write based on SSTable collection of this version
- Release reference when complete (refs–)
- Compaction Complete:
- Create new version, containing new SSTable collection
- Atomically switch current_ to point to new version
- Decrement old version reference count
- Old Version Cleanup:
- When version’s refs == 0, can be recycled
- SSTable files unique to this version can be deleted
3.4.4 Compaction Strategy Comparison
| Strategy | Principle | Write Amplification | Read Amplification | Space Amplification | Applicable Scenario |
|---|---|---|---|---|---|
| Leveled (LevelDB Default) | Each level fully sorted, adjacent level size ratio 10:1 | High (10-30×) | Low (O(1)) | Low | Read-heavy |
| Tiered (RocksDB Universal) | Multiple sorted runs per level, delayed merge | Low (2-5×) | High (O(T)) | High | Write-heavy |
| Leveled-N | Hybrid strategy, L0-LN uses Tiered, LN+ uses Leveled | Medium | Medium | Medium | Mixed workload |
| FIFO | Only keep recent N files | Very Low | High | Very High | Time-series data, cache |
Leveled vs Tiered Compaction Diagram:
3.4.5 Compaction Tuning Parameters
Tuning Suggestions:
3.6 Read Optimization
LevelDB designs multi-layer optimization strategy for LSM-Tree’s need to check multiple levels during reads.
Read Amplification Sources:
- Worst case: Need to check MemTable + Immutable + L0-L6, total 8 levels
- Each level may need to open one SSTable, read Bloom Filter, Index Block, Data Block
- For non-existent key, need to check all levels to confirm
Optimization Strategies:
1. MemTable Priority
- Latest data always in MemTable, fastest lookup
- SkipList O(log n) lookup complexity
2. SSTable Cache (Table Cache)
- Cache opened SSTable file descriptors and metadata
- Avoid frequent open/close file system call overhead
3. Block Cache
- LRU cache recently read Data Blocks
- Hot data directly hits memory, no disk read needed
- Default 8MB, configurable
4. Bloom Filter Filtering
- Each SSTable has corresponding Bloom Filter
- Check Filter first during query, quickly exclude files not containing target key
- For non-existent key, reduce 90%+ disk reads
5. Level Skipping Optimization
- Utilize SSTable file ordering, maintain key range index per level
- If target key < minimum key of a level or > maximum key, skip that level directly
6. Prefetch Optimization
- During range query, asynchronously prefetch next Block while reading current Block
- Utilize disk sequential read high bandwidth
Read Performance Comparison (With/Without Optimization):
| Scenario | Without Optimization | With Optimization | Improvement |
|---|---|---|---|
| Point Query (Exists) | 3-8 ms | 0.1-0.5 ms | 6-30× |
| Point Query (Not Exists) | 10-20 ms | 0.5-1 ms | 10-40× |
| Range Query | Linear Scan | Binary+Sequential Read | 10-100× |
4. Comparison Between LevelDB and Original LSM-Tree Paper
This chapter deeply analyzes the key evolution from LSM-Tree theory (1996) to LevelDB engineering implementation (2011), revealing how industrial practice optimizes theoretical models.
4.1 Core Differences Overview
The LSM-Tree paper proposed a theoretical framework and cost model, but transforming it into an industrial-grade system requires numerous engineering decisions. LevelDB, while maintaining core ideas, made several key improvements:
4.2 Detailed Comparison Analysis
4.2.1 Architecture Design Comparison
| Aspect | LSM-Tree Paper (1996) | LevelDB (2011) | Improvement Significance |
|---|---|---|---|
| C0 Implementation | B-Tree variant or (2-3) tree | SkipList | 5× code simplification (400 lines vs 2000+ lines), supports lock-free reads |
| Level Design | Conceptual C0, C1, C2…Ck | Fixed 7 levels L0-L6 | Engineering simplification, clear parameters, easy to tune |
| L0 Special Handling | None (all levels same structure) | Allows file overlapping | Reduces write amplification by 30%+, lowers L0→L1 merge pressure |
| Disk Format | Generic multi-page blocks | SSTable format | Prefix compression, Bloom Filter, index optimization |
| MemTable Structure | B-Tree-like | SkipList | Higher memory efficiency, better range query performance |
4.2.2 Core Data Structure Comparison
MemTable: B-Tree vs SkipList
SkipList Advantage Quantification:
| Characteristic | B-Tree | SkipList | LevelDB Benefit |
|---|---|---|---|
| Implementation Code | ~2000+ lines | ~400 lines | 80% maintenance cost reduction |
| Concurrent Read | Requires lock | Lock-free | 2-3× read performance improvement |
| Memory Allocation | Frequent alloc/free | Arena batch allocation | Reduced fragmentation, faster allocation |
| Implementation Complexity | High (requires balancing) | Low (probabilistic balancing) | Lower bug rate, easier to optimize |
4.2.3 Compaction Strategy Comparison
LSM-Tree Paper: Continuous Rolling Merge
LevelDB: On-demand Trigger + Background Thread
4.2.4 Read Optimization Comparison
LSM-Tree Paper: No Specialized Optimization
- Sequential search of each level
- Need to read complete data blocks for binary search
LevelDB: Multi-layer Optimization
| Optimization Technology | Paper (1996) | LevelDB (2011) | Read Amplification Reduction |
|---|---|---|---|
| Bloom Filter | ❌ Not mentioned | ✅ Per-file + per-block filters | 90%+ |
| Block Cache | ❌ Not mentioned | ✅ LRU cache Data Block | 50-70% |
| Index Block | ❌ Not mentioned | ✅ Fast Data Block location | Avoid full file scan |
| Table Cache | ❌ Not mentioned | ✅ Cache SSTable metadata | Reduce file open count |
| Prefix Compression | ❌ Not mentioned | ✅ Key shared prefix compression | Save 30-50% storage space |
Bloom Filter Effect Quantification:
4.2.5 Concurrency and Consistency Comparison
| Characteristic | LSM-Tree Paper (1996) | LevelDB (2011) | Improvement |
|---|---|---|---|
| Concurrent Read | Node-level locking | Version control + lock-free read | 2-5× read performance improvement |
| Concurrent Write | Not discussed in detail | Write Batching | 10×+ throughput improvement |
| Snapshot Isolation | Mentioned but not implemented | Full MVCC support | Supports transactions and backup |
| Recovery Mechanism | Checkpoint + WAL | MANIFEST + version management | 3× recovery speed improvement |
Write Batching Detailed Explanation:
4.2.6 Level Design Detailed Comparison
LSM-Tree Paper: Variable Multi-Component
LevelDB: Fixed 7 Levels + Level 0 Special Handling
4.3 Key Improvements Summary
Engineering Simplification
| Improvement Point | Paper Complexity | LevelDB Implementation | Simplification Effect | |——————-|——————|————————|———————-| | MemTable | B-Tree variant (complex balancing) | SkipList (probabilistic balancing) | Code volume -80% | | Level Management | Dynamic multi-component | Fixed 7 levels | Configuration simplification | | Compaction | Continuous Rolling Merge | Threshold trigger | Resource controllable | | File Format | Generic multi-page blocks | SSTable | Rich functionality |
Performance Improvement
| Metric | Paper Model | LevelDB | Improvement | |——–|————-|———|————-| | Write Throughput | Theoretical model | ~10 MB/s (HDD) / ~100 MB/s (SSD) | Engineering achievable | | Read Performance | Unoptimized | Optimized, read amplification reduced 70% | Engineering optimization | | Concurrency | Not discussed | Supports multi-thread read/write | Modern requirements | | Resource Control | Continuous occupation | Controllable background tasks | Production-friendly |
4.4 Design Philosophy Differences
The LSM-Tree paper and LevelDB demonstrate different design orientations from theory to engineering.
Theory-oriented vs Engineering-oriented:
| Dimension | LSM-Tree Paper (1996) | LevelDB (2011) |
|---|---|---|
| Goal | Prove LSM superiority in cost model | Build industrial-grade high-performance embedded storage |
| Focus | Mathematical optimization, asymptotic complexity | Actual performance, engineering simplicity, maintainability |
| Architecture | Generic multi-component model | Fixed 7 levels + special L0 |
| Implementation | Conceptual description | Complete runnable code (~20KB) |
| Optimization | Theoretical cost minimization | Comprehensive balance of read/write amplification, space amplification |
Core Differences Analysis:
1. MemTable Choice: B-Tree vs SkipList
- Paper: Suggests using B-Tree or (2-3) tree, ensuring deterministic performance
- LevelDB: Uses SkipList, code volume reduced by 80%, simpler implementation
- Engineering trade-off: Probabilistic balance for implementation simplicity, actual performance difference is small
2. Compaction Mode: Continuous vs On-demand
- Paper: Rolling Merge is a continuously running background task
- LevelDB: Compaction triggered on-demand, with clear trigger conditions
- Engineering trade-off: Controllable resource usage, avoiding continuous I/O occupation
3. Level Design: Dynamic vs Fixed
- Paper: Component count and size can be dynamically adjusted based on workload
- LevelDB: Fixed 7 levels, each level size increases 10×
- Engineering trade-off: Simplified configuration and management, easier to understand and tune
4. Read Optimization: None vs Multi-layer
- Paper: Does not specifically discuss read optimization
- LevelDB: Multi-layer optimizations including Bloom Filter, Block Cache, indexes, etc.
- Engineering trade-off: 2011 SSD proliferation, random read performance improved, requiring targeted read optimization
Key Insights:
- Theory to engineering requires trade-offs: LevelDB simplified the theoretical model but gained better engineering characteristics
- Hardware changes drive design: 2011 SSD proliferation made optimizations like Bloom Filter, parallel reads important
- Simplicity first: SkipList replacing B-Tree is a key factor in engineering success
- Controllability matters: On-demand Compaction is more suitable for production than continuous Rolling Merge
4.5 Impact on Subsequent Systems
LevelDB’s design decisions influenced the entire LSM-Tree ecosystem:
5. WiscKey Key-Value Separation Optimization (2016)
Paper: WiscKey: Separating Keys from Values in SSD-Conscious Storage
Authors: Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Institution: University of Wisconsin, Madison
Published: FAST 2016
Core Idea: Key-Value Separation - Separating values from LSM-tree to significantly reduce I/O amplification
5.1 Research Background and Motivation
5.1.1 Core Problems of LSM-Tree
LSM-Tree (such as LevelDB, RocksDB), while avoiding random writes, suffers from severe I/O amplification problems:
Write Amplification: 12× - 50×
- Cause: Data is repeatedly read and written during Compaction process
- Example: Writing 1GB of data may require 12-50GB of actual disk writes
Read Amplification: 3× - 14×
- Cause: Lookup needs to traverse multiple levels of SSTables
- Example: Reading 1KB of data may require reading 3-14KB of actual data
Space Amplification
- Actual occupation exceeds original data due to invalid data not being cleaned in time
5.1.2 Essential Differences Between SSD and HDD
HDD (Hard Disk Drive):
- Random I/O is 100×+ slower than sequential I/O
- LSM-Tree’s trade-off is reasonable: Use sequential writes in exchange for query performance
SSD (Solid State Drive):
- Random read performance is close to sequential read (50% single-thread, can match with multi-thread)
- Random writes still have overhead (erase-write cycle)
- High I/O amplification wastes bandwidth and shortens SSD lifespan
- Internal parallelism: Multiple channels/chips can process random reads in parallel
Conclusion: LSM-Tree’s design assumptions are no longer fully applicable on SSDs
5.1.3 Key Observations
- Compaction only cares about Key order: Value sorting is not essential for range queries
- Keys are usually much smaller than Values: In modern workloads, keys are ~16B, values are 100B-4KB+
- SSD parallel random read ≈ sequential read: This makes range queries feasible after key-value separation
5.2 Key-Value Separation Architecture
5.2.1 Overall Architecture
WiscKey Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌───────────────────────────┐ │
│ │ LSM-tree │ │ vLog │ │
│ │ (Stores only │ │ (Value Log append-only)│ │
│ │ Keys) │ │ │ │
│ │ │ │ │ │
│ │ Key1 → addr1 │────────▶│ addr1: Value1 │ │
│ │ Key2 → addr2 │ │ addr2: Value2 │ │
│ │ Key3 → addr3 │ │ addr3: Value3 │ │
│ │ ... │ │ ... │ │
│ │ │ │ │ │
│ │ L0-L6 (SSTable)│ │ Sequential append write │ │
│ │ Size significantly│ │ Background Garbage │ │
│ │ reduced │ │ Collection │ │
│ └─────────────────┘ └───────────────────────────┘ │
│ │
│ Core Advantages: │
│ 1. LSM-tree greatly shrinks → Compaction cost significantly │
│ reduced │
│ 2. Value sequential append → Write performance approaches │
│ device bandwidth │
│ 3. Read operation: Query LSM-tree for address → Random read │
│ vLog │
│ │
└─────────────────────────────────────────────────────────────┘
5.2.2 Detailed Data Layout
Record format in LSM-tree:
Key | Value-Address (vLog offset)
Record format in vLog:
Key | Value | Value-Size (used for GC verification)
vLog Management:
- Head pointer (in memory): New Value append position
- Tail pointer (persistent): Valid data starting position
- Both Tail and Head pointers are stored in LSM-tree:
<"tail", tail-vLog-offset>,<"head", head-vLog-offset>
5.2.3 Design Goals
| Goal | Description |
|---|---|
| Low Write Amplification | Reduce unnecessary writes, extend SSD lifespan |
| Low Read Amplification | Improve query throughput, enhance cache efficiency |
| SSD Optimization | Match SSD’s parallel random read characteristics |
| Complete Functionality | Support LSM-tree features like range queries, snapshots |
| Realistic Key-Value Sizes | Optimize for small key (16B) + variable-length value scenarios |
5.3 Key Challenges and Solutions
5.3.1 Challenge 1: Range Query Performance
Problem: After key-value separation, range queries require random reads to vLog, performance may degrade
Solution: Utilize SSD parallel random reads + prefetching
Prefetching Mechanism:
- Sequentially read a batch of keys from LSM-tree (e.g., 1000)
- Extract corresponding value addresses
- Put addresses into work queue
- 32 background threads parallel random read vLog
- Utilize SSD internal parallelism to achieve bandwidth close to sequential reads
Performance Results:
- Small value (64B): 12× slower than LevelDB (device random read limitation)
- Large value (4KB+): 8.4× faster than LevelDB
- Sequentially written data: Performance close to LevelDB (data in vLog is already sorted)
Further Optimization: For small value scenarios, log reorganization (sorting) can be performed
5.3.2 Challenge 2: Garbage Collection
Problem: After delete/update, invalid values are generated in vLog, space needs to be reclaimed
Solution: Online lightweight garbage collector
Trigger Conditions:
- Configured to run periodically
- Or space usage reaches threshold
- Supports offline mode for maintenance
Collection Process:
- Sequentially read chunk from Tail (e.g., 16MB)
- For each Key-Value pair:
- Look up this Key in LSM-tree
- Check if address in LSM-tree matches current vLog position
- If match: Valid data, append write to Head
- If not match: Invalid data, skip
- Update Tail pointer
- Use fallocate(FALLOC_FL_PUNCH_HOLE) to release space
Consistency Guarantee:
- First fsync() appended data in vLog
- Then synchronously update Tail pointer in LSM-tree
- After crash: Everything before Tail is persisted, after may be lost
Performance Overhead (Random Write Workload): | Invalid Data Ratio | Throughput Degradation | |——————–|————————| | 100% | < 10% | | 50% | ~25% | | 25% | ~35% |
Can be balanced by adjusting GC frequency and space reservation
5.3.3 Challenge 3: Crash Consistency
Problem: After key-value separation, how to ensure data consistency during crash recovery?
Solution: Utilize append atomicity of modern file systems
File System Guarantees (ext4, btrfs, xfs):
- Append operation atomicity: After crash, file only contains complete appended prefix
- Will not occur: Partial writes, out-of-order writes, garbage data
WiscKey Write Order:
- Value first appended to vLog
- fsync() vLog to ensure persistence
- Key + Address written to LSM-tree (persistence guaranteed by its WAL)
Crash Recovery Process:
- Read Head pointer from LSM-tree
- Scan vLog from Head to end of file
- Rebuild Key → Address mapping
- Duplicate data handled by LSM-tree deduplication mechanism
Exception Query Handling:
- Key in LSM-tree, but Value address out of range: Delete Key
- Key in LSM-tree, but Key in Value doesn’t match: Delete Key
- Value exists but Key lost: Will be recycled by subsequent GC
- Double verification during query: Address range + Key match
Recovery Time:
- LevelDB: 0.7 seconds (1KB value)
- WiscKey: 2.6 seconds (need to scan vLog)
- Can be optimized by more frequent Head pointer persistence
5.3.4 Additional Optimization: Remove LSM-tree WAL
Observation:
- LSM-tree’s WAL is used to recover MemTable data
- WiscKey’s vLog already contains complete Key-Value data
- Can recover by scanning vLog, no additional WAL needed
Optimization:
- Completely remove LSM-tree’s log file, reduce one write
Recovery Process:
- Get most recent Head pointer from LSM-tree
- Scan vLog from Head to end
- Rebuild LSM-tree index
Effect:
- Particularly effective for small writes, reduce ~50% write volume
5.4 Performance Evaluation Results
5.4.1 Experimental Setup
| Configuration | Parameters |
|---|---|
| Device | 500GB Samsung 840 EVO SSD |
| Database Size | 100 GB |
| Key Size | 16 Bytes |
| Value Size | 64B - 256KB |
| Comparison Systems | LevelDB 1.18, RocksDB |
5.4.2 Micro Benchmark - Load Performance
Sequential Load - Throughput (MB/s):
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| WiscKey | 40 | 80 | 150 | 350 | 450 | 480 | 500 |
| Improvement | 2.5× | 5× | 15× | 46× | 45× | 48× | 50× |
Random Load - Throughput (MB/s):
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| WiscKey | 40 | 80 | 150 | 350 | 450 | 480 | 500 |
| Improvement | 2.5× | 5× | 15× | 111× | 104× | 85× | 46× |
Write Amplification Comparison:
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 14 | 12 | 12 | 12 | 12 | 12 | 12 |
| WiscKey | 4.5 | 2.5 | 1.2 | 1.05 | 1.01 | ~1 | ~1 |
5.4.3 Micro Benchmark - Query Performance
Random Lookup - Throughput (KOps/s):
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 25 | 10 | 4 | 3 | 2.5 | 2.5 | 2.5 |
| WiscKey | 40 | 48 | 48 | 42 | 40 | 36 | 30 |
| Improvement | 1.6× | 4.8× | 12× | 14× | 16× | 14× | 12× |
Range Query - Throughput (MB/s):
Randomly written database (vLog unordered):
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 4 | 12 | 45 | 120 | 180 | 220 | 230 |
| WiscKey | 0.3 | 1.2 | 8 | 60 | 150 | 210 | 230 |
| Relative Performance | 0.12× | 0.1× | 0.18× | 0.5× | 0.83× | 0.95× | 1.0× |
Sequentially written database (vLog ordered):
| Value Size | 64B | 256B | 1KB | 4KB | 16KB | 64KB | 256KB |
|---|---|---|---|---|---|---|---|
| LevelDB | 4 | 12 | 45 | 120 | 180 | 220 | 230 |
| WiscKey | 3 | 10 | 40 | 120 | 180 | 220 | 230 |
| Relative Performance | 0.75× | 0.83× | 0.89× | 1.0× | 1.0× | 1.0× | 1.0× |
5.4.4 Garbage Collection Overhead
| Invalid Data Ratio | 25% | 50% | 75% | 100% |
|---|---|---|---|---|
| Throughput Degradation | ~35% | ~25% | ~15% | <10% |
Note: Even with 100% invalid data, GC overhead is small because it’s mainly sequential I/O
5.4.5 Space Amplification
Actual size of 100GB randomly written database:
| System | Actual Size | Space Amplification |
|---|---|---|
| LevelDB | ~120 GB | 1.2× |
| WiscKey (Before GC) | ~115 GB | 1.15× |
| WiscKey (After GC) | ~102 GB | 1.02× |
5.4.6 YCSB Macro Benchmark
Value = 1KB:
| Workload | Description | LevelDB | RocksDB | WiscKey-GC | WiscKey |
|---|---|---|---|---|---|
| LOAD | Load 100GB | 1.0× | 1.2× | 45× | 50× |
| A | 50% read, 50% update | 1.0× | 1.1× | 4.6× | 4.6× |
| B | 95% read, 5% update | 1.0× | 1.3× | 4.0× | 5.4× |
| C | 100% read | 1.0× | 1.1× | 3.6× | 5.4× |
| D | 95% read, 5% insert | 1.0× | 1.7× | 3.4× | 3.6× |
| E | 95% range query | 1.0× | 0.8× | 0.7× | 0.8× |
| F | 50% read, 50% RMW | 1.0× | 0.7× | 3.5× | 3.4× |
Value = 16KB:
| Workload | LevelDB | RocksDB | WiscKey-GC | WiscKey |
|---|---|---|---|---|
| LOAD | 1.0× | 1.3× | 100× | 104× |
| A-F | 1.0× | 1.1-2.5× | 2×-7.5× | 2.3×-7.5× |
Conclusion: WiscKey outperforms LevelDB and RocksDB in almost all YCSB workloads
5.4.7 Recovery Time
Database recovery time after crash (1KB value):
| System | Recovery Time |
|---|---|
| LevelDB | 0.7 seconds |
| WiscKey | 2.6 seconds |
Analysis:
- WiscKey needs to scan vLog to rebuild index
- Can be optimized by more frequent Head pointer persistence
- Recovery time increases 3.7×, but still within acceptable range
5.5 Detailed Comparison with LevelDB
| Characteristic | LevelDB | WiscKey | Description |
|---|---|---|---|
| Core Idea | LSM-Tree | Key-Value Separated LSM-Tree | WiscKey retains LSM advantages, removes main overhead |
| LSM-tree Content | Key + Value | Only Key + Value Address | LSM size significantly reduced |
| Value Storage | SSTable | Independent vLog (Append Log) | Sequential write, no Compaction |
| Write Amplification | 12× - 50× | ~1× | Most significant improvement |
| Read Amplification | 3× - 14× | ~1× + 1 random read | Still advantageous in small value scenarios |
| Range Query | Sequential read | Parallel random read | Better performance in large value scenarios |
| Garbage Collection | Compaction | vLog GC (Lightweight) | Only sequential I/O |
| Crash Recovery | LSM-tree log | vLog Scan | Can remove LSM WAL |
| Space Amplification | Higher | Lower (close to 1 after GC) | More SSD space efficient |
| CPU Usage | Lower | Slightly higher | Range queries require multi-thread processing |
| Applicable Scenarios | General | SSD + Medium/Large Value | Best effect when value > 1KB |
Architecture Comparison Diagram:
┌─────────────────────────────────────────────────────────────────┐
│ LevelDB vs WiscKey │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LevelDB: │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────────────┐ │
│ │ MemTable│───▶│ Immutable│───▶│ L0-L6 SSTable (Key+Value) │ │
│ │(SkipList│ │ MemTable│ │ Compaction needs to sort │ │
│ └─────────┘ └─────────┘ │ Key+Value │ │
│ ▲ └─────────────────────────────┘ │
│ │ WAL │
│ ┌────┴────┐ │
│ │ Log │ │
│ └─────────┘ │
│ │
│ WiscKey: │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────────┐ │
│ │ MemTable│───▶│ Immutable│───▶│ L0-L6 SSTable (Key only)│ │
│ │(SkipList│ │ MemTable│ │ Compaction only sorts Key│ │
│ └─────────┘ └─────────┘ └─────────────────────────┘ │
│ │ │ │
│ │ │ (Address reference) │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ vLog (Value Log) │ │
│ │ ┌─────────────────────────────────┐ │ │
│ │ │ Tail │ ... Valid Values ... │ Head│ │ │
│ │ └─────────────────────────────────┘ │ │
│ │ (Sequential append, background GC) │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
5.6 Design Insights and Subsequent Impact
5.6.1 Core Design Insights
- Workload Characteristics Determine Storage Design
- Observation: In modern workloads, keys are small (16B), values vary greatly (100B-4KB+)
- Insight: Compaction only cares about Key order
- Conclusion: Key-value separation can significantly reduce Compaction overhead
- Hardware Characteristics Influence Architecture Choice
- SSD random read close to sequential read → Key-value separation feasible
- SSD high internal parallelism → Multi-thread random reads effective
- SSD lifespan limited by write volume → Reducing write amplification necessary
- Trade-off Re-evaluation
- Traditional LSM: Sequential write vs Query performance
- WiscKey: Lower write amplification + Utilize SSD parallelism
- File System Characteristics Can Be Leveraged
- Append operation atomicity simplifies crash consistency design
5.6.2 Applicable Scenarios
WiscKey Best Suited For:
- SSD storage devices
- Value size >= 1KB
- Write-intensive workloads
- Scenarios requiring range queries but not too large query ranges
WiscKey Not Suited For:
- HDD storage (poor random read performance)
- Very small values (below 64B)
- Workloads dominated by large-range sequential scans
5.6.3 Subsequent Related Work
WiscKey influenced multiple subsequent storage systems:
WiscKey (2016)
│
├──→ Titan (PingCAP, 2018)
│ └── TiKV's key-value separation engine
│ └── Architecture similar to WiscKey
│
├──→ TerarkDB (ByteDance, 2019)
│ └── Combines WiscKey with other compression optimizations
│
├──→ BlobDB (Facebook RocksDB)
│ └── RocksDB's built-in key-value separation implementation
│ └── Large values automatically separated to Blob files
│
├──→ HashKV (SOSP 2018)
│ └── Further optimizes GC efficiency, reduces data migration
│
└──→ RocksDB's Integrated BlobDB (2021+)
└── Integrates WiscKey ideas into main branch
5.6.4 Paper Contribution Summary
| Contribution Type | Specific Content |
|---|---|
| Problem Identification | Points out LSM-Tree’s high I/O amplification problem is particularly severe on SSDs |
| Core Innovation | Proposes key-value separation architecture, significantly reduces Compaction overhead |
| Challenge Solution | Solves three major challenges: range query, garbage collection, crash consistency |
| Experimental Proof | Performance significantly better than LevelDB/RocksDB under various workloads |
| Open Source Implementation | Implemented based on LevelDB 1.18, reproducible and extensible |
5.6.5 Significance for LSM-Tree Development
WiscKey represents an important milestone in LSM-Tree optimization shifting from generality to hardware-awareness:
- Before: LSM-Tree design mainly considered HDD characteristics
- WiscKey: Redesigned for SSD characteristics, proving order-of-magnitude performance improvements achievable
- After: More work focuses on hardware characteristics (such as ZNS SSD, Persistent Memory)
6. PebblesDB FLSM Optimization (2017)
Paper: PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees
Authors: Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, Ittai Abraham
Institution: University of Texas at Austin, VMware Research
Published: SOSP 2017 (ACM Symposium on Operating Systems Principles)
Core Contribution: Proposed FLSM (Fragmented Log-Structured Merge Trees) data structure, combining Skip List and LSM-Tree, significantly reducing write amplification
6.1 Research Background and Motivation
6.1.1 LSM-Tree Write Amplification Problem
LSM-Tree (such as LevelDB, RocksDB), while avoiding random writes, has a severe write amplification problem:
Root Cause: LSM requires non-overlapping key ranges for sstables at each level
- New data writes need to be merged and sorted with existing data
- Same data is repeatedly read and written during Compaction process
- Write amplification can reach 12× - 50×
Traditional LSM Compaction Example:
L0: {1,100} → L1: {1,50} → L2: {1,25}
{100,200} {25,100}
{100,400}
Data {1,100} is rewritten 3 times!
Consequences:
- SSD lifespan shortened
- Storage costs increased
- Write throughput reduced (RocksDB write throughput is only 10% of read)
6.1.2 Limitations of Traditional Solutions
| Solution | Method | Limitation |
|---|---|---|
| No sstable merging | Directly add new sstable to level | Read performance drops sharply, need to check multiple sstables |
| Specialized hardware | Utilize SSD FTL features (NVMKV) | Depends on specific hardware, poor generality |
| Sacrifice read performance | LSM-trie, Universal Compaction | Does not support range queries or poor read performance |
| Key-value separation | WiscKey | Range query performance drops, requires GC |
6.1.3 Core Insight
Root cause of LSM write amplification: Requirement for non-overlapping (disjoint) key ranges for sstables at each level
If overlapping sstables are allowed at the same level, data rewriting during Compaction can be avoided.
Challenge: How to maintain efficient query performance while allowing overlap?
Solution: Use Guards (inspired by Skip List)
6.2 FLSM Core Design
6.2.1 Guards Mechanism
FLSM Guards Mechanism
┌─────────────────────────────────────────────────────────────┐
│ │
│ Guard: Key randomly selected from inserted keys, used to │
│ partition key space │
│ │
│ Key Properties: │
│ ├── Guards have non-overlapping key ranges (disjoint) │
│ ├── Each Guard can have multiple attached sstables │
│ │ (overlap allowed) │
│ ├── Higher-level Guards contain all Guards from lower │
│ │ levels (similar to Skip List) │
│ └── Lowest level has most Guards │
│ │
│ Guard Selection Probability: │
│ ├── Level 1: Lowest probability (fewest Guards) │
│ ├── Level i+1: Probability > Level i (more Guards) │
│ └── Example: If probability is 1/10, select 1 Guard per │
│ 10 keys │
│ │
│ Example Layout: │
│ │
│ Level 3: [Guard:5]────[Guard:100]────[Guard:375]────[Guard:1023] │
│ │ │ │ │ │
│ Level 2: [G:5]────────[G:100]────────[G:375] │
│ │ │ │ │
│ Level 1: [G:5]────────[G:100] │
│ │ │
│ Level 0: (No Guards, new sstables attached directly) │
│ │
│ sstables under each Guard (overlap allowed): │
│ Guard:5 may have: {1,20}, {2,35}, {5,40} (all contain key 5)│
│ │
└─────────────────────────────────────────────────────────────┘
6.2.2 FLSM vs LSM Architecture Comparison
LSM vs FLSM Architecture Comparison
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Traditional LSM: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Level 1: [1..100] [200..300] [400..500] │ │
│ │ ↑ Each key in only one sstable │ │
│ │ Level 0: [1..50] [60..80] [200..250] │ │
│ │ ↑ Can overlap │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ FLSM: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Level 1: Guard:50 ────── Guard:200 ────── Guard:400 │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ [1,60] [150,250] [350,450] │ │
│ │ [20,80] [180,300] [380,500] │ │
│ │ [40,100] [200,350] [400,550] │ │
│ │ ↑ sstables under same Guard can overlap │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ Key Differences: │
│ ├── LSM: sstables at each level must be disjoint │
│ ├── FLSM: Guards are disjoint, sstables within Guard can │
│ │ overlap │
│ └── FLSM: Query first finds Guard, then checks all sstables │
│ under that Guard │
│ │
└─────────────────────────────────────────────────────────────────┘
6.2.3 FLSM Compaction Algorithm
FLSM Compaction Process
┌─────────────────────────────────────────────────────────────┐
│ │
│ Traditional LSM Compaction: │
│ 1. Read Level i sstable and overlapping sstables in │
│ Level i+1 │
│ 2. Merge and sort all key-values │
│ 3. Write new sstables to Level i+1 │
│ 4. Delete old sstables │
│ → Data is rewritten! │
│ │
│ FLSM Compaction (Most Cases): │
│ 1. Select a Guard in Level i (when its sstables count │
│ exceeds threshold) │
│ 2. Partition all sstables of this Guard by Guards in │
│ Level i+1 │
│ 3. Directly append (append) partitioned sstables to │
│ corresponding Guard in Level i+1 │
│ 4. Delete old sstables in Level i │
│ → Data is only partitioned, not rewritten! │
│ │
│ Example: │
│ Level 1 Guard:5 has sstable {1, 20, 45, 101, 245} │
│ Level 2 Guards: 1, 40, 200 │
│ Compaction Result: │
│ ├── Guard:1 gets {1, 20} │
│ ├── Guard:40 gets {45, 101} │
│ └── Guard:200 gets {245} │
│ │
│ Note: Only at highest level (Last Level) needs rewriting │
│ to control fragmentation │
│ │
└─────────────────────────────────────────────────────────────┘
Two Compaction Modes:
| Mode | Trigger Condition | Operation | Write Amplification |
|---|---|---|---|
| Partition Compaction | Non-highest level | Only partition, no sorting | Close to 1× |
| Rewrite Compaction | Highest level or too fragmented | Merge, sort, rewrite | Similar to LSM |
6.3 PebblesDB Implementation Optimizations
6.3.1 Solving FLSM Read Performance Problem
FLSM Problem: Queries need to check all sstables under Guard, read performance decreases.
PebblesDB Optimization Techniques:
| Optimization Technique | Description | Effect |
|---|---|---|
| SSTable-level Bloom Filter | Each sstable has Bloom Filter | Avoid reading sstables not containing target key |
| Parallel Seek | Multi-thread parallel search of multiple sstables within Guard | Reduce seek latency |
| Seek-Based Compaction | Trigger Compaction when seek count exceeds threshold | Reduce sstable count within Guard |
| Aggressive Compaction | Trigger when Level i size exceeds 25% of Level i+1 | Reduce number of levels to search |
6.3.2 Guard Selection Algorithm
Hash-based Guard Selection:
PebblesDB uses hash-based Guard selection algorithm to avoid uneven Guard distribution caused by hot keys.
Core Idea:
- Calculate hash value for each key
- Check if last N bits of hash value are all 1s
- Higher the level, smaller the N → Higher match probability → More Guards
Probability Design (Example):
- Level 1: Last 8 bits must be 1 → Probability ~ 1/256 (fewest Guards)
- Level 2: Last 6 bits must be 1 → Probability ~ 1/64
- Level 3: Last 4 bits must be 1 → Probability ~ 1/16 (most Guards)
This design ensures higher levels have more Guards, forming a hierarchical structure similar to Skip List.
6.3.3 Guard Asynchronous Insertion
Guards are not inserted synchronously, but lazily bound:
- When new key passes hash check to become Guard, first add to “uncommitted_guards” set
- During subsequent Compaction process, sstables are partitioned and attached to new Guard
- This lazy insertion avoids synchronous overhead while maintaining Guard hierarchical structure
6.3.4 PebblesDB Operation Flow
Get Operation:
- Check MemTable
- Search level by level from Level 0:
- Binary search to locate Guard
- Check all sstables under Guard (use Bloom Filter to filter)
- Return if found
Range Query:
- Determine Guards involved at each level
- Binary search to locate starting key in relevant sstables within each Guard
- Use multi-way merge (similar to Merge Sort) to produce ordered result
- Use parallel read optimization
6.4 Performance Evaluation
6.4.1 Experimental Environment
- Hardware: Dell Precision Tower 7810, Intel Xeon 2.8GHz, 16GB RAM
- Storage: 2× Intel 750 SSD (1.2TB each, RAID0)
- System: Ubuntu 16.04 LTS, Linux 4.4 kernel, ext4
- Dataset: 3× larger than memory (ensure data mainly resides on disk)
6.4.2 Micro Benchmark (db_bench)
Write Amplification Comparison (Insert 500 million key-value, total 45GB):
| System | Total Write IO | Write Amplification |
|---|---|---|
| LevelDB | 540 GB | 12× |
| HyperLevelDB | 600 GB | 13.3× |
| RocksDB | 840 GB | 18.7× |
| PebblesDB | 175 GB | 3.9× |
Throughput Comparison:
Random Write:
- LevelDB: Baseline
- HyperLevelDB: 1.1×
- RocksDB: 0.5×
- PebblesDB: 2.7× (6.7× vs RocksDB)
Random Read:
- PebblesDB: Approximately equal to HyperLevelDB, similar or slightly better than other systems
Range Query:
- Pure seek: PebblesDB 30% slower (worst case)
- 50 next(): PebblesDB 15% slower
- 1000 next(): PebblesDB 11% slower
Update Throughput (50M key-value):
| Operation | PebblesDB | HyperLevelDB | LevelDB | RocksDB |
|---|---|---|---|---|
| Insert 50M | 56.18 KOps/s | 40.00 KOps/s | 22.42 KOps/s | 14.12 KOps/s |
| Update Round 1 | 47.85 KOps/s | 24.55 KOps/s | 12.29 KOps/s | 7.60 KOps/s |
| Update Round 2 | 42.55 KOps/s | 19.76 KOps/s | 11.99 KOps/s | 7.36 KOps/s |
6.4.3 YCSB Macro Benchmark
Write-dominant workloads (Load A, Load E):
- PebblesDB: 1.5-2× faster than RocksDB
- PebblesDB: 1.5× faster than HyperLevelDB
- Write IO: 50% less than RocksDB
Read-dominant workloads (B, C, D):
- Workload C (100% read): PebblesDB better (caches more indexes)
- Workload B/D: Comparable to HyperLevelDB
- Workload E (95% range query): 6% slower
Workload F (RMW): Comparable to other systems
6.4.4 Real Application Testing
MongoDB (using PebblesDB as storage engine):
- Load A: 18% faster than WiredTiger
- A (50% write): 50% faster than WiredTiger
- Write IO: 4% less than WiredTiger
- 40% less IO than RocksDB
Note: Application layer overhead limits performance improvement magnitude
HyperDex:
- Average improvement: 18-105% (depends on workload)
- Write IO: 35-55% reduction
- Large value (16KB): More obvious improvement (105%)
6.4.5 Resource Consumption
| Resource | PebblesDB vs Others | Description |
|---|---|---|
| Memory | Uses more | Need to store Guard metadata and more Bloom Filters |
| CPU | Uses more | Guard lookup and parallel search overhead |
| Space Amplification | Similar | Invalid data cleanup mechanism similar |
6.5 Theoretical Analysis
6.5.1 Asymptotic Complexity (DAM Model)
Assumptions:
- Total data items: n
- Block size: B
- Guard probability: 1/B (Guards increase B times per level)
- Number of levels H = log_B(n)
Write Cost:
- Write once per level: O(H) = O(log_B(n))
- Rewrite at last level: O(B) times
- Total write cost: O((B + log_B(n))/B)
Read Cost:
- Binary search Guard at each level: O(log(B^H)) = O(H·log B) = O(log n)
- Check sstables within Guard: O(1) after using Bloom Filter (most cases)
- Total read cost: O(log n) memory operations + O(1) disk read
6.5.2 FLSM as Generalization of LSM
When max_sstables_per_guard = 1:
- Each Guard can only have 1 sstable
- Equivalent to traditional LSM (sstables disjoint at each level)
- Performance comparable to LSM
FLSM Advantages:
- max_sstables_per_guard > 1: Lower write amplification, higher write throughput
- Tunable parameters balance read/write performance
6.6 Comparison with Related Work
6.6.1 Comparison with WiscKey
| Characteristic | WiscKey | PebblesDB |
|---|---|---|
| Core Idea | Key-value separation | Guards organize overlapping sstables |
| Write Amplification Reduction | Yes (significant) | Yes (significant) |
| Range Query | Performance drops (random read) | Slight drop (optimizable) |
| Garbage Collection | Requires independent GC | Not needed (handled during Compaction) |
| MemTable | Key only | Key+Value |
| Applicable Scenarios | Large Value | General |
6.6.2 Comparison with Other LSM Optimizations
| System | Method | Limitation |
|---|---|---|
| bLSM | Snowshoveling scheduling | Does not reduce write amplification, only smooths writes |
| VT-tree | Avoid re-sorting already sorted data | Depends on data distribution |
| LSM-trie | Use Trie structure | Does not support range queries |
| TRIAD | Hot-cold separation + delayed Compaction | Orthogonal to FLSM, can be used together |
| PebblesDB | FLSM + Guards | Read performance slightly drops |
6.7 Limitations and Applicable Scenarios
6.7.1 PebblesDB Limitations
- Poor Sequential Write Performance
- LSM can directly move sstable
- FLSM must partition, generating extra IO
- Range Query Overhead
- Pure seek operations 30% slower
- Large number of next() can amortize overhead
- Memory Consumption
- Need to store Guard metadata
- More Bloom Filters
- CPU Overhead
- Guard lookup
- Parallel search coordination
6.7.2 Applicable Scenarios
Best Suited For:
- Random write-dominated workloads
- Write throughput sensitive scenarios
- SSD storage (extend lifespan)
- Write-intensive NoSQL applications
Not Suited For:
- Pure sequential writes (log type)
- Large number of small range queries (no next() to amortize)
- Memory-constrained environments
6.8 Summary and Impact
6.8.1 Core Contributions
- FLSM Data Structure: First proposed Fragmented LSM, allowing overlapping sstables at each level
- Guards Mechanism: Inspired by Skip List, efficiently organizes overlapping sstables
- PebblesDB Implementation: Proved can simultaneously achieve low write amplification, high write throughput, acceptable read performance
- Practical Application: Successfully integrated into MongoDB and HyperDex
6.8.2 Design Insights
- Traditional LSM’s disjoint sstable constraint is the root cause of write amplification
- Through Guards organizing data, can maintain query efficiency while allowing overlap
- Write optimization and read optimization can be balanced, not necessarily zero-sum
- Tunable parameters (max_sstables_per_guard) allow optimization for different workloads
6.8.3 Key Technologies Summary
| Technology | Purpose | Effect |
|---|---|---|
| Guards | Organize overlapping sstables | Allow Compaction without re-sorting |
| SSTable Bloom Filters | Accelerate reads | Avoid reading irrelevant sstables |
| Parallel Seek | Accelerate range queries | Multi-thread search of multiple sstables |
| Seek-Based Compaction | Control Guard size | Balance read/write performance |
| Hash-based Guard Selection | Avoid skew | Evenly distribute Guards |
6.8.4 Subsequent Impact
PebblesDB’s FLSM ideas influenced multiple subsequent systems:
- Proved LSM can have lower write amplification implementations
- Guards concept borrowed by other systems
- Inspired more innovations in Compaction algorithms
- Demonstrated feasibility of “partial sorting”
7. RocksDB and Industrial Practice
7.1 RocksDB Architecture Evolution
RocksDB is an industrial-grade LSM-Tree storage engine developed by Facebook based on LevelDB, with approximately 600,000 lines of C++ code and 1300+ files. Its core design goal is to provide enterprise-level high performance, scalability, and reliability while maintaining LevelDB’s simplicity.
Core Design Philosophy:
- High Performance: Multi-threaded compaction, parallel I/O, CPU cache optimization
- Scalability: Column Family, tiered storage, pluggable architecture
- Flexibility: Rich configuration options, multiple Compaction strategies
- Reliability: Complete transaction support, backup/recovery, crash recovery
Core Architecture Layers:
RocksDB Core Architecture
┌─────────────────────────────────────────────────────────────┐
│ API Layer │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Put │ │ Get │ │ Delete │ │ Snapshot│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
└───────┼───────────┼───────────┼───────────┼─────────────────┘
│ │ │ │
┌───────▼───────────▼───────────▼───────────▼─────────────────┐
│ Write Layer (WAL + MemTable) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ WriteBatch → WAL (Write-Ahead Log) → MemTable │ │
│ │ (SkipList / HashSkipList / VectorRep) │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌───────▼─────────────────────────────────────────────────────┐
│ Flush Layer (Minor Compaction) │
│ MemTable ──[Flush]──> Immutable MemTable ──[Build]──> L0 │
└─────────────────────────────────────────────────────────────┘
│
┌───────▼─────────────────────────────────────────────────────┐
│ Compaction Layer (Major Compaction) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Level 0│→│ Level 1│→│ Level 2│→│ Level 6│ │
│ │ (Overlap)│ │(Non-overlap)│ │(Non-overlap)│ │(Non-overlap)│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌───────▼─────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │SSTable │ │SSTable │ │BlobFile │ │MANIFEST │ │
│ │(Block- │ │(Filter- │ │(Large │ │(Metadata)│ │
│ │ based) │ │ Block) │ │ Value) │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
│
┌───────▼─────────────────────────────────────────────────────┐
│ Cache & Memory Management │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Block │ │ Table │ │ Row │ │ Clock/ │ │
│ │ Cache │ │ Cache │ │ Cache │ │ LRU │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Differences from LevelDB:
| Aspect | LevelDB | RocksDB |
|---|---|---|
| Concurrency | Single-threaded compaction | Multi-threaded compaction |
| Transactions | None | Full support |
| Backup | None | BackupEngine |
| Column Family | None | Supported |
| BlobDB | None | Built-in |
| Cache | Simple LRU | HyperClockCache |
| Compaction Strategy | Leveled | Leveled/Universal/FIFO |
7.2 BlobDB: Industrial Implementation of WiscKey
RocksDB’s BlobDB is an industrial-grade implementation of the WiscKey key-value separation concept, separating large values from SSTables for storage, significantly reducing write amplification.
7.2.1 Architecture Design
Key-Value Separation Principle:
Traditional LSM:
SSTable: [Key1|Value1] [Key2|Value2] ...
↓ Values repeatedly rewritten during Compaction
BlobDB:
SSTable: [Key1|BlobRef1] [Key2|BlobRef2] ...
BlobFile: [Value1] [Value2] ... (Append-only, no Compaction)
Core Components:
| Component | Location | Function |
|---|---|---|
| BlobFileBuilder | db/blob/blob_file_builder.cc |
Build Blob files |
| BlobFileReader | db/blob/blob_file_reader.cc |
Read Blob files |
| BlobIndex | Stored in SSTable | Pointer to Blob file |
| BlobGarbageCollector | db/blob/blob_garbage_collector.cc |
Garbage collection |
7.2.2 Blob File Format
Blob File Layout
┌─────────────────────────────────────────────────────────────┐
│ Blob Records │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Key (for verification) │ Value │ CRC32 │ │ │
│ └───────────────────────────────────────────────────────┘ │
│ ... │
├─────────────────────────────────────────────────────────────┤
│ Blob Index (Stored in SSTable) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ blob_file_number │ offset │ size │ compression_type │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
7.2.3 Key Configuration Parameters
| Configuration | Default Value | Description |
|---|---|---|
enable_blob_files |
false | Enable key-value separation |
min_blob_size |
0 | Values smaller than this still stored in SSTable |
blob_file_size |
256MB | Blob file target size |
enable_blob_garbage_collection |
true | Enable garbage collection |
blob_garbage_collection_age_cutoff |
0.25 | GC only processes old files |
blob_garbage_collection_force_threshold |
1.0 | Force GC threshold |
7.2.4 Garbage Collection Mechanism
BlobDB’s GC is performed automatically during Compaction:
- Scan: Traverse Blob Index in SSTable
- Verify: Check if each value is still valid
- Copy: Append valid values to new Blob file
- Update: Modify Blob Index pointer in SSTable
- Cleanup: Delete old Blob files
GC Trigger Conditions:
- Configured to run periodically
- Invalid data ratio exceeds threshold
- Automatic check during Compaction
7.2.5 Differences from WiscKey
| Characteristic | WiscKey | BlobDB |
|---|---|---|
| Implementation | Standalone system | RocksDB built-in |
| GC | Requires separate implementation | Coordinated with Compaction |
| Configuration | Limited | Rich configuration options |
| Monitoring | Basic | Comprehensive statistics |
| Applicable Scenarios | Large value specific | General, automatic selection |
Applicable Scenarios:
- Average value size > 1KB
- Write-intensive, need to reduce write amplification
- Can accept slight range query performance degradation
7.3 Compaction Strategy Detailed Explanation
RocksDB provides three main Compaction strategies, suitable for different workload scenarios.
7.3.1 Leveled Compaction
Source: db/compaction/compaction_picker_level.cc
Core Idea: SSTable key ranges at each level are completely non-overlapping, similar to LevelDB.
Trigger Conditions:
- L0 file count exceeds
level0_file_num_compaction_trigger(default 4) - Total size of a level exceeds
max_bytes_for_level_base × (level_multiplier ^ (L-1)) - Manual trigger (CompactRange)
Compaction Selection Algorithm:
1. Prioritize L0 (if file count exceeds threshold)
2. Otherwise select "most crowded" level (current size / target size ratio largest)
3. Select earliest created SSTable in that level as input
4. Find all overlapping SSTables in next level
5. Check overlap with grandparent level, limit Compaction size
Write Amplification: 10-30× (depends on data distribution) Read Amplification: Low (O(1), at most one SSTable per level)
Applicable Scenarios:
- Read-heavy (read/write > 10:1)
- Frequent range queries
- Need stable read performance
7.3.2 Universal Compaction (Size-Tiered)
Source: db/compaction/compaction_picker_universal.cc
Core Idea: Delayed merging, each level allows multiple overlapping sorted runs, sorted by size.
Trigger Conditions:
- Space amplification exceeds threshold (
max_size_amplification_percent) - Number of similar-sized runs exceeds threshold
- Total run count exceeds
level0_file_num_compaction_trigger
Merge Strategy:
1. Space amplification merge: When total size > (100 + threshold)% of valid data
2. Size ratio merge: Merge runs of similar size
3. File count merge: When total run count is too high
Write Amplification: 2-5× (significantly lower than Leveled) Read Amplification: Higher (need to check multiple runs)
Applicable Scenarios:
- Write-heavy (write/read > 10:1)
- Log-type workloads
- Sequential writes
7.3.3 FIFO Compaction
Source: db/compaction/compaction_picker_fifo.cc
Core Idea: Only keep latest data, similar to cache.
Trigger Conditions:
- Total size exceeds
compaction_options_fifo.max_table_files_size - Or TTL expired
Compaction Behavior:
1. Sort by file creation time
2. Delete oldest files until size within limit
3. No merge operations performed
Write Amplification: Very low (~1×, no rewriting) Read Amplification: Very high (may need to check many files)
Applicable Scenarios:
- Time-series data
- Cache
- TTL-dominated scenarios
7.3.4 Strategy Comparison Summary
| Strategy | Write Amplification | Read Amplification | Space Amplification | Applicable Scenario | RocksDB Recommendation |
|---|---|---|---|---|---|
| Leveled | 10-30× | Low (O(1)) | Low | Read-heavy | Default |
| Universal | 2-5× | Medium-High | Medium | Write-heavy | Log-type |
| FIFO | ~1× | Very High | Very High | Time-series/Cache | Special scenarios |
| BlobDB | 2-5× | Medium | Low | Large value | General large value |
7.3.5 Configuration Selection Recommendations
Read-heavy (read/write > 10:1):
options.compaction_style = kCompactionStyleLevel;
options.level0_file_num_compaction_trigger = 4;
options.level_slowdown_writes_trigger = 20;
options.level_stop_writes_trigger = 36;
Write-heavy (write/read > 10:1):
options.compaction_style = kCompactionStyleUniversal;
options.compaction_options_universal.max_size_amplification_percent = 200;
options.compaction_options_universal.size_ratio = 1;
Time-series data/TTL:
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_table_files_size = 100 * 1024 * 1024 * 1024; // 100GB
options.compaction_options_fifo.allow_compaction = true; // Allow minor compaction
options.ttl = 7 * 24 * 60 * 60; // 7 days
Large Value (>1KB):
options.enable_blob_files = true;
options.min_blob_size = 1024;
options.blob_file_size = 256 * 1024 * 1024;
options.enable_blob_garbage_collection = true;
7.4 Industrial-Grade Features
RocksDB provides rich enterprise-level features to meet production environment requirements:
1. Multi-Version Concurrency Control (MVCC)
- Snapshot isolation level, supports consistent reads
- Serializable transactions, guarantees strict execution order
- Deadlock detection and automatic resolution
2. Backup and Recovery
- Hot backup (BackupEngine): Create consistent snapshots without stopping service
- Incremental backup: Only backup changed data
- Point-in-time recovery: Restore to state at specified timestamp
3. Checkpoint
- Create lightweight consistent snapshots
- Based on hard links, completes almost instantaneously
- Used for quick backup or test environment cloning
4. Tiered Storage (Hot-Cold Separation)
- Support storing data from different levels on different media
- Hot data (L0-L2): SSD
- Cold data (L3+): HDD or object storage
- Automatic data tiered migration
5. Compression Algorithm Selection
- Snappy: Default, fast compression, suitable for general scenarios
- Zstd: High compression ratio, saves storage space
- LZ4: Extremely fast compression, suitable for CPU-constrained scenarios
- Support selecting different algorithms per level
6. Bloom Filter Tuning
- Full table filter: One Filter per SSTable
- Partition filter: Reduce memory usage, suitable for large SSTables
- Per-block filter: Precise to Data Block level
7. Adaptive Compaction
- Dynamically adjust Compaction rate based on workload
- Configurable rate limiting, avoid affecting foreground
- Priority scheduling, prioritize processing read-hot levels
8. Column Families
- Logical isolation of different data collections
- Independent Compaction strategy configuration per Column Family
- Support cross-Column Family atomic operations
9. Monitoring and Diagnostics
- Real-time statistics (Statistics): Operation counts, latency distribution
- Performance context (PerfContext): Detailed latency of single operation
- Tracing (Trace): Record operation sequence for diagnosis
10. Write Throttling and Flow Control
- Prevent writes from being too fast causing Compaction to fall behind
- Automatically reduce write rate, maintain system stability
- Configurable delayed write strategy
7.5 RocksDB Latest Progress (2024-2026)
Note: This section involves RocksDB latest versions (v10.x - v11.x), some features are still rapidly iterating, please test thoroughly before production use.
7.5.1 RocksDB v11.0 New Features (2026)
Blob File Storage Wide-Column Entities:
- Background: Supports HBase-like wide-column model, suitable for sparse column storage scenarios
- Optimization:
min_blob_sizeconfiguration supports wide-column scenarios, automatic storage method selection - Benefit: Reduce SST file size, improve read performance by 10-30%
FIFO Compaction Enhancements:
max_data_files_size: Trim old files based on SST+blob total sizeuse_kv_ratio_compaction: BlobDB-optimized intra-L0 compaction- Applicable Scenarios: Time-series data, cache and other TTL-dominated workloads
Interpolation Search:
- Configuration:
index_block_search_typenew option - Principle: Utilize key distribution patterns to predict position, fallback to binary search for non-uniform distribution
- Performance: Uniformly distributed keys (such as timestamps) can reduce comparison count by 30-50%
Key-Value Separated Data Blocks:
- Configuration:
separate_key_value_in_data_blockoption - Benefit: Improve CPU cache hit rate, improve compression ratio by 5-15%
7.5.2 RocksDB v10.x Important Updates
| Version | Feature | Impact |
|---|---|---|
| v10.7.0 | HyperClockCache Default | Replace LRU Cache, higher concurrent performance, reduced lock contention |
| v10.6.0 | MultiScan API | Multi-range scan optimization, reduce duplicate index lookups |
| v10.5.0 | User-Defined Index (UDI) | Support custom secondary indexes, flexibility improvement |
| v10.4.0 | Parallel Compression Optimization | CompressionOptions::parallel_threads accelerates compression |
| v10.0.0 | Integrated BlobDB Complete | Key-value separation functionality matured, recommended for production use |
7.5.3 Frontier Research Directions
Hardware Adaptation:
ZNS SSD (Zoned Namespace SSD):
- Characteristics: SSD divided into multiple Zones, can only write sequentially, requires explicit erase
- Optimization: LSM SSTables naturally aligned with Zones
- Benefit: Reduce SSD internal GC, extend lifespan, improve performance
Persistent Memory (PMem):
- Characteristics: Byte-addressable, large capacity, non-volatile
- Optimization: PMem as large-capacity MemTable or L0 layer
- Benefit: Fast recovery, large-capacity buffering, reduced write amplification
CXL Memory Expansion:
- Characteristics: Expand memory pool via CXL protocol
- Optimization: Separate compute and storage, remote MemTable
- Benefit: Compute-storage separated architecture, elastic scaling
Cloud-Native Optimization:
Compute-Storage Separation:
- Compute nodes stateless, data stored in object storage (S3/OSS)
- SSTable tiering: Hot data local, cold data remote
- Challenges: Cold start latency, consistency guarantees
Remote Compaction:
- Offload Compaction tasks to dedicated compute nodes
- Release foreground node resources, improve service capacity
- Technical challenges: Network overhead, task scheduling
Serverless LSM:
- Launch LSM instances on demand, pay per query
- Data persisted in object storage
- Applicable: Low-frequency access, elastic demand scenarios
New Data Structures:
Learned Index:
- Use machine learning models to replace traditional B-Tree indexes
- Predict key position, reduce index size
- Challenges: Model training, dynamic updates, worst-case guarantees
Adaptive Data Structures:
- Automatically adjust structure based on workload
- Read-heavy → B-Tree mode
- Write-heavy → LSM mode
- Representatives: Dostoevsky, Fluid LSM-Tree
Hardware Acceleration:
- FPGA/GPU acceleration for Compaction
- Dedicated accelerators for compression/decompression
- Smart NIC offload network I/O
7.5.4 Version Selection Recommendations
| Scenario | Recommended Version | Reason |
|---|---|---|
| Production Environment (Stability Priority) | v8.x - v9.x LTS | Large-scale verified, comprehensive community support |
| New Feature Requirements | v10.x | Rich functionality, relatively mature, optimizations like HyperClockCache worth adopting |
| Frontier Testing/Evaluation | v11.x | Latest features, such as wide-column support, interpolation search, need thorough testing before production |
7.5.5 Getting Latest Updates
- GitHub Releases: https://github.com/facebook/rocksdb/releases
- RocksDB Blog: https://rocksdb.org/blog/
- Official Wiki: https://github.com/facebook/rocksdb/wiki
8. LSM-Tree Optimization Technology Map
The three core performance metrics of LSM-Tree: Write Amplification, Read Amplification, Space Amplification. This chapter systematically organizes the complete optimization technology system for these three problems.
Root Causes of LSM-Tree Three Amplification Problems:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Write Amplification │
│ ├── Root: Repeatedly rewriting data during Compaction │
│ ├── Impact: Reduces write throughput, shortens SSD lifespan │
│ └── Typical: LevelDB 12-50×, RocksDB 10-30× │
│ │
│ Read Amplification │
│ ├── Root: Need to check multiple levels/files to find key │
│ ├── Impact: Increases read latency, reduces random read │
│ │ performance │
│ └── Typical: LevelDB 3-14×, RocksDB 2-10× │
│ │
│ Space Amplification │
│ ├── Root: Invalid data not cleaned timely, low compression │
│ │ efficiency │
│ ├── Impact: Wastes disk space, increases costs │
│ └── Typical: LevelDB 1.2×, RocksDB 1.15× │
│ │
└─────────────────────────────────────────────────────────────┘
8.1 Write Amplification Optimization
Write amplification is the most core performance problem of LSM-Tree, defined as the ratio of actual data written to disk to user-written data.
8.1.1 Write Amplification Source Analysis
Three Major Sources of Write Amplification:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. Compaction Rewrite │
│ ┌────────────────────────────────────────────────────┐ │
│ │ L0 → L1 Compaction Example: │ │
│ │ │ │
│ │ L0: [File_A][File_B][File_C][File_D] │ │
│ │ ↓ Merge │ │
│ │ L1: [File_1][File_2][File_3]...[File_N] │ │
│ │ │ │
│ │ Input: 4 files × 2MB = 8MB │ │
│ │ Output: 10 files × 2MB = 20MB │ │
│ │ Write Amp: 20MB / 8MB = 2.5× (Only this level) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ 2. Bloom Filter Update │
│ - Need to recalculate Bloom Filter for each Compaction │
│ - Additional write overhead: ~5-10% │
│ │
│ 3. Index Rebuild │
│ - Index Block and MetaIndex Block rebuild │
│ - Additional write overhead: ~3-5% │
│ │
└─────────────────────────────────────────────────────────────┘
Overall Write Amplification Formula:
Write_Amp = Σ(Level_i_Size / Level_i-1_Size) × Compaction_Overhead
8.1.2 Key-Value Separation Optimization
Core Idea: Compaction only cares about Key order, Value can be stored separately to avoid repeatedly rewriting large Values.
Key-Value Separation Architecture Comparison:
Traditional LSM: Key-Value Separated LSM:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ LSM-Tree │ │ LSM-Tree │ │ vLog │
│ Key+Value│ │ Key+Addr │ │ (Value) │
└──────────┘ └────┬─────┘ └────┬─────┘
▲ │ │
│ └───────────────┘
Write Amp 12-50× Write Amp ~1× Sequential Append
Advantages:
1. Compaction only processes Key (usually 16-64 bytes)
2. Value directly appended to vLog, no movement
3. GC reclaims invalid Values in background, doesn't affect foreground writes
Representative Implementations:
| System | Proposal Time | Key Characteristics | Write Amp Reduction |
|---|---|---|---|
| WiscKey | 2016 (FAST) | Academic pioneer, vLog + online GC | 12× → 1.2× (90%) |
| Titan | 2018 | TiKV engine, offline GC | 15× → 1.5× (90%) |
| BlobDB | 2020 | RocksDB integrated, production-ready | 15× → 2× (87%) |
WiscKey Detailed Design:
// vLog format
struct LogEntry {
uint32_t key_length;
char[] key_data;
uint32_t value_length;
char[] value_data;
uint32_t checksum;
};
// Address stored in LSM-Tree
struct BlobAddress {
uint64_t file_number; // vLog file number
uint64_t offset; // Offset in vLog
uint32_t size; // Value size
};
// GC process
void GarbageCollection() {
while (true) {
// 1. Sequentially read chunk from Tail
entries = ReadChunk(tail_file);
// 2. Verify if each Key-Value is still valid
valid_entries = [];
for (entry : entries) {
if (LSM.Lookup(entry.key) == entry.address) {
valid_entries.push(entry); // Still referenced
}
}
// 3. Append valid data to Head
for (entry : valid_entries) {
new_addr = AppendToHead(entry.key, entry.value);
LSM.Update(entry.key, new_addr);
}
// 4. Update Tail, release space
tail_file = next_file();
}
}
Bloom Filter Effect Quantification:
| bits_per_key | False Positive Rate | Memory Overhead (per million keys) | Reduced Disk Reads |
|---|---|---|---|
| 5 | 5.0% | 0.625 MB | 85% |
| 8 | 1.0% | 1.0 MB | 90% |
| 10 | 0.8% | 1.25 MB | 92% |
| 15 | 0.05% | 1.875 MB | 94.5% |
False Positive Rate Formula:
P(false positive) ≈ (1 - e^(-k*n/m))^k
Where:
- n = number of keys
- m = total bits
- k = number of hash functions ≈ 0.693 * (m/n)
- bits_per_key = m/n
LevelDB/RocksDB default bits_per_key = 10
→ k ≈ 7 hash functions
→ False positive rate ≈ 0.8%
8.1.3 Data Structure Optimization
PebblesDB FLSM (Fragmented LSM):
FLSM vs Traditional LSM Comparison:
Traditional LSM (Disjoint SSTables):
Level 1: [1-100][200-300] ← key ranges non-overlapping
↑
Must rewrite entire range during Compaction
FLSM (Overlap Allowed):
Level 1: Guard:50 ──── Guard:200
│ │
[1-60] [150-250]
[20-80] [180-300]
[40-100] [200-350]
↑ Overlap allowed
Only partition during Compaction, no rewrite!
Guard Mechanism:
- Guard: Randomly selected partition point from inserted keys
- Guards are disjoint, sstables within Guard can overlap
- Higher levels contain all Guards from lower levels (similar to Skip List)
FLSM Compaction Process:
Traditional LSM Compaction: FLSM Compaction:
1. Read sstable 1. Select Guard (sstables exceed threshold)
2. Merge and sort 2. Partition by lower level Guards
3. Write new sstables 3. Direct append (no sorting!)
4. Delete old files 4. Delete old files
→ Data rewrite! → Only partition, no rewrite!
Write Amp Comparison:
- Traditional LSM: 18.7×
- PebblesDB (FLSM): 3.9× (79% reduction)
Other Data Structure Optimizations:
| Technology | Principle | Write Amp Reduction | Representative System |
|---|---|---|---|
| LSM-trie | Trie structure reduces overlap | 15× → 6× (60%) | Research prototype |
| Dostoevsky | Segmented LSM, delayed merge | 20× → 8× (60%) | Research prototype |
| HashKV | Hash partitioning, local Compaction | 18× → 7× (61%) | Research prototype |
8.1.4 Compaction Strategy Optimization
Four Mainstream Compaction Strategies Comparison:
1. Leveled Compaction (LevelDB Default)
L0: [A-M] [N-Z] ← 4 files trigger
↓ Merge
L1: [A-F] [G-L] [M-R] [S-Z] ← Output fully sorted
Characteristics:
- Each level fully sorted, adjacent level size ratio 10:1
- Lowest read amplification (O(1))
- Highest write amplification (10-30×)
- Applicable: Read-heavy scenarios
2. Tiered Compaction (RocksDB Universal)
L0: [A-C] [D-F] [G-I] [J-L] ← 4 sorted runs
↓ Delayed merge
L1: [A-L] ← One-time merge
Characteristics:
- Each level allows multiple sorted runs
- Lowest write amplification (2-5×)
- Highest read amplification (O(T), T=number of runs)
- Applicable: Write-heavy scenarios
3. Leveled-N (Hybrid Strategy)
L0-L2: Tiered style ← Lower level delayed merge
L3-L6: Leveled style ← Higher level strict sorting
Characteristics:
- Balance read/write amplification
- Write amplification: 5-10×
- Read amplification: 2-5×
- Applicable: Mixed workloads
4. FIFO Compaction
Keep only recent N files, delete when exceeded
Characteristics:
- Extremely low write amplification (~1×)
- Extremely high read amplification
- Extremely high space amplification
- Applicable: Time-series data, cache
Compaction Strategy Selection Decision Tree:
Start
│
▼
Workload Type?
│
├── Read-dominated (Read > 80%)
│ └──▶ Leveled Compaction
│ - Low read amplification
│ - Acceptable high write amplification
│
├── Write-dominated (Write > 80%)
│ ├──▶ Tiered Compaction
│ │ - Low write amplification
│ │ - Acceptable high read amplification
│ │
│ └──▶ FLSM (PebblesDB)
│ - Extremely low write amplification
│ - Medium read amplification
│
├── Mixed workload
│ └──▶ Leveled-N or Tuned Leveled
│ - Dynamically adjust parameters
│ - Balance read/write
│
└── Time-series data/Cache
└──▶ FIFO Compaction
- Only keep latest data
- Minimal Compaction
8.1.5 Write Amplification Optimization Technology Summary
Write Amplification Optimization Technology Map:
┌─────────────────────────────────────────────────────────────┐
│ Write Amplification Optimization │
├─────────────────────────────────────────────────────────────┤
│ │
│ ├── Key-Value Separation │
│ │ ├── WiscKey (vLog + Online GC) │
│ │ ├── Titan (TiKV, Offline GC) │
│ │ └── RocksDB BlobDB (Industrial Implementation) │
│ │ ├── enable_blob_files = true │
│ │ ├── min_blob_size = 4096 (Separate above 4KB) │
│ │ └── blob_garbage_collection = true │
│ │ │
│ ├── Data Structure Optimization │
│ │ ├── PebblesDB (FLSM + Guards) │
│ │ │ ├── Partition Compaction (Partition only) │
│ │ │ └── Rewrite Compaction (Sort when necessary) │
│ │ ├── LSM-trie (Trie structure reduces overlap) │
│ │ └── Dostoevsky (Segmented LSM) │
│ │ │
│ └── Compaction Strategy │
│ ├── Tiered Compaction (Delayed merge) │
│ ├── Leveled-N (Hybrid strategy) │
│ ├── FIFO (Only keep latest) │
│ └── TRIAD (Delayed Compaction) │
│ │
└─────────────────────────────────────────────────────────────┘
Optimization Effect Comparison (Based on LevelDB 12×):
| Optimization Technology | Write Amp | Reduction | Cost |
|------------------------|-----------|-----------|------|
| No optimization (LevelDB) | 12× | - | - |
| RocksDB (Multi-threaded) | 15× | -25% | CPU overhead +20% |
| Tiered Compaction | 5× | 58% | Read Amp +3× |
| PebblesDB (FLSM) | 4× | 67% | Read Amp +2× |
| BlobDB (4KB+) | 2× | 83% | Random read latency +30% |
| WiscKey (4KB+) | 1.2× | 90% | Range query -20% |
8.2 Read Amplification Optimization
Read amplification is defined as the ratio of actual data read to requested data, directly affecting read latency.
8.2.1 Read Amplification Source Analysis
Three Major Sources of Read Amplification:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. Multi-Level Lookup │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Get(key) Lookup Path: │ │
│ │ │ │
│ │ 1. MemTable (O(log n)) │ │
│ │ ↓ Not Found │ │
│ │ 2. Immutable MemTable (At most 2) │ │
│ │ ↓ Not Found │ │
│ │ 3. Level 0 SSTables (At most 4, check one by one) │ │
│ │ ↓ Not Found │ │
│ │ 4. Level 1+ (Binary search to locate file each level)│ │
│ │ │ │
│ │ Worst case: 1 + 2 + 4 + 6 = 13 file lookups │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ 2. In-Block Scan │
│ - Data Block default 4KB, need to scan to find key │
│ - Optimized through Restart Points, still partial scan │
│ │
│ 3. Merge Operator │
│ - Merge type needs to collect multiple versions and merge│
│ - May span multiple SSTables │
│ │
└─────────────────────────────────────────────────────────────┘
Total Read Amplification Formula:
Read_Amp = Σ(Levels_Checked) × Files_Per_Level × (1 - Bloom_Hit_Rate)
8.2.2 Filtering Optimization
Bloom Filter Multi-Level Deployment:
Bloom Filter Three-Level Deployment:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Level 1: Whole Table Filter (Whole Key Filter) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ SSTable Footer │ │
│ │ └─→ Filter Block Handle │ │
│ │ └─→ [Filter for entire table] │ │
│ │ │ │
│ │ Advantage: One Filter covers entire table, low memory │ │
│ │ Disadvantage: Relatively high false positive rate │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Level 2: Partitioned Filter (Partitioned Filter) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ MetaIndex Block │ │
│ │ ├─→ Filter Partition 0 [keys 0-999] │ │
│ │ ├─→ Filter Partition 1 [keys 1000-1999] │ │
│ │ └─→ Filter Partition N [...] │ │
│ │ │ │
│ │ Advantage: Load partitions on demand, reduce memory │ │
│ │ Disadvantage: Increased implementation complexity │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
│ Level 3: Block Filter (Block Filter) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ One Filter per Data Block │ │
│ │ Data Block 0 → Filter 0 │ │
│ │ Data Block 1 → Filter 1 │ │
│ │ │ │
│ │ Advantage: Lowest false positive rate, precise │ │
│ │ Disadvantage: Highest memory usage │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Configuration Example (RocksDB):
options.optimize_filters_for_hits = true; // Optimize for hit rate
options.partition_filters = true; // Enable partitioned filter
options.cache_index_and_filter_blocks = true; // Cache filter
Prefix Bloom Filter:
Optimization for Prefix Query:
Scenario: Large number of queries share same prefix
Example: user:1001, user:1002, user:1003...
Traditional Bloom Filter:
- Store complete key: "user:1001", "user:1002"...
- Query: Must provide complete key
Prefix Bloom Filter:
- Only store prefix: "user:1001" → "1001"
- Query: Can quickly exclude using prefix
Implementation:
1. Configure prefix_extractor
2. Build Bloom Filter for each prefix in MemTable
3. Carry prefix Filter when flushing to SSTable
Effect:
- Prefix query speed improvement 5-10×
- Memory usage reduced 30-50%
8.2.3 Cache Optimization
Multi-Level Cache Architecture:
RocksDB Four-Level Cache System:
┌─────────────────────────────────────────────────────────────┐
│ │
│ L1: Row Cache (Optional) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Cache complete Key-Value pairs │ │
│ │ │ │
│ │ Hit: Direct return, no disk I/O │ │
│ │ Miss: Continue lookup below │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ Miss │
│ L2: Block Cache (Core) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Cache Data Block, Index Block, Filter Block │ │
│ │ │ │
│ │ Implementation: LRU / HyperClockCache (RocksDB v10.7+)│ │
│ │ Size: Usually set to 30-50% of available memory │ │
│ │ Hit rate target: > 90% │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ Miss │
│ L3: Table Cache │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Cache SSTable metadata (Footer, Index Block pointers) │ │
│ │ │ │
│ │ Function: Avoid repeated file open, reduce stat() │ │
│ │ Size: Usually 1000-10000 table handles │ │
│ └───────────────────────────────────────────────────────┘ │
│ ↓ Miss │
│ L4: OS Page Cache │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Operating system level file cache │ │
│ │ │ │
│ │ RocksDB utilizes via mmap or direct I/O │ │
│ │ Size: Remaining available memory │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Cache Hit Rate Impact on Read Amplification:
| Cache Level | Hit Rate | Effective Read Amp |
|-------------|----------|-------------------|
| Row Cache 90% | 90% | 1.3× |
| Block Cache 90% | 90% | 2.5× |
| OS Cache Only | 50% | 6.0× |
| No Cache | 0% | 12× |
HyperClockCache (RocksDB v10.7+):
// Replace traditional LRU Cache, higher concurrent performance
Traditional LRU Cache Problems:
- Severe global lock contention
- Performance drops significantly under high concurrency
- Clock algorithm variant, lock-free design
HyperClockCache Advantages:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. Lock-Free Concurrency │
│ - Use atomic operations and CAS │
│ - Support hundreds of threads concurrent access │
│ │
│ 2. Adaptive Eviction │
│ - Based on access frequency and time │
│ - Automatically adjust eviction strategy │
│ │
│ 3. NUMA Awareness │
│ - Prioritize access to local NUMA node │
│ - Reduce cross-node memory access │
│ │
│ Performance Improvement: │
│ - 32 threads concurrent: 2.5× faster than LRU │
│ - 64 threads concurrent: 4× faster than LRU │
│ │
└─────────────────────────────────────────────────────────────┘
8.2.4 Index Optimization
Index Block Two Search Methods:
1. Binary Search (Default)
┌───────────────────────────────────────────────────────┐
│ Index Block Structure: │
│ │
│ [Key: "apple"] → BlockHandle(offset: 0, size: 4KB) │
│ [Key: "banana"] → BlockHandle(offset: 4KB, size: 4KB) │
│ [Key: "cherry"] → BlockHandle(offset: 8KB, size: 4KB) │
│ ... │
│ │
│ Search Process: │
│ 1. Binary search in Index Block │
│ 2. Find first Entry >= target_key │
│ 3. Read corresponding Data Block │
│ │
│ Time Complexity: O(log N), N=Data Block count │
│ Applicable: Range queries, prefix queries │
└───────────────────────────────────────────────────────┘
2. Hash Search (Configuration enabled)
┌───────────────────────────────────────────────────────┐
│ Hash Index Structure: │
│ │
│ Bucket 0 → [hash("apple"), BlockHandle] │
│ Bucket 1 → [hash("banana"), BlockHandle] │
│ Bucket 2 → [hash("cherry"), BlockHandle] │
│ │
│ Search Process: │
│ 1. Calculate hash value of key │
│ 2. Directly locate corresponding Bucket │
│ 3. Read corresponding Data Block │
│ │
│ Time Complexity: O(1) │
│ Applicable: Pure point query workloads │
│ Not applicable: Range queries │
└───────────────────────────────────────────────────────┘
Configuration:
BlockBasedTableOptions::index_type = kHashSearch; // Enable hash index
Interpolation Search (Interpolation Search, RocksDB v11.0+):
Interpolation Search Principle:
Prerequisite: Keys uniformly distributed (e.g., timestamps, auto-increment IDs)
Traditional Binary Search:
mid = (low + high) / 2
Interpolation Search:
mid = low + ((target - keys[low]) / (keys[high] - keys[low])) * (high - low)
Predict position based on target value, not fixed midpoint
Example:
Keys: [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
Target: 850
Binary Search:
1st: mid = 5 (600) < 850, go right
2nd: mid = 7 (800) < 850, go right
3rd: mid = 8 (900) > 850, go left
4th: Found 850
Interpolation Search:
Predicted position = 0 + ((850-100)/(1000-100)) * 9 = 7.5 ≈ 8
1st: Directly locate to index 8 (900)
2nd: Go left to find 850
Only 2 comparisons needed!
Performance Improvement:
- Uniformly distributed keys: Reduce comparison count by 30-50%
- Non-uniform distribution: Automatically fallback to binary search
- Configuration: index_block_search_type = kInterpolationSearch
8.2.5 Parallelization Optimization
WiscKey Parallel Range Query:
Problem: After key-value separation, range queries require multiple random reads to vLog
Solution: Utilize SSD parallel random read capability
┌─────────────────────────────────────────────────────────────┐
│ │
│ Serial Method (Traditional): │
│ GetRange(start, end): │
│ 1. Find keys in LSM: [k1, k2, k3, ..., kn] │
│ 2. for each key: │
│ addr = LSM.lookup(key) │
│ value = vLog.read(addr) ← Sequential, n I/Os │
│ │
│ Total latency: n × t_random (t_random ≈ 100μs for SSD) │
│ │
│ Parallel Method (WiscKey): │
│ 1. Find keys in LSM: [k1, k2, k3, ..., kn] │
│ 2. Create 32 background threads │
│ 3. Batch submit vLog read requests │
│ 4. Wait for all to complete, return in order │
│ │
│ Total latency: (n/32) × t_random ≈ 3% × n × t_random │
│ │
│ Performance Improvement: │
│ - 4KB value: 14× │
│ - 16KB value: 16× │
│ - 64KB value: 14× │
│ │
└─────────────────────────────────────────────────────────────┘
RocksDB MultiGet:
// Batch point query optimization
// Traditional method (n independent Gets)
for (int i = 0; i < n; i++) {
db->Get(key[i], &value[i]); // n system calls, n lookups
}
// MultiGet method (1 batch Get)
std::vector<Slice> keys = {key1, key2, ..., keyn};
std::vector<PinnableSlice> values(n);
std::vector<Status> statuses = db->MultiGet(keys, &values);
// Optimization points:
// 1. One system call, reduce context switching
// 2. Shared MemTable/SSTable iterators
// 3. Batch read consecutive Data Blocks
// 4. Merged Bloom Filter checks
Performance Improvement:
- Compared to n independent Gets: 3-5× faster
- Suitable: Batch point query scenarios (e.g., query by ID list)
8.2.6 Read Amplification Optimization Technology Summary
Read Amplification Optimization Technology Map:
┌─────────────────────────────────────────────────────────────┐
│ Read Amplification Optimization │
├─────────────────────────────────────────────────────────────┤
│ │
│ ├── Filtering │
│ │ ├── Bloom Filter │
│ │ │ ├── Whole Table Level (Entire SST) │
│ │ │ ├── Partition Level (Each partition) │
│ │ │ └── Block Level (Each data block) │
│ │ ├── Prefix Bloom Filter │
│ │ │ └── Optimize for prefix queries │
│ │ └── Index Pruning │
│ │ ├── Range Index │
│ │ └── Time Index │
│ │ │
│ ├── Caching │
│ │ ├── Block Cache (LRU/HyperClockCache) │
│ │ ├── Table Cache (Metadata cache) │
│ │ ├── Index Block Cache │
│ │ └── Filter Block Cache │
│ │ │
│ ├── Parallelization │
│ │ ├── Parallel Seek (PebblesDB) │
│ │ │ └── Multi-thread search of sstables within Guard │
│ │ ├── Multi-thread Range Query (WiscKey) │
│ │ │ └── 32 threads parallel read vLog │
│ │ └── MultiGet (Batch fetch) │
│ │ │
│ └── Index Optimization │
│ ├── Interpolation Search (RocksDB v11.0) │
│ │ └── Better than binary search for uniform keys │
│ ├── Hash Index │
│ │ └── O(1) point query, no range query support │
│ └── Adaptive Index │
│ └── Auto-select based on workload │
│ │
└─────────────────────────────────────────────────────────────┘
Optimization Effect Comparison (Based on LevelDB Read Amp 10×):
| Optimization Technology | Read Amp | Reduction | Cost |
|------------------------|----------|-----------|------|
| No optimization (LevelDB) | 10× | - | - |
| + Bloom Filter (10bpk) | 3× | 70% | Memory 1.25MB/million keys |
| + Block Cache (90% hit) | 2× | 80% | Memory usage |
| + HyperClockCache | 1.8× | 82% | CPU +5% |
| + Parallel range query | 1.5× | 85% | Thread overhead |
| WiscKey (Large value) | 1.2× | 88% | Range query -20% |
8.3 Space Amplification Optimization
Space amplification is defined as the ratio of actual disk usage to effective data volume, directly affecting storage costs.
8.3.1 Space Amplification Source Analysis
Three Major Sources of Space Amplification:
┌─────────────────────────────────────────────────────────────┐
│ │
│ 1. Invalid Data Accumulation │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Scenario: Frequent Update/Delete │ │
│ │ │ │
│ │ T0: Put(k1, v1) → SSTable_A │ │
│ │ T1: Put(k1, v2) → SSTable_B (v1 not cleaned) │ │
│ │ T2: Put(k1, v3) → SSTable_C (v1,v2 not cleaned) │ │
│ │ T3: Delete(k1) → SSTable_D (v1,v2,v3 not cleaned) │ │
│ │ │ │
│ │ Valid data: 0 (deleted) │ │
│ │ Actual usage: 4 versions × size │ │
│ │ Space Amp: ∞ (Infinite) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ 2. Low Compression Efficiency │
│ - No compression: 1.0× (No savings) │
│ - Snappy: 1.3-1.5× (Save 30-40%) │
│ - Zstd: 1.1-1.3× (Save 50-60%) │
│ │
│ 3. Metadata Overhead │
│ - InternalKey: +8 bytes/key (seq+type) │
│ - Bloom Filter: 1.25MB/million keys │
│ - Index Block: ~5-10% additional space │
│ │
└─────────────────────────────────────────────────────────────┘
Total Space Amplification Formula:
Space_Amp = (Raw_Data + Metadata + Invalid_Data) / Effective_Data
8.3.2 Compression Optimization
Compression Algorithm Comparison:
┌─────────────────────────────────────────────────────────────┐
│ Compression Algorithm Performance Comparison│
├─────────────────────────────────────────────────────────────┤
│ │
│ Algorithm Ratio Compress Speed Decompress Speed CPU │
│ ───────────────────────────────────────────────────────── │
│ None 1.0× ∞ ∞ 0% │
│ Snappy 1.3-1.5× 200MB/s 400MB/s Low │
│ LZ4 1.3-1.6× 400MB/s 800MB/s Low │
│ Zstd 1.5-2.5× 100MB/s 300MB/s Medium │
│ Bzip2 2.0-3.0× 50MB/s 150MB/s High │
│ │
│ Compression Ratio Test (TPC-H Dataset): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Original: 100 GB │ │
│ │ Snappy: 68 GB (Save 32%) │ │
│ │ LZ4: 65 GB (Save 35%) │ │
│ │ Zstd: 45 GB (Save 55%) │ │
│ │ Bzip2: 38 GB (Save 62%) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Recommended Configuration: │
│ options.compression = kSnappyCompression; // Default │
│ options.bottommost_compression = kZSTD; // Bottom level │
│ │
└─────────────────────────────────────────────────────────────┘
Tiered Compression Strategy:
Different compression algorithms for different levels:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Level 0-1 (Hot Data): │
│ - Frequently accessed, frequent Compaction │
│ - Use Snappy (Fast, low CPU overhead) │
│ - Compression ratio: 1.3× │
│ │
│ Level 2-4 (Warm Data): │
│ - Medium access frequency │
│ - Use Zstd (Balance compression ratio and speed) │
│ - Compression ratio: 1.8× │
│ │
│ Level 5-6 (Cold Data): │
│ - Rarely accessed, long-term storage │
│ - Use Zstd high compression level or Bzip2 │
│ - Compression ratio: 2.5× │
│ │
│ Configuration: │
│ options.compression_per_level = { │
│ kSnappyCompression, // L0 │
│ kSnappyCompression, // L1 │
│ kZSTDCompression, // L2 │
│ kZSTDCompression, // L3 │
│ kZSTDCompression, // L4 │
│ kBZip2Compression, // L5 │
│ kBZip2Compression, // L6 │
│ }; │
│ │
│ Effect: Overall space saving 40-50%, CPU overhead +10-15% │
│ │
└─────────────────────────────────────────────────────────────┘
8.3.3 Garbage Collection Optimization
Timely Compaction:
Accelerate invalid data cleanup by adjusting Compaction parameters:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Parameter Tuning: │
│ │
│ 1. level0_file_num_compaction_trigger (Default 4) │
│ ↓ Reduce to 2 │
│ Effect: More frequent L0→L1 Compaction triggers, │
│ faster invalid data cleanup │
│ Cost: Write amplification increases 20-30% │
│ │
│ 2. max_bytes_for_level_base (Default 256MB) │
│ ↓ Reduce to 128MB │
│ Effect: Reduced capacity per level, more aggressive │
│ Compaction │
│ Cost: Increased Compaction frequency │
│ │
│ 3. periodic_compaction_seconds (RocksDB 6.29+) │
│ Set to 7 days (604800 seconds) │
│ Effect: Force Compaction on all data every 7 days │
│ Applicable: Periodic cleanup of expired data │
│ │
│ Monitoring Metrics: │
│ - rocksdb.estimate-num-deletes: Estimated delete markers │
│ - rocksdb.estimate-live-data-size: Estimated valid data │
│ │
│ When delete_ratio > 20%, recommend manual Compaction: │
│ db->CompactRange(compact_options, nullptr, nullptr); │
│ │
└─────────────────────────────────────────────────────────────┘
Key-Value Separation GC:
Comparison of GC Mechanisms for WiscKey/Titan/BlobDB:
┌─────────────────────────────────────────────────────────────┐
│ │
│ WiscKey (Online GC): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Trigger: Background periodic operation │ │
│ │ │ │
│ │ Process: │ │
│ │ 1. Sequentially read chunk from Tail (256KB) │ │
│ │ 2. For each entry, check if LSM still points to addr │ │
│ │ 3. Append valid entries to Head │ │
│ │ 4. Update Tail, release old files │ │
│ │ │ │
│ │ Advantage: GC doesn't affect foreground I/O │ │
│ │ Disadvantage: Complex implementation, vLog state │ │
│ │ machine maintenance │ │
│ │ │ │
│ │ Performance impact: Throughput drop < 10% even with │ │
│ │ 100% invalid data │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Titan (Offline GC): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Trigger: Synchronized with Compaction │ │
│ │ │ │
│ │ Process: │ │
│ │ 1. Compaction reads blob addresses from SSTable │ │
│ │ 2. Read corresponding values from vLog │ │
│ │ 3. Only write valid values to new vLog │ │
│ │ 4. Update addresses in SSTable │ │
│ │ │ │
│ │ Advantage: Simple implementation, natural integration│ │
│ │ with Compaction │ │
│ │ Disadvantage: Increased I/O pressure during │ │
│ │ Compaction │ │
│ │ │ │
│ │ Performance impact: Compaction time extended 30-50% │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ BlobDB (RocksDB Integrated): │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Trigger: Configurable, supports online + offline │ │
│ │ mixed │ │
│ │ │ │
│ │ Special Features: │ │
│ │ - use_kv_ratio_compaction: Optimize based on key/ │ │
│ │ value ratio │ │
│ │ - max_data_files_size: Prune based on SST+blob │ │
│ │ total size │ │
│ │ - intra-L0 compaction: Merge L0 internal blob files │ │
│ │ │ │
│ │ Configuration Example: │ │
│ │ options.enable_blob_files = true; │ │
│ │ options.min_blob_size = 4096; // Separate above 4KB │ │
│ │ options.blob_file_size = 256MB; │ │
│ │ options.enable_blob_garbage_collection = true; │ │
│ │ options.blob_garbage_collection_age_cutoff = 0.25; │ │
│ │ │ │
│ │ Effect: Space amplification reduced from 1.5× to │ │
│ │ 1.05× (97% reduction) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
8.3.4 Deduplication Optimization
HashKV Hash Partition Deduplication:
HashKV Core Idea: Keys with same hash go to same partition, localize deduplication
┌─────────────────────────────────────────────────────────────┐
│ │
│ Traditional LSM: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ All keys mixed together │ │
│ │ Compaction requires global scan │ │
│ │ Difficult to identify duplicate data │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ HashKV: │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Partition 0: hash(key) % 100 == 0 │ │
│ │ [key: "user:100", val: "v1"] │ │
│ │ [key: "user:100", val: "v2"] ← Duplicate found! │ │
│ │ [key: "user:200", val: "v1"] │ │
│ │ │ │
│ │ Partition 1: hash(key) % 100 == 1 │ │
│ │ [key: "user:101", val: "v1"] │ │
│ │ ... │ │
│ │ │ │
│ │ Deduplication process: │ │
│ │ 1. Detect duplicates within each partition during │ │
│ │ Compaction │ │
│ │ 2. Only keep latest version, delete old versions │ │
│ │ 3. Generate deduplicated SSTable │ │
│ │ │ │
│ │ Effect: Space amplification reduced from 1.8× to │ │
│ │ 1.2× (67% reduction) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
8.3.5 Space Amplification Optimization Technology Summary
Space Amplification Optimization Technology Map:
┌─────────────────────────────────────────────────────────────┐
│ Space Amplification Optimization │
├─────────────────────────────────────────────────────────────┤
│ │
│ ├── Compression │
│ │ ├── Snappy (Default, ~200MB/s) │
│ │ ├── Zstd (High compression, ~100MB/s) │
│ │ ├── LZ4 (Fast, ~400MB/s) │
│ │ └── Bzip2 (Highest compression, ~50MB/s) │
│ │ │
│ ├── Garbage Collection │
│ │ ├── Timely Compaction │
│ │ │ ├── Reduce trigger threshold │
│ │ │ └── Periodic Compaction │
│ │ ├── vLog GC (Key-value separation) │
│ │ │ ├── WiscKey (Online GC) │
│ │ │ ├── Titan (Offline GC) │
│ │ │ └── BlobDB (Hybrid GC) │
│ │ └── Background cleanup threads │
│ │ │
│ └── Deduplication │
│ ├── HashKV (Hash partition deduplication) │
│ ├── Global deduplication │
│ └── Local deduplication │
│ │
└─────────────────────────────────────────────────────────────┘
Optimization Effect Comparison (Based on LevelDB space amplification 1.2×):
| Optimization Technology | Space Amp | Improvement | Cost |
|------------------------|-----------|-------------|------|
| No optimization (LevelDB) | 1.2× | - | - |
| + Zstd compression | 1.08× | 10% | CPU +10% |
| + Timely Compaction | 1.05× | 12.5% | Write amplification +20% |
| + BlobDB GC | 1.02× | 15% | Random read +30% |
| + HashKV deduplication | 1.01× | 16% | Memory +5% |
Comprehensive optimization case:
A log storage system, original data 1TB:
- Before optimization: 1.5TB disk usage (space amplification 1.5×)
- Enable Zstd: 1.2TB (1.2×)
- Enable BlobDB: 1.05TB (1.05×)
- Enable periodic Compaction: 1.02TB (1.02×)
- Final savings: 480GB disk space, annual cost savings $5000+
8.4 Synergy of Three Optimization Techniques
In practical systems, write amplification, read amplification, and space amplification constrain each other and require comprehensive trade-offs.
Trade-off Triangle:
Write Amplification
/ \
/ \
/ \
/ \
/ \
/ \
Read Amp ←────→ Space Amp
Typical Trade-off Scenarios:
1. Reduce Write Amp ↔ Increase Read Amp
- Tiered Compaction: Write Amp↓, Read Amp↑
- Key-value separation: Write Amp↓↓, Random Read↑
2. Reduce Write Amp ↔ Increase Space Amp
- Delayed Compaction: Write Amp↓, Space Amp↑
- FIFO strategy: Write Amp↓↓, Space Amp↑↑
3. Reduce Read Amp ↔ Increase Space Amp
- Bloom Filter: Read Amp↓, Memory↑
- More cache: Read Amp↓, Memory↑
Optimal Strategy:
Choose the appropriate balance point based on specific workload!
Workload-Driven Optimization Selection:
┌─────────────────────────────────────────────────────────────┐
│ Optimization Strategies for Different │
│ Scenarios │
├─────────────────────────────────────────────────────────────┤
│ │
│ Scenario 1: Write-heavy, Read-light (Log storage, │
│ Monitoring data) │
│ ├── Main conflict: Write amplification │
│ ├── Recommended: Tiered Compaction + BlobDB │
│ ├── Acceptable cost: Slightly higher read amp, │
│ │ Slightly higher space amp │
│ └── Expected effect: Write Amp 2-3×, Read Amp 5-8×, │
│ Space Amp 1.1× │
│ │
│ Scenario 2: Read-heavy, Write-light (Config center, │
│ Metadata storage) │
│ ├── Main conflict: Read amplification │
│ ├── Recommended: Leveled Compaction + Large Block Cache │
│ ├── Acceptable cost: Slightly higher write amp │
│ └── Expected effect: Write Amp 15-20×, Read Amp 1.5-2×, │
│ Space Amp 1.1× │
│ │
│ Scenario 3: Large Value Storage (Object storage, │
│ Document database) │
│ ├── Main conflict: Write amplification + Space │
│ │ amplification │
│ ├── Recommended: Key-value separation (WiscKey/Titan/ │
│ │ BlobDB) │
│ ├── Acceptable cost: Slightly higher random read latency │
│ └── Expected effect: Write Amp 1-2×, Read Amp 2-3×, │
│ Space Amp 1.05× │
│ │
│ Scenario 4: Time-series Data (Metrics monitoring, │
│ IoT data) │
│ ├── Main conflict: Write throughput + Storage cost │
│ ├── Recommended: FIFO Compaction + Tiered compression │
│ ├── Acceptable cost: Historical data may be quickly │
│ │ cleaned │
│ └── Expected effect: Write Amp 1×, Read Amp 3-5×, │
│ Space Amp 1.1× │
│ │
│ Scenario 5: Mixed Workload (E-commerce, Social │
│ networks) │
│ ├── Main conflict: Balance read/write │
│ ├── Recommended: Leveled-N + Moderate key-value │
│ │ separation │
│ ├── Acceptable cost: Increased implementation complexity │
│ └── Expected effect: Write Amp 5-8×, Read Amp 2-3×, │
│ Space Amp 1.1× │
│ │
└─────────────────────────────────────────────────────────────┘
Chapter 9: Performance Comparison Summary
9.1 Core Metrics Comparison Across Systems
| System | Year | Write Amp | Read Amp | Main Optimization | Applicable Scenario |
|---|---|---|---|---|---|
| LSM-Tree | 1996 | Theoretical model | Theoretical model | Sequential write | Theoretical foundation |
| LevelDB | 2011 | 12-50× | 3-14× | Engineering simplification | Embedded |
| RocksDB | 2012 | 10-30× | 2-10× | Multi-threading, Rich features | Industrial general purpose |
| WiscKey | 2016 | ~1× | ~1×+1 | Key-value separation | Large value |
| PebblesDB | 2017 | 3-4× | Medium | FLSM | High-throughput writes |
| Titan | 2018 | ~1× | ~1×+1 | Key-value separation | TiKV ecosystem |
| BlobDB | 2020 | 2-5× | 2-5× | Industrial-grade key-value separation | RocksDB ecosystem |
9.2 Quantified Optimization Effects
Using LevelDB as baseline (100%), compare effects of various optimization schemes:
Write Throughput (Random Write, MB/s): | System | Throughput | Relative to Baseline | |——–|————|———————| | LevelDB | 10 | 100% | | RocksDB | 15 | 150% | | PebblesDB | 27 | 270% | | WiscKey (4KB value) | 350 | 3500% | | BlobDB (4KB value) | 400 | 4000% |
Write Amplification (Lower is better): | System | Write Amp | Reduction | |——–|———–|———–| | LevelDB | 12× | - | | RocksDB | 15× | -25% | | PebblesDB | 4× | 67% | | WiscKey | 1.2× | 90% | | BlobDB | 2× | 83% |
Read Performance (Random Point Query, KOps/s): | System | Point Query Throughput | Relative to Baseline | |——–|———————-|———————| | LevelDB | 25 | 100% | | RocksDB | 30 | 120% | | PebblesDB | 25 | 100% | | WiscKey | 300 | 1200% | | BlobDB | 37 | 150% |
Space Efficiency (100GB Raw Data Actual Usage): | System | Actual Usage | Space Amplification | |——–|————-|———————| | LevelDB | 120 GB | 1.2× | | RocksDB | 115 GB | 1.15× | | WiscKey (After GC) | 102 GB | 1.02× | | BlobDB | 105 GB | 1.05× |
Optimization Benefits Summary:
- Key-value separation is the most effective means to reduce write amplification (WiscKey/BlobDB)
- Bloom Filter is the most effective means to reduce read amplification (Reduce 90%+ invalid disk reads)
- Compression algorithm selection has significant impact on space efficiency (Zstd saves 30-50% space compared to Snappy)
- Multi-threading can significantly improve write throughput (RocksDB is 50%+ higher than LevelDB)
Chapter 10: References and Further Reading
10.1 Core Papers
| Paper | Year | Core Contribution |
|---|---|---|
| The Log-Structured Merge-Tree | 1996 | LSM-Tree theoretical foundation |
| WiscKey: Separating Keys from Values | 2016 | Key-value separation optimization |
| PebblesDB: Building Key-Value Stores using FLSM | 2017 | FLSM data structure |
| Bigtable: A Distributed Storage System | 2006 | Distributed LSM practice |
| The Five-Minute Rule Ten Years Later | 2007 | Storage cost analysis |
10.2 Open Source Implementations
| Project | Language | Characteristics |
|---|---|---|
| LevelDB | C++ | Most classic implementation, concise code |
| RocksDB | C++ | Industrial-grade, feature-rich |
| Badger | Go | WiscKey idea implementation |
| Titan | C++ | TiKV’s key-value separation engine |
| mini-lsm | Rust | Educational, step-by-step implementation |
| sled | Rust | Pure Rust, modern design |
10.3 Recommended Books
| Title | Author | Focus |
|---|---|---|
| Designing Data-Intensive Applications | Martin Kleppmann | Chapter 3 Storage and Retrieval |
| Database Internals | Alex Petrov | B-Tree vs LSM comparison |
Appendix: Glossary
| Term | Explanation |
|---|---|
| LSM | Log-Structured Merge |
| SSTable | Sorted String Table |
| MemTable | In-memory table, write buffer |
| Compaction | Merge operation, maintains ordering |
| Bloom Filter | Bloom filter, fast exclusion for queries |
| Tombstone | Deletion marker |
| MVCC | Multi-Version Concurrency Control |
| WAL | Write-Ahead Log |
| GC | Garbage Collection |
| vLog | Value Log, value storage after key-value separation |
| Guard | Partition point for organizing sstables in PebblesDB |
| FLSM | Fragmented LSM, data structure proposed by PebblesDB |
Document End