Skip to content

Recipe 3: A Database Buffer Pool That Borrows Memory From the Network

Situation

Databases are often constrained by buffer pool size. If the working set grows beyond RAM, performance falls off a cliff as disk I/O increases.

Traditional options:

  • Move to a bigger machine (restart, cost).
  • Add cache nodes (complexity).
  • Add replicas (not the same thing as more buffer pool).

In a disaggregated fabric, the “buffer pool” is not limited to local RAM. You can lease memory on other nodes and use it as a second-tier cache.

The goal: add capacity now, without moving the primary process.

What You Build

A two-tier buffer pool abstraction:

  • Tier 1: low-latency “near” memory lease.
  • Tier 2: a set of additional memory leases (shards) on other nodes.

Reads:

  1. Check tier 1.
  2. If miss, check tier 2.
  3. If miss, go to disk / storage.

Writes:

  • Write-through or write-back depending on your system.

Building Blocks

  • MemBuilder and MemLease for acquiring memory.
  • FabricHashMap (or FabricVec) to store pages/blocks.
  • Typed errors: FabricError::Disconnected, FabricError::LeaseExpired.

Related API docs:

Design

Key Choice

In a DB, the key is typically (file_id, page_no) or similar. For the recipe:

  • Use a u128 page identifier, or serialize a tuple.

Value Choice

The value is a fixed-size page (e.g. 4 KiB, 8 KiB). FabricHashMap stores serialized values; fixed-size pages are a good fit because:

  • You know the stride.
  • You can avoid variable allocation.

Tier 2 Sharding

Tier 2 can be:

  • A single large shard (simple, hotspot risk).
  • Many shards (better parallelism and failure isolation).

Routing by hash is fine.

Consistency

This recipe is about capacity, not transactional semantics. You can treat the buffer pool as a cache:

  • Stale reads are unacceptable for most DBs, so your DB engine still owns correctness.
  • The buffer pool holds copies of pages; the source of truth remains storage.

Walkthrough (Implementation Sketch)

1. Tier 1

Acquire a lease and build a map:

let l1 = MemBuilder::new().min_bytes(128 * 1024).acquire()?;
let mut tier1: FabricHashMap<u128, [u8; 4096]> = FabricHashMap::new(l1, 16, 4096)?;

(The stride choices here are illustrative; match your actual serialization/layout.)

2. Tier 2

Acquire multiple leases:

let mut tier2 = Vec::new();
for _ in 0..4 {
let lease = MemBuilder::new().min_bytes(256 * 1024).acquire()?;
let map: FabricHashMap<u128, [u8; 4096]> = FabricHashMap::new(lease, 16, 4096)?;
tier2.push(map);
}

3. Read Path

fn read_page(id: u128) -> Result<[u8; 4096]> {
if let Some(p) = tier1.get(&id)? { return Ok(p); }
let idx = (id as usize) % tier2.len();
if let Some(p) = tier2[idx].get(&id)? {
// Promote to tier1 if desired.
let _ = tier1.insert(&id, &p)?;
return Ok(p);
}
// Fallback: storage.
let p = disk_read(id)?;
let _ = tier1.insert(&id, &p)?;
Ok(p)
}

4. Handling Remote Failure

Tier 2 is remote-ish. Failures happen. Decide behavior:

  • If a tier2 shard is Disconnected, treat it as a miss and go to disk.
  • Optionally reacquire a new lease and rebuild that shard.

In a real system, you would also account for latency differences: tier 2 is higher latency than tier 1, but still often lower than disk.

Failure Modes

  • Disconnected: treat as a miss, rebuild shard opportunistically.
  • LeaseExpired: shard died; drop it from tier2 and replace later.

Because leases have TTL, a crash does not leave “remote shm segments” lying around.

Observability

Track:

  • l1 hit rate
  • l2 hit rate
  • l2 latency percentiles
  • l2 shard health (disconnect count)
  • total leased bytes

Variations

  • Adaptive tier2: add or remove shards based on miss rate.
  • Locality-aware: prefer nodes in the same rack for tier2 leases.
  • Compression: store compressed pages in tier2.
  • Write-back: hold dirty pages in tier1 and flush to storage asynchronously.