Recipe 1: The Elastic Cache That Grows Without Restarting

Situation

You run a key-value cache inside a long-lived process (API server sidecar, feature flag cache, edge cache). Traffic is spiky. A conventional in-process cache is bounded by the RAM of the machine it runs on:

If you size for peak, you waste memory at off-peak.
If you size for average, you evict hot keys at peak.
If you need more RAM, you typically restart the process on a bigger instance or shard the cache out into a separate service.

The goal here is not “resize a HashMap”. The goal is: change the cache’s memory footprint at runtime without a process restart, and have the cache clean up after itself automatically if it crashes.

What You Build

An elastic cache with:

A small “base” shard (always present).
Optional extra shards that are created by acquiring new memory leases on demand.
A routing function (hashing) from key -> shard.
A shrink policy that drops extra shards when they are no longer needed.

The key idea: cache capacity is a function of how many memory leases you hold, not a fixed process property.

Building Blocks

grafos_cache::ElasticShardSet<K, V> for elastic sharded caching — source
grafos_std::mem::MemBuilder and MemLease for leased memory.
grafos_collections::map::FabricHashMap<K, V> as the backing store used internally by each shard.
(Optional) grafos_observe metrics/events to measure churn and hit rate.

Design

Sharding Model

ElasticShardSet manages a collection of shards internally, each backed by a FabricHashMap with its own memory lease. Routing from key to shard is handled by consistent hashing inside the shard set.

When you call add_shard() or remove_shard(), the shard count changes. The shard set tracks which entries need migration, and rehash() performs the actual data movement. This is the “rehash on resize” strategy — correct and straightforward.

Growth Trigger

Decide when to add a shard. Typical signals:

Load factor / occupancy exceeds threshold (e.g. entries/capacity).
Hit rate drops or latency rises.
shard_count() is too low for the current working set.

Shrink Trigger

Decide when to remove a shard:

Sustained low occupancy.
Sustained low QPS (or low renewal pressure, if you choose to tie caching to TTL).

Call remove_shard() followed by rehash() to migrate keys back. The removed shard’s lease is dropped, returning memory to the global pool immediately (or on TTL expiry if the process crashed).

Walkthrough (Implementation Sketch)

1. Pick Strides and Create the Shard Set

ElasticShardSet takes stride parameters for fixed key/value storage. For example:

Keys: strings with stride 32 bytes.
Values: u64 with stride 16.

use grafos_cache::ElasticShardSet;

// Start with 2 shards, 256 buckets per shard, key stride 32, value stride 16.
let mut cache: ElasticShardSet<String, u64> =
    ElasticShardSet::new(2, 256, 32, 16)?;

Each shard acquires its own memory lease internally via MemBuilder.

2. Cache API (Get/Put/Remove)

ElasticShardSet provides the cache interface directly — no manual shard wrapper or routing needed:

// Write
cache.put(&"session:abc".to_string(), &42u64)?;

// Read
let val = cache.get(&"session:abc".to_string())?;  // Some(42)

// Delete
cache.remove(&"session:abc".to_string())?;

Routing from key to shard is handled internally by the shard set.

3. Grow (Add Shard and Rehash)

cache.add_shard()?;   // Acquires a new memory lease for the new shard.
cache.rehash()?;      // Migrates entries to rebalance across all shards.

assert_eq!(cache.shard_count(), 3);

add_shard() acquires a new lease; rehash() moves entries so routing is consistent.

4. Shrink (Remove Shard and Rehash)

cache.remove_shard()?;  // Marks last shard for removal.
cache.rehash()?;         // Migrates entries out, then drops the shard's lease.

assert_eq!(cache.shard_count(), 2);

Dropping the shard drops its lease, returning memory to the global pool.

Failure Modes and Recovery

FabricError::Disconnected during reads/writes: the remote node is unreachable.
- Decide whether to treat cache as best-effort (miss) or fail closed (bubble error).
FabricError::LeaseExpired: lease ended; your shard is gone.
- Treat as “shard died”, rebuild it by acquiring a fresh lease.
Process crash:
- All shards are reclaimed when their leases expire. You do not need a cleanup job.

Observability

With grafos-observe, track:

active_shards gauge
leases_active gauge (per resource type)
get_latency histogram (split by shard)
rehash_duration histogram

The point is to make elastic behavior visible: you should be able to see shard count step up and down.

Variations

Avoid full rehash: maintain multiple generations and consult old shards on miss.
Two-tier cache: shard 0 is “local-ish” (low latency), added shards may be remote (higher latency).
Admission control: do not insert cold keys; only promote to the cache after repeated hits.
Write-through persistence: combine with Durable to checkpoint hot keys between restarts.

Testing

Run against a live dev fabric or a test cell with enough FBMU capacity for all shards.

Simulate growth and shrink and assert:
- no panics
- keys are still readable after rehash
- leases drop on shard drop (if you expose lease counts via metrics, assert they decrease)