Recipe 1: The Elastic Cache That Grows Without Restarting
Situation
You run a key-value cache inside a long-lived process (API server sidecar, feature flag cache, edge cache). Traffic is spiky. A conventional in-process cache is bounded by the RAM of the machine it runs on:
- If you size for peak, you waste memory at off-peak.
- If you size for average, you evict hot keys at peak.
- If you need more RAM, you typically restart the process on a bigger instance or shard the cache out into a separate service.
The goal here is not “resize a HashMap”. The goal is: change the cache’s memory footprint at runtime
without a process restart, and have the cache clean up after itself automatically if it crashes.
What You Build
An elastic cache with:
- A small “base” shard (always present).
- Optional extra shards that are created by acquiring new memory leases on demand.
- A routing function (hashing) from key -> shard.
- A shrink policy that drops extra shards when they are no longer needed.
The key idea: cache capacity is a function of how many memory leases you hold, not a fixed process property.
Building Blocks
grafos_cache::ElasticShardSet<K, V>for elastic sharded caching — sourcegrafos_std::mem::MemBuilderandMemLeasefor leased memory.grafos_collections::map::FabricHashMap<K, V>as the backing store used internally by each shard.- (Optional)
grafos_observemetrics/events to measure churn and hit rate.
See also:
- grafos-collections sharded hashmap example
- grafos-collections guide
- grafos-std guide
- grafos-collections README
- grafos-std README
Design
Sharding Model
ElasticShardSet manages a collection of shards internally, each backed by a FabricHashMap with its own
memory lease. Routing from key to shard is handled by consistent hashing inside the shard set.
When you call add_shard() or remove_shard(), the shard count changes. The shard set tracks which entries
need migration, and rehash() performs the actual data movement. This is the “rehash on resize” strategy —
correct and straightforward.
Growth Trigger
Decide when to add a shard. Typical signals:
- Load factor / occupancy exceeds threshold (e.g. entries/capacity).
- Hit rate drops or latency rises.
shard_count()is too low for the current working set.
Shrink Trigger
Decide when to remove a shard:
- Sustained low occupancy.
- Sustained low QPS (or low renewal pressure, if you choose to tie caching to TTL).
Call remove_shard() followed by rehash() to migrate keys back. The removed shard’s lease is dropped,
returning memory to the global pool immediately (or on TTL expiry if the process crashed).
Walkthrough (Implementation Sketch)
1. Pick Strides and Create the Shard Set
ElasticShardSet takes stride parameters for fixed key/value storage. For example:
- Keys: strings with stride 32 bytes.
- Values: u64 with stride 16.
use grafos_cache::ElasticShardSet;
// Start with 2 shards, 256 buckets per shard, key stride 32, value stride 16.let mut cache: ElasticShardSet<String, u64> = ElasticShardSet::new(2, 256, 32, 16)?;Each shard acquires its own memory lease internally via MemBuilder.
2. Cache API (Get/Put/Remove)
ElasticShardSet provides the cache interface directly — no manual shard wrapper or routing needed:
// Writecache.put(&"session:abc".to_string(), &42u64)?;
// Readlet val = cache.get(&"session:abc".to_string())?; // Some(42)
// Deletecache.remove(&"session:abc".to_string())?;Routing from key to shard is handled internally by the shard set.
3. Grow (Add Shard and Rehash)
cache.add_shard()?; // Acquires a new memory lease for the new shard.cache.rehash()?; // Migrates entries to rebalance across all shards.
assert_eq!(cache.shard_count(), 3);add_shard() acquires a new lease; rehash() moves entries so routing is consistent.
4. Shrink (Remove Shard and Rehash)
cache.remove_shard()?; // Marks last shard for removal.cache.rehash()?; // Migrates entries out, then drops the shard's lease.
assert_eq!(cache.shard_count(), 2);Dropping the shard drops its lease, returning memory to the global pool.
Failure Modes and Recovery
FabricError::Disconnectedduring reads/writes: the remote node is unreachable.- Decide whether to treat cache as best-effort (miss) or fail closed (bubble error).
FabricError::LeaseExpired: lease ended; your shard is gone.- Treat as “shard died”, rebuild it by acquiring a fresh lease.
- Process crash:
- All shards are reclaimed when their leases expire. You do not need a cleanup job.
Observability
With grafos-observe, track:
active_shardsgaugeleases_activegauge (per resource type)get_latencyhistogram (split by shard)rehash_durationhistogram
The point is to make elastic behavior visible: you should be able to see shard count step up and down.
Variations
- Avoid full rehash: maintain multiple generations and consult old shards on miss.
- Two-tier cache: shard 0 is “local-ish” (low latency), added shards may be remote (higher latency).
- Admission control: do not insert cold keys; only promote to the cache after repeated hits.
- Write-through persistence: combine with
Durableto checkpoint hot keys between restarts.
Testing
Run against a live dev fabric or a test cell with enough FBMU capacity for all shards.
- Simulate growth and shrink and assert:
- no panics
- keys are still readable after rehash
- leases drop on shard drop (if you expose lease counts via metrics, assert they decrease)