Recipe 66: Batching Four Tenants Into One Decode Forward Pass
Situation
Recipe 65 composes two engines for one tenant’s throughput. Recipe 66 covers the orthogonal axis: N independent tenants — different prompts, different KV caches, no shared state — share ONE GPU forward pass, each emitting one token per cycle.
Without this primitive, fabric-leased inference can only serve one request per GPU per decode forward. With it, the GPU’s per-forward cost amortizes across N tenants and tail-latency stays bounded by the slowest tenant in the batch, not by the queue depth. This is the substrate for production LLM serving on shared hardware.
The contract: each tenant’s emitted token must be byte-identical to
what they would have gotten from a serial per-tenant decode_one
call. No cross-tenant leakage in KV state, attention pattern,
softmax normalizer, or logits. The fabric does the multiplexing;
per-tenant correctness is preserved.
What You Build
A multi-tenant batched-decode pin: load N=4 distinct prompts into 4
sequence handles on one paged engine, run N decode steps two ways
(serially via decode_one per tenant, and batched via
decode_step of all 4 handles at once), and assert the two
per-tenant emit sequences are token-for-token identical.
Building Blocks
FabricNativeCudaEngine::new_paged(entropy, pool_size)— the paged-cache engine constructor; required because batched decode uses block-table indirection for per-sequence KV.engine.create_sequence(&model_handle, max_kv)— per-tenant sequence handle, independent KV cache slot.engine.decode_step(handles, samplings)— the batched decode entrypoint; takes N mutable handles, returns N emitted tokens.paged_flash_attention_multi_head_fused_q_f32_cuda— the FA kernel with per-sequence block-table indirection. Reads K/V from[n_kv_heads, max_seq_len, head_dim]pools through each sequence’sk_block_indices_dev/v_block_indices_dev.paged_kv_rope_v_append_batched— the batched K/V append kernel that writes each sequence’s new K/V row to its own block slot.
See:
Design
Resource Model
The paged engine owns ONE K/V pool of shape [n_kv_heads, max_seq_len, head_dim] (per layer). Each tenant’s sequence handle
holds:
- A per-sequence position counter (
kv_position). - A per-sequence
k_block_indices_dev/v_block_indices_devbuffer mapping logical position → physical block slot in the shared pool. - A per-sequence Q-bias / RoPE configuration (typically shared across tenants of the same model, but distinct handles).
The pool is allocated once per engine; tenants lease block slots through the pool’s allocator. Slots are released when the sequence handle drops.
Batched Forward Geometry
For N=4 sequences each contributing 1 token to the next forward:
- Embedding lookup:
[N, hidden](one row per tenant). - Q/K/V matmuls: same shape, computed once for all N tenants.
- Per-sequence FA: ONE multi-head FA kernel launches with
n_seqs=N; each thread block handles one (sequence, head) pair, reading K/V through that sequence’s block-table indirection. No cross-tenant memory access. - K/V append: per-sequence write to each sequence’s block slot.
- FFN: shared across all N tenants (no per-sequence state).
- LM head:
[N, vocab]; sample independently per tenant.
The substrate that makes this safe is the block-table indirection in the FA + K/V append kernels — every K/V read is gated by the tenant’s own indices, so cross-tenant contamination is structurally impossible.
Isolation and Safety
- Per-sequence block-table indirection in FA prevents cross-tenant K/V reads.
- Per-sequence KV positions advance independently — tenant A at position 47 doesn’t pollute tenant B at position 14.
- The shared K/V pool is allocated through a quota-aware allocator (see Recipe 26 for the multi-tenant preemption layer); the per-tenant max-blocks quota prevents one tenant from starving the pool.
Walkthrough (Implementation Sketch)
1. Create N Sequences From One Engine
let entropy: Arc<dyn EntropySource> = Arc::new(FixedEntropy(0xDEADBEEF_CAFEBABE));let mut engine = FabricNativeCudaEngine::new_paged(entropy, 256);let model = engine.load_model(&mut source, cfg).await?;
let prompts = [ tokenizer.encode("Hello, ")?, tokenizer.encode("The quick ")?, tokenizer.encode("In a ")?, tokenizer.encode("Once upon ")?,];let n_seqs = prompts.len();
let mut handles: Vec<FabricNativeCudaSequenceHandle> = (0..n_seqs) .map(|_| engine.create_sequence(&model, 2048).unwrap()) .collect();for (h, p) in handles.iter_mut().zip(&prompts) { engine.submit_prompt(h, p)?;}2. Decode All N Tenants Per Step
let samplings: Vec<SamplingParams> = (0..n_seqs).map(|_| SamplingParams::greedy(1)).collect();
for step in 0..n_decode_steps { let mut handle_refs: Vec<&mut FabricNativeCudaSequenceHandle> = handles.iter_mut().collect(); let toks = engine.decode_step(&mut handle_refs, &samplings).await?; // toks[i] is tenant i's emitted token at this step.}3. Equivalence Pin (Test Harness)
// Run 1: batched.let batched_tokens = run_batched(&model, &prompts, n_decode_steps).await;
// Run 2: serial (one sequence at a time on fresh engine).let serial_tokens = run_serial(&model, &prompts, n_decode_steps).await;
// Assert per-tenant equivalence.for (seq_i, (batched_seq, serial_seq)) in batched_tokens.iter().zip(&serial_tokens).enumerate(){ assert_eq!( batched_seq, serial_seq, "seq {} divergence: batched={:?} serial={:?}", seq_i, batched_seq, serial_seq, );}Verification
Silicon evidence (AWS L4, 2026-05-23):
running 1 testtest cuda_engine_e2e_batched_decode_matches_serial_decode_at_n_eq_4 ... \ batched_decode_matches_serial_decode_at_n_eq_4 PASS: 4 sequences x 10 steps all agreeok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 5.37sThe equivalence pin is GREEN. Every emitted token across 4 tenants × 10 steps agrees between the batched-decode path and the per-tenant serial-decode path. Combined with Recipe 61’s silicon evidence (serial path is byte-identical to llama.cpp), the per-tenant emitted-token correctness contract holds in batched mode too.
Adjacent evidence from the same run exercises the underlying
batched-paged-forward kernel from a different angle: the
speculative_batched_verify_paged_matches_sequential_decode_one
pin (the substrate behind Recipe 65) emits identical token
streams ([2776, 264, 501, 1196, 1588]) from the batched K+1
verify forward and from K+1 sequential decode_one calls. The
shared kernel (paged FA + K/V append) is behaving consistently
across both consumers.
Run the equivalence pin yourself (requires the batched-decode
feature):
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,batched-decode,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture --test-threads=1 \ cuda_engine_e2e_batched_decode_matches_serial_decode_at_n_eq_4 \ cuda_engine_e2e_batched_decode_variable_kv_position_matches_serial \ cuda_engine_e2e_batched_decode_throughput_beats_serial_at_n_eq_4Failure Modes
- Cross-tenant K/V indirection bug. A buggy block-table indirection in FA could read tenant B’s K rows while computing tenant A’s attention. The equivalence pin catches this: tenant A’s emit would differ from its serial-path emit.
- Per-sequence position counter drift. If batched decode
advances all sequences by the same
kv_positioninstead of per-tenant counters, sequences with different prompt lengths desynchronize. Thecuda_engine_e2e_batched_decode_variable_kv_position_matches_serialpin (which uses prompts of different lengths) catches this. - Pool starvation. If the K/V pool allocator runs out of free
blocks during batched decode, the engine returns
OutOfBlocks. The pin sizes the pool generously (pool_size=256for 4 sequences); production deployments must size based on worst-case concurrent contexts. - Throughput regression. The
cuda_engine_e2e_batched_decode_throughput_beats_serial_at_n_eq_4pin asserts batched < serial in wall-clock. A regression in the batched kernel or the dispatcher routing M=N to a sub-optimal variant breaks this. Less load-bearing than the equivalence pins, but it catches “we made batched correct but slow.”
Observability
Each batched decode step emits grafos_observe::Event::BatchedDecodeStep
with {step, n_seqs, emitted_tokens, per_seq_kv_positions}. The
operator dashboard shows live N (how many tenants are sharing) and
per-tenant decoded-token-count.
Variations
- N > 4. All current pins use N=4. Larger N exercises shared- memory pressure in the FA kernel and the K/V pool allocator’s block packing. Production deployments routinely run N=32 or N=64; the substrate supports it but isn’t pinned here.
- Heterogeneous prompts. Mixed-length prompts (one tenant on a
1024-token context, another on a 4-token prompt) exercise the
per-sequence position counter. The
..._variable_kv_position_matches_serialpin covers this. - Continuous batching (admission control). Recipe 26 (multi-tenant preemption) layers a continuous-batch scheduler on top of this primitive: it admits prompts dynamically, packs them into ongoing batches, and evicts on quota. The batched decode step here is the kernel-level substrate; continuous batching is the policy layer above.
- Cross-tenant prefix sharing. When two tenants share a system prompt prefix, a prefix-cache layer can serve them from the same K/V block slots for the prefix portion and switch to per-tenant slots at the divergence point. Filed as a follow-up recipe.
Why This Is Recipe 66
Recipes 61–64 establish single-engine properties. Recipe 65 composes ACROSS engines (draft + target). Recipe 66 composes ACROSS tenants (N sequences in ONE engine’s forward). Together, 65 and 66 establish two-axis composability — the substrate for production LLM serving on fabric-leased hardware.