Recipe 65: Composing Two Fabric-Leased Engines for Speculative Decode
Situation
Single-engine inference (Recipe 61) produces correct tokens but pays a full forward per token. For latency-sensitive workloads (chat, code completion), throughput matters: how do you serve more tokens-per-second without scaling out hardware?
Speculative decode is the standard answer. A draft model
(smaller, cheaper) proposes K candidate tokens; a target model
(larger, more accurate) verifies them in ONE batched forward pass.
Tokens the target’s argmax agrees with are accepted; the first
disagreement rolls back. When the draft and target agree often (a
“high acceptance rate”), throughput approaches K + 1 tokens per
target forward — a substantial speedup.
The challenge: speculative decode is the simplest non-trivial composition of two engines on one fabric. Both engines hold broker-owned weight regions, both pay Recipe 63’s revocation cost, and the protocol between them MUST preserve correctness — accepted tokens must equal what a target-only greedy decode would have emitted.
What You Build
A two-engine speculative decode session:
- Draft engine: Qwen 2.5 0.5B Q4_K_M, fabric-leased, runs K proposal forwards per cycle.
- Target engine: Qwen 2.5 1.5B Q4_K_M, fabric-leased, runs one batched verify forward per cycle on K+1 token positions.
- Acceptance protocol: accept the longest prefix where target’s
argmax matches draft’s proposed token; emit the accepted prefix
- one extra “free” token from target’s logits at the first disagreement position; advance both engines’ KV caches by exactly the accepted count.
The contract: end-to-end greedy emit sequence equals target-only greedy decode’s emit sequence. No correctness loss from speculation; only throughput win.
Building Blocks
- Two instances of
FabricNativeCudaEngine(one draft, one target), each with its ownLeaseRegion-backed weights. SpeculativeDecodeEngine— the wrapper that holds both engines and implements the propose/verify/accept loop.forward_prefill_cuda_paged_all_logits— the target-side batched verify entrypoint (returns logits for K+1 token positions in one forward).paged_flash_attention_multi_head_fused_q_f32_cuda— the FA kernel the batched verify path uses; correct underq_start_pos != 0(recipe-58 batched-verify-causal-fix).
See:
Design
Resource Model
Each engine independently:
- Acquires its own GPU lease + memory regions for weights.
- Runs its own paged KV cache.
- Pays Recipe 63’s intra-kernel revoke cost.
The speculative engine wires them together at the propose/verify/accept boundary. From the fabric’s POV they’re two tenants on potentially-different GPUs (or the same GPU sharing memory pools).
Acceptance Rule (Greedy)
For step i of K proposals:
- Draft emits
proposal[i] = argmax(draft_logits_at_position_i). - Target’s batched verify gives
target_argmax[i] = argmax(target_logits_at_position_i). - If
proposal[i] == target_argmax[i], accept. Else, reject from positionionward. - Emit
accepted_prefix + target_argmax[first_disagreement_or_K]— i.e.,accepted_count + 1tokens per cycle.
Isolation and Safety
- Both engines’ weights are fabric-leased; Recipe 63’s revoke semantics apply to both. A revoke on the draft engine doesn’t kill the target’s session, and vice versa — the wrapper surfaces the revocation as a typed error.
- The acceptance rule preserves greedy correctness exactly: at every accepted position, target’s argmax was the emitted token. Equivalent to running target-only greedy without speculation.
Throughput Model
Each cycle costs: 1 batched target forward + K draft forwards.
With acceptance rate α:
- Tokens emitted per cycle:
α·K + 1. - Target-only baseline: 1 token per target forward.
- Speedup:
(α·K + 1) / (1 + K·(t_draft / t_target))wheret_draft / t_targetis the cost ratio of one draft forward to one target forward. For Qwen 0.5B / 1.5B at L4 this ratio is roughly 0.3.
At α = 0.95 and K = 4, speedup ≈ 2.7× over target-only greedy.
Walkthrough (Implementation Sketch)
1. Acquire Both Engines
let draft = FabricNativeCudaEngine::new_paged(entropy.clone(), 1024);let target = FabricNativeCudaEngine::new_paged(entropy.clone(), 1024);
let draft_model = draft.load_model(&mut draft_src, cfg.clone()).await?;let target_model = target.load_model(&mut target_src, cfg).await?;2. Wire into SpeculativeDecodeEngine
let mut spec = SpeculativeDecodeEngine::new( draft, draft_model, target, target_model, SpecConfig { k: 4, /* ... */ },);let mut seq = spec.create_sequence(1024)?;spec.submit_prompt(&mut seq, &prompt_ids)?;3. Decode Loop
let sampling = SamplingParams::greedy(1);for _ in 0..n_tokens { let emitted = spec.decode_one(&mut seq, &sampling).await?; sink(emitted);}4. Internal: One Cycle
// Pseudocode for the wrapper's per-cycle work.let proposals: [u32; K] = (0..K).map(|_| draft.decode_one(...)).collect();let target_logits = target.forward_prefill_paged_all_logits( &[last_emitted, proposals[0], ..., proposals[K-1]],).await?;let target_argmax = target_logits.iter().map(|l| l.argmax()).collect();let accepted = proposals.iter().zip(&target_argmax) .take_while(|(p, t)| p == t).count();emit_n_tokens(&proposals[..accepted]);emit_one_token(target_argmax[accepted]); // "free" token at first disagreementdraft.rollback_kv(K - accepted); // un-advance rejected positionstarget.rollback_kv(K - accepted);Verification (Silicon)
On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):
test speculative_batched_verify_paged_matches_sequential_decode_one ... ok sequential a_tokens: [2776, 264, 501, 1196, 1588] batched b_tokens: [2776, 264, 501, 1196, 1588]
test speculative_decode_accept_rate_in_reasonable_range ... ok DIAG spec step 0: proposals=[11, 323, 279, 3974] accept_count=0 DIAG spec step 1: proposals=[1986, 374, 264, 11416] accept_count=0 DIAG spec step 2: proposals=[3974, 13876, 38835, 34208] accept_count=4 DIAG spec step 3: proposals=[279, 15678, 5562, 11] accept_count=3 DIAG spec step 4: proposals=[785, 3974, 13876, 38835] accept_count=4 recent_accept_rate after 50 steps: 0.955
test speculative_decode_greedy_matches_target_only_greedy ... oktest speculative_decode_kv_position_advances_by_emitted_count ... ok
test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 237.40sSTEP 7 EXIT: 0Four pins, all green:
| Pin | Claim |
|---|---|
speculative_batched_verify_paged_matches_sequential_decode_one | Batched K+1 verify forward produces the same tokens as K+1 sequential decode_one calls. |
speculative_decode_accept_rate_in_reasonable_range | Over 50 steps, draft argmax matches target argmax at 95.5% of proposed positions. |
speculative_decode_greedy_matches_target_only_greedy | End-to-end emit sequence equals target-only greedy. No quality loss. |
speculative_decode_kv_position_advances_by_emitted_count | KV caches advance by the accepted-token count, not the proposed-token count. Cache state stays sound under rejection. |
Run the pins yourself (requires BOTH GGUFs):
FABRIC_TEST_MODEL_PATH_DRAFT=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \FABRIC_TEST_MODEL_PATH_TARGET=/opt/grafos/models/qwen2.5-1.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,speculative-decode,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture --test-threads=1 \ speculative_decode_greedy_matches_target_only_greedy \ speculative_decode_kv_position_advances_by_emitted_count \ speculative_decode_accept_rate_in_reasonable_range \ speculative_batched_verify_paged_matches_sequential_decode_one--test-threads=1 is mandatory on g6.xlarge (16 GiB RAM): 3 pins ×
2 engines × ~1.5 GiB each exhausts host memory.
Failure Modes
- Causal mask drift between batched and sequential paths. The
paged batched-verify FA kernel masks at
kg > q_start_pos + q_row_local. If a future kernel change in either the contig or paged path reverts to local-only Q indexing, batched vs sequential will diverge. Recipe 63’scuda_engine_e2e_intra_kernel_revoke_boundspin will continue to pass; only this recipe’s first pin will catch the divergence. - KV cache rollback bugs. If
rollback_kv(K - accepted)un-advances by the wrong count, subsequent decodes operate against stale KV positions. Thespeculative_decode_kv_position_advances_by_emitted_countpin catches this. - Low acceptance rate (≪0.5) makes speculation slower than
baseline. A poor draft/target affinity (different
architectures, different tokenizers, very different sizes)
produces α near 0; speculation pays the draft cost per cycle
for no throughput gain. Use the
recent_accept_rateevent to detect and disable speculation dynamically. - Per-engine revoke during a cycle. If the draft revokes mid- cycle, the target’s verify has nothing to verify against. The wrapper bails with the revocation error; the application must re-acquire (Recipe 64).
Observability
Each cycle emits grafos_observe::Event::SpecCycle with
{step, proposals, target_argmax, accepted_count, emitted_count}.
Aggregating over many cycles produces the live acceptance_rate
signal a scheduler can use to start/stop speculation per tenant.
Variations
- Top-K / top-P sampling instead of greedy. Speculative decode with stochastic sampling requires a different acceptance rule (matching the underlying distribution via rejection sampling). Filed as a follow-up; not yet implemented.
- N-way speculation (N drafts, 1 target). Multiple drafts propose in parallel; target verifies all proposals in one batched forward. Higher acceptance via diversity, higher draft cost.
- Cross-tenant speculation sharing. Multiple tenants share the same target engine; their drafts run in their own leases. The scheduler routes proposals to the shared target’s batched verify queue.
- Heterogeneous quant draft/target. Both Qwens above are Q4_K_M. Cross-quant speculation (e.g., Q4_K_M draft + Q8_0 target) works in principle but isn’t pinned here.
Why This Is Recipe 65
Recipes 61–64 establish that a single engine is correct, efficient, and revocable. Recipe 65 establishes that two engines compose into a higher-level inference primitive without losing those properties. Recipe 66 establishes the orthogonal composition (N tenants in ONE engine’s forward). Together they show fabricBIOS’ inference primitive composes across two orthogonal dimensions.