Skip to content

Recipe 63: Handling Mid-Kernel Lease Revocation in a Decode Loop

Situation

A fabric resource isn’t really a fabric resource if the broker can only reclaim it between top-level RPCs. LLM forward passes run hundreds of kernel launches per token (matmul / RoPE / norm / FA / softmax / lookup, × N layers × decode-loop). If revocation only took effect between full forward passes, a hung tenant would block reclaim for the entire forward window — meaning worst-case latency to free is bounded by the slowest tenant’s slowest forward, not by the broker’s policy.

The pattern: every kernel-launch site polls the revocation token before dispatching. The poll is a single Relaxed-ordered AtomicBool load — cheap enough to inline at every launch site — and returns a typed error if the broker has flipped the token. The engine never starts a kernel against memory the broker considers reclaimed; the worst-case bound is one kernel-launch latency (tens of µs), not one forward pass.

What You Build

A decode loop that observes lease revocation at every kernel-launch boundary, returning a typed CudaForwardError::Revoked to the caller within tens of microseconds of the broker’s token flip — even if the flip happens mid-token, between two adjacent matmul launches. Recipe 63 covers the speed of the bail; Recipe 64 covers the state after.

Building Blocks

  • grafos_tensor_kernels_cuda::lease_region::LeaseRegion::CudaDevice — broker-owned device memory with a revocation token.
  • grafos_inference_engine::cuda_forward::poll_revoke_or_return! — the inline macro at every kernel-launch site.
  • LeaseRevocationView — a cheap-to-clone read-side handle the forward path holds for the duration of one decode call.
  • CudaForwardError::Revoked — the typed error variant the macro returns when the token has been flipped.

See:

Design

Resource Model

Each weight tensor sits behind a LeaseRegion::CudaDevice carrying a RevocationToken. The broker holds one side of the token; the engine holds a LeaseRevocationView (a read-side handle) for the duration of any forward pass. Flipping the broker side makes the view’s atomic load return “revoked” — no IPC, no syscall, just a cache-line read.

Poll Placement

poll_revoke_or_return! appears immediately before every kernel launch in forward_impl_with_shape, forward_impl_paged, and batched_decode_paged_*. There are ~30 sites per forward. The contract: at every site, EITHER the launch proceeds (token live) OR the macro returns CudaForwardError::Revoked without launching. No site reads/writes device memory between the poll and the launch that could race with the broker.

Isolation and Safety

  • The view is cloned-out per forward call from the engine’s broker handle. It cannot outlive the lease — the broker-side Drop invalidates it.
  • The poll is Relaxed ordering. This is sufficient because the contract is “if the broker flipped before this poll, observe it” — full synchronization isn’t required since the engine’s next poll catches a later flip.
  • The macro returns BEFORE launching. There is no in-flight kernel against memory the broker considers reclaimed.

Walkthrough (Implementation Sketch)

1. Engine-Side Macro at Every Launch

// From cuda_forward.rs — repeated ~30 times per forward pass.
poll_revoke_or_return!(revoke_view);
flash_attention_multi_head_fused_q_tc_f32_cuda(/* args */)?;
poll_revoke_or_return!(revoke_view);
rmsnorm_f32_cuda(/* args */)?;
poll_revoke_or_return!(revoke_view);
dispatch_matmul_f32_b_kn_cuda(/* args */)?;

2. Macro Expansion

macro_rules! poll_revoke_or_return {
($view:expr) => {
if $view.is_revoked() {
return Err(CudaForwardError::Revoked {
lease_id: $view.lease_id(),
});
}
};
}

3. Broker-Side Revoke

// In application code (the broker), e.g. on TTL expiry or tenant kill.
let lease_handle = /* obtained at lease creation */;
lease_handle.revoke(); // flips the AtomicBool the engine view reads.
// On the engine's next forward call, the next poll site returns
// CudaForwardError::Revoked. No further kernels launch.

Verification (Silicon)

On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):

--- Step 6: intra-kernel-revoke wall-clock (THE categorical claim) ---
revoke_sync wall-clock: 0.00 ms
test cuda_engine_e2e_intra_kernel_revoke_bounds_revoke_wall_clock ... ok
STEP 6 EXIT: 0

0.00 ms is what std::time::Instant::now().elapsed() reported at millisecond granularity. The actual revoke-to-bail interval is some sub-millisecond value — bounded above by one kernel-launch latency (tens of µs for the small kernels in a 0.5B decode path).

Run the pin yourself:

Terminal window
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,test-helpers \
--test cuda_engine_e2e \
-- --ignored --nocapture \
cuda_engine_e2e_intra_kernel_revoke_bounds

Failure Modes

  • Missed poll site. A new kernel-launch site without the macro is invisible to the broker until the next poll-protected site runs. Symptom: rare, near-impossible-to-reproduce races where the engine returns one extra token after revocation. Mitigation: every kernel-launch site must be reviewed for the macro; the smoke test cuda_engine_e2e_intra_kernel_revoke_bounds exercises the contract under timing pressure but won’t catch every missing site.
  • CUDA Graph replay paths. When CUDA Graphs are enabled (feature = "cuda-graph"), an entire stream of kernels launches as one atomic unit. The poll-before-launch contract doesn’t trivially extend through graph replay — the graph runs to completion. For revocation safety under graph replay, the surrounding loop (between replays) is the cancel granularity. A cleaner intra-graph cancel is a follow-up.
  • In-flight kernel completion after revoke. The macro stops future launches but doesn’t cancel the kernel already executing when the flip happens. That kernel writes its output to the engine’s scratch buffers, not to broker-owned memory outside the lease — the next poll site catches the revoke before the engine reads from any new broker buffer. The contract is “no kernel launches against revoked memory,” not “no kernels running at all.”

Observability

Every macro return emits a grafos_observe::Event::ForwardRevoked record: {lease_id, kernel_site, kv_position}. Operators can correlate this with the broker’s flip event to measure end-to-end revoke latency including the broker → engine signal path.

Variations

  • Per-tensor revocation granularity. Today every kernel-launch poll checks the engine-level handle. Per-tensor (per-LeaseRegion) polls would let the broker revoke individual weights without killing the engine — useful for hot-swapping a single layer in research workloads. Filed as a follow-up.
  • Cooperative cancellation token from outside the fabric. Some applications need cancellation from the request-handling layer (e.g., HTTP client disconnect). Passing a second cancellation view alongside the lease view lets poll_revoke_or_return! cover both signals with one atomic load.
  • Latency SLO enforcement. The test prints wall-clock but doesn’t enforce an upper bound. A revoke_wall_clock_p99_us pin would catch regressions in the path between broker flip and engine return; filed as a perf-guard follow-up.

Why This Is Recipe 63

Recipes 61 and 62 establish that the engine produces correct output with efficient memory. Recipe 63 establishes that the broker can take that memory back at any time — intra-decode, not between-decodes. Without this property, fabric leases on the inference path would be too coarse-grained to support real multi-tenant serving (Recipe 66).