Skip to content

Recipe 64: Detecting a FENCED Lease State After Revocation

Situation

Recipe 63 covers how fast the engine stops using revoked memory. This recipe covers how to know the engine has stopped, in a way your application layer can reason about. A retried decode after a revoked lease could in principle:

  1. Race the broker’s reclaim and partially execute against memory the broker considers free.
  2. Silently degrade output quality if the memory has been reissued to another tenant.
  3. Crash the engine with a cryptic CUDA error from the kernel layer instead of a typed application-layer error.

The pattern: a revoked lease transitions to a FENCED state. Any subsequent operation against the FENCED lease’s weight regions fails-closed with a typed MissingWeight / LeaseFenced error — no kernel runs, no stale memory is read, no silent wrong answer. The application layer matches on the typed error and routes to recovery (request a new lease, fall back to another tenant’s session, surface to the caller).

What You Build

A revocation contract verifier: load a model, start decode, revoke mid-flight, observe the in-flight forward returns Revoked (Recipe 63’s job), then re-attempt the engine — observe the SECOND attempt returns a typed MissingWeight / LeaseFenced error rather than a CUDA fault or silent wrong answer. The FENCED state is one-way: once entered, no future op succeeds against this engine handle.

Building Blocks

  • LeaseRegion::CudaDevice Drop semantics — flips the revocation token (already covered by Recipe 63) but leaves the device pointer in a “fenced” state until cleanup completes.
  • CudaForwardError::MissingWeight { which } — the typed error variant the forward path returns when a required weight is null (i.e., its lease region was dropped/revoked).
  • check_layer_weights — the per-layer guard at the top of every forward call. Returns MissingWeight if any required tensor is null.
  • LeaseFenced — the broker-side state marker indicating a lease cannot be re-armed; only re-issued as a fresh lease.

See:

Design

Resource Model

The engine’s CudaLlamaWeights holds per-layer pointers. Each pointer is either:

  • Live: populated by the loader, broker-owned via a LeaseRegion, in the engine’s weight_regions Vec.
  • NULL: the loader never populated it OR the LeaseRegion was dropped (revocation + cleanup completed). The forward path’s check_layer_weights returns MissingWeight immediately.

After revocation:

  1. Broker flips the revocation token.
  2. Engine bails out of the in-flight forward at the next poll site (Recipe 63).
  3. The lease handle on the broker side transitions to FENCED.
  4. The LeaseRegion’s Drop runs, calls cuda_free on the device pointer, and the corresponding engine pointer is set NULL.
  5. Any future forward sees NULL pointers and returns MissingWeight — fail-closed.

One-Way Transition

A FENCED lease cannot be un-FENCED. The application must either:

  • Acquire a fresh lease (different lease_id, fresh device memory, fresh tokens) and rebuild the engine handle.
  • Accept that this serving session is over and reject the request.

There is no “retry” against the same engine handle that succeeds after the FENCED transition.

Isolation and Safety

  • The FENCED check is at the application layer (typed error matching), not the kernel layer. A correctly-coded application never sees a CUDA error from a fenced lease; it sees MissingWeight or Revoked and routes accordingly.
  • The lease handle on the broker side carries an is_fenced() query, so multi-step recovery flows (drain in-flight requests, then re-acquire) can be expressed without polling kernel layer state.

Walkthrough (Implementation Sketch)

1. Detect FENCED on the Application Side

match engine.decode_one(&mut seq, &sampling).await {
Ok(token) => emit(token),
Err(CudaForwardError::Revoked { lease_id }) => {
// In-flight revocation hit Recipe 63's path.
log::warn!("Lease {} revoked mid-decode", lease_id);
handle_revocation(lease_id);
}
Err(CudaForwardError::MissingWeight { which }) => {
// Lease has transitioned to FENCED. The engine handle is
// no longer usable.
log::error!("Lease FENCED: missing {}", which);
return Err(ServingError::LeaseFenced);
}
Err(other) => return Err(ServingError::Engine(other)),
}

2. Broker-Side Recovery

// Detect FENCED state without invoking a forward:
if broker.lease_status(lease_id).is_fenced() {
let new_lease = broker.acquire_inference_lease(&policy)?;
let new_engine = FabricNativeCudaEngine::new(/* fresh entropy */);
new_engine.load_model_with_lease(&new_lease, ...)?;
// ... resume work on new_engine
}

3. Forward-Side Guard

// From cuda_forward.rs, called at the top of every forward.
fn check_layer_weights(layer: &CudaLlamaLayer, idx: usize)
-> Result<(), CudaForwardError>
{
if layer.attn_q.is_null() {
return Err(CudaForwardError::MissingWeight { which: "attn_q" });
}
// ... 8 other per-layer required slots
Ok(())
}

Verification (Silicon)

On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):

--- Step 5: cuda_engine_e2e broker-revoke fence (FENCED on lease flip) ---
test cuda_engine_e2e_broker_revoke_fences_in_flight_decode ... ok
STEP 5 EXIT: 0

The pin:

  1. Loads Qwen 2.5 0.5B Q4_K_M with broker-owned LeaseRegion weights.
  2. Submits a prompt, starts decode.
  3. Flips ONE region’s revocation token mid-decode via the broker handle.
  4. Observes the next forward returns CudaForwardError::Revoked (Recipe 63’s part).
  5. The R64 part: re-attempts the engine. Observes the second attempt returns MissingWeight (one of the per-layer check_layer_weights slots is now NULL).

Run the pin yourself:

Terminal window
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \
cargo test --release \
-p grafos-inference-engine \
--features cuda,test-helpers \
--test cuda_engine_e2e \
-- --ignored --nocapture \
cuda_engine_e2e_broker_revoke_fences

Failure Modes

  • Partial-region revocation. This recipe revokes a whole region at a time. Sub-region revocation (e.g., a single layer’s weight out of 24) is not currently supported. If you need finer-grain reclaim, file a follow-up; the contract today is per-region.
  • Cross-engine recovery. This recipe runs ONE engine. A scheduler spinning up a NEW engine instance on top of fresh regions is the recovery path; that’s tested separately by the multi-tenant acquire/release tests (Recipe 66).
  • Memory leak shape. The FENCED state means new ops don’t succeed; it doesn’t mean the device memory has been reclaimed by the test process yet. Cleanup-on-drop is a separate contract pinned by Rust’s Drop impls and the cuda_free accounting tests. Memory is reclaimed when the engine’s Drop walks the weight pointers.
  • Recovery path with stale state in the application. If the application caches a sequence handle across the FENCED transition and re-submits against the new lease, it will hit KV-cache state confusion (the new lease has a fresh, empty KV cache). The recovery flow must re-create the sequence on the new engine.

Observability

The lease’s state transitions emit grafos_observe::Event::LeaseState records: {lease_id, state} where state ∈ {Live, Revoked, Fenced}. Pair these with the engine’s Revoked / MissingWeight returns to surface a complete revocation timeline: “broker flip → engine return → state transition → recovery acquired.”

Variations

  • Partial revocation (per-tenant in a shared engine). When Recipe 66’s multi-tenant batched decode is layered with per-tenant lease handles, revoking one tenant must FENCE only that tenant’s KV cache slot, not the whole engine. The per-LeaseRegion granularity supports this; the engine-level fence is the simplest case.
  • Cooperative drain before FENCE. Some applications prefer to drain in-flight requests cleanly rather than revoke abruptly. A “soft-revoke” mode that blocks new submissions but lets current decodes complete is a useful variant — the lease enters FENCED only after the drain.
  • State observability on the broker. A dashboard panel showing real-time Live / Revoked / Fenced counts is operator catnip — it surfaces the difference between healthy churn (“leases flow Live → Revoked → Fenced → freed”) and stuck state (“100 leases stuck in Revoked”).

Why This Is Recipe 64

Recipe 63 (intra-kernel revoke) is “how fast does the engine stop.” Recipe 64 is “how does the application know the engine has stopped, in a typed and unambiguous way.” Together they make fabric-leased inference safe to share — Recipe 66 (multi-tenant) sits on top.