Recipe 64: Detecting a FENCED Lease State After Revocation
Situation
Recipe 63 covers how fast the engine stops using revoked memory. This recipe covers how to know the engine has stopped, in a way your application layer can reason about. A retried decode after a revoked lease could in principle:
- Race the broker’s reclaim and partially execute against memory the broker considers free.
- Silently degrade output quality if the memory has been reissued to another tenant.
- Crash the engine with a cryptic CUDA error from the kernel layer instead of a typed application-layer error.
The pattern: a revoked lease transitions to a FENCED state. Any
subsequent operation against the FENCED lease’s weight regions
fails-closed with a typed MissingWeight / LeaseFenced error —
no kernel runs, no stale memory is read, no silent wrong answer.
The application layer matches on the typed error and routes to
recovery (request a new lease, fall back to another tenant’s
session, surface to the caller).
What You Build
A revocation contract verifier: load a model, start decode, revoke
mid-flight, observe the in-flight forward returns Revoked (Recipe
63’s job), then re-attempt the engine — observe the SECOND attempt
returns a typed MissingWeight / LeaseFenced error rather than a
CUDA fault or silent wrong answer. The FENCED state is one-way:
once entered, no future op succeeds against this engine handle.
Building Blocks
LeaseRegion::CudaDeviceDrop semantics — flips the revocation token (already covered by Recipe 63) but leaves the device pointer in a “fenced” state until cleanup completes.CudaForwardError::MissingWeight { which }— the typed error variant the forward path returns when a required weight is null (i.e., its lease region was dropped/revoked).check_layer_weights— the per-layer guard at the top of every forward call. ReturnsMissingWeightif any required tensor is null.LeaseFenced— the broker-side state marker indicating a lease cannot be re-armed; only re-issued as a fresh lease.
See:
Design
Resource Model
The engine’s CudaLlamaWeights holds per-layer pointers. Each
pointer is either:
- Live: populated by the loader, broker-owned via a
LeaseRegion, in the engine’sweight_regionsVec. - NULL: the loader never populated it OR the LeaseRegion was
dropped (revocation + cleanup completed). The forward path’s
check_layer_weightsreturnsMissingWeightimmediately.
After revocation:
- Broker flips the revocation token.
- Engine bails out of the in-flight forward at the next poll site (Recipe 63).
- The lease handle on the broker side transitions to FENCED.
- The
LeaseRegion’s Drop runs, callscuda_freeon the device pointer, and the corresponding engine pointer is set NULL. - Any future forward sees NULL pointers and returns
MissingWeight— fail-closed.
One-Way Transition
A FENCED lease cannot be un-FENCED. The application must either:
- Acquire a fresh lease (different
lease_id, fresh device memory, fresh tokens) and rebuild the engine handle. - Accept that this serving session is over and reject the request.
There is no “retry” against the same engine handle that succeeds after the FENCED transition.
Isolation and Safety
- The FENCED check is at the application layer (typed error
matching), not the kernel layer. A correctly-coded application
never sees a CUDA error from a fenced lease; it sees
MissingWeightorRevokedand routes accordingly. - The lease handle on the broker side carries an
is_fenced()query, so multi-step recovery flows (drain in-flight requests, then re-acquire) can be expressed without polling kernel layer state.
Walkthrough (Implementation Sketch)
1. Detect FENCED on the Application Side
match engine.decode_one(&mut seq, &sampling).await { Ok(token) => emit(token), Err(CudaForwardError::Revoked { lease_id }) => { // In-flight revocation hit Recipe 63's path. log::warn!("Lease {} revoked mid-decode", lease_id); handle_revocation(lease_id); } Err(CudaForwardError::MissingWeight { which }) => { // Lease has transitioned to FENCED. The engine handle is // no longer usable. log::error!("Lease FENCED: missing {}", which); return Err(ServingError::LeaseFenced); } Err(other) => return Err(ServingError::Engine(other)),}2. Broker-Side Recovery
// Detect FENCED state without invoking a forward:if broker.lease_status(lease_id).is_fenced() { let new_lease = broker.acquire_inference_lease(&policy)?; let new_engine = FabricNativeCudaEngine::new(/* fresh entropy */); new_engine.load_model_with_lease(&new_lease, ...)?; // ... resume work on new_engine}3. Forward-Side Guard
// From cuda_forward.rs, called at the top of every forward.fn check_layer_weights(layer: &CudaLlamaLayer, idx: usize) -> Result<(), CudaForwardError>{ if layer.attn_q.is_null() { return Err(CudaForwardError::MissingWeight { which: "attn_q" }); } // ... 8 other per-layer required slots Ok(())}Verification (Silicon)
On AWS L4 (i-0b00f3d6f383200d4, 2026-05-23, commit b065c11c):
--- Step 5: cuda_engine_e2e broker-revoke fence (FENCED on lease flip) ---test cuda_engine_e2e_broker_revoke_fences_in_flight_decode ... okSTEP 5 EXIT: 0The pin:
- Loads Qwen 2.5 0.5B Q4_K_M with broker-owned
LeaseRegionweights. - Submits a prompt, starts decode.
- Flips ONE region’s revocation token mid-decode via the broker handle.
- Observes the next forward returns
CudaForwardError::Revoked(Recipe 63’s part). - The R64 part: re-attempts the engine. Observes the second
attempt returns
MissingWeight(one of the per-layercheck_layer_weightsslots is now NULL).
Run the pin yourself:
FABRIC_TEST_MODEL_PATH=/opt/grafos/models/qwen2.5-0.5b-instruct-q4_k_m.gguf \cargo test --release \ -p grafos-inference-engine \ --features cuda,test-helpers \ --test cuda_engine_e2e \ -- --ignored --nocapture \ cuda_engine_e2e_broker_revoke_fencesFailure Modes
- Partial-region revocation. This recipe revokes a whole region at a time. Sub-region revocation (e.g., a single layer’s weight out of 24) is not currently supported. If you need finer-grain reclaim, file a follow-up; the contract today is per-region.
- Cross-engine recovery. This recipe runs ONE engine. A scheduler spinning up a NEW engine instance on top of fresh regions is the recovery path; that’s tested separately by the multi-tenant acquire/release tests (Recipe 66).
- Memory leak shape. The FENCED state means new ops don’t
succeed; it doesn’t mean the device memory has been reclaimed
by the test process yet. Cleanup-on-drop is a separate
contract pinned by Rust’s
Dropimpls and thecuda_freeaccounting tests. Memory is reclaimed when the engine’sDropwalks the weight pointers. - Recovery path with stale state in the application. If the
application caches a
sequence handleacross the FENCED transition and re-submits against the new lease, it will hit KV-cache state confusion (the new lease has a fresh, empty KV cache). The recovery flow must re-create the sequence on the new engine.
Observability
The lease’s state transitions emit
grafos_observe::Event::LeaseState records: {lease_id, state}
where state ∈ {Live, Revoked, Fenced}. Pair these with the
engine’s Revoked / MissingWeight returns to surface a complete
revocation timeline: “broker flip → engine return → state
transition → recovery acquired.”
Variations
- Partial revocation (per-tenant in a shared engine). When Recipe 66’s multi-tenant batched decode is layered with per-tenant lease handles, revoking one tenant must FENCE only that tenant’s KV cache slot, not the whole engine. The per-LeaseRegion granularity supports this; the engine-level fence is the simplest case.
- Cooperative drain before FENCE. Some applications prefer to drain in-flight requests cleanly rather than revoke abruptly. A “soft-revoke” mode that blocks new submissions but lets current decodes complete is a useful variant — the lease enters FENCED only after the drain.
- State observability on the broker. A dashboard panel
showing real-time
Live / Revoked / Fencedcounts is operator catnip — it surfaces the difference between healthy churn (“leases flow Live → Revoked → Fenced → freed”) and stuck state (“100 leases stuck in Revoked”).
Why This Is Recipe 64
Recipe 63 (intra-kernel revoke) is “how fast does the engine stop.” Recipe 64 is “how does the application know the engine has stopped, in a typed and unambiguous way.” Together they make fabric-leased inference safe to share — Recipe 66 (multi-tenant) sits on top.