Recipe 53: Per-Request Leased Inference Gateway
Situation
You operate a shared inference gateway. Requests from different tenants should not inherit a long-lived process-level GPU grant, and a stalled request should not leave GPU memory resident after its useful lifetime.
In grafOS, each request can acquire a short GPU lease, receive a scoped capability grant for that lease, run the kernel, and tear the lease down when the request completes.
What You Build
An inference handler that:
- acquires one short
GpuLeaseper request; - requests a scoped capability through the grafOS authority boundary;
- launches inference through
GpuSession; - records request completion in
ReplicatedIdempotencyStore; - explicitly frees GPU memory, unloads the module, and releases the lease.
The compiled recipe lives in
cookbook/recipe-53-per-request-leased-inference.
Core grafOS API Path
The handler uses the public idempotency store, GPU builder/session APIs, and grafOS capability authority APIs. It does not hold a fabricBIOS signing key or call raw token minting functions:
use grafos_replicated::{ FenceEpoch, IdempotencyKey, IdempotencyOutcome, LogicalResourceName, OperationHash, ReplicatedIdempotencyStore, ResourceKind, SchemaId,};use grafos_std::capability::{ CapabilityAuthority, CapabilityPermission, CapabilityRequest,};use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass, GpuSession, KernelArgs};use cookbook_recipe_53_per_request_leased_inference::{ InferenceError, InferenceRequest, InferenceResponse,};
fn handle_request( effects: &mut ReplicatedIdempotencyStore, authority: &mut impl CapabilityAuthority, request: InferenceRequest, module_bytes: &[u8], kernel: &str,) -> Result<InferenceResponse, InferenceError> { let key = IdempotencyKey::new(format!("inference:{}", request.request_id)); let mut payload = request.request_id.clone().into_bytes(); payload.push(0); payload.extend_from_slice(request.model_id.as_bytes()); payload.push(0); payload.extend_from_slice(&request.output_len.to_be_bytes()); payload.extend_from_slice(&request.input); let fingerprint = OperationHash::from_canonical_parts( &LogicalResourceName::new("per-request-inference"), ResourceKind::Workflow, "infer", &SchemaId::new("inference-request.v1"), &payload, );
let reservation = effects.reserve( key.clone(), fingerprint, None, FenceEpoch(1), )?; if matches!(reservation.value.outcome, IdempotencyOutcome::Completed { .. }) { return Err(InferenceError::DuplicateRequest(request.request_id)); }
let lease = GpuBuilder::new() .min_vram(request.min_vram) .lease_secs(request.lease_ttl_secs) .exclusivity(GpuExclusivityClass::SessionExclusive) .acquire()?;
let grant = authority.issue( CapabilityRequest::new( lease.lease_id(), CapabilityPermission::LEASE_QUERY | CapabilityPermission::TASKLET_SUBMIT, request.audience, ) .ttl_secs(request.lease_ttl_secs as u64) .nonce(request.token_nonce), request.now_secs, )?; authority.validate( &grant, request.now_secs, &request.audience, CapabilityPermission::TASKLET_SUBMIT, )?;
let mut session = GpuSession::new(&lease); let input = session.mem_alloc(request.input.len() as u64)?; session.mem_write(&input, 0, &request.input)?; let module = session.module_load(module_bytes)?; let args = KernelArgs::new() .push_u32(request.output_len) .push_buffer(&input); session.launch_with_args(&module, kernel, [1, 1, 1], [1, 1, 1], args)?; session.sync()?; let output = session.mem_read(&input, 0, request.output_len)?; session.module_unload(module)?; session.mem_free(input)?;
effects.complete( key, reservation.version, IdempotencyOutcome::Completed { effect: None }, FenceEpoch(1), )?; lease.free();
Ok(InferenceResponse { request_id: request.request_id, lease_id: lease.lease_id(), token_expires_at: grant.expires_at_unix_secs(), output, })}The published serve_inference_request function in the crate is this pattern
packaged as a reusable handler.
In production the CapabilityAuthority handle is supplied by the scheduler or
runtime to the gateway. The local development authority used below exists only
so the native cookbook crate can exercise the same grafOS API path in tests.
Program
use cookbook_recipe_53_per_request_leased_inference::{ replicated_inference_effects, serve_inference_request, InferenceRequest,};use grafos_std::capability::RuntimeCapabilityAuthority;
let mut effects = replicated_inference_effects()?;let mut authority = RuntimeCapabilityAuthority::local_development();let response = serve_inference_request( &mut effects, &mut authority, InferenceRequest { request_id: "req-2026-05-06-1".into(), model_id: "embedder-v3".into(), input: vec![1, 2, 3, 4], output_len: 4, min_vram: 2 * 1024 * 1024 * 1024, lease_ttl_secs: 10, audience: [0x22; 32], token_nonce: 99, now_secs: 1_000, }, include_bytes!("infer.ptx"), "infer",)?;
assert_eq!(response.output.len(), 4);# Ok::<(), cookbook_recipe_53_per_request_leased_inference::InferenceError>(())Design
The request gets only the permissions required for inference:
CapabilityPermission::TASKLET_SUBMITto submit the kernel work;CapabilityPermission::LEASE_QUERYto bind the request to the lease identity.
The capability grant is validated before GPU work starts. Audience mismatch, expiration, revocation, or insufficient permissions all stop the request before the kernel is launched. The fabricBIOS wire token exists below this API, but the program does not mint or parse it.
Failure Modes
- Bad or expired capability: fail before GPU launch.
- Duplicate request: the idempotency store prevents a second logical effect.
- GPU pressure: lease acquisition fails; the gateway can reject or queue the request without owning any partially initialized GPU session.
- Request handler crash: GPU resources are tied to the lease/session handles and are not a separate cleanup script.
Tests
Run it with:
cargo test -p cookbook-recipe-53-per-request-leased-inferenceThe tests cover request-scoped leases, real capability validation, and fail-closed audience checks.
See also:
crates/grafos-std/src/capability.rscrates/grafos-std/src/gpu.rs