Recipe 53: Per-Request Leased Inference Gateway

Situation

You operate a shared inference gateway. Requests from different tenants should not inherit a long-lived process-level GPU grant, and a stalled request should not leave GPU memory resident after its useful lifetime.

In grafOS, each request can acquire a short GPU lease, receive a scoped capability grant for that lease, run the kernel, and tear the lease down when the request completes.

What You Build

An inference handler that:

acquires one short GpuLease per request;
requests a scoped capability through the grafOS authority boundary;
launches inference through GpuSession;
records request completion in ReplicatedIdempotencyStore;
explicitly frees GPU memory, unloads the module, and releases the lease.

The compiled recipe lives in cookbook/recipe-53-per-request-leased-inference.

Core grafOS API Path

The handler uses the public idempotency store, GPU builder/session APIs, and grafOS capability authority APIs. It does not hold a fabricBIOS signing key or call raw token minting functions:

use grafos_replicated::{
    FenceEpoch, IdempotencyKey, IdempotencyOutcome, LogicalResourceName,
    OperationHash, ReplicatedIdempotencyStore, ResourceKind, SchemaId,
};
use grafos_std::capability::{
    CapabilityAuthority, CapabilityPermission, CapabilityRequest,
};
use grafos_std::gpu::{GpuBuilder, GpuExclusivityClass, GpuSession, KernelArgs};
use cookbook_recipe_53_per_request_leased_inference::{
    InferenceError, InferenceRequest, InferenceResponse,
};

fn handle_request(
    effects: &mut ReplicatedIdempotencyStore,
    authority: &mut impl CapabilityAuthority,
    request: InferenceRequest,
    module_bytes: &[u8],
    kernel: &str,
) -> Result<InferenceResponse, InferenceError> {
    let key = IdempotencyKey::new(format!("inference:{}", request.request_id));
    let mut payload = request.request_id.clone().into_bytes();
    payload.push(0);
    payload.extend_from_slice(request.model_id.as_bytes());
    payload.push(0);
    payload.extend_from_slice(&request.output_len.to_be_bytes());
    payload.extend_from_slice(&request.input);
    let fingerprint = OperationHash::from_canonical_parts(
        &LogicalResourceName::new("per-request-inference"),
        ResourceKind::Workflow,
        "infer",
        &SchemaId::new("inference-request.v1"),
        &payload,
    );

    let reservation = effects.reserve(
        key.clone(),
        fingerprint,
        None,
        FenceEpoch(1),
    )?;
    if matches!(reservation.value.outcome, IdempotencyOutcome::Completed { .. }) {
        return Err(InferenceError::DuplicateRequest(request.request_id));
    }

    let lease = GpuBuilder::new()
        .min_vram(request.min_vram)
        .lease_secs(request.lease_ttl_secs)
        .exclusivity(GpuExclusivityClass::SessionExclusive)
        .acquire()?;

    let grant = authority.issue(
        CapabilityRequest::new(
            lease.lease_id(),
            CapabilityPermission::LEASE_QUERY | CapabilityPermission::TASKLET_SUBMIT,
            request.audience,
        )
        .ttl_secs(request.lease_ttl_secs as u64)
        .nonce(request.token_nonce),
        request.now_secs,
    )?;
    authority.validate(
        &grant,
        request.now_secs,
        &request.audience,
        CapabilityPermission::TASKLET_SUBMIT,
    )?;

    let mut session = GpuSession::new(&lease);
    let input = session.mem_alloc(request.input.len() as u64)?;
    session.mem_write(&input, 0, &request.input)?;
    let module = session.module_load(module_bytes)?;
    let args = KernelArgs::new()
        .push_u32(request.output_len)
        .push_buffer(&input);
    session.launch_with_args(&module, kernel, [1, 1, 1], [1, 1, 1], args)?;
    session.sync()?;
    let output = session.mem_read(&input, 0, request.output_len)?;
    session.module_unload(module)?;
    session.mem_free(input)?;

    effects.complete(
        key,
        reservation.version,
        IdempotencyOutcome::Completed { effect: None },
        FenceEpoch(1),
    )?;
    lease.free();

    Ok(InferenceResponse {
        request_id: request.request_id,
        lease_id: lease.lease_id(),
        token_expires_at: grant.expires_at_unix_secs(),
        output,
    })
}

The published serve_inference_request function in the crate is this pattern packaged as a reusable handler.

In production the CapabilityAuthority handle is supplied by the scheduler or runtime to the gateway. The local development authority used below exists only so the native cookbook crate can exercise the same grafOS API path in tests.

Program

use cookbook_recipe_53_per_request_leased_inference::{
    replicated_inference_effects, serve_inference_request, InferenceRequest,
};
use grafos_std::capability::RuntimeCapabilityAuthority;

let mut effects = replicated_inference_effects()?;
let mut authority = RuntimeCapabilityAuthority::local_development();
let response = serve_inference_request(
    &mut effects,
    &mut authority,
    InferenceRequest {
        request_id: "req-2026-05-06-1".into(),
        model_id: "embedder-v3".into(),
        input: vec![1, 2, 3, 4],
        output_len: 4,
        min_vram: 2 * 1024 * 1024 * 1024,
        lease_ttl_secs: 10,
        audience: [0x22; 32],
        token_nonce: 99,
        now_secs: 1_000,
    },
    include_bytes!("infer.ptx"),
    "infer",
)?;

assert_eq!(response.output.len(), 4);
# Ok::<(), cookbook_recipe_53_per_request_leased_inference::InferenceError>(())

Design

The request gets only the permissions required for inference:

CapabilityPermission::TASKLET_SUBMIT to submit the kernel work;
CapabilityPermission::LEASE_QUERY to bind the request to the lease identity.

The capability grant is validated before GPU work starts. Audience mismatch, expiration, revocation, or insufficient permissions all stop the request before the kernel is launched. The fabricBIOS wire token exists below this API, but the program does not mint or parse it.

Failure Modes

Bad or expired capability: fail before GPU launch.
Duplicate request: the idempotency store prevents a second logical effect.
GPU pressure: lease acquisition fails; the gateway can reject or queue the request without owning any partially initialized GPU session.
Request handler crash: GPU resources are tied to the lease/session handles and are not a separate cleanup script.

Tests

Run it with:

cargo test -p cookbook-recipe-53-per-request-leased-inference

The tests cover request-scoped leases, real capability validation, and fail-closed audience checks.

See also:

crates/grafos-std/src/capability.rs
crates/grafos-std/src/gpu.rs