Skip to content

Recipe 5: An Inference Server That Shares GPUs Without Containers

Situation

You have multiple models/services that want GPU access. Traditional GPU utilization problems:

  • Whole-device allocation wastes VRAM and compute.
  • Container-per-model isolates processes but does not automatically solve VRAM fragmentation and scheduling.
  • GPU scheduling stacks are complex and operationally heavy.

In a lease-based model, you aim for:

  • VRAM scoped to a lease.
  • Session lifetime scoped to a lease.
  • Fast teardown on drop or expiry.

The goal of this recipe is to show the pattern: GPU access becomes a leased resource, and inference becomes “acquire -> run -> drop”, with observability.

What You Build

A simple inference service with:

  • Per-model GPU lease acquisition.
  • A request/response interface (RPC) that uses leased memory as its hot path.
  • A teardown model that returns VRAM capacity immediately when the lease is dropped.

Building Blocks

  • grafos_std::gpu::GpuBuilder and GpuLease
  • grafos_rpc::{RpcServer, RpcClient} for lease-backed RPC
  • grafos_tensor (if you model tensors explicitly; optional for the conceptual pattern)
  • grafos_observe for metering

See:

Design

Resource Model

Per model instance:

  • Acquire a GPU lease with min_vram(bytes) and an appropriate TTL.
  • Keep the lease handle for the lifetime of that model instance.

On request:

  • Deserialize input.
  • Upload input to GPU (or treat input as bytes passed to a kernel).
  • Launch kernel(s).
  • Return result bytes.

Isolation and Safety

This recipe assumes the host/runtime enforces:

  • Lease scoping of GPU submission APIs.
  • Fuel/time limits on kernels (or watchdog).
  • Output size limits.

The Rust side should still implement:

  • Input size bounds.
  • Output size bounds.
  • Explicit error paths for expired/disconnected leases.

Walkthrough (Implementation Sketch)

1. Acquire GPU Lease

use grafos_std::gpu::GpuBuilder;
let gpu = GpuBuilder::new()
.min_vram(2 * 1024 * 1024 * 1024)
.lease_secs(300)
.acquire()?;

2. Submit a Kernel

FabricGpu provides submit(kernel_name, binary); you configure grid/block and pass argument bytes:

let res = gpu.gpu()
.submit("infer", kernel_binary)
.grid([256, 1, 1])
.block([64, 1, 1])
.arg(input_bytes)
.launch()?;

3. Serve Requests via RPC

The RPC hot path is shared memory. A colocated client/server can exchange requests without TCP.

You can structure:

  • a server loop that watches for REQUEST_READY
  • an RPC handler that does gpu.submit(...).launch()

The cookbook-level point: “inference service” is mostly plumbing; the novel part is the lifecycle and isolation.

Failure Modes

  • LeaseExpired: model instance loses GPU; return a clear error to client and optionally reacquire.
  • Disconnected: fabric/runtime unreachable; treat as transient and retry/backoff.
  • Resource pressure: acquiring VRAM fails; degrade (serve smaller models) or queue.

Observability

Track:

  • gpu_lease_seconds per model
  • kernel_launch_total, kernel_fail_total
  • inference_latency_ms histogram
  • bytes_in / bytes_out

Variations

  • Multi-tenant server: multiple model leases in one process, routing requests by model id.
  • Burst models: models that lease VRAM only while warm, drop when idle.
  • Batching: combine requests into a single kernel launch.
  • Admission control: reject if TTL remaining is too small to finish inference safely.