Skip to content

Shared inference

Multi-tenant inference servers and request gateways on shared GPUs.

When to read this sub-group

You are operating an inference service that hosts more than one tenant on a shared GPU. The basics of running a kernel are behind you; the question is now how the GPU stays utilized across tenants without leaking state, exceeding per-tenant quotas, or stranding capacity when a request completes.

Suggested order

  1. An Inference Server That Shares GPUs Without Containers — the foundational shared-GPU server. Establishes per-model leasing.
  2. Per-Request Leased Inference Gateway — variant where each request gets its own short lease. Use this when request lifetimes are highly variable.
  3. GPU Generation Targeting — placement-aware variant. How to constrain tenants to specific hardware generations (H100 vs L4 vs A100).
  4. Batching Four Tenants Into One Decode Forward Pass — kernel-level multi-tenant batching. Read after the gateway recipe to see how the underlying decode kernel handles multiple tenants without leaking KV state.
  5. Continuous Batching With Per-Tenant Quotas — the production scheduler variant on top of the batched decode kernel. Adds dynamic admission, fair-share, no-starvation.

What’s not here

Engine correctness comparisons. See correctness and memory. Lease-revocation recovery during in-flight requests. See revocation and recovery. Per-request audit / billing attribution. See audit and attribution.