GPU & Inference

All GPU and LLM workloads. Sub-grouped: basics, shared inference, correctness & memory, revocation & recovery, audit & attribution.

When to read this section

You are running anything on a GPU — a single CUDA kernel, a multi-tenant inference server, a checkpointed training job, a speculative-decode pipeline. fabricBIOS treats GPU memory and compute as leased resources, which means the same acquire → use → drop shape you’ve seen elsewhere in the cookbook applies, but with stronger contracts: revocation can happen mid-kernel, not just between RPCs, and KV cache slots / model weights / tenant quotas all live on the lease layer.

This is the largest section in the cookbook. It is sub-grouped to match the natural reader arc: basics first, then shared inference, then correctness and memory, then revocation and recovery, then audit and attribution. Each sub-group has its own overview page.

Sub-groups

GPU basics — single-tenant, single-session, single-kernel GPU programs. Start here if you have never run a kernel on a leased device before.
Shared inference — multi-tenant inference servers and request gateways on shared GPUs.
Correctness & memory — engine output equivalence against canonical references, memory-efficient model loading, multi-engine composition.
Revocation & recovery — what happens when a lease is reclaimed mid-decode and how the engine recovers cleanly without dropping the other tenants.
Audit & attribution — per-request audit attribution for billing, compliance, and abuse detection.

What’s not here

Lease primitive fundamentals. See Start Here and the rest of the cookbook for the underlying lease, KV, and observe surfaces — these recipes assume you already know them. Cross-cloud GPU placement and provider migration. See Placement, Scaling & Operations.