Tag: inference

LLM inference recipes — single-tenant, multi-tenant, batched, speculative, and the audit / attribution pattern that closes the per-request loop.

12 recipes carry this tag, ordered by recipe number:

Recipe 5 — An Inference Server That Shares GPUs Without Containers · gpu-and-inference/shared-inference
Recipe 53 — Per-Request Leased Inference Gateway · gpu-and-inference/shared-inference
Recipe 56 — GPU Generation Targeting · gpu-and-inference/shared-inference
Recipe 61 — Verifying Engine Output Against a Canonical Reference · gpu-and-inference/correctness-and-memory
Recipe 62 — Loading an LLM Without f32-at-Load Memory Blowup · gpu-and-inference/correctness-and-memory
Recipe 63 — Handling Mid-Kernel Lease Revocation in a Decode Loop · gpu-and-inference/revocation-and-recovery
Recipe 64 — Detecting a FENCED Lease State After Revocation · gpu-and-inference/revocation-and-recovery
Recipe 65 — Composing Two Fabric-Leased Engines for Speculative Decode · gpu-and-inference/correctness-and-memory
Recipe 66 — Batching Four Tenants Into One Decode Forward Pass · gpu-and-inference/shared-inference
Recipe 67 — Hot-Rebind Inference Continuity After a Lease Revocation · gpu-and-inference/revocation-and-recovery
Recipe 68 — Continuous Batching With Per-Tenant Quotas · gpu-and-inference/shared-inference
Recipe 69 — Per-Request Audit Attribution for Inference · gpu-and-inference/audit-and-attribution