Tag: inference
LLM inference recipes — single-tenant, multi-tenant, batched, speculative, and the audit / attribution pattern that closes the per-request loop.
12 recipes carry this tag, ordered by recipe number:
- Recipe 5 — An Inference Server That Shares GPUs Without Containers ·
gpu-and-inference/shared-inference - Recipe 53 — Per-Request Leased Inference Gateway ·
gpu-and-inference/shared-inference - Recipe 56 — GPU Generation Targeting ·
gpu-and-inference/shared-inference - Recipe 61 — Verifying Engine Output Against a Canonical Reference ·
gpu-and-inference/correctness-and-memory - Recipe 62 — Loading an LLM Without f32-at-Load Memory Blowup ·
gpu-and-inference/correctness-and-memory - Recipe 63 — Handling Mid-Kernel Lease Revocation in a Decode Loop ·
gpu-and-inference/revocation-and-recovery - Recipe 64 — Detecting a FENCED Lease State After Revocation ·
gpu-and-inference/revocation-and-recovery - Recipe 65 — Composing Two Fabric-Leased Engines for Speculative Decode ·
gpu-and-inference/correctness-and-memory - Recipe 66 — Batching Four Tenants Into One Decode Forward Pass ·
gpu-and-inference/shared-inference - Recipe 67 — Hot-Rebind Inference Continuity After a Lease Revocation ·
gpu-and-inference/revocation-and-recovery - Recipe 68 — Continuous Batching With Per-Tenant Quotas ·
gpu-and-inference/shared-inference - Recipe 69 — Per-Request Audit Attribution for Inference ·
gpu-and-inference/audit-and-attribution