Recipe 56: GPU Generation Targeting
Situation
A research team runs ML inference on NVIDIA Hopper-class GPUs (H100, H200) for the latency-critical path, and batch training on Ampere- class GPUs (A100, A40) where throughput matters more than per-token latency. The team’s quota envelope shouldn’t be a single fleet-wide “N GPUs”; it should be “up to 4 Hopper and up to 8 Ampere,” with training requests refused if they accidentally target Hopper and inference refused if they accidentally land on Ampere.
In grafOS the tenant’s quota table carries per-generation
GpuGenerationQuota entries. The scheduler treats those entries as
an allow-list: when the table is non-empty, every GPU lease request
MUST declare its HardwareGeneration and unlisted generations have
an effective limit of zero. Missing generation and exceeded
generation are distinct typed denials, so policy code and SIEM
alerts can react to each cleanly.
What You Build
A quota policy + admission check that:
- Builds a typed
GpuFleetPolicyfrom per-generation allowances (Hopper, Ampere, Blackwell, MI300, etc.); - Rejects duplicate-generation policy entries at construction time;
- Commits the policy onto a
QuotaManager; - Checks an incoming
GpuLeaseRequestagainst the committed envelope and returns a typedAdmissionResult::{Approved, Denied(QuotaDenied)}; - Surfaces the exact
QuotaDeniedshape the scheduler emits, so callers can pattern-match onGpuGenerationRequiredvsGpuGenerationLimitExceeded.
The compiled recipe lives in
cookbook/recipe-56-gpu-generation-targeting.
Core grafOS API Path
use grafos_core::{GpuGenerationQuota, HardwareGeneration};use grafos_scheduler::{QuotaDenied, QuotaManager, TenantId};
let mut mgr = QuotaManager::new();let tenant = TenantId(0xa1);
mgr.set_gpu_generation_limits( tenant, &[ GpuGenerationQuota { generation: HardwareGeneration::NvidiaHopper, count: 4, }, GpuGenerationQuota { generation: HardwareGeneration::NvidiaAmpere, count: 8, }, ],)?;
// Approved: declared, within envelope.mgr.check_gpu_generation(tenant, Some(HardwareGeneration::NvidiaHopper), 2)?;
// Denied: generation required when limits are configured.let err = mgr .check_gpu_generation(tenant, None, 1) .unwrap_err();assert_eq!(err, QuotaDenied::GpuGenerationRequired { requested: 1 });
// Denied: unlisted generation has limit=0.let err = mgr .check_gpu_generation(tenant, Some(HardwareGeneration::NvidiaBlackwell), 1) .unwrap_err();assert!(matches!( err, QuotaDenied::GpuGenerationLimitExceeded { generation: HardwareGeneration::NvidiaBlackwell, limit: 0, used: 0, requested: 1, }));# Ok::<(), Box<dyn std::error::Error>>(())Program
use cookbook_recipe_56_gpu_generation_targeting::{ check_gpu_request, AdmissionResult, GpuFleetPolicy, GpuLeaseRequest,};use grafos_core::{GpuGenerationQuota, HardwareGeneration};use grafos_scheduler::{QuotaManager, TenantId};
let tenant = TenantId(0xa1);let mut mgr = QuotaManager::new();
let policy = GpuFleetPolicy::new( tenant, vec![ GpuGenerationQuota { generation: HardwareGeneration::NvidiaHopper, count: 4, }, GpuGenerationQuota { generation: HardwareGeneration::NvidiaAmpere, count: 8, }, ],)?;policy.commit(&mut mgr).expect("commit");
// Inference path requests Hopper.let inference = GpuLeaseRequest { generation: Some(HardwareGeneration::NvidiaHopper), count: 2,};assert_eq!(check_gpu_request(&mgr, tenant, inference), AdmissionResult::Approved);
// Training path forgot to tag generation — fail closed.let untagged = GpuLeaseRequest { generation: None, count: 1,};match check_gpu_request(&mgr, tenant, untagged) { AdmissionResult::Denied(_) => {} AdmissionResult::Approved => unreachable!("untagged request must be denied"),}# Ok::<(), cookbook_recipe_56_gpu_generation_targeting::GpuFleetPolicyError>(())Design
The per-generation table is an allow-list, not a refinement. That choice has two operator-visible consequences:
- Untagged GPU requests fail closed. When the table has any
entries, requests without a
HardwareGenerationare rejected withQuotaDenied::GpuGenerationRequired. Tenants that mix targeted and untargeted workloads need a policy entry for every generation they want to land on — including aHardwareGeneration::Otherrow if they want a catch-all bucket. - Unlisted generations have effective limit zero. A tenant
that lists Hopper + Ampere implicitly forbids Blackwell, MI300,
etc. The typed denial carries the unlisted generation, a
limit = 0, and the requested count — so a SIEM operator can distinguish “asked for the wrong generation entirely” from “asked for a generation we have, but exceeded the count.”
The duplicate-generation check at construction time mirrors the
scheduler-side check at set_gpu_generation_limits. Catching it
in the builder shortens the failure path: a misconfigured policy
file is rejected before any scheduler state mutates.
GpuGenerationQuota only carries the variant kind and count.
Carried fields like driver version, NVLink topology, or per-SM
config are out-of-band (they belong on the inventory side, not the
quota side). Two Hopper { count: 4 } entries with different
driver versions aggregate under the same SIEM bucket — quota
attribution does not split on driver micro-version.
Failure Modes
- Generation required: tenant has per-generation limits
configured and the request omitted
generation. TypedQuotaDenied::GpuGenerationRequired { requested }. - Generation limit exceeded: the request’s
(generation, count)would push usage past the listed limit. TypedQuotaDenied::GpuGenerationLimitExceeded { generation, limit, used, requested }. Operators read all four numbers to size the next policy revision. - Unlisted generation: the request named a generation not in
the policy. Same shape as
GpuGenerationLimitExceededwithlimit = 0, used = 0— the SIEM bucket can be the same (“wrong generation”) or split by inspectinglimit. - Duplicate generation in policy: rejected at policy build
time via
GpuFleetPolicy::new, before any scheduler state mutates. - Release lowers usage:
record_gpu_generation_freeis the counterpart ofrecord_gpu_generation_alloc. Recipes are responsible for pairing them; a release that exceeds the recorded usage saturates at zero.
Tests
Run it with:
cargo test -p cookbook-recipe-56-gpu-generation-targetingThe tests cover declared-and-fits (Approved), missing generation
(GpuGenerationRequired), exceeding the envelope
(GpuGenerationLimitExceeded with typed limit/used/requested),
unlisted generation (limit = 0 denial), duplicate-generation
policy rejection at construction time, and release lowering usage
so a subsequent request fits again.
Adaptation Notes
- Allow-list vs additive: this recipe uses the allow-list
semantic (unlisted = forbidden). If you want a “soft target”
semantic where unlisted generations fall back to a fleet-wide
pool, leave per-generation limits unset and use the
gpu_countfield onQuotaSchemainstead. The two surfaces compose: a tenant with both has the per-generation table as the strict ceiling and the totalgpu_countas an additional cap. - Adding a new GPU family: extend
HardwareGeneration(ingrafos-core) with the new variant and its snake_caseas_str(). Existing policies that don’t list the new variant continue to forbid it (allow-list semantic stays correct). - Per-generation rate-card pricing: the typed
HardwareGenerationvalue flows into the billing surface; a tenant with mixed Hopper + Ampere allowances sees per-generation cost rows in their invoice. Seedocs/operations/scheduler-features.md§ “Resource taxonomy” for the billing-side mapping.
See also:
crates/grafos-core/src/policy_vocab.rs—HardwareGeneration,GpuGenerationQuota,QuotaSchema.crates/grafos-scheduler/src/quota.rs—QuotaManager,check_gpu_generation,record_gpu_generation_alloc.docs/operations/scheduler-features.md§ “Quota schema” and “HardwareGeneration”.docs/operations/siem-vocabulary-cookbook.md— SIEM filter recipes forquota_violation == "gpu_generation_required"andquota_violation == "gpu_generation_limit_exceeded".