Skip to content

Recipe 59: Cost Attribution With Accounting Tags

Situation

A team needs to know what each piece of their workload actually costs. Not “what’s the tenant invoice” — that’s the operator’s view. The team wants to roll up cost by their own budget group (e.g. “production inference” vs “research” vs “internal tools”) and by their own request class (e.g. “realtime”, “interactive”, “batch”) so they can answer questions like:

  • How much did production inference cost this week?
  • Within production inference, how much was realtime vs batch?
  • Is the synthetic monitoring probe inflating any of these numbers?

In grafOS, every lease can carry a fixed-size AccountingTag with five canonical fields: tenant, workload, request_class, budget_group, flags. The fields enter the audit-chain canonical bytes deterministically. A tenant’s local bookkeeping and the operator’s billing path consume the same struct, so the tenant’s roll-up matches what the invoice would compute.

What You Build

A typed cost-attribution helper that:

  • Builds AccountingTags at the call site without hiding which u32 is which (tag(tenant, workload, request_class, budget_group, flags));
  • Carries lease-held seconds + a tenant-defined per-second rate;
  • Rolls up cost by budget_group and by (budget_group, request_class);
  • Surfaces a typed FLAG_SYNTHETIC_PROBE bit and an is_synthetic_probe predicate so probes can be filtered out of production roll-ups without magic numbers;
  • Uses saturating arithmetic so a pathological rate doesn’t panic.

The compiled recipe lives in cookbook/recipe-59-cost-attribution-tags.

Core grafOS API Path

use grafos_core::AccountingTag;
let prod_realtime = AccountingTag {
tenant: 1,
workload: 42,
request_class: 1, // tenant-defined: 1 = realtime
budget_group: 100, // tenant-defined: 100 = production-inference
flags: 0,
};
// Round-trip through canonical bytes — same encoding the
// audit chain consumes when this tag accompanies a lease.
let mut buf = [0u8; AccountingTag::ENCODED_LEN];
prod_realtime.encode(&mut buf)?;
let back = AccountingTag::decode(&buf)?;
assert_eq!(back, prod_realtime);
# Ok::<(), Box<dyn std::error::Error>>(())

Program

use cookbook_recipe_59_cost_attribution_tags::{
is_synthetic_probe, tag, BudgetByClassRollup, BudgetGroupRollup, LeaseCostRecord,
FLAG_SYNTHETIC_PROBE,
};
// Three production leases + one synthetic probe.
let records = vec![
LeaseCostRecord {
tag: tag(1, /*workload=*/ 1, /*class=*/ 1, /*budget=*/ 100, 0),
seconds_held: 3600,
rate_micros_per_sec: 1_000,
},
LeaseCostRecord {
tag: tag(1, 2, 2, 100, 0),
seconds_held: 1800,
rate_micros_per_sec: 2_000,
},
LeaseCostRecord {
tag: tag(1, 3, 1, 200, 0),
seconds_held: 7200,
rate_micros_per_sec: 1_500,
},
LeaseCostRecord {
tag: tag(1, 99, 1, 100, FLAG_SYNTHETIC_PROBE),
seconds_held: 60,
rate_micros_per_sec: 1_000,
},
];
// Production-only roll-up — drop probes.
let production_only: Vec<_> = records.iter()
.copied()
.filter(|r| !is_synthetic_probe(r.tag))
.collect();
let by_budget = BudgetGroupRollup::from_records(production_only.iter().copied());
let by_budget_class = BudgetByClassRollup::from_records(production_only);
for (budget, total) in &by_budget.totals_micros {
println!("budget_group {budget}: {total} micro-units");
}
for ((budget, class), total) in &by_budget_class.totals_micros {
println!(" budget={budget} class={class}: {total}");
}

Design

The AccountingTag shape is deliberately five fixed-size fields and nothing else:

FieldTypePurpose
tenantu64Issuing tenant / namespace owner.
workloadu64Program / service / run identifier.
request_classu32Policy-defined latency / priority bucket.
budget_groupu32Policy-defined cost bucket.
flagsu32Policy-defined bitfield; bit 0 reserved for “synthetic / probe edge” by convention.

Two consequences:

  1. No free-form strings, no PII. The fields are u64s and u32s that the scheduler does not interpret. The tenant’s policy layer decides what request_class = 1 means; the scheduler carries the number through unchanged. SIEM streams and billing roll-ups operate on integers, not strings, so aggregation is deterministic.
  2. Fixed canonical bytes. AccountingTag::ENCODED_LEN is 28 bytes. The encoding is byte-stable across versions, so a recorded audit row from yesterday round-trips through today’s decoder unchanged.

The synthetic-probe flag is the one bit the recipe surfaces with a named constant. Monitoring tooling that emits synthetic traffic (latency probes, capacity checks, canary requests) sets bit 0 so production cost roll-ups can filter it out without sampling or heuristics.

BudgetGroupRollup keys on budget_group alone — typical operator-facing view (“what did production inference cost?”). BudgetByClassRollup keys on (budget_group, request_class) for the deeper “within production inference, how much was realtime vs batch?” view. Both use BTreeMap so iteration is sort-stable, so roll-up output is byte-identical across runs given the same input.

Failure Modes

  • Misattributed lease: a lease created without an AccountingTag (or with AccountingTag::ZERO) aggregates into the all-zeros bucket. The tenant’s local discipline catches this — the operator’s billing path will also surface the zeroed rows so neither side hides them.
  • Probe inflation: probes that don’t set FLAG_SYNTHETIC_PROBE get rolled up into production. The recipe ships the constant so probe tooling can set it consistently.
  • Rate overflow: LeaseCostRecord::total_micros uses saturating multiplication. A pathological rate (e.g. u64::MAX from a misconfigured rate-card) saturates at u64::MAX instead of panicking, and the saturated value is visible at the roll-up — it does not silently roll into a neighboring bucket.
  • Tag tampering after seal: the audit-chain hash includes the tag bytes (28 bytes per record). Tampering with any of the five fields after seal breaks the chain hash; the reference collector at crates/grafos-audit-collector catches this on ingest.

Tests

Run it with:

Terminal window
cargo test -p cookbook-recipe-59-cost-attribution-tags

Six tests cover budget-group aggregation across request classes, budget+class split rollup, synthetic-probe flag predicates, canonical-bytes round-trip for AccountingTag (proves the recipe matches what audit-chain canonical bytes carry), rollup stability under record reordering, and saturating arithmetic on pathological rates.

Adaptation Notes

  • Currency: this recipe uses tenant-defined “micro-units per second” so it doesn’t pick a currency. Production callers wire the per-second rate from grafos admin fair-share-policy get / the scheduler’s rate-card so tenant and operator agree.
  • Field semantics: request_class and budget_group are policy-defined u32s. Pick a convention (e.g. request_class = 1 for realtime) and document it in the tenant’s own runbook; the scheduler treats both as opaque identifiers.
  • Flag bits beyond bit 0: the flags field is policy-defined except for bit 0 (“synthetic / probe edge”). Tenants may define additional bits (bit 1 = “speculative cancel-on-loss”, bit 2 = “tenant-internal idle pre-warm”, etc.) — keep the meaning stable, never reuse a retired bit.
  • Cross-tenant rollup: not allowed at the tenant API surface. A tenant rolling up only sees its own tagged leases. Operator-side rollups across tenants are visible through the billing endpoints (/api/v1/billing/invoice) which carry the same AccountingTag shape.

See also:

  • crates/grafos-core/src/accounting.rsAccountingTag, canonical encode/decode.
  • docs/operations/scheduler-features.md § “Accounting tags
    • project scopes”.
  • docs/spec/audit-chain-canonical-bytes.md — wire shape for the audit-chain bytes that carry these tags.
  • Recipe 57 (Per-Project Fair-Share Policy) — pairs naturally with this one when tagging is keyed off a project_id used in both surfaces.