Recipe 69: Per-Request Audit Attribution for Inference
Situation
When inference is a fabric resource, every emitted token costs something: GPU time, KV memory, electricity. For multi-tenant production serving, you need to attribute that cost back to a specific request and tenant on an auditable channel — one that survives independent of the telemetry pipeline, that operators can verify after the fact, and that a downstream billing or compliance system can join against tenant identity.
Without auditable attribution: billing is impossible, compliance is impossible, abuse detection is impossible.
What You Build
A per-request audit emission that lands one
AuditEventKind::InferenceRequestCompleted record on the existing
hash-chained audit chain at the end of every inference request.
Each record carries the tenant id, the application-side request
id, the total tokens emitted, the total compute wall-clock, and
a typed InferenceCompletionReason (end-of-sequence, max tokens
reached, engine error, cancelled, preempted). Downstream consumers
query the chain on kind=inference_request_completed and derive
per-tenant aggregates without joining against a separate
telemetry store.
This recipe uses the same hash-chained audit infrastructure that already records lease lifecycle, admission decisions, preemptions, and tenant CRUD. Per-token attribution (one event per emitted token) would be too high-frequency for the audit chain shape; that channel belongs in the observe layer (see Variations).
Building Blocks
grafos_audit::AuditRecord— the per-event chain record. Carrieskind,identity,event_data, monotonicsequence,timestamp_unix,prev_event_hash, andcurrent_event_hash.grafos_audit::AuditEventData::InferenceRequestCompleted— the typed structured payload.grafos_core::AuditEventKind::InferenceRequestCompleted— the discriminator on the record’skindfield.grafos_core::InferenceCompletionReason— the typed enum for the terminal reason (EndOfSequence | MaxTokensReached | EngineError | Cancelled | Preempted).grafos_core::WorkloadIdentity— the standard identity field on every audit record. Carries the tenant identifier so chain consumers can group / filter without consulting the structured payload.grafos_audit::assemble_record+ anAnchorStore— produces a fully-chainedAuditRecordfrom caller-supplied input.grafos_audit::verify_chain— re-validates a captured chain end-to-end against an anchor. Detects tampering.grafos_audit::canonical_bytes— the wire-format serializer that the chain hashes over. Stable across minor versions; new variants APPEND new bytes without shifting existing ones.
See:
- grafos-audit lib (source)
- InferenceCompletionReason (source)
- Recipe 55 — consuming the audit chain
- Recipe 60 — tenant audit dashboard
- Recipe 68 — continuous batching
Design
Resource Model
The audit chain is one append-only stream per producer (typically
one per scheduler / engine pair) with the chain-linkage invariant:
record[N+1].prev_event_hash == record[N].current_event_hash.
Inference attribution slots in alongside the existing
admission / lease / preemption events on the SAME chain — the
scheduler’s SchedulerAuditChain shares one chain across both
admission and preemption producers, and inference completion
joins that stream as a third producer.
Event Cardinality
One record per completed inference request. Not per emitted token. This is deliberate:
- A token-rate audit chain explodes (hundreds of tokens per second
per active request × N requests). The hash chain’s
prev_event_hashlinkage doesn’t parallelize; emit-rate is bounded. - The per-token DETAIL belongs in the observe layer (FabricEvent / FabricMetrics), where high-rate telemetry is the design point.
- The AUDIT layer’s contract is “what happened” (the per-request outcome), not “every kernel launch that happened on the way.”
If your application needs per-token attribution for billing or debugging, see the Variations section for the observe-layer pattern.
Hash-Chain Tamper-Evidence
The variant’s canonical_bytes serialization is:
0x0c (variant tag)[u32-len + utf8] tenant_id[u32-len + utf8] request_id[u32 big-endian] total_tokens[u64 big-endian] compute_ms[u32-len + utf8] reason.as_str() (snake_case typed name)Each field enters the canonical bytes that current_event_hash
covers. Tampering with the tenant id, request id, token count,
compute time, OR the reason name breaks the hash AND breaks
chain linkage at the next record. A SIEM aggregating per-tenant
totals can trust the numbers because flipping any byte breaks
verify_chain.
Isolation and Safety
- The audit record’s
identityfield carries the tenant. The structured payload’stenant_idis redundant by design — chain consumers can group onidentity.tenantwithout parsing the payload, and the payload’stenant_idis the authoritative marker for cross-system correlation. - The
InferenceCompletionReasonis typed. Adding a new variant on the grafos-core side propagates through the canonical bytes via the variant’s snake_caseas_str()name, so a chain reader can match unknown variants safely without breakingverify_chain.
Walkthrough (Implementation Sketch)
1. Emit on Request Completion
In the serving harness around ContinuousBatchScheduler (Recipe
68), catch the SchedulerEvent::Completed { request, reason }
event and translate it into an audit emission:
use grafos_audit::{AuditEventData, AuditRecordInput, assemble_record};use grafos_core::{ AuditEventKind, InferenceCompletionReason, WorkloadIdentity,};use grafos_inference_engine::continuous_batch::{ CompletionReason, SchedulerEvent,};
fn audit_inference_completion( chain: &mut SchedulerAuditChain, request_id: RequestId, reason: CompletionReason, tenant: &str, total_tokens: u32, compute: Duration,) -> Result<(), AuditError> { let audit_reason = match reason { CompletionReason::EndOfSequence => InferenceCompletionReason::EndOfSequence, CompletionReason::MaxTokensReached => InferenceCompletionReason::MaxTokensReached, CompletionReason::EngineError(_) => InferenceCompletionReason::EngineError, CompletionReason::Cancelled => InferenceCompletionReason::Cancelled, CompletionReason::Preempted => InferenceCompletionReason::Preempted, };
let identity = WorkloadIdentity::new(tenant) .with_instance_id(format!("inference:request_completed:{}", request_id));
let input = AuditRecordInput { kind: AuditEventKind::InferenceRequestCompleted, identity, event_data: Some(AuditEventData::InferenceRequestCompleted { tenant_id: tenant.to_string(), request_id: request_id.to_string(), total_tokens, compute_ms: compute.as_millis() as u64, reason: audit_reason, }), // ... other AuditRecordInput fields };
chain.emit(input)?; Ok(())}2. Drive From the Scheduler Tick Loop
let events = scheduler.step().await?;for event in events { match event { SchedulerEvent::Token { request, .. } => { /* per-token telemetry path */ } SchedulerEvent::Completed { request, reason } => { let ctx = registry.context_for(request); audit_inference_completion( &mut audit_chain, request, reason, &ctx.tenant_id, ctx.total_tokens(), ctx.compute_elapsed(), )?; } SchedulerEvent::AdmissionRejected { .. } => { /* admit-time path */ } }}3. Query the Chain Per Tenant
use grafos_audit::{AuditRecord, AuditEventData, verify_chain};use grafos_core::AuditEventKind;
fn tenant_inference_summary( records: &[AuditRecord], tenant: &str,) -> TenantInferenceSummary { let mut total_tokens: u64 = 0; let mut total_compute_ms: u64 = 0; let mut by_reason: HashMap<String, u64> = HashMap::new();
for rec in records.iter() .filter(|r| r.kind == AuditEventKind::InferenceRequestCompleted) .filter(|r| r.identity.tenant == tenant) { if let Some(AuditEventData::InferenceRequestCompleted { total_tokens: tokens, compute_ms, reason, .. }) = &rec.event_data { total_tokens += *tokens as u64; total_compute_ms += compute_ms; *by_reason.entry(reason.as_str().to_string()).or_default() += 1; } }
TenantInferenceSummary { total_tokens, total_compute_ms, by_reason }}verify_chain(records, anchor) confirms the chain is intact
before any aggregation — if it returns an error, an aggregation
is meaningless until the integrity issue is resolved.
Verification
Hash-chain tamper-evidence
The variant ships with two unit tests in grafos-audit:
audit_event_data_variants_produce_distinct_hashes— asserts theInferenceRequestCompletedvariant produces a chain hash distinct from every otherAuditEventDatavariant under the same identity / timestamp / sequence. This proves the canonical bytes cover the payload meaningfully (a tampered tenant id cannot collide into a clean record).audit_event_data_variant_tags_are_stable— pinsEVENT_DATA_TAG_INFERENCE_REQUEST_COMPLETED = 0x0c. Bumping this value is a wire-format break that requires bumpingCANONICAL_VERSION.
Run them:
cargo test -p grafos-audit --features serde --lib audit_event_dataMetric exposure
FabricMetrics::audit_inference_request_completed is incremented
for every emit, surfaced through count_audit_emit, and rendered
in the Prometheus exposition as
grafos_audit_records_inference_request_completed_total. Run:
cargo test -p grafos-observe --lib auditFailure Modes
- Late or missed emit. If the harness’s
Completedhandler panics before the audit emit lands, the request is unaccounted for. The chain stays internally consistent; the inference request just isn’t represented. Mitigation: emit BEFORE acknowledging the application-side completion (atomic with respect to caller-visible side effects). - Clock skew across producers. When inference spans cells
(Recipe 39), each cell’s audit chain stamps its own wall clock.
Cross-cell ordering requires the chain’s
prev_event_hashlinkage (logical), not the wall clock. Operators must not rely on wall-clock for cross-cell ordering. - Lease re-issue with same identifier. Some broker
implementations reuse
lease_idacross re-issues; this conflates two distinct leases in the chain. The audit chain carries the application’srequest_idseparately from the lease id specifically because of this — operators query onrequest_idfor unambiguous per-request attribution. - Privacy leak through the payload. The payload includes the
tenant id and request id but NOT the prompt or the emitted
tokens. If your environment requires the prompt-hash to be on
the chain too (e.g. for content audit), extend the variant
before relying on the chain — adding fields is a wire-format
break that requires bumping
CANONICAL_VERSIONand a migration drill.
Observability
The chain itself IS the observability surface. Operator queries
match the existing Recipe 55: Consuming the Audit Chain pattern:
- “All completed requests from tenant 47 in the last hour” — scan
for
kind=inference_request_completed, filteridentity.tenant == "47", filter on timestamp range. - “Total compute cost for tenant 47” — sum
event_data.compute_msover the matching records. - “Tenants whose request error rate jumped” — bucket
event_data.reason == "engine_error"per tenant per window.
The Prometheus counter
grafos_audit_records_inference_request_completed_total provides
the emit-rate signal for scrape-based dashboards (Recipe 60).
Variations
- Per-token observability (not audit). Per-token attribution
belongs in the observe layer, not the audit chain. A custom
FabricEventvariant emitted perSchedulerEvent::Tokenis the right pattern — the chain stays bounded to per-request outcomes; high-rate telemetry uses the observe pipeline. - Cryptographic signing.
grafos-auditships anEd25519Signerthat signs each record. Enable it on the scheduler’s audit chain so the records carry signatures verifiable by external auditors who do not trust the storage provider. Costs one Ed25519 sign per emit. - Streaming to a separate audit cell. The chain itself is storage-agnostic — write the JSONL stream to a fabric-leased storage volume on a dedicated audit cell (Recipe 36’s stateful KV pattern adapted to append-only event storage). Inference cells emit; the audit cell holds; operator dashboards consume.
- Per-token billing meter. A derived
BillingPulseevent emitted per N tokens (where N is large enough to keep the audit cardinality bounded) gives finer-grain billing without destroying the chain’s emit-rate budget. Pair with a pricing function applied at aggregation time.
Why This Is Recipe 69
Recipes 61-68 establish the inference primitives. Recipe 69 establishes the attribution primitive that makes those primitives operationally trustworthy. Without per-request audit attribution, fabric-leased inference can’t be monetized, audited, or operationally observed; with it, every completed request is provably traceable to a tenant, the totals are tamper-evident, and the chain doubles as a billing ledger.