Recipe 69: Per-Request Audit Attribution for Inference

Situation

When inference is a fabric resource, every emitted token costs something: GPU time, KV memory, electricity. For multi-tenant production serving, you need to attribute that cost back to a specific request and tenant on an auditable channel — one that survives independent of the telemetry pipeline, that operators can verify after the fact, and that a downstream billing or compliance system can join against tenant identity.

Without auditable attribution: billing is impossible, compliance is impossible, abuse detection is impossible.

What You Build

A per-request audit emission that lands one AuditEventKind::InferenceRequestCompleted record on the existing hash-chained audit chain at the end of every inference request. Each record carries the tenant id, the application-side request id, the total tokens emitted, the total compute wall-clock, and a typed InferenceCompletionReason (end-of-sequence, max tokens reached, engine error, cancelled, preempted). Downstream consumers query the chain on kind=inference_request_completed and derive per-tenant aggregates without joining against a separate telemetry store.

This recipe uses the same hash-chained audit infrastructure that already records lease lifecycle, admission decisions, preemptions, and tenant CRUD. Per-token attribution (one event per emitted token) would be too high-frequency for the audit chain shape; that channel belongs in the observe layer (see Variations).

Building Blocks

grafos_audit::AuditRecord — the per-event chain record. Carries kind, identity, event_data, monotonic sequence, timestamp_unix, prev_event_hash, and current_event_hash.
grafos_audit::AuditEventData::InferenceRequestCompleted — the typed structured payload.
grafos_core::AuditEventKind::InferenceRequestCompleted — the discriminator on the record’s kind field.
grafos_core::InferenceCompletionReason — the typed enum for the terminal reason (EndOfSequence | MaxTokensReached | EngineError | Cancelled | Preempted).
grafos_core::WorkloadIdentity — the standard identity field on every audit record. Carries the tenant identifier so chain consumers can group / filter without consulting the structured payload.
grafos_audit::assemble_record + an AnchorStore — produces a fully-chained AuditRecord from caller-supplied input.
grafos_audit::verify_chain — re-validates a captured chain end-to-end against an anchor. Detects tampering.
grafos_audit::canonical_bytes — the wire-format serializer that the chain hashes over. Stable across minor versions; new variants APPEND new bytes without shifting existing ones.

See:

Design

Resource Model

The audit chain is one append-only stream per producer (typically one per scheduler / engine pair) with the chain-linkage invariant: record[N+1].prev_event_hash == record[N].current_event_hash. Inference attribution slots in alongside the existing admission / lease / preemption events on the SAME chain — the scheduler’s SchedulerAuditChain shares one chain across both admission and preemption producers, and inference completion joins that stream as a third producer.

Event Cardinality

One record per completed inference request. Not per emitted token. This is deliberate:

A token-rate audit chain explodes (hundreds of tokens per second per active request × N requests). The hash chain’s prev_event_hash linkage doesn’t parallelize; emit-rate is bounded.
The per-token DETAIL belongs in the observe layer (FabricEvent / FabricMetrics), where high-rate telemetry is the design point.
The AUDIT layer’s contract is “what happened” (the per-request outcome), not “every kernel launch that happened on the way.”

If your application needs per-token attribution for billing or debugging, see the Variations section for the observe-layer pattern.

Hash-Chain Tamper-Evidence

The variant’s canonical_bytes serialization is:

0x0c                                       (variant tag)
[u32-len + utf8] tenant_id
[u32-len + utf8] request_id
[u32 big-endian] total_tokens
[u64 big-endian] compute_ms
[u32-len + utf8] reason.as_str()           (snake_case typed name)

Each field enters the canonical bytes that current_event_hash covers. Tampering with the tenant id, request id, token count, compute time, OR the reason name breaks the hash AND breaks chain linkage at the next record. A SIEM aggregating per-tenant totals can trust the numbers because flipping any byte breaks verify_chain.

Isolation and Safety

The audit record’s identity field carries the tenant. The structured payload’s tenant_id is redundant by design — chain consumers can group on identity.tenant without parsing the payload, and the payload’s tenant_id is the authoritative marker for cross-system correlation.
The InferenceCompletionReason is typed. Adding a new variant on the grafos-core side propagates through the canonical bytes via the variant’s snake_case as_str() name, so a chain reader can match unknown variants safely without breaking verify_chain.

Walkthrough (Implementation Sketch)

1. Emit on Request Completion

In the serving harness around ContinuousBatchScheduler (Recipe 68), catch the SchedulerEvent::Completed { request, reason } event and translate it into an audit emission:

use grafos_audit::{AuditEventData, AuditRecordInput, assemble_record};
use grafos_core::{
    AuditEventKind, InferenceCompletionReason, WorkloadIdentity,
};
use grafos_inference_engine::continuous_batch::{
    CompletionReason, SchedulerEvent,
};

fn audit_inference_completion(
    chain: &mut SchedulerAuditChain,
    request_id: RequestId,
    reason: CompletionReason,
    tenant: &str,
    total_tokens: u32,
    compute: Duration,
) -> Result<(), AuditError> {
    let audit_reason = match reason {
        CompletionReason::EndOfSequence    => InferenceCompletionReason::EndOfSequence,
        CompletionReason::MaxTokensReached => InferenceCompletionReason::MaxTokensReached,
        CompletionReason::EngineError(_)   => InferenceCompletionReason::EngineError,
        CompletionReason::Cancelled        => InferenceCompletionReason::Cancelled,
        CompletionReason::Preempted        => InferenceCompletionReason::Preempted,
    };

    let identity = WorkloadIdentity::new(tenant)
        .with_instance_id(format!("inference:request_completed:{}", request_id));

    let input = AuditRecordInput {
        kind: AuditEventKind::InferenceRequestCompleted,
        identity,
        event_data: Some(AuditEventData::InferenceRequestCompleted {
            tenant_id: tenant.to_string(),
            request_id: request_id.to_string(),
            total_tokens,
            compute_ms: compute.as_millis() as u64,
            reason: audit_reason,
        }),
        // ... other AuditRecordInput fields
    };

    chain.emit(input)?;
    Ok(())
}

2. Drive From the Scheduler Tick Loop

let events = scheduler.step().await?;
for event in events {
    match event {
        SchedulerEvent::Token { request, .. } => { /* per-token telemetry path */ }
        SchedulerEvent::Completed { request, reason } => {
            let ctx = registry.context_for(request);
            audit_inference_completion(
                &mut audit_chain,
                request,
                reason,
                &ctx.tenant_id,
                ctx.total_tokens(),
                ctx.compute_elapsed(),
            )?;
        }
        SchedulerEvent::AdmissionRejected { .. } => { /* admit-time path */ }
    }
}

3. Query the Chain Per Tenant

use grafos_audit::{AuditRecord, AuditEventData, verify_chain};
use grafos_core::AuditEventKind;

fn tenant_inference_summary(
    records: &[AuditRecord],
    tenant: &str,
) -> TenantInferenceSummary {
    let mut total_tokens: u64 = 0;
    let mut total_compute_ms: u64 = 0;
    let mut by_reason: HashMap<String, u64> = HashMap::new();

    for rec in records.iter()
        .filter(|r| r.kind == AuditEventKind::InferenceRequestCompleted)
        .filter(|r| r.identity.tenant == tenant)
    {
        if let Some(AuditEventData::InferenceRequestCompleted {
            total_tokens: tokens,
            compute_ms,
            reason,
            ..
        }) = &rec.event_data
        {
            total_tokens += *tokens as u64;
            total_compute_ms += compute_ms;
            *by_reason.entry(reason.as_str().to_string()).or_default() += 1;
        }
    }

    TenantInferenceSummary { total_tokens, total_compute_ms, by_reason }
}

verify_chain(records, anchor) confirms the chain is intact before any aggregation — if it returns an error, an aggregation is meaningless until the integrity issue is resolved.

Verification

Hash-chain tamper-evidence

The variant ships with two unit tests in grafos-audit:

audit_event_data_variants_produce_distinct_hashes — asserts the InferenceRequestCompleted variant produces a chain hash distinct from every other AuditEventData variant under the same identity / timestamp / sequence. This proves the canonical bytes cover the payload meaningfully (a tampered tenant id cannot collide into a clean record).
audit_event_data_variant_tags_are_stable — pins EVENT_DATA_TAG_INFERENCE_REQUEST_COMPLETED = 0x0c. Bumping this value is a wire-format break that requires bumping CANONICAL_VERSION.

Run them:

cargo test -p grafos-audit --features serde --lib audit_event_data

Metric exposure

FabricMetrics::audit_inference_request_completed is incremented for every emit, surfaced through count_audit_emit, and rendered in the Prometheus exposition as grafos_audit_records_inference_request_completed_total. Run:

cargo test -p grafos-observe --lib audit

Failure Modes

Late or missed emit. If the harness’s Completed handler panics before the audit emit lands, the request is unaccounted for. The chain stays internally consistent; the inference request just isn’t represented. Mitigation: emit BEFORE acknowledging the application-side completion (atomic with respect to caller-visible side effects).
Clock skew across producers. When inference spans cells (Recipe 39), each cell’s audit chain stamps its own wall clock. Cross-cell ordering requires the chain’s prev_event_hash linkage (logical), not the wall clock. Operators must not rely on wall-clock for cross-cell ordering.
Lease re-issue with same identifier. Some broker implementations reuse lease_id across re-issues; this conflates two distinct leases in the chain. The audit chain carries the application’s request_id separately from the lease id specifically because of this — operators query on request_id for unambiguous per-request attribution.
Privacy leak through the payload. The payload includes the tenant id and request id but NOT the prompt or the emitted tokens. If your environment requires the prompt-hash to be on the chain too (e.g. for content audit), extend the variant before relying on the chain — adding fields is a wire-format break that requires bumping CANONICAL_VERSION and a migration drill.

Observability

The chain itself IS the observability surface. Operator queries match the existing Recipe 55: Consuming the Audit Chain pattern:

“All completed requests from tenant 47 in the last hour” — scan for kind=inference_request_completed, filter identity.tenant == "47", filter on timestamp range.
“Total compute cost for tenant 47” — sum event_data.compute_ms over the matching records.
“Tenants whose request error rate jumped” — bucket event_data.reason == "engine_error" per tenant per window.

The Prometheus counter grafos_audit_records_inference_request_completed_total provides the emit-rate signal for scrape-based dashboards (Recipe 60).

Variations

Per-token observability (not audit). Per-token attribution belongs in the observe layer, not the audit chain. A custom FabricEvent variant emitted per SchedulerEvent::Token is the right pattern — the chain stays bounded to per-request outcomes; high-rate telemetry uses the observe pipeline.
Cryptographic signing. grafos-audit ships an Ed25519Signer that signs each record. Enable it on the scheduler’s audit chain so the records carry signatures verifiable by external auditors who do not trust the storage provider. Costs one Ed25519 sign per emit.
Streaming to a separate audit cell. The chain itself is storage-agnostic — write the JSONL stream to a fabric-leased storage volume on a dedicated audit cell (Recipe 36’s stateful KV pattern adapted to append-only event storage). Inference cells emit; the audit cell holds; operator dashboards consume.
Per-token billing meter. A derived BillingPulse event emitted per N tokens (where N is large enough to keep the audit cardinality bounded) gives finer-grain billing without destroying the chain’s emit-rate budget. Pair with a pricing function applied at aggregation time.

Why This Is Recipe 69

Recipes 61-68 establish the inference primitives. Recipe 69 establishes the attribution primitive that makes those primitives operationally trustworthy. Without per-request audit attribution, fabric-leased inference can’t be monetized, audited, or operationally observed; with it, every completed request is provably traceable to a tenant, the totals are tamper-evident, and the chain doubles as a billing ledger.