Recipe 15: Observing Everything (The Meta-Recipe)
Situation
Production failures are rarely “one thing”. In systems with multiple resource types, you usually need:
- CPU / memory / storage / network metrics;
- Causality (what happened around the spike?);
- A durable, tamper-evident record (what was the actual state at the moment the operator paged us?).
In a traditional stack these come from different libraries and different namespaces. In grafOS the lease API is the choke point for resource use, and there are three correlated observability layers above it. Instrument the lease layer once and you observe everything that uses leases.
What You Build
A unified observability surface across three layers, all keyed on the same lease lifecycle events:
| Layer | API | Loss profile | Best for |
|---|---|---|---|
| Metrics | grafos_observe::FabricMetrics | aggregate-only | ”is this happening a lot?” |
| Events | grafos_observe::EventRingBuffer + FabricEvent | lossy (bounded ring) | “what was happening around the spike?” |
| Audit chain | grafos_audit::assemble_record + AuditRecord | tamper-evident, durable when anchor persists | ”exactly what was sealed at that moment?” |
Every lease lifecycle event hits all three. Recipes 55 (Consuming the Audit Chain) and 60 (Tenant Audit Dashboard) consume the chain this recipe produces.
The compiled recipe lives in
cookbook/recipe-15-observing-everything.
Building Blocks
grafos_observe::{FabricMetrics, EventRingBuffer, FabricEvent, OpType, ResourceType}grafos_observe::prometheus::PrometheusExporter(feature-gated)grafos_observe::json_log::JsonEventSink(feature-gated)grafos_audit::{assemble_record, AnchorStore, AuditInput, AuditRecord, Signer, NullSigner, MemoryAnchorStore}grafos_core::{AuditEventKind, WorkloadIdentity}
See:
- grafos-observe README
- grafos-observe guide
- grafos-audit canonical bytes spec
- SIEM vocabulary cookbook
- metrics demo example
- event logging example
- audit-chain demo
Design
Layer 1: Metrics
Aggregate counters and histograms. Fast, lossy, cheap to scrape.
At minimum:
- active leases by resource type
- acquire/drop counts
- operation latency histograms (read / write / submit)
- error counts by
FabricErrorvariant
Layer 2: Events
Typed FabricEvent records in a bounded ring buffer. The shape:
FabricEvent::LeaseAcquired { resource_type, lease_id, node, bytes, trace_id }FabricEvent::LeaseExpired { resource_type, lease_id, node }FabricEvent::Disconnected { node, reason }... (closed-set enum)Lossy by design — when the ring wraps, the oldest events drop. The ring is for short-window in-process correlation: “what happened in the 60 seconds around the spike?”.
Layer 3: Audit chain
grafos_audit::assemble_record produces an AuditRecord with a
typed AuditEventKind (lease_allocated, lease_expired,
preempted, edge_rewritten, …) and the prior chain head sealed
into a SHA-256 hash. Tamper-evident: any byte-level change after
seal breaks verify_chain. The AnchorStore persists the chain
head atomically; on restart the consumer resumes from the persisted
anchor.
Three properties the audit chain has that the metrics / events layers do not:
- Hash linkage: no record can be added, removed, reordered, or modified after seal without detection.
- Typed payloads:
AuditEventData::{LeasePreempted, RevokeStateTransition, BundleAdmissionDecided, EdgeRewritten, ...}carry the structured fields downstream consumers pattern-match on. - Cross-process readable: serialize to JSONL via
grafos_audit::jsonl::write_record, ingest via the reference collector atcrates/grafos-audit-collector.
Correlation
Every event in every layer should carry, where applicable:
request_id/trace_id(W3C traceparent, seedocs/operations/scheduler-features.md§ “Trace context propagation”)lease_idnode_id
A SIEM filter or operator query then walks across all three layers on the same key.
Program
use cookbook_recipe_15_observing_everything::{ dev_observability, record_lease_acquired, record_lease_expired, record_operation,};use grafos_audit::verify_chain;use grafos_core::WorkloadIdentity;use grafos_observe::{OpType, ResourceType};
let (metrics, mut events, mut anchor, signer) = dev_observability();let identity = WorkloadIdentity::tenant_only("acme");
let r1 = record_lease_acquired( &metrics, &mut events, &mut anchor, &signer, /*sequence*/ 1, /*timestamp*/ 1_700_000_000, &identity, ResourceType::Mem, /*lease_id*/ 7, "node-a", /*bytes*/ 4096,);// ... workload runs ...record_operation(&metrics, OpType::Read, 120, 4096);record_operation(&metrics, OpType::Write, 240, 8192);
// Lease expires later in the day.let r2 = record_lease_expired( &metrics, &mut events, &mut anchor, &signer, 2, 1_700_000_300, &identity, ResourceType::Mem, 7, "node-a",);
// All three layers reflect the same lifecycle.assert_eq!(metrics.leases_total.get(), 1);assert_eq!(events.len(), 2);verify_chain(&[r1, r2], [0u8; grafos_audit::HASH_LEN]).expect("chain verifies");Debugging Example: Periodic Latency Spike
Symptom:
- p99 jumps every ~300 seconds.
Investigation across the three layers:
- Metrics layer:
leases_expiredshows bursts at the same interval. Confirms it’s a lease lifecycle event, not a network glitch. - Events layer: ring buffer shows
LeaseExpired → LeaseAcquiredpairs at the same lease_id every 300s. Confirms the renewal pattern, not new lease churn. - Audit chain:
grafos admin audit-query --kind lease_expired --since <t>(orRecipe 55’s collector) returns the sealed records with sequence numbers and timestamps. Confirms what was durably recorded at the producer.
Root cause: application renews leases at 99% of TTL — when the network adds even a small RTT, the renewal lands after expiry.
Fix: renew at 60-80% of TTL.
Verification:
LeaseExpiredevents → ~0 in steady state on the metrics layer;- ring buffer shows no
LeaseExpiredbetween checkpoints; - audit chain shows only
LeaseAllocated+LeaseRenewedfor long-running workloads.
Failure Modes
- Metrics layer overflow: counters saturate at
u64::MAX; histograms drop outliers above the configured ceiling. Both are expected. - Events layer ring wrap: oldest events drop when the buffer is full. Sizing matters — set the ring to cover the longest correlation window you expect.
- Audit chain tamper:
verify_chainreturns the failing record index. The reference collector atcrates/grafos-audit-collectorincrementschain_verification_failuresand refuses to advance the anchor; production callers route this to an alert. - Anchor corruption: if the persisted anchor file is missing
or corrupt,
FileAnchorStore::load_or_unanchoredreturns a fresh sentinel — producers continue but the chain restarts. Production callers persist the anchor to a side store before the producer process exits.
Tests
Run it with:
cargo test -p cookbook-recipe-15-observing-everythingTwo tests cover the full lease lifecycle hitting all three layers
(metrics counters move, events ring records two entries, audit
chain produces two hash-linked records that verify_chain
accepts) and the data-plane operation path going only to metrics
(per-byte traffic does not enter the chain).
Adaptation Notes
- Production signer: replace
NullSignerwithgrafos_audit::Ed25519Signeronce Phase 220 wires the custody-matrix-named process. Until thenNullSignerproduces unsigned records;verify_chainaccepts them. - Production anchor: replace
MemoryAnchorStorewithFileAnchorStore::load_or_unanchored(path)so the chain resumes from a known head after a process restart. - Alert rules: build alerts on the SIEM-stable counters per
docs/operations/siem-vocabulary-cookbook.md— e.g. “grafos_audit_records_lease_expired_totalrate over 5m > N” fires when the renewal pattern breaks. - Per-lease cost tracking: pair with Recipe 59 (Cost
Attribution with Accounting Tags) — every lease event carries
a typed
AccountingTagin canonical bytes, so cost rollups use the same surface this recipe produces. - Tenant-side dashboard: pair with Recipe 60 (Tenant Audit Dashboard) — consumes the chain this recipe produces and projects it into operator-readable views.
Variations
- Trace context propagation: thread a W3C
traceparentinto every lease event so all three layers carry the same trace_id. TheFabricEvent::LeaseAcquired { trace_id, .. }field is already wired for this; the audit-chain layer carries the trace context inside the typedEdgeRecordpayload foredge_rewrittenrecords. - Replay from chain: with the audit chain persisted, an
operator can reconstruct lease-lifecycle history without
trusting the metrics or events surface —
verify_chainproves the records weren’t tampered with after sealing.
See also:
- Recipe 55 (Consuming the Audit Chain) — the downstream collector pattern.
- Recipe 60 (Tenant Audit Dashboard) — the operator-facing projection.
- Recipe 59 (Cost Attribution With Accounting Tags) — cost rollups keyed on the same lease events.