Token-Gating Policy
This document defines which fabricBIOS control-plane operations require capability tokens and what verification steps each target must perform.
Design Principles
-
Every state-mutating operation that creates or extends resource access must be token-gated. This includes LEASE_ALLOC, LEASE_RENEW, TASKLET_SUBMIT, and TASKLET_LEASE_ALLOC.
-
Cleanup operations (LEASE_FREE) are intentionally ungated. The lease_id itself is the authorization — any holder of a lease_id can free it. This ensures cleanup always works, even when tokens have expired or the minting node is unreachable.
-
Read-only operations are ungated. PING, GET_IDENTITY, GET_INVENTORY, GET_BUILD_INFO, GET_THERMAL, LEASE_QUERY, and LEASE_LIST_ACTIVE do not require tokens. LEASE_LIST_ACTIVE reveals allocation patterns but is needed for scheduler reconciliation — gating it would break the scheduler’s ability to detect drift.
-
Nodes are the token authority. Each node mints and verifies its own tokens using its own signing key. The scheduler does not hold node keys and cannot forge tokens — it obtains them by calling CAP_REQUEST on the target node, which decides whether to mint based on fencing, epoch, and local policy. This is the fabricBIOS principle: the node is mechanism (issues tokens, enforces leases), the scheduler is policy (decides who gets what). A compromised scheduler can only obtain tokens from nodes it can reach, and the node’s fencing/epoch checks constrain even that.
The scheduler always mints tokens via CAP_REQUEST on the target node. After TIME_SYNC, all targets use Unix time, so tokens are technically portable across nodes. However, the design intentionally keeps token minting node-local: no key distribution, no PKI, no trust in external signers. The node remains the sole authority over its own resources.
Operation Token Requirements
| Operation | Token Required | Rationale |
|---|---|---|
| LEASE_ALLOC (0x0200) | Yes | Creates a new resource lease — must prove authorization |
| LEASE_RENEW (0x0202) | Yes | Extends lease lifetime — state mutation equivalent to creation |
| LEASE_FREE (0x0201) | No | Cleanup — lease_id is the auth; must always work |
| LEASE_QUERY (0x0203) | No | Read-only query of lease status |
| LEASE_LIST_ACTIVE (0x0208) | No | Read-only; needed by scheduler reconciliation |
| TASKLET_SUBMIT (0x0500) | Yes | Submits code for execution on a CPU lease |
| TASKLET_STATUS (0x0501) | No | Read-only query of tasklet status |
| TASKLET_FETCH_RESULT (0x0502) | No | Read-only fetch of tasklet output |
| TASKLET_CANCEL (0x0503) | No | Cleanup equivalent — tasklet_id is the auth |
| TASKLET_LEASE_ALLOC (0x0800) | Yes | Creates composite CPU+MEM lease |
| TASKLET_LEASE_FREE (0x0801) | No | Delegates to LEASE_FREE |
| GPU_SUBMIT (0x0600) | No | GPU lease already required token to create; GPU_SUBMIT operates within that lease. The lease-exists check is sufficient. |
| CAP_REQUEST (0x0100) | No | Token minting — this IS the authorization primitive |
Token Verification Steps (required for all token-gated ops)
Both bare-metal and fabricbiosd must perform these steps in order:
- Empty check — reject if
req.token.is_empty() - Decode —
CapabilityToken::decode(&req.token)→ reject on parse failure - Signature verification —
token.verify_signature(&verify_key)→ reject if invalid - Time bounds —
token.verify_time_bounds(now, max_ttl)→ reject if expired or future - Audience —
token.verify_audience(presenter)→ reject if audience doesn’t match (audience=0 is wildcard, accepted by any presenter) - Permissions —
token.permissionsmust include the required permission for the op (WRITE for LEASE_ALLOC/RENEW, WRITE for TASKLET_SUBMIT) - Revocation — check token_id against revocation cache → reject if revoked
- Caveats — verify any caveats attached to the token (source IP, time, range, etc.)
GPU_SUBMIT Rationale
GPU_SUBMIT is intentionally ungated by a capability token. The reasoning:
- Creating a GPU lease (via LEASE_ALLOC) requires a token. The lease-exists check in GPU_SUBMIT confirms the caller has an active GPU lease.
- GPU_SUBMIT is analogous to FBMU WRITE — it operates within the scope of an existing lease, not creating new resource access.
- Adding a token requirement to GPU_SUBMIT would require the caller to hold both a lease AND a separate submission token, which adds complexity without meaningfully improving security (the lease already proves authorization).
Clock Semantics
-
fabricbiosd:
issued_atandexpires_atare Unix timestamps (seconds since 1970-01-01).verify_time_bounds()usesSystemTime::now(). TIME_SYNC is accepted (no-op, returns current time). -
Bare-metal (after TIME_SYNC):
issued_atandexpires_atare Unix timestamps, derived frommonotonic_ticks + offsetwhere offset was set by the first TIME_SYNC from a trusted peer. Tokens are compatible with fabricbiosd tokens. -
Bare-metal (before TIME_SYNC): Token-gated operations (CAP_REQUEST, LEASE_ALLOC, LEASE_RENEW, TASKLET_SUBMIT) return
TimeNotSynced(0x000B). Read-only operations (PING, GET_INVENTORY, GET_THERMAL) work without time sync. The node is honest: “I don’t know what time it is.” -
Lease durations: Use
lease_now()which returns Unix time after TIME_SYNC. Leaseexpires_atis on the same time base as tokens. -
Token portability: After TIME_SYNC, all targets use Unix time. Tokens are technically portable across nodes. However, the design intentionally keeps minting node-local — the node is the token authority (see Design Principle 4).
Bare-metal clock source chain
On Raspberry Pi 5 bare-metal:
- Boot: ARM generic timer starts counting from 0.
- First QUIC client sends TIME_SYNC with its Unix time.
set_unix_time(peer_secs): computesoffset = peer_secs - monotonic.now_unix_secs(): returnsmonotonic + offset(real Unix time).- Before TIME_SYNC:
now_unix_secs()returnsErr(Unsupported), token-gated ops returnTimeNotSynced(0x000B).
The Pi5 has an RTC in the DA9091 PMIC, but it is not accessible from bare-metal code (no firmware mailbox tag exposed). TIME_SYNC provides the same function — a trusted peer sets the clock on first connection.
Why not scheduler-minted tokens?
The scheduler could theoretically sign tokens itself (saving a QUIC round-trip per lease). This is not done because:
- Key distribution: the scheduler would need each node’s signing key, or a separate CA that nodes trust. Both add complexity and attack surface.
- Separation of mechanism and policy: the node is the authority over its own resources. The scheduler proves authorization by successfully calling CAP_REQUEST — the node decides whether to mint.
- Compromise containment: a compromised scheduler can only get tokens from nodes it can reach, and the node’s fencing/epoch checks limit even that. If the scheduler held signing keys, it could forge unlimited tokens.
This aligns with the fabricBIOS principle: nodes expose resources and enforce access control; policy lives above.