Skip to content

fabricBIOS wire encoding (v0)

This document defines the normative v0 wire encoding used by the Rust reference implementation. It removes ambiguity around variable-length fields, fragmentation metadata, and version/flag handling so independent implementations can interoperate.

Conventions

All integers are big-endian.

Variable-length fields (normative v0)

  • bytes := u32 len + len bytes (reject if len exceeds implementation limit).
  • list := u16 count + repeated items (reject if count exceeds implementation limit).
  • tlv := u8 type + u16 len + len bytes.
    • Unknown TLVs must be skippable (do not treat as fatal).
  • Optional fields use a u8 present flag (0=absent, 1=present) followed by the field bytes.

Wire version and flags (normative v0)

  • Unknown protocol versions MUST be rejected (fail-closed).
  • Unknown flag bits MUST be rejected (fail-closed).
  • Fragmentation metadata is present only when FRAG_V2 is set (see below).

Fragmentation (normative v0)

Fragmentation uses CONTINUED/FINAL plus the FRAG_V2 flag and fragment metadata fields.

When FRAG_V2 is set, the header includes:

frag_offset : u32 (byte offset within the full payload)
frag_total_len : u32 (total unfragmented payload length)

This extends the header by 8 bytes; payload length remains the fragment payload only.

Rules:

  • Fragments are keyed by (sender, request_id).
  • Fragments with the same key MUST agree on msg_type, nonce, cleared flags, and frag_total_len (if present); otherwise the reassembly MUST be dropped.
  • frag_offset + payload_len MUST NOT exceed frag_total_len.
  • FINAL MUST be set on the fragment where frag_offset + payload_len == frag_total_len.
  • Overlapping fragments MUST be rejected.
  • CONTINUED and FINAL MUST NOT both be set in the same frame.
  • Implementations SHOULD bound in-flight reassembly and drop incomplete entries after a short timeout to avoid unbounded memory growth.

Capability negotiation / fallback:

  • A requester advertises support by setting FRAG_V2 on its request frame.
  • A responder MUST use FRAG_V2 fragments only if the requester advertised support; otherwise it falls back to legacy CONTINUED/FINAL fragments with no offset metadata.

Relay discovery profile (v0)

Relays answering SOLICIT MUST use MsgType::RESPONSE with the following payload encoding:

count : u16
repeated count times:
announce_payload_bytes : bytes (u32 length + bytes of AnnouncePayload encoding)

Rules:

  • Inventory ordering is implementation-defined; clients MUST treat the list as an unordered snapshot and MAY sort by node_id for stable presentation.
  • Relays SHOULD include at most one entry per node_id, choosing the most recently observed ANNOUNCE (or highest sequence when available).
  • Inventory may be truncated to fit relay caps; absence of a node in a RESPONSE does not imply withdrawal or fencing.

WITHDRAW payload

FieldTypeBytesDescription
node_idu12816Node being withdrawn
sequenceu648Monotonic sequence number
reasonu81Reason code (see below)

Total: 25 bytes.

Reason codes:

CodeName
0x00GRACEFUL_SHUTDOWN
0x01MAINTENANCE
0x02RESOURCE_EXHAUSTION
0x03POLICY

For backwards compatibility, decoders SHOULD accept 16-byte payloads (old format) as node_id only with sequence=0, reason=GRACEFUL_SHUTDOWN.

SOLICIT filters (v0)

SOLICIT includes a query_type and a list of fixed-size filters (field, op, value[32]).

Filter field ID ranges:

  • 0x00 reserved
  • 0x01..=0x3F core registry (this doc)
  • 0x40..=0x7F experimental/extension
  • 0x80..=0xFF vendor-specific

Core registry fields:

  • field=1 RESOURCE_TYPE: value[0..2] = BE u16 (e.g., CPU=0x0001, MEM=0x0002); op EQ
  • field=2 NODE_ID: value[0..16] = BE u128; op EQ
  • field=3 SITE_ID: value[0..4] = BE u32; ops EQ/GT/LT
  • field=4 ROW_ID: value[0..4] = BE u32; ops EQ/GT/LT
  • field=5 RACK_ID: value[0..4] = BE u32; ops EQ/GT/LT
  • field=6 LOCALITY_CUSTOM: value[0..32] opaque; ops EQ/CONTAINS (non-zero bytes must match)
  • field=7 RESOURCE_FLAGS: value[0..2] = BE u16 bitmask; ops EQ/CONTAINS (FENCED=0x0001, DEGRADED=0x0002; CONTAINS requires all bits set)
  • field=6 LOCALITY_CUSTOM: value[0..32] is a 32-byte value; for CONTAINS, non-zero bytes act as a position-wise mask

Current reference semantics:

  • Filters are applied with AND semantics.
  • RESOURCE_TYPE is evaluated against the ResourceSummary[] list (must exist a resource that matches all resource-scoped filters).
  • NODE_ID / locality filters are evaluated against the ANNOUNCE payload.

ResourceSummary flags (implementation note)

ResourceSummary.flags is a u16 bitfield.

  • bit0 FENCED: resource is fenced and must not be leased.
  • bit1 DEGRADED: resource is usable but degraded.

All other bits are reserved and must be zero in the reference implementation.

Capability token caveats (implementation note)

The spec names caveat types but does not assign numeric type codes. The current reference assigns:

  • TIME_BOUND = 1: data = u64 start_be || u64 end_be
  • SOURCE_IP = 2: data = [u8; 16] (IPv6) or [u8; 4] (IPv4)
  • RANGE = 3: data = u64 offset_be || u64 len_be
  • RATE_LIMIT = 4: data = u32 max_per_sec_be (exposed as a requirement for the caller to enforce)
  • DEPTH = 5: data = u8 max_delegation_edges
  • AUDIENCE = 6: data = u128 audience_be (additional audience constraint)

Unknown caveat types are rejected by the reference verifier.

Revocation broadcast payload (implementation note)

The spec defines REVOKE_BROADCAST as:

issuer : u128
token_ids : u128[]
until : u64

The reference uses list encoding count:u16 + repeated u128.

Certificate encoding (implementation note)

fabricbios_core::identity::Certificate is encoded as:

subject : u128
issuer : u128
issued_at : u64
expires_at : u64
public_key : [u8; 32]
extensions : tlv[] (list encoding with `count:u16`)
signature : [u8; 64]

Control plane ops (implementation note)

For early development, the reference originally implemented a minimal control-plane on TCP using the common Frame header and these payload encodings.

Current Pi5 bare-metal direction uses QUIC over UDP for the control plane (no TCP). The message payload schemas remain useful, but the transport framing is QUIC-stream based. See:

  • docs/platform/pi5/design-doc-1-quic-controlplane-udp-dataplanes.md

REQUEST payload

This is the normative v0 wire encoding for control REQUEST payloads. Design-level transport and authentication requirements are in docs/spec/fabricBIOS-design-document.md (Section 11) and docs/spec/fabricBIOS-design-document-v1.1.md (Section 12).

op : u16
resource_id : u128
token : bytes
params : bytes
presenter_id : u128 (REQUIRED on UDP control; OPTIONAL on QUIC — peer identity is TLS-authenticated)
presenter_sig: [u8; 64] (REQUIRED on UDP control; OPTIONAL on QUIC — peer identity is TLS-authenticated)

RESPONSE payload

status : u16
op : u16
result : bytes

Dev op codes and params

These are not yet part of the normative spec; they are a dev scaffold:

  • PING (0x0001): params empty, result u64 uptime_secs
  • GET_IDENTITY (0x0003): params empty, result u128 node_id || [u8;32] controller_pubkey
  • GET_INVENTORY (0x0002): params empty, result ResourceSummary[] (list encoding)
  • ENROLL_REQUEST (0x0004): params u128 node_id || [u8;32] public_key, result Certificate bytes
  • CAP_REQUEST (0x0100): params u32 perms || u32 ttl_secs || u128 audience, result CapabilityToken bytes
  • CAP_REFRESH (0x0101): params u32 ttl_secs, request token is the token to refresh, result CapabilityToken bytes
  • CAP_REVOKE (0x0102): params u128 token_id || u32 ttl_secs, request token must authorize revocation, result RevokeBroadcast bytes
  • LEASE_ALLOC (0x0200): params u32 duration_secs || u32 grace_secs [|| u16 resource_type [|| u64 requested_bytes]], result u128 lease_id || u64 expires_at || binding TLVs
    • resource_type (optional, 10+ bytes): 0x0001=CPU, 0x0002=MEM, 0x0003=BLOCK, 0x0004=NET, 0x0005=GPU, 0x0010=SCHEDULER. Default: MEM.
    • RES_TYPE_SCHEDULER (0x0010): used in ANNOUNCE payloads by the grafos-scheduler-service. Capacity field carries the HTTP API port. Discovery is visibility only — trust requires explicit --scheduler-url.
    • requested_bytes (optional, 18+ bytes): desired allocation size in bytes. Server uses its default (1 MiB) if absent.
    • Response binding TLVs include TLV_LIMITS (0x0104) whose len field reflects the actual granted region size.
  • LEASE_RENEW (0x0202): params u128 lease_id || u32 duration_secs
  • LEASE_FREE (0x0201): params u128 lease_id
  • LEASE_QUERY (0x0203): params u128 lease_id, result u8 lease_status || u64 expires_at
    • lease_status values: 0=ACTIVE, 1=EXPIRED, 2=REVOKED
  • GET_THERMAL (0x0006): params empty, result TLV sequence (extensible)
    • Response TLV tags (0x0900 range):
      • 0x0900 (4 bytes, u32): age_ms — milliseconds since sensor was last sampled at response-generation time
      • 0x0901 (4 bytes, i32): soc_temp_milli_c — SoC temperature in milli-°C
      • 0x0902 (1 byte, u8): soc_status — sensor status: 0=OK, 1=Unavailable, 2=Stale
      • 0x0903 (1 byte, u8): throttled_bits — bit 0 = thermal throttled, remaining bits reserved
      • 0x0904 (4 bytes, i32): nvme_temp_milli_c — NVMe temperature in milli-°C (optional)
      • 0x0905 (1 byte, u8): nvme_status — NVMe sensor status (same values as soc_status)
    • Missing TLVs indicate “not supported on this platform” (not a protocol error).
    • Unknown TLV tags must be skipped by consumers (forward-compatible).
    • Temperature validity bounds: -40 000 to 150 000 milli-°C. Values outside this range are malformed.
    • Older nodes without GET_THERMAL return a standard error response; consumers must handle this gracefully.

Scheduler fencing ops

These ops allow an external scheduler to install and query epoch fences on managed nodes. Once a fence is installed, CAP_REQUEST on that node requires a matching epoch trailer.

  • SCHEDULER_FENCE_SET (0x0009): params u64 new_epoch

    • Installs the exact epoch value new_epoch on the node.
    • The node stores the epoch verbatim (not current + 1).
    • If a fence is already installed, new_epoch MUST be strictly greater than the installed epoch; otherwise the node returns STALE_EPOCH (0x09).
    • If no fence is installed, any non-zero epoch is accepted as the initial installation.
    • Result: empty on success.
  • SCHEDULER_FENCE_GET (0x000A): params empty

    • Returns the current fence state of the node.
    • Result: u8 installed || u64 epoch
      • installed: 0 = no scheduler fence installed, 1 = fence installed.
      • epoch: the currently installed epoch value. 0 when installed = 0.

Lease list op

LEASE_QUERY (0x0203) is an existing per-lease targeted probe that returns the status of a single lease by ID. LEASE_LIST_ACTIVE is a new bulk snapshot operation that returns all active leases on the node.

  • LEASE_LIST_ACTIVE (0x0208): params empty
    • Returns all active leases as a list.
    • Result encoding:
      count : u16
      repeated count times:
      lease_id : u128
      resource_id : u128
      holder : u128
      expires_at : u64
      alloc_bytes : u64
    • Each entry is 64 bytes. The list uses standard u16 count prefix encoding.
    • Nodes SHOULD return a consistent snapshot (no new leases created or expired mid-response). If the node cannot guarantee atomicity, the response is best-effort and the caller MUST tolerate minor staleness.

Control-plane status codes

CodeNameDescription
0x0000OKSuccess
0x0001INVALID_TOKENToken is malformed, expired, or revoked
0x0002INSUFFICIENT_PERMToken lacks required permission
0x0003RESOURCE_NOT_FOUNDTarget resource does not exist
0x0004RESOURCE_BUSYResource is in use and cannot be modified
0x0005CAPACITY_EXCEEDEDNode cannot satisfy the requested allocation
0x0006LEASE_EXPIREDThe referenced lease has expired
0x0007RATE_LIMITEDToo many requests from this sender
0x0008RESOURCE_FENCEDResource is fenced (teardown failure or admin action)
0x0009STALE_EPOCHThe scheduler epoch in the request is older than the installed fence
0x000AFENCE_REQUIREDThe node requires a scheduler fence but none is installed
0x00FFINTERNAL_ERRORUnrecoverable internal error

Scheduler substrate state model

The scheduler maintains the following state categories. These are not on-wire types but are referenced by the protocol extensions above and by scheduler service implementations.

  • PendingGrant: The scheduler has issued authority (reserved capacity and quota) for a lease, but no real lease exists on the fabric yet. The client has not yet called LEASE_ALLOC or has not confirmed the result. Pending grants expire if confirmation does not arrive within the token TTL window.

  • ConfirmedLease: The client has reported back a real lease_id from a successful LEASE_ALLOC. The scheduler can now attribute this lease to a tenant and policy domain, and it becomes eligible for preemption and lifecycle tracking.

  • UnattributedLease: A node reports an active lease (via LEASE_LIST_ACTIVE) that the scheduler cannot map to any known pending grant or confirmed lease. This arises from direct client allocations, stale scheduler state, or crash recovery gaps. Unattributed leases consume node capacity but are excluded from tenant quota and automatic preemption until they are explicitly attributed or cleared by an operator.

  • LeaderEpoch: A monotonically increasing u64 epoch counter identifying the current scheduler leader’s term. Each promotion increments the epoch and installs it on all managed nodes via SCHEDULER_FENCE_SET. Token minting on managed nodes requires the request to carry the currently installed epoch.

  • SchedulerRole:

    • Standby: Read-only. The scheduler replays its WAL and reconciles state but does not serve mutating APIs (lease admission, token minting, preemption). A standby can be promoted to active via manual intervention.
    • Active: Serves all mutating APIs. Promotion requires WAL replay, reconciliation, and successful epoch fence installation on all healthy managed nodes.

Dev-only service/network scaffold (lease-bound ingress TCP proxy):

  • SVC_TCP_BIND (0x0400): params u32 duration_secs || u16 backend_port, result u128 lease_id || u64 expires_at || u16 ingress_port
    • ingress_port is the leased TCP listener port on localhost (127.0.0.1) that proxies to backend_port while the lease is active.
    • The request token must include WRITE permission.
  • SVC_TCP_QUERY (0x0403): params u128 lease_id, result u8 lease_status || u64 expires_at || u16 ingress_port
    • ingress_port is 0 if unknown/not tracked by the implementation (the socketed simulator tracks it while the gate is active).
    • lease_status values: 0=ACTIVE, 1=EXPIRED, 2=REVOKED

Tasklet execution ops

These ops submit and manage WASM tasklet invocations on fabric nodes. The tasklet executor runs a sandboxed WASM module with fuel-metered execution, capability-scoped resource access, and configurable limits.

Tasklet status codes (u8):

CodeName
0OK
1INVALID
2UNAUTHORIZED
3FUEL_EXHAUSTED
4OOM
5EXEC_FAILED
6HOSTCALL_LIMIT
7UNSUPPORTED_PROFILE
8MODULE_NOT_FOUND
9UNKNOWN
10NOT_FOUND
11UNSUPPORTED
12FINISHED
13LISTENER_REVOKED
14LISTENER_FENCED
15MAX_SESSIONS
16SESSION_CLOSED

Auxiliary structures:

CapEntry (20 bytes):

FieldTypeBytesDescription
cap_tokenu12816Capability token granting access to a lease resource
rightsu324Bitmask of rights granted for this token

Rights bitmask bits:

BitName
0QUERY
1REVOKE
2FBMU_ALLOC
3FBMU_LEASE_IO
4FBBU_ALLOC
5FBBU_LEASE_IO
6SVC_LISTEN
7SVC_SESSION_IO

TaskletLimits (20 bytes):

FieldTypeBytesDescription
max_fuelu648WASM fuel units (instruction budget)
max_linear_memoryu324Max WASM linear memory in bytes
max_output_bytesu324Max output bytes from tasklet_run
max_hostcallsu324Total hostcall invocations allowed
  • TASKLET_SUBMIT (0x0500): Submit a WASM tasklet for execution.

    Request params:

    FieldTypeBytesDescription
    resource_idu12816MEM resource (WASM linear memory arena)
    cpu_resource_idu12816CPU resource (compute time); zero = legacy
    module_sha256[u8; 32]32SHA-256 of the WASM module
    wasmbytes4 + lenWASM module bytes (empty = submit-by-hash)
    capslist(CapEntry)2 + N*20Capability entries granted to the tasklet
    inputbytes4 + lenInput message (typically TaskletInputV0)
    limitsTaskletLimits20Resource limits for this invocation

    Response result:

    FieldTypeBytesDescription
    negotiated_profile_versionu162Profile version the runtime negotiated
    statusu81Tasklet status code
    tasklet_idu648Node-assigned invocation ID
    outputbytes4 + lenOutput from tasklet_run
    log_countu162Optional trailer: number of fb_log records; absent in legacy responses
    logslist(LogRecord)log_count entriesOptional trailer entries; omitted when log_count is absent

    LogRecord entries are encoded as:

    FieldTypeBytesDescription
    levelu81fb_log level supplied by the tasklet
    messagebytes4 + lenRaw message bytes after hostcall-boundary clamping

    Decoders MUST accept legacy TASKLET_SUBMIT responses that end immediately after output and treat them as log_count = 0. Encoders in this version SHOULD include the trailer, even when log_count = 0, so downstream cell schedulers can persist first-class run log artifacts without inferring logs from the tasklet output payload.

  • TASKLET_STATUS (0x0501): Query the execution status of a submitted tasklet.

    Request params:

    FieldTypeBytesDescription
    tasklet_idu648Tasklet invocation ID

    Response result:

    FieldTypeBytesDescription
    statusu81Tasklet status code
  • TASKLET_FETCH_RESULT (0x0502): Fetch the output of a completed tasklet.

    Request params:

    FieldTypeBytesDescription
    tasklet_idu648Tasklet invocation ID

    Response result:

    FieldTypeBytesDescription
    statusu81Tasklet status code
    outputbytes4 + lenTasklet output data
  • TASKLET_CANCEL (0x0503): Cancel a running tasklet.

    Request params:

    FieldTypeBytesDescription
    tasklet_idu648Tasklet invocation ID

    Response result:

    FieldTypeBytesDescription
    statusu81Tasklet status code

GPU operations

These ops submit compiled GPU kernels for execution on leased GPU devices.

GPU status codes (u8):

CodeName
0OK
1INVALID
2UNAUTHORIZED
3LOAD_FAILED
4LAUNCH_FAILED
5UNSUPPORTED
6READ_FAILED
  • GPU_SUBMIT (0x0600): Load and launch a compiled GPU kernel (HSACO/PTX) on a leased GPU device.

    Request params:

    FieldTypeBytesDescription
    resource_idu12816GPU resource from GET_INVENTORY
    binarybytes4 + lenCompiled GPU binary (HSACO for ROCm, PTX/cubin for CUDA)
    kernel_nameu16 len + UTF-82 + lenKernel function name within the module
    grid_dim[u32; 3]12Grid dimensions (x, y, z)
    block_dim[u32; 3]12Block dimensions (x, y, z)
    argsbytes4 + lenSerialized kernel arguments (packed contiguously)
    arg_sizeslist(u32)2 + N*4Per-argument byte sizes (sum must equal args length)
    output_offsetu648Offset in lease GPU memory to read after kernel
    output_sizeu324Bytes to copy device-to-host (0 = no output)

    Response result:

    FieldTypeBytesDescription
    statusu81GPU status code
    submit_idu648Node-assigned monotonic submit ID
    exit_codei324Kernel exit/error code (0 = success); wire-encoded as u32
    outputbytes4 + lenOutput data (device-to-host results)

Composite tasklet lease ops

These ops allocate and free composite leases that bundle CPU + MEM resources with affinity constraints, providing a single lease covering both compute and memory for tasklet execution.

Affinity values (u8):

CodeNameDescription
0SAME_NODECPU + MEM on the same physical node (default)
1SAME_RACKCPU on one node, MEM on a nearby node in the same rack (future)
2ANYScheduler picks the cheapest combination (future)
  • TASKLET_LEASE_ALLOC (0x0800): Allocate a composite CPU + MEM lease.

    Request params (20 bytes, fixed):

    FieldTypeBytesDescription
    duration_secsu324Requested lease duration in seconds
    grace_secsu324Grace period before teardown after expiry
    num_coresu162Number of CPU cores requested
    mem_bytesu648Memory size in bytes
    affinityu81TaskletAffinity value
    max_threadsu81Advisory only in Tasklet Profile v0; current runtimes keep execution width fixed at 1

    Response result (66 bytes, fixed):

    FieldTypeBytesDescription
    lease_idu12816Composite lease identifier
    expires_atu648Lease expiry timestamp (seconds since epoch)
    mem_resource_idu12816Allocated MEM resource ID
    cpu_resource_idu12816Allocated CPU resource ID
    arena_sizeu648Actual granted memory arena size in bytes
    num_coresu162Actual granted core count
  • TASKLET_LEASE_FREE (0x0801): Free a composite tasklet lease.

    Request params:

    FieldTypeBytesDescription
    lease_idu12816Composite lease to free

    Response: standard control-plane RESPONSE with empty result on success.

WASM host ABI (implementation note, v0 in-place update)

The fabricbios_fbmu_v0 and fabricbios_fbbu_v0 WASM import modules keep the legacy hello/read/write/get-size calls and now also expose lease-aware calls in-place under the same v0 module names.

fabricbios_fbmu_v0 additions:

  • fbmu_alloc(min_bytes, lease_secs) -> lease_id + expires_at + arena_size
  • fbmu_query(lease_id) -> lease_status + expires_at + arena_size
  • fbmu_renew(lease_id, duration_secs) -> expires_at
  • fbmu_free(lease_id)
  • fbmu_write_lease(lease_id, offset, ptr, len)
  • fbmu_read_lease(lease_id, offset, len, out_ptr)

fabricbios_fbbu_v0 additions:

  • fbbu_alloc(min_blocks, lease_secs) -> lease_id + expires_at + num_blocks + block_size
  • fbbu_query(lease_id) -> lease_status + expires_at + num_blocks + block_size
  • fbbu_renew(lease_id, duration_secs) -> expires_at
  • fbbu_free(lease_id)
  • fbbu_write_block_lease(lease_id, lba, ptr)
  • fbbu_read_block_lease(lease_id, lba, out_ptr)

Lease status values used by query calls:

  • 0 = ACTIVE
  • 1 = EXPIRED
  • 2 = REVOKED

bytes

Encoded as:

len : u32
data: [u8; len]

Lists

Encoded as:

count : u16
items : repeated `count` times

TLV

Encoded as:

type : u8
length : u16
data : [u8; length]

Optional TLV fields

Encoded as:

present : u8 (0/1)
tlv : TLV (only if present=1)

RDMA dataplane binding TLVs

RDMA (RoCE v2) binding credentials are returned to the client after lease creation. The client uses these fields to set up its own QP and perform RDMA READ/WRITE operations against the server’s registered memory region.

Tag range: 0x02xx.

TagNameLengthDescription
0x0201RDMA_RKEY4Remote key for RDMA access (big-endian u32)
0x0202RDMA_REMOTE_ADDR8Virtual address of the registered memory region (big-endian u64)
0x0203RDMA_QP_NUM4Queue Pair number on the server (big-endian u32)
0x0204RDMA_GID16GID (Global Identifier) of the RDMA port (128-bit, IPv6-format)
0x0205RDMA_PORT2RDMA port number on the server HCA (big-endian u16)

Encoded using the standard TLV framing (u16 tag + u16 length + data). Unknown TLV tags in the 0x02xx range MUST be skipped (not treated as fatal).

NVMe-oF dataplane binding TLVs

NVMe-oF binding credentials are returned to the client after lease creation. The client uses these fields to connect to the NVMe-oF target subsystem via nvme connect.

Tag range: 0x03xx.

TagNameLengthDescription
0x0301NVMEOF_NQNvariableNVMe Qualified Name of the target subsystem (UTF-8)
0x0302NVMEOF_TRADDRvariableTransport address (IP address, UTF-8)
0x0303NVMEOF_TRSVCID2Transport service ID (TCP port, big-endian u16)
0x0304NVMEOF_TRTYPE1Transport type: 0=tcp, 1=rdma, 2=loop

NQN format: nqn.2026-02.io.fabricbios:lease:<lease_id_hex_32>.

Encoded using the standard TLV framing (u16 tag + u16 length + data). Unknown TLV tags in the 0x03xx range MUST be skipped (not treated as fatal).

The initial transport is TCP-only (trtype=0); RDMA transport support (trtype=1) is available but optional and requires rdma-core on both sides.

SR-IOV dataplane binding TLVs

SR-IOV binding credentials identify a Virtual Function and its parent Physical Function. Used when a lease allocates an SR-IOV VF for direct device assignment.

Tag range: 0x05xx.

TagNameLengthDescription
0x0501SRIOV_VF_PCI_ADDR4PCI BDF address of the VF ([domain_hi, domain_lo, bus, devfn] where devfn = (device << 3) | function)
0x0502SRIOV_VF_INDEX2VF index within the PF (big-endian u16)
0x0503SRIOV_PF_PCI_ADDR4PCI BDF address of the parent PF (same encoding as SRIOV_VF_PCI_ADDR)

Encoded using the standard TLV framing (u16 tag + u16 length + data). Unknown TLV tags in the 0x05xx range MUST be skipped (not treated as fatal).

OFI (libfabric) dataplane binding TLVs

OFI binding credentials are returned to the client after lease creation when the server supports OpenFabrics Interfaces (libfabric). The binding specifies the access model (one-sided RMA or two-sided messaging) and endpoint address.

Tag range: 0x06xx.

TagNameLengthDescription
0x0601OFI_PROVIDERvariableProvider name (UTF-8, e.g. “efa”, “tcp”, “verbs”)
0x0602OFI_FABRIC_NAMEvariableFabric name from fi_fabric_attr (UTF-8)
0x0603OFI_EP_TYPE1Endpoint type: 1=FI_EP_MSG, 2=FI_EP_DGRAM, 3=FI_EP_RDM
0x0604OFI_ADDRvariableEndpoint address (opaque, from fi_getname)
0x0605OFI_ACCESS_MODEL1Access model: 0x01=RMA (one-sided), 0x02=MSG (two-sided)
0x0606OFI_CAPABILITY_FLAGS8Capability flags from fi_info->caps (big-endian u64)
0x0607OFI_MAX_MSG_SIZE8Maximum message size (big-endian u64)
0x0608OFI_PROTOCOL_VERSION1Two-sided message framing version (currently 1)
0x0609OFI_LEASE_ID8Lease ID this binding is associated with (big-endian u64)

Encoded using the standard TLV framing (u16 tag + u16 length + data). Unknown TLV tags in the 0x06xx range MUST be skipped (not treated as fatal).

Two-sided messaging protocol (access_model=0x02)

When OFI_ACCESS_MODEL is 0x02 (MSG), data-plane operations use framed send/recv with server-side lease validation on every request.

Request header (32 bytes, big-endian):

OffsetSizeField
08lease_id
88generation (high byte = opcode: 0x01=WRITE, 0x02=READ)
168offset
248len

For WRITE requests, len bytes of payload follow the header.

Response (1 + N bytes):

OffsetSizeField
01status: 0x00=OK, 0x10=LEASE_REVOKED, 0x11=LEASE_EXPIRED, 0x12=STALE_GENERATION
1NPayload (READ response data; empty for WRITE)

The server MUST validate (lease_id, generation) against its lease table before servicing any request. Rejection codes:

  • LEASE_REVOKED (0x10): lease has been revoked.
  • LEASE_EXPIRED (0x11): lease has expired.
  • STALE_GENERATION (0x12): generation counter does not match (indicates a stale client).

Request param TLV tags

Tag range: 0x04xx.

TagNameSizeDescription
0x0401REQUESTED_BYTES8Desired allocation size in bytes (big-endian u64)

Used in LEASE_ALLOC request params as an optional extension. Currently encoded as a flat params extension at byte offset 10 rather than as a TLV wrapper. The tag constant is reserved for future TLV-based param encoding.

0x08xx — Inventory Tier TLVs

These TLVs extend the resource inventory with tier-level memory and GPU attachment metadata. They use the standard TLV framing (u16 tag + u16 length + value). Unknown tags in the 0x08xx range MUST be skipped (not treated as fatal).

TLV_MEM_TIER_INFO (0x0801)

Describes a single memory tier exported by a node.

TLV header:

tag : u16 = 0x0801
length : u16 (value length in bytes)

Value payload (all big-endian):

OffsetFieldTypeDescription
0tier_kindu16Memory tier classification (see below)
2capacity_bytesu64Total capacity in bytes
10latency_class_nsu32Approximate local access latency in nanoseconds
14bandwidth_hint_bpsu64Approximate bandwidth hint in bits per second
22page_granule_bytesu32Minimum allocation granularity in bytes (e.g. 4096)
26share_modeu16Sharing mode (see below)
28attach_domains_countu16Number of attach domain entries
30attach_domains[u128; N]NUMA/CXL/PCIe domain identifiers (N = attach_domains_count)
30+N*16flagsu32Tier-specific flags (see below)

Total value length: 34 + attach_domains_count * 16 bytes.

tier_kind values:

ValueNameDescription
0DramStandard DDR DRAM
1HbmHigh Bandwidth Memory (HBM2/HBM3)
2CxlExpanderCXL Type 3 memory expander
3PmemPersistent memory (Intel Optane, CXL PM)
4VramVisibleHostGPU VRAM visible to host CPU via BAR
5BlockBlock storage (cold cache tier only, not CPU-addressable)

Unknown tier_kind values MUST be rejected (InvalidValue).

share_mode values:

ValueName
0EXCLUSIVE
1SHARED_RO
2SHARED_RW

flags bits:

BitNameDescription
0MEM_TIER_ECCECC protection enabled
1MEM_TIER_HOTPLUGHotpluggable memory
2MEM_TIER_INTERLEAVEDInterleaved across channels

TLV_GPU_ATTACH_INFO (0x0802)

Describes how a GPU resource attaches to the memory fabric.

TLV header:

tag : u16 = 0x0802
length : u16 (value length in bytes)

Value payload (all big-endian):

OffsetFieldTypeDescription
0gpu_resource_idu128Resource ID of the GPU
16peer_group_idu128Peer group for P2P transfers (e.g. NVLink domain)
32vram_bytesu64VRAM capacity in bytes
40supports_host_mapu81 = GPU VRAM is mappable by host CPU, 0 = not
41supports_remote_mapu81 = GPU VRAM is remotely accessible via fabric, 0 = not
42preferred_mem_tiers_countu16Number of preferred tier entries
44preferred_mem_tiers[u16; N]MemTierKind values ordered by preference (N = count)

Total value length: 44 + preferred_mem_tiers_count * 2 bytes.

0x09xx — Lease Intent TLVs

Lease intent TLVs are advisory placement hints carried in LEASE_ALLOC request params. The allocator uses them for placement scoring but the resulting lease is still a standard memory/block lease. Unknown or unsupported intent TLVs are silently ignored (forward-compatible).

TLV_LEASE_INTENT_KV_CACHE (0x0901)

Advisory placement hint indicating the lease will be used as a KV cache for LLM inference.

TLV header:

tag : u16 = 0x0901
length : u16 (value length: 41 or 57)

Value payload (all big-endian):

OffsetFieldTypeDescription
0model_fingerprint[u8; 32]SHA-256 fingerprint of the model
32preferred_tieru16Preferred memory tier (see below)
34page_bytesu32Page size the runtime will use for cache pages
38sharing_modeu16Sharing mode (see below)
40has_gpuu80x00 = no GPU, 0x01 = GPU resource ID follows
41gpu_resource_idu128(only present when has_gpu = 0x01) GPU to co-locate with

Total value length: 41 bytes (no GPU) or 57 bytes (with GPU).

preferred_tier values:

ValueName
0x0001HBM — High-bandwidth memory
0x0002DDR — Standard DDR4/DDR5
0x0003CXL — CXL-attached memory
0x0004PMEM — Persistent memory

sharing_mode values:

ValueNameDescription
0x0000EXCLUSIVESingle consumer, no sharing
0x0001READ_SHAREDMultiple readers, single writer
0x0002COPY_ON_WRITEReaders see snapshot; writer gets private copy

Invalid has_gpu values (not 0x00 or 0x01) MUST be rejected.

TLV_LEASE_CPU_ISOLATION (0x0902)

Per-lease CPU isolation policy. See docs/spec/cpu-isolation-wire-format.md for the design rationale.

TLV header:

tag : u16 = 0x0902
length : u16 = 0x0001

Value payload:

OffsetFieldTypeDescription
0classu8CPU isolation class

class values:

ValueNameDescription
0x00BestEffortNo pinning beyond cgroup quota (equivalent to absent TLV)
0x01WholeCoreLease owns a full core; no SMT sibling sharing
0x02StrictIsolatedWholeCore + topology/NUMA constraints
0x03–0xFEreservedMUST be rejected
0xFFreserved sentinelNever valid on the wire

When the TLV is absent, the lease inherits the daemon-wide default set by --cpu-isolation-policy. Unknown class bytes and incorrect TLV lengths (anything other than 1) MUST be rejected with InvalidIntent.

TLV_LEASE_GPU_EXCLUSIVITY (0x0903)

Per-lease GPU exclusivity class. See docs/spec/gpu-exclusivity-wire-format.md for the design rationale.

TLV header:

tag : u16 = 0x0903
length : u16 = 0x0001

Value payload:

OffsetFieldTypeDescription
0classu8GPU exclusivity class

class values:

ValueNameDescription
0x00SharedDevice may multiplex other tenants
0x01SessionExclusiveExclusive residency for session lifetime
0x02DeviceExclusiveWhole device for lease lifetime
0x03PartitionExclusiveReserved for future MIG/partition isolation
0x04–0xFEreservedMUST be rejected
0xFFreserved sentinelNever valid on the wire

When the TLV is absent, the lease inherits the daemon-wide default set by --gpu-share-mode. The daemon mode acts as a permission envelope (not a ceiling): clients cannot escape a tighter daemon mode by requesting a looser class. Unknown class bytes and incorrect TLV lengths MUST be rejected with InvalidIntent.

TLV_LEASE_AFFINITY (0x0910)

Per-lease affinity constraint. Multiple entries may appear in the same params blob — one per constraint. See docs/spec/affinity-request-model.md for the design rationale and docs/grafos/affinity-taxonomy.md for the taxonomy.

TLV header:

tag : u16 = 0x0910
length : u16 (variable)

Value payload:

OffsetFieldTypeDescription
0categoryu8Affinity category
1strengthu8Strength (bits 0:6) + anti-affinity flag (bit 7)
2target_typeu8Target type
3target_lenu16 BELength of target bytes
5target[u8; target_len]Target value

category values:

ValueName
0x01Resource — co-locate with a specific resource
0x02State — co-locate with a data shard or lease
0x03Topology — topology / failure-domain placement
0x04Trust — trust domain / attestation requirement
0x05Facility — reserved (deferred)

strength byte layout: bits 0:6 = strength value, bit 7 = anti-affinity flag.

Value (masked)Name
0x01Required — filter stage, fail-closed
0x02Preferred — score stage, soft ranking
0x03Adaptive — reserved (deferred)

target_type values:

ValueNametarget_len
0x01NodeId16 (u128 BE)
0x02ResourceId16 (u128 BE)
0x03LeaseId16 (u128 BE)
0x04ServiceId16 (u128 BE)
0x05TrustDomainvariable (UTF-8)
0x06RackId4 (u32 BE)

Unknown category, strength, or target_type values MUST be rejected with InvalidIntent. Required affinity that cannot be satisfied results in an empty placement (no candidates pass the filter).

TLV_NODE_AFFINITY_META (0x0804)

Node-level affinity metadata: topology locality and trust domain.

TLV header:

tag : u16 = 0x0804
length : u16 (variable)

Value payload (all big-endian):

OffsetFieldTypeDescription
0rack_idu32Rack identifier
4row_idu32Row/aisle identifier
8site_idu32Site/datacenter identifier
12geo_hashu64Geographic hash
20trust_domain_lenu16Length of trust domain name
22trust_domainUTF-8Trust domain name (empty = not advertised)

TLV_SUPPORTED_ISOLATION (0x0805)

Advertises which CPU isolation and GPU exclusivity classes this node honors per-lease.

TLV header:

tag : u16 = 0x0805
length : u16 (variable)

Value payload:

OffsetFieldTypeDescription
0cpu_countu8Number of supported CPU isolation classes
1cpu_classes[u8; cpu_count]CpuIsolationClass values (0x00–0x02)
1+cpu_countgpu_countu8Number of supported GPU exclusivity classes
2+cpu_countgpu_classes[u8; gpu_count]GpuExclusivityClass values (0x00–0x03)

Page Operations (FBMKV_PAGE_V0)

Profile identifier: PROFILE_FBMKV_PAGE_V0 = 0x0A01.

This profile provides a page-based abstraction over leased fabric memory. Each lease is divided into fixed-size pages with tracked state (dirty, access count, tier residency, advisory hints). Six operations form the profile.

Operation codes

CodeNameDescription
0x10PAGE_READRead bytes from a page
0x11PAGE_WRITEWrite bytes to a page
0x12PAGE_COPYCopy a page between leases (or within one)
0x13PAGE_ZEROZero a page
0x14PAGE_ADVISESet advisory hints on a page
0x15PAGE_QUERYQuery current state of a page

PAGE_READ (0x10) request

lease_id : u128
page_idx : u32
offset : u32
length : u32

PAGE_READ response:

status : u8
data : [u8; length] (only if status == OK)

PAGE_WRITE (0x11) request

lease_id : u128
page_idx : u32
offset : u32
length : u32
data : [u8; length]

PAGE_WRITE response:

status : u8

PAGE_COPY (0x12) request

src_lease_id : u128
src_page_idx : u32
dst_lease_id : u128
dst_page_idx : u32

Cross-tenant copies are denied. Both leases must be active.

PAGE_COPY response:

status : u8

PAGE_ZERO (0x13) request

lease_id : u128
page_idx : u32

PAGE_ZERO response:

status : u8

PAGE_ADVISE (0x14) request

lease_id : u128
page_idx : u32
advice_flags : u32

PAGE_ADVISE response:

status : u8

PAGE_QUERY (0x15) request

lease_id : u128
page_idx : u32

PAGE_QUERY response:

status : u8
page_idx : u32
resident_tier : u8
dirty : u8 (0 = clean, 1 = dirty)
access_count : u32
last_access_ms : u64
advice_flags : u32

MemTierKind (page tier classification)

ValueName
0Dram — Fast local DRAM
1Hbm — High-bandwidth memory
2Pmem — Persistent memory
3Remote — Remote fabric memory (RDMA/NVMe-oF/CXL)

PageAdvice flags

BitValueNameDescription
00x01WILL_NEEDPage will be accessed soon
10x02DONT_NEEDPage is no longer needed
20x04SEQUENTIALSequential access pattern expected
30x08RANDOMRandom access pattern expected
40x10EVICT_PRIORITY_LOWLow eviction priority
50x20EVICT_PRIORITY_HIGHHigh eviction priority

Page error status codes

0x00 OK
0x01 LEASE_NOT_FOUND
0x02 LEASE_NOT_ACTIVE
0x03 LEASE_EXPIRED
0x04 LEASE_REVOKED
0x05 LEASE_FENCED
0x06 PAGE_OUT_OF_RANGE
0x07 INSUFFICIENT_CAPACITY
0x08 SRC_LEASE_NOT_FOUND
0x09 DST_LEASE_NOT_FOUND
0x0A CROSS_TENANT_COPY_DENIED

Lease Export/Import/Relocate

Three QUIC control-plane opcodes for exporting lease handles to other runtimes, importing them, and relocating data between leases.

Op codes

CodeNameDescription
0x0020OP_LEASE_EXPORT_HANDLEExport a lease handle for another runtime
0x0021OP_LEASE_IMPORT_HANDLEImport a previously exported handle
0x0022OP_LEASE_RELOCATERelocate data between leases

BindingType registry

The binding type is a u8 discriminant describing how the leased resource is accessed on the data plane. Unknown binding types MUST be rejected (fail-closed).

ValueNamebinding_desc layout
0x01PcieBaru64 bar_base + u64 bar_len
0x02RdmaRkeyu32 rkey + u64 remote_addr
0x03LinuxDmabufu32 fd + u64 offset + u64 len
0x04VramOffsetu64 vram_offset + u64 len
0x05FabricPageu128 page_id + u64 offset + u64 len

Target runtime kinds

ValueName
0GPU_WORKER
1TASKLET
2NATIVE_SERVICE
3ACCELERATOR

Rights bitmask

BitValueName
01RIGHT_READ
12RIGHT_WRITE
24RIGHT_APPEND

RelocateMode values

ValueNameDescription
0CopyFull copy; both source and destination remain valid
1MoveCopy data then mark source for deferred free
2CowSeedCreate reference without immediate copy

Unknown mode values MUST be rejected.

LEASE_EXPORT_HANDLE (0x0020)

Request payload (all big-endian):

lease_id : u64
target_runtime_kind : u16
target_identity : u128
rights : u32
ttl_secs : u32

Total: 30 bytes.

Response payload:

export_handle : u128
expires_at : u64
binding_type : u8
binding_desc : bytes (u32 length + data; layout depends on binding_type)

LEASE_IMPORT_HANDLE (0x0021)

Request payload:

export_handle : u128

Total: 16 bytes.

Response payload:

local_attach_id : u64
rights : u32
expires_at : u64

Total: 20 bytes.

LEASE_RELOCATE (0x0022)

Request payload:

src_lease_id : u64
dst_lease_id : u64
mode : u8 (RelocateMode)
has_preserve : u8 (0 = absent, 1 = present)
preserve_until_ms : u64 (only if has_preserve == 1)

Total: 18 bytes (without preserve) or 26 bytes (with preserve).

Response payload:

bytes_transferred : u64
status : u8 (0 = Complete, 1 = InProgress, 2 = Failed)
reason : bytes (u32 length + UTF-8 data; only if status == 2)

Constrained dataplane bindings (v0 draft)

These bindings are transport-specific protocols used by constrained nodes that cannot implement RDMA/NVMe-oF yet. They are part of the v0 wire contract once stabilized; until then they are treated as draft profiles.

Default direction for general-purpose nodes is a secure QUIC control plane plus QUIC dataplane transport. UDP shims (FBMU/FBBU) are a constrained compatibility profile for bring-up and limited environments. See:

  • docs/platform/pi5/draft-dataplane-shims-v0.md (draft, UDP shims)
  • docs/platform/pi5/design-doc-1-quic-controlplane-udp-dataplanes.md (architecture)
  • Golden vectors: vectors/v0/fbmu/manifest.txt and vectors/v0/fbbu/manifest.txt

Note: The TCP constrained dataplane profiles below (FBMT/FBBT) are deprecated. The intended replacement direction is QUIC streams carrying the same application framing (READ/WRITE, READ_BLOCK/WRITE_BLOCK). Until the QUIC dataplane profile is fully specified, UDP shims remain draft-only compatibility transport. The forthcoming QUIC dataplane profile will be documented alongside docs/platform/pi5/quic-profile-v0.md.

TCP memory protocol (FBMT)

Deprecated: kept for historical reference; do not extend.

This protocol runs over TCP and exposes a leased memory region using simple read/write ops.

All integers are big-endian.

Each message begins with a fixed 16-byte header:

magic : [u8; 4] = "FBMT"
version : u8 = 1
op : u8
flags : u16 (must be 0 for v1)
payload_len : u32
request_id : u32

payload_len counts only the payload bytes following the header. request_id is client-chosen; for IO operations it MUST be non-zero. Unknown magic, unsupported version, or non-zero flags MUST result in the connection being closed.

Opcodes

0x01 HELLO
0x02 HELLO_ACK
0x10 READ
0x11 READ_RESP
0x20 WRITE
0x21 WRITE_RESP
0x30 PING
0x31 PONG
0x7F ERROR

Handshake

Client MUST send HELLO immediately after connect. Server replies with HELLO_ACK and either accepts or rejects the lease.

HELLO payload:

lease_id : u128

HELLO_ACK payload:

status : u8
reserved : u8
reserved : u16
resource_len : u64
max_read_len : u32
max_write_len : u32

If status != 0, the server SHOULD close the connection after sending the response.

IO operations

READ payload:

offset : u64
length : u32

READ_RESP payload:

status : u8
reserved : u8
reserved : u16
length : u32
data : [u8; length]

WRITE payload:

offset : u64
length : u32
data : [u8; length]

WRITE_RESP payload:

status : u8
reserved : u8
reserved : u16

Rules:

  • offset + length MUST be within resource_len from HELLO_ACK.
  • length MUST be > 0 and <= max_{read,write}_len.
  • Payload sizes MUST match length; otherwise return BAD_LENGTH.
  • Server MAY process requests out of order; client matches responses by request_id.

Status codes

0x00 OK
0x01 INVALID_LEASE
0x02 EXPIRED
0x03 FENCED
0x04 OUT_OF_RANGE
0x05 BAD_LENGTH
0x06 BUSY
0x07 INTERNAL

For non-OK statuses on IO requests, responses MUST contain no data bytes.

PING/PONG

PING and PONG have empty payloads and allow a client to test liveness. Servers MAY ignore PING requests.

Connection lifecycle

  • Server MUST close the connection when the lease expires or is revoked.
  • Client SHOULD treat connection close as lease invalidation.
  • Server SHOULD close the connection on any protocol violation.

TCP block storage protocol (FBBT)

Deprecated: kept for historical reference; do not extend.

This protocol runs over TCP and exposes a fixed-size block device using lease-gated access.

All integers are big-endian.

Header

Each message begins with a fixed 16-byte header:

magic : [u8; 4] = "FBBT"
version : u8 = 1
op : u8
flags : u16 (must be 0 for v1)
payload_len : u32
request_id : u32

payload_len counts only the payload bytes following the header. request_id is client-chosen; for IO operations it MUST be non-zero. Unknown magic, unsupported version, or non-zero flags MUST result in the connection being closed.

Opcodes

0x01 HELLO
0x02 HELLO_ACK
0x10 READ_BLOCK
0x11 READ_BLOCK_RESP
0x20 WRITE_BLOCK
0x21 WRITE_BLOCK_RESP
0x30 PING
0x31 PONG
0x7F ERROR

Handshake

Client MUST send HELLO immediately after connect. Server replies with HELLO_ACK and either accepts or rejects the lease.

HELLO payload:

lease_id : u128

HELLO_ACK payload:

status : u8
reserved : u8
reserved : u16
block_size : u32
device_block_cnt : u64
max_blocks_per_io: u32

If status != 0, the server SHOULD close the connection after sending the response.

IO operations

READ_BLOCK payload:

block_index : u64
block_count : u32

READ_BLOCK_RESP payload:

status : u8
reserved : u8
reserved : u16
block_count : u32
block_data : [u8; block_count * block_size]

WRITE_BLOCK payload:

block_index : u64
block_count : u32
block_data : [u8; block_count * block_size]

WRITE_BLOCK_RESP payload:

status : u8
reserved : u8
reserved : u16

Rules:

  • block_index + block_count MUST be within device_block_cnt from HELLO_ACK.
  • block_count MUST be > 0 and <= max_blocks_per_io.
  • Payload sizes MUST match block_count * block_size; otherwise return BAD_LENGTH.
  • Server MAY process requests out of order; client matches responses by request_id.

Status codes

0x00 OK
0x01 INVALID_LEASE
0x02 EXPIRED
0x03 FENCED
0x04 OUT_OF_RANGE
0x05 BAD_LENGTH
0x06 BUSY
0x07 INTERNAL

For non-OK statuses on IO requests, responses MUST contain no data bytes.

PING/PONG

PING and PONG have empty payloads and allow a client to test liveness. Servers MAY ignore PING requests.

Connection lifecycle

  • Server MUST close the connection when the lease expires or is revoked.
  • Client SHOULD treat connection close as lease invalidation.
  • Server SHOULD close the connection on any protocol violation.