Premium Dataplane Methodology
This document defines how fabricBIOS integrates premium (native, high-performance) dataplanes. It is the canonical reference for the design rule, conformance requirements, and evidence standards that apply to every transport binding in fabricBIOS.
1. Core Design Rule
Authorize once, bind natively, revoke deterministically.
fabricBIOS owns five things:
- Identity — node identity, certificate validation, trust bootstrap.
- Capability minting — audience-bound, short-TTL tokens with anti-replay.
- Lease lifecycle — create, renew, expire, revoke.
- Revocation — signed REVOKE_BROADCAST; data-plane teardown on expiry or revoke.
- Fencing — fail-closed quarantine when teardown fails.
Native transports own one thing: steady-state bulk I/O.
After binding establishment, there are zero fabricBIOS round trips on the data path and zero extra copies introduced by the control plane. The transport is intended to run at native performance through its native API (ibverbs, nvme connect / kernel NVMe-oF stack, GPU vendor SDK, CXL memory-mapped access). fabricBIOS re-enters the picture only at renewal, revocation, or expiry. That architectural property is stronger than today’s evidence level for every transport: lifecycle proof can exist before steady-state throughput/latency proof, and the repo must say which one has actually been demonstrated.
2. Contrasting Architectures
2.1 Proxy / Middlebox
Every I/O traverses a mediator process or appliance.
- Extra memory copy per operation (mediator receive buffer to destination).
- Extra latency per operation (mediator scheduling, context switch, or network hop).
- Mediator is a single point of failure and a CPU bottleneck under load.
- Mediator must be at least as fast as the transport it mediates — an impossible requirement for RDMA or GPU fabric traffic at line rate.
2.2 Raw Trusted-Fabric Shortcut
Direct hardware access with no lifecycle management layer.
- No lease expiry enforcement. A departed or misbehaving client retains access indefinitely until an administrator intervenes.
- No revocation mechanism. Revoking access requires manual reconfiguration of hardware ACLs or physical disconnection.
- No fencing on failure. A hardware or teardown fault leaves the resource in an undefined state with no guaranteed isolation.
- Operational burden scales with fleet size; every access control decision is ad hoc.
2.3 fabricBIOS
Authorize the binding, hand the client native transport credentials, then get out of the way.
- Zero mediator on the data path. The client talks directly to the transport hardware (RNIC, NVMe-oF target, GPU, CXL endpoint).
- Deterministic revocation. Lease expiry triggers transport-specific teardown (QP destroy, MR deregister, configfs subsystem removal, session credential invalidation). The node confirms teardown succeeded or fences the resource.
- Fail-closed on teardown failure. A resource that cannot be cleanly torn down enters FENCED state: no new leases are granted and the resource is reported as FENCED in discovery until remediation (reset, power cycle) succeeds.
- Uniform control plane across transports. The same LEASE_ALLOC / LEASE_RENEW / LEASE_FREE / LEASE_QUERY operations work for RDMA, NVMe-oF, GPU, NIC, and CXL resources. Transport-specific binding material is carried in typed TLV extensions within the same wire format.
3. Premium Dataplane Conformance Matrix
The table below covers every transport binding defined or planned in fabricBIOS. For each transport, it specifies the native I/O mechanism, the binding material exchanged at lease creation, the teardown procedure, the fenced fallback when teardown fails, and the current implementation status.
3.1 RoCEv2 / soft-RoCE (RDMA)
| Aspect | Detail |
|---|---|
| Resource type | MEM (0x0002) |
| Native transport | ibverbs RDMA WRITE / RDMA READ over RC QP |
| Binding material (TLV 0x02xx) | rkey (u32), remote_addr (u64), qp_num (u32), gid ([u8; 16]), port (u16) |
| Teardown | QP destroy, MR deregister, CQ destroy, PD dealloc via ibverbs. RAII Drop impls in RdmaMemoryBinding. |
| Fenced fallback | MR deregistered (rkey invalidated), QP destroyed. RdmaLeaseManager::fence() sets state to Fenced; no new allocations accepted. |
| Implementation | crates/fabricbios-platform-linux/src/rdma.rs. RdmaContext wraps ibv_device/PD/CQ. RdmaMemoryBinding registers an MmapMemoryRegion as ibv_mr, creates RC QP, transitions INIT -> RTR -> RTS. RdmaLeaseManager tracks allocations with free_or_fence for fail-closed teardown. RdmaClient provides the client-side counterpart (connect, rdma_write, rdma_read). |
| Test environment | soft-RoCE (rxe) on Linux. See scripts/setup-rxe.sh. |
| Status | Implemented (Phase 57). Binding TLV roundtrip tested. Server-side QP management, lease allocation, teardown, and fencing tested. Client-side RDMA WRITE/READ tested against soft-RoCE. |
3.2 InfiniBand
| Aspect | Detail |
|---|---|
| Resource type | MEM (0x0002) |
| Native transport | ibverbs RDMA WRITE / RDMA READ (same API as RoCEv2) |
| Binding material | Same TLV 0x02xx range (rkey, remote_addr, qp_num, gid, port) |
| Teardown | Same ibverbs teardown path |
| Fenced fallback | Same as RoCEv2 |
| Implementation | Same rdma.rs code path. IB vs. RoCE is a link-layer difference; the ibverbs API is identical. |
| Status | Architecture-ready. Wire format implemented, code path shared with RoCEv2. No InfiniBand hardware tested. |
3.3 NVMe-oF/TCP
| Aspect | Detail |
|---|---|
| Resource type | BLOCK (0x0003) |
| Native transport | nvme-tcp kernel module (client connects via nvme connect) |
| Binding material (TLV 0x03xx) | nqn (String, NQN per NVMe spec), traddr (String, IP), trsvcid (u16, port), trtype (u8, 0=tcp) |
| Teardown | Reverse-order configfs removal: unlink port symlink, remove allowed host, disable namespace (enable=0), remove namespace directory, remove subsystem directory. Implemented in NvmetManager::teardown(). |
| Fenced fallback | NvmetManager::fence_subsystem(): write attr_allow_any_host=0, disable all namespaces (enable=0). No new connections possible. |
| Implementation | crates/fabricbios-platform-linux/src/nvmet.rs. NvmetManager creates/tears down NVMe-oF subsystems, namespaces, port bindings, and host ACLs via /sys/kernel/config/nvmet/ configfs. Each lease gets its own subsystem NQN derived from lease_id, its own namespace, and an explicit host allowlist entry (no allow_any_host). |
| Test environment | Loop-backed nvmet (trtype=loop) for unit tests; nvme-tcp for integration. NvmetManager::with_root() supports tempdir-based testing without root privileges. |
| Status | Implemented for target lifecycle and binding (Phase 58). Subsystem create/teardown tested. Fencing on teardown failure tested. Binding TLV roundtrip tested. NQN validation implemented per NVMe spec. Steady-state initiator I/O and performance proof remain open. |
3.4 NVMe-oF/RDMA
| Aspect | Detail |
|---|---|
| Resource type | BLOCK (0x0003) |
| Native transport | nvme-rdma kernel module |
| Binding material | Same TLV 0x03xx range; trtype=1 (RDMA) |
| Teardown | Same configfs removal path as NVMe-oF/TCP; trtype field changes the kernel transport module, not the configfs structure |
| Fenced fallback | Same as NVMe-oF/TCP |
| Implementation | Wire format and NvmetTransport::Rdma variant implemented. NvmetManager handles RDMA ports identically to TCP ports (configfs addr_trtype=rdma). |
| Status | Wire format implemented. No NVMe-oF/RDMA hardware tested. The configfs management code is transport-agnostic; only the addr_trtype attribute differs. |
3.5 SR-IOV / VFIO NIC
| Aspect | Detail |
|---|---|
| Resource type | NET (0x0004) |
| Native transport | VF (Virtual Function) passthrough to guest or container via VFIO or kernel netdev |
| Binding material (TLV 0x05xx) | vf_pci_addr (PciBdf, 4 bytes: domain, bus, devfn), vf_index (u16), pf_pci_addr (PciBdf, 4 bytes) |
| Teardown | VF release: clear MAC/VLAN override, disable VF. SriovVfManager::release() frees the VF slot; release_or_fence() fences on failure. Macvlan interfaces destroyed via ip link delete. TC bandwidth reservations cleaned up. |
| Fenced fallback | SriovVfManager::fence() sets state to Fenced; no new VF allocations. VF remains assigned but isolated (no MAC/VLAN forwarding). |
| Implementation | crates/fabricbios-platform-linux/src/nic.rs. SriovVfAllocator manages VF index allocation. SriovVfManager handles full lifecycle: enable SR-IOV (sriov_numvfs), allocate VFs with MAC/VLAN/spoofchk/rate configuration via sysfs + ip link set, release with cleanup. MacvlanInterface provides lightweight L2 isolation without SR-IOV hardware. TcBandwidthReservation adds HTB rate limiting. NicLeaseManager coordinates NIC resource leasing with macvlan creation. |
| Test environment | Macvlan and TC tested on standard Linux NICs. SR-IOV requires hardware with VF support. |
| Status | Code implemented (Phase 59). SR-IOV allocation, configuration, release, and fencing implemented. Macvlan and TC bandwidth reservation implemented and tested. Blocked on SR-IOV hardware for end-to-end VF passthrough testing. |
3.6 Vendor GPU Fabrics / GPUDirect RDMA
| Aspect | Detail |
|---|---|
| Resource type | GPU (0x0005) |
| Native transport | CUDA Driver API (NVIDIA) or HIP/ROCm runtime (AMD) for local compute; GPUDirect RDMA for remote VRAM access |
| Binding material | Session credentials bound to lease. GPU binding contents defined in docs/gpu-leasing-contract-draft.md: lease_id, gpu_resource_id, access_caps (operation mask), queue_caps, mem_caps, revocation_epoch. |
| Teardown | GPU context/session destroy. HIP: hipFree device memory, hipModuleUnload, device context release. CUDA: equivalent driver API teardown. Lease expiry invalidates the session credentials. |
| Fenced fallback | GPU device reported FENCED in inventory. No new leases granted. Device reset required to clear fenced state. |
| Implementation | crates/fabricbios-platform-linux/src/gpu.rs. GpuDevice wraps detected GPUs with NVML (NVIDIA) and HIP (AMD) runtime data. GPU compute via gpu-hip feature: dynamic loading of libamdhip64.so, device memory allocation (hipMalloc/hipFree), host-device transfers (hipMemcpy), kernel launch (hipModuleLaunchKernel). gpu-cuda feature: CUDA Driver API + NVML for discovery, MIG detection, SM count. Exclusive and fractional share modes controlled via --gpu-share-mode. |
| Design documents | docs/aurora-gpu-design-document.md (Aurora GPU pod module architecture), docs/gpu-leasing-contract-draft.md (GPU leasing contract: authority, lifecycle, enforcement, choke points). |
| Status | GPU leasing implemented. Device detection, inventory, exclusive/fractional share modes, HIP compute (alloc, transfer, kernel launch) all functional behind feature flag. GPUDirect RDMA is architecture-specified (Aurora design document) but no GPUDirect hardware has been tested. |
3.7 CXL.mem
| Aspect | Detail |
|---|---|
| Resource type | Vendor extension (0x00FF, with resource type code 0x0007 allocated) |
| Native transport | CXL memory-mapped regions (load/store access via CPU memory map) |
| Binding material | Switch, port, and decoder identifiers (format TBD) |
| Teardown | Decoder rule removal; CXL switch port reconfiguration |
| Fenced fallback | Decoder rule disabled. No new mappings. |
| Implementation | Resource type code allocated in spec. |
| Status | Spec-allocated only. No implementation, no hardware. |
4. Evidence Budgets and Acceptance Criteria
Every premium dataplane binding must satisfy the following evidence requirements before it is considered production-grade. This section defines what must be measured, what must be published, and what distinguishes “proven” from “architecturally specified.”
4.1 Steady-State Path Requirements
After binding establishment, the data path MUST have:
- Zero extra copies introduced by fabricBIOS. The transport operates on its native buffers (ibverbs MR, nvme-tcp scatter-gather, GPU device memory).
- Zero extra fabricBIOS round trips. No control-plane messages between binding establishment and lease renewal/expiry/revocation.
- Native API only. The client uses the transport’s standard API (ibverbs verbs,
nvme connect, HIP/CUDA calls) with no fabricBIOS shim in the I/O path.
4.2 Latency Measurements (Required Per Transport)
For each transport that reaches “implemented” status, publish:
| Metric | Description |
|---|---|
| Bind latency | Time from LEASE_ALLOC request to first successful data-plane I/O. Includes control-plane round trip, binding material exchange, and transport-specific setup (QP transition, configfs write, GPU context init). |
| Renew latency | Time for LEASE_RENEW round trip. Should be dominated by QUIC RTT. |
| Revoke latency | Time from REVOKE_BROADCAST receipt (or lease expiry) to data-plane teardown completion. Measures how quickly a stale client loses access. |
| Expiry-cutoff latency | Worst-case time between lease expiry and the last data-plane operation that succeeds. Bounded by grace period + teardown time. |
4.3 Throughput and Overhead Measurements (Required For Steady-State Claims)
The table below defines acceptance targets for transports that claim steady-state dataplane proof. These targets are not automatically satisfied by lifecycle automation alone. A transport may be:
- lifecycle-proven: bind, teardown, revoke, and fence behavior exercised
- steady-state-proven: native bulk I/O exercised with published throughput, latency, and CPU deltas versus direct setup
Every proof artifact must label which of those two scopes it actually covers.
| Metric | Description |
|---|---|
| Throughput delta | Throughput of fabricBIOS-managed transport vs. identical transport configured manually (direct nvme connect, direct QP setup, etc.). Target for steady-state-proof claims: less than 1% delta on steady-state bulk I/O. |
| Latency delta | Per-operation latency of fabricBIOS-managed transport vs. direct setup. Target for steady-state-proof claims: zero measurable delta on steady-state I/O (any difference is in bind/teardown, not data path). |
| CPU overhead delta | CPU utilization of fabricBIOS-managed transport vs. direct. Target for steady-state-proof claims: zero delta on steady-state I/O (fabricBIOS is not in the data path). |
Publishing bind/renew/revoke/expiry timing without a connected native I/O harness is still valuable evidence, but it only supports lifecycle-proof claims. It does not, by itself, justify a steady-state dataplane performance claim.
4.4 Failure Behavior Measurements (Required Per Transport)
| Scenario | What to verify |
|---|---|
| Lease expiry | Data-plane I/O fails within grace period + teardown time. No stale access. |
| Node crash (provider) | Client-side I/O fails (transport timeout or error). Client must not silently succeed. |
| Node crash (consumer) | Provider-side lease expires normally. Resource becomes available for re-lease. No leaked state. |
| Network partition | Lease cannot be renewed. Expires on schedule. Teardown runs on provider side. |
| Teardown failure | Resource enters FENCED state. Reported in discovery. No new leases. Requires remediation. |
4.5 Evidence Classification
Each transport binding in the conformance matrix (Section 3) carries one of the following evidence levels:
| Level | Meaning |
|---|---|
| Proven | Bind/teardown/fencing tested on real or emulated hardware. Steady-state dataplane measurements published. Failure scenarios exercised. |
| Tested | Bind/teardown/fencing tested on real or emulated hardware. Some transports at this level may only have lifecycle/provisioning evidence; steady-state dataplane measurements may still be absent. |
| Code-complete | Implementation exists. Unit tests pass. No hardware or emulated-hardware testing yet. |
| Wire-format only | TLV encoding defined and roundtrip-tested. No platform implementation. |
| Spec-allocated | Resource type code reserved. No wire format or implementation. |
Current evidence levels:
| Transport | Evidence Level |
|---|---|
| RoCEv2 / soft-RoCE | Tested (soft-RoCE emulation) |
| InfiniBand | Wire-format only (shared code with RoCEv2, no IB hardware) |
| NVMe-oF/TCP | Tested (target lifecycle + binding only) |
| NVMe-oF/RDMA | Wire-format only |
| SR-IOV NIC | Code-complete (macvlan/TC tested; SR-IOV blocked on hardware) |
| GPU (HIP/ROCm) | Tested (AMD GPU with HIP runtime) |
| GPU (CUDA) | Code-complete (NVML discovery implemented; compute path behind feature flag) |
| GPUDirect RDMA | Spec-allocated (Aurora design document) |
| CXL.mem | Spec-allocated |
5. Trusted Fabric Posture
A trusted network (physically isolated, switch-only, no untrusted hosts) can simplify deployment, but only through the explicit non-default exception profile defined in docs/spec/fabricBIOS-design-document.md Section 8.5 and expanded in docs/spec/trusted-fabric-profile.md. This methodology does not redefine the normative secure default.
That exception profile permits a narrow set of setup-time relaxations, such as:
- omitting TLS on the control plane when the physical network provides equivalent confidentiality and integrity
- accepting unsigned discovery on a fully trusted segment
- relaxing internal firewalling on the trusted segment
A trusted network does not replace any of the following fabricBIOS invariants. These hold regardless of network trust level:
- Capability checks. Tokens are still validated (audience, expiry, signature) before granting leases.
- Lease expiry enforcement. Leases still expire on schedule. Background
tick_leasesstill runs. There is no “infinite lease” mode. - Fail-closed teardown. Teardown still runs on expiry or revocation. The same QP destroy / configfs removal / session invalidation sequence executes whether the network is trusted or not.
- Fencing on teardown failure. A resource that cannot be cleanly torn down is still fenced. Trusted network status does not grant a pass on teardown failures.
- Anti-replay. Nonce-based replay caches are still enforced on data-plane operations (FBMU/FBBU) even on trusted networks.
The rationale: trusted networks protect against some external attackers. They do not protect against bugs, misconfiguration, hardware faults, or leaked credentials. fabricBIOS’s lifecycle enforcement protects against these failure modes, and disabling it on a trusted network would create a class of failures that are harder to diagnose (silent stale access, resource leaks) than the overhead it saves.
6. References
crates/fabricbios-core/src/binding.rs—DataplaneBinding,RdmaBinding,NvmeofBinding,SriovBindingTLV encode/decodecrates/fabricbios-platform-linux/src/rdma.rs— RDMA implementation (ibverbs via libloading, soft-RoCE)crates/fabricbios-platform-linux/src/nvmet.rs— NVMe-oF target management (configfs)crates/fabricbios-platform-linux/src/nic.rs— SR-IOV VF manager, macvlan, TC bandwidth reservationcrates/fabricbios-platform-linux/src/gpu.rs— GPU device detection, HIP/ROCm compute, CUDA/NVML discoverydocs/spec/fabricBIOS-design-document.md— Normative spec: Sections 10 (Lease Management), 10.3 (Teardown), 10.4 (Fencing)docs/spec/resource-types.md— Resource type codes, inventory formats, enforcement mechanismsdocs/aurora-gpu-design-document.md— Aurora composable GPU pod module designdocs/gpu-leasing-contract-draft.md— GPU leasing contract (fabricBIOS / grafOS interface)
7. Security Without Datapath Tax
Premium dataplane throughput comes from native transport ownership of the hot path. fabricBIOS never interposes on steady-state I/O.
This section defines which security properties are mandatory regardless of deployment posture, which may be relaxed on trusted segments, and how every security enhancement is classified relative to the data path.
7.1 Mandatory Invariants (Never Negotiable)
These five invariants hold on every fabricBIOS deployment, whether the network is trusted, untrusted, or mixed:
- Identity of allocator. Every lease allocation is bound to a verified node identity. Anonymous or unverified allocators cannot obtain leases.
- Lease TTL enforcement. Every lease has a finite TTL. Background
tick_leasesruns unconditionally. There is no infinite-lease mode and no way to disable expiry. - Revocation. Signed REVOKE_BROADCAST causes immediate teardown of the named lease. There is no “ignore revocation” mode.
- Fail-closed fencing. If teardown fails for any reason, the resource enters FENCED state. No new leases are granted until remediation (reset, power cycle) succeeds.
- Auditability. Bind, renew, revoke, expire, and fence events are logged with timestamps, node identities, and lease identifiers. The audit trail exists regardless of network trust level.
7.2 Optional Trusted-Segment Relaxations
On physically isolated networks where all participants are trusted, the following may be relaxed via explicit configuration (see docs/spec/trusted-fabric-profile.md):
- Discovery signing policy. ANNOUNCE and SOLICIT frames may omit Ed25519 signatures when all nodes on the L2 segment are trusted. The
signed_announce_requiredandsigned_solicit_requiredconfig knobs control this. - Control-plane transport posture. TLS on the QUIC control plane may be omitted if the physical network provides equivalent confidentiality and integrity. The
tls_requiredconfig knob controls this. - Compatibility dataplane auth posture. Per-message integrity checks on discovery frames may be skipped when the L2 provides them (e.g., MACsec).
These relaxations affect setup latency and operational complexity. They do not affect steady-state dataplane throughput, which is transport-native regardless of control-plane security posture.
7.3 Security Enhancement Classification
Every security enhancement for premium dataplanes must be classified into exactly one of the following categories. Enhancements in the “forbidden” category are rejected unconditionally.
| Enhancement | Classification | Rationale |
|---|---|---|
| Identity verification | Bind-time only | Verified once at LEASE_ALLOC. Not checked per I/O. |
| Capability token validation | Bind-time only | Token presented and validated at lease creation. Not re-validated per I/O. |
| Lease creation | Bind-time only | Single control-plane round trip. |
| Lease renewal | Renewal-time only | Periodic LEASE_RENEW, frequency governed by TTL. Not on data path. |
| Anti-replay check | Bind-time only | Nonce validated on FBMU/FBBU control operations, not on native transport I/O. |
| Binding material exchange | Bind-time only | rkey, NQN, VF address, GPU session credentials exchanged once at bind. |
| Transport credential issuance | Bind-time only | Native credentials (rkey, host NQN, VF MAC) issued at bind, consumed by transport directly. |
| Lease expiry enforcement | Revocation-time only | Background tick_leases triggers teardown. No data-path involvement. |
| Revocation broadcast | Revocation-time only | Signed control-plane message triggers teardown. |
| Fencing | Revocation-time only | Applied after teardown failure. No data-path involvement. |
| Teardown execution | Revocation-time only | Transport-specific cleanup (QP destroy, configfs removal, VF release). |
| Stale-access detection | Observability-only | Counters and logs for post-revoke/expiry access attempts. No blocking on data path. |
| Audit logging | Observability-only | Background event recording. No data-path involvement. |
| Per-I/O authorization | FORBIDDEN (hot path) | fabricBIOS must not authorize individual read/write/send operations. Native transport handles access control via binding material (rkey scope, NQN allowlist, VF isolation). |
| Per-I/O encryption by fabricBIOS | FORBIDDEN (hot path) | Use native transport encryption (RDMA over IPsec, NVMe-oF/TLS, MACsec, GPU link encryption). fabricBIOS must not add an encryption layer on steady-state I/O. |
8. Design Checklist for Premium Dataplane Security
Before accepting any security-related change to the premium dataplane path, apply this checklist. If any of the first four items is “yes,” the change must be rejected or redesigned.
- Does this add a per-I/O round trip? If yes, reject.
- Does this add a per-I/O copy? If yes, reject.
- Does this require a long-lived mediator process on the data path? If yes, reject.
- Does this duplicate an authorization primitive the native transport already has? If yes, reject or redesign to attenuate into the native primitive.
- Does this add latency only at bind time, renewal time, or revocation time? If yes, measure and publish the budget, then proceed.
- Does this add CPU cost only in background observability paths? If yes, proceed.