RDMA Lease Revoke Semantics
Status: descriptive. Documents the currently-shipped behavior of RDMA lease revocation in fabricBIOS, with particular attention to the split between the authoritative control-plane contract (closed) and the additive dataplane-observability property for non-cooperating clients (investigated + baselined, not tuned further per reviewer guidance).
Scope: applies to the x86-uefi bare-metal firmware driving a ConnectX-5 over VFIO passthrough. The conclusions below are CX-5 firmware 16.35.4506 specific where noted; the control-plane contract is transport-agnostic.
Related:
docs/spec/fabricbios-wire-encoding-v0.md—LEASE_REVOKEop encoding (0x020A).- TODO.md items 10958 (control-plane contract, closed) and 10959 (dataplane property, baselined).
crates/fabricbios-x86-uefi/src/rdma_backend.rs::Mlx5RdmaBackend::lease_revoke_outcome.crates/fabricbios-x86-uefi/src/mlx5_hw.rs::rdma_lease_revoke_fence,rdma_lease_sweep_fenced.
1. Two Distinct Properties
RDMA lease revocation has two separable observability paths. Keeping them distinct is critical — failing to do so historically caused the team to treat a hardware characteristic as a tuning problem.
1.1 Control-plane authoritative recall (CLOSED)
Property: after a LEASE_REVOKE (op 0x020A) request, the server
returns a structured outcome code within ≤ 1 s, authoritatively
declaring the lease recalled:
| Code | Meaning |
|---|---|
TornDown (0) | Clean teardown initiated; slot will be released once deferred destroy sweeps |
Fenced (1) | Partial teardown failure; slot permanently consumed (fail-closed) |
NotFound (2) | No active lease with this id |
Observed cross-host on bare-metal x86-uefi firmware with ConnectX-5 (firmware 16.35.4506) via the dataplane verifier harness: ≈ 12 ms recall latency (includes a synchronous flow-steering FTE install; bare 2ERR_QP + DESTROY_MKEY is ≈ 1.5 ms).
A cooperating runtime acts on this outcome immediately — it must
stop issuing RDMA ops to the revoked lease’s QP/rkey once the
TornDown or Fenced outcome is received. This is the primary
revoke semantic and is unchanged by anything below.
1.2 Dataplane observability for non-cooperating clients (BASELINED)
Property (desired): a sustained ibv_post_send(IBV_WR_RDMA_WRITE)
loop by a client that IGNORES the control-plane outcome should see a
completion error within a bounded sub-second interval after the
server’s revoke.
Property (current baseline on CX-5 fw 16.35.4506): ≈ 1.4 s
dataplane observation latency, status IBV_WC_REM_ACCESS_ERR (12).
This is a hardware floor, not a tuning target. See §3 for the
mechanism.
2. Teardown Pipeline
The revoke teardown is split into two phases to give clean resource lifecycle and to provide a hook point for any future stricter dataplane work (§5 option c).
2.1 Immediate phase (rdma_lease_revoke_fence)
Runs synchronously in the LEASE_REVOKE request handler:
2ERR_QP(opcode0x0507) — transitions the QP to ERR state.DESTROY_MKEY(opcode0x0202) — invalidates the rkey.- (If RDMA_RX FT setup succeeded at boot)
SET_FLOW_TABLE_ENTRYon the pre-configuredmisc.bth_opcodeDROP group, indexed by the lease slot. Best-effort; does NOT buy latency on this firmware (§3) but is the skeleton for option (c). - Mark slot
pending_destroy_at_secs = now + GRACE_SECS(default 5 s).
Outcome: TornDown (all three of 1–3 ok) or Fenced (2ERR_QP or
DESTROY_MKEY failed). Returns via LEASE_REVOKE wire response.
2.2 Deferred phase (rdma_lease_sweep_fenced)
Runs from the QUIC main-loop idle path (tick_rdma_sweep) on
pending-destroy slots past their deadline:
DELETE_FLOW_TABLE_ENTRYfor the slot’s installed FTE.2RST_QP(opcode0x050A) — drains QP from ERR to RESET.DESTROY_QP(opcode0x0501) — releases the QP slot on the HCA.- Clear slot (reusable for new lease) OR fence (if 2RST or DESTROY failed).
2.3 Fail-closed fencing
If any destroy command fails, the slot is permanently fenced:
RdmaLeaseTable::fence(slot, failed_mask, origin) preserves the
failure context on the slot for post-mortem. FENCE_ORIGIN_*
constants distinguish LEGACY (synchronous path), REVOKE (immediate
phase), and SWEEP (deferred phase). fenced_summary() exposes a
diagnostic view. Fenced slots are not reused until firmware reboot.
Deterministic regression test: [FENCE-TEST] in main.rs (gated by
bringup-tests feature) corrupts mkey_index to 0x00FF_FFFF,
which forces DESTROY_MKEY to return BAD_PARAM on the HCA, and
asserts the slot becomes fenced=true, active=false, table.fenced_count +1.
3. CX-5 Dataplane Observability Floor
3.1 Empirical observation
Across every tested approach — bare 2ERR_QP + DESTROY_MKEY, the
immediate-fence + deferred-destroy split, flow-table DROP on
misc.bth_dst_qp, flow-table DROP on misc.bth_opcode=0x0A (RDMA
WRITE Only) — the measured dataplane first-fail latency is ≈ 1.4 s.
The failing completion status is consistently IBV_WC_REM_ACCESS_ERR
(12), not IBV_WC_REM_OP_ERR (11) that a true QP-in-ERR NAK would
produce.
3.2 Mechanism (hypothesis, consistent with all observations)
The HCA maintains a per-QP fast-path dispatch cache. Once traffic is flowing to a QP:
- Installed flow rules do not invalidate the cache. Proven: the same
bth_opcode=0x0ADROP rule that intercepts every WRITE when installed PRE-traffic does NOT intercept inflight WRITEs when installed MID-stream (during revoke). - Only QP state transitions through
2ERR→2RST→DESTROY_QPinvalidate the cache, and that sequence itself has the ≈ 1.4 s drain floor.
Additional confirming evidence:
misc.bth_dst_qpmatching is non-functional for generic RoCEv2 dispatch despiteft_field_support.bth_dst_qp=1capability bit. Kernelmlx5_ibonly uses this field whenIB_QP_CREATE_SOURCE_QPNunderlay is configured (drivers/infiniband/hw/mlx5/fs.c:995-1003). Any value (388 = actual qpn, 0, 0xAAAAAA) left traffic flowing unchanged.SET_FLOW_TABLE_ENTRYon an already-occupied flow_index returnsstatus=0x08 (BAD_INDEX)— no atomic modify-in-place.- Pre-installing an ALLOW rule and switching to DROP at revoke ran into the same floor (not directly tested to completion, but the HCA-cache model predicts it and the reviewer explicitly flagged this workaround pattern as not philosophically preferred).
3.3 Baseline acceptance, not tuning
Per reviewer guidance:
The honest conclusion is that this firmware/hardware path appears to impose a real floor for mid-stream dataplane observability. … Treat that as the baseline (a). If we decide the stricter non-cooperating-client property is worth the lift, pursue the real mlx5-native underlay / SOURCE_QPN path (c) rather than a workaround-first approach.
The cross-host regression test
(lease_alloc_rdma_verify --revoke-test) now reports the dataplane
latency informationally with a 2000 ms regression budget. Exceeding
that budget is a signal of genuine regression (lost ack path, new
cache behavior, misconfiguration); within it is the known baseline.
4. Diagnostic Infrastructure
Four feature-gated probes are preserved in the firmware source so any future investigation can reproduce the key findings:
| Feature | Install | Purpose |
|---|---|---|
nic-rx-drop-probe | NIC_RX DROP on UDP/4791 at boot | Proves NIC_RX is BYPASSED for RoCEv2 on this firmware (verifier WRITE still succeeds) — eliminates that steering domain |
rdma-rx-ft-probe | Standalone CREATE_FLOW_TABLE on RDMA_RX | Proves FT creation with table_type=0x07 succeeds (contradicts prior “silent null” claim) |
rdma-rx-catchall-drop | RDMA_RX match-anything DROP at boot | Proves the FT IS in the active dispatch path (verifier WRITE fails immediately pre-traffic) |
rdma-rx-bth-mismatch-probe | RDMA_RX misc.bth_opcode=0x0A DROP at boot | Proves MISC parsing works, bth_opcode extraction works, DROP action works — all PRE-traffic |
Each probe is enabled by building the bare-metal firmware with the feature flag in the table above and running the dataplane verifier harness against the resulting image.
5. Resolution Options (if Stricter Property Is Ever Justified)
(a) Accept baseline (current posture)
The documented ~1.4 s dataplane floor on this firmware revision is a hardware characteristic. Cooperating clients unaffected; the control-plane contract is the primary revoke channel. This is the current answer.
(b) Pre-provision at LEASE_ALLOC
Install an ALLOW flow rule at LEASE_ALLOC (before the QP gets hot),
then DELETE+SET to DROP at revoke. Theoretical — may or may not
bypass the HCA cache. Not philosophically preferred per reviewer.
Acceptable only as an explicit mitigation experiment.
(c) SOURCE_QPN / underlay-QPN native path
The mlx5-native architectural answer. Per-lease underlay QP
allocation via IB_QP_CREATE_SOURCE_QPN; flow rules attached to the
underlay QP BEFORE the main QP is hot; misc.bth_dst_qp matching
then engages the cache-invalidation machinery the hardware was
designed for. Substantial machinery (per-lease extra QP + TIR + RQ
wrapper). See kernel drivers/infiniband/hw/mlx5/fs.c:1374 for the
reference implementation. This is the real path if the stricter
property is declared worth pursuing.
(d) Kernel-trace cache-invalidation deep dive
If options (a)-(c) are all unsatisfactory, capture kernel mlx5_ib
flow-steering byte sequences during actual IPoIB / underlay setup via
dump_command dynamic debug on mlx5_core cmd.c, and replicate the
EXACT sequence including any cache-invalidation commands not yet
identified. High-effort; last resort.