Skip to content

Lease primitive

A lease is a time-bounded grant of access to a fabric resource. It is the foundation primitive of grafOS: every byte of memory you read, every block you write, every GPU you submit work to, every UDP packet you send across a leased interface — all of it happens under a lease.

If you understand leases, the rest of grafOS follows.

What’s in a lease

A lease binds three things together:

  1. A resource. A specific region of fabric memory, a specific block range, a specific GPU, a specific network interface — addressable by a ResourceId.
  2. A scope. Who’s allowed to use it: a tenant identity, optionally narrowed further by capability caveats (program path, hardware caveats, etc.).
  3. A TTL. A wall-clock deadline. After the deadline, the fabric forces teardown of the data-plane binding.

The TTL is the load-bearing piece. Linux file descriptors are open until you close them or the process dies; grafOS leases expire whether or not your code is paying attention.

How a lease is created

Three things happen in sequence on LEASE_ALLOC:

  1. The scheduler admits the request against the requesting tenant’s quota and the target cell’s free capacity.
  2. The cell’s CapBroker mints a capability token scoped to this lease and returns it to the requester.
  3. The cell creates the data-plane binding (RDMA QP, NVMe namespace, GPU context, network flow rule) and surfaces a connection coordinate.

You hold the capability token. The token is what you present on every subsequent data-plane operation; the fabric verifies the token against the lease state on every op. The token’s TTL never exceeds the lease TTL.

Why expiry forces teardown

In a long-running cluster, leases expire constantly — for a thousand reasons (tenant deletes the program, scheduler revokes for higher-priority load, network partition, lease rejected by quota). The system needs a single, simple answer for “what happens to in-flight access when a lease expires?”

grafOS picks: teardown is mandatory, not advisory.

When a lease expires:

  • The cell tears down the data-plane binding.
  • New ops fail with a typed LeaseRevoked (or equivalent kind-specific code).
  • In-flight ops fail with the same code; bytes already on the wire are not “saved” — the program sees the failure and decides.

If the cell can’t tear down cleanly (hardware error, software hang), the resource transitions to FENCED state. No new leases are admitted on a fenced resource until it is observably reset. This is the sharpest guarantee grafOS makes: an error in teardown becomes a hard refusal to lease, not a silent corruption window.

Lease renewal vs new lease

There’s no “extend” — every lease has a fixed deadline at creation time. To keep using a resource, you ask for a renewal before the deadline. A renewal is a fresh lease with a fresh capability token; the old token is invalidated when the renewal succeeds.

The SDK helper grafos-leasekit handles the renewal cadence — you pick a budget, it polls and renews. Most programs never call lease management directly; the typed wrappers in grafos-std and grafos-collections do it for you.

Failure modes worth knowing

  • Lease denied at admission. The scheduler refused (no quota, no capacity). This is a runtime error you handle.
  • Lease expired. The TTL passed before you renewed. Same error class as denial — you handle it or fail.
  • Lease revoked. The fabric proactively pulled the lease (resource fenced, tenant suspended, operator revocation). Surfaces as ResourceFenced or LeaseRevoked events on the program’s event queue.
  • Stale token. Op presented an old token after a renewal. Fail closed, refresh, retry.

The SDK surfaces all four as typed errors. There is no untyped “something went wrong” path — every lease-related failure has a kind.

Where to next