Skip to content

Recipe 11: A Development Environment That Leases a GPU for Five Minutes

Situation

Interactive workflows (notebooks, REPLs, “cells”) alternate between:

  • long stretches of CPU-only exploration
  • short bursts of GPU compute

If the unit of allocation is “a GPU instance”, you pay for idle time and suffer slow startup.

In a lease model, you can allocate VRAM for only as long as you actually use it, and you can make “GPU ended” a first-class, explicit state transition rather than an operational failure.

What You Build

An interactive environment where:

  • GPU is acquired per session/cell as a short TTL lease.
  • The environment renews while active.
  • Long runs checkpoint state to block storage so preemption is cheap.

This recipe complements Recipe 18 (multi-user GPU studio) by focusing on the single-user ergonomics and the preemption/checkpoint loop.

Building Blocks

  • grafos_std::gpu::GpuBuilder
  • grafos_std::block::BlockBuilder for checkpoints
  • grafos_collections::durable::Durable<T> for model state snapshots
  • grafos_observe for per-cell metering

Related API docs:

Design

Preemptible GPU Sessions

Treat the GPU lease as ephemeral:

  • default TTL: ~5 minutes
  • renew while actively computing
  • if renewal fails, stop and surface a structured error

The user experience should be:

  • “GPU session ended, rerun this cell to reacquire”

not:

  • “everything is slow and broken”

Checkpointing

To make preemption normal:

  • checkpoint model state frequently (after each epoch or N steps)
  • store checkpoints in block lease

If the GPU lease ends, the next cell:

  1. acquires a new GPU lease
  2. restores from checkpoint
  3. continues

Walkthrough (Implementation Sketch)

1. Acquire GPU Lease

let gpu = fabric.alloc_gpu()
.min_vram(2 * 1024 * 1024 * 1024)
.lease_secs(300)
.acquire()?;

2. Renew While Active

Use a renewal policy like:

  • renew when remaining TTL < 25% (or every minute)

If renew fails:

  • abort current cell
  • checkpoint if possible

3. Checkpoint Model State

let ckpt_lease = BlockBuilder::new().min_blocks(512).lease_secs(3600).acquire()?;
let mut d: Durable<ModelState> = Durable::new(initial_state, ckpt_lease);
// periodically:
d.checkpoint()?;

4. Restore After Preemption

After a session ends:

let d2: Durable<ModelState> = Durable::restore(existing_ckpt_lease)?;

Failure Modes

  • LeaseExpired mid-run: treat as preemption; surface clear message.
  • Disconnected: retry or fail with “GPU unavailable”.
  • checkpoint failure: fail closed; user should not believe progress was saved.

Observability

Track per-cell:

  • VRAM leased
  • GPU-seconds consumed
  • kernel launches
  • checkpoint bytes written

This is what enables “cost per cell” UX.

Variations

  • multi-GPU: acquire multiple leases for data parallelism
  • interactive batching: batch small requests to amortize lease acquisition