Recipe 11: A Development Environment That Leases a GPU for Five Minutes
Situation
Interactive workflows (notebooks, REPLs, “cells”) alternate between:
- long stretches of CPU-only exploration
- short bursts of GPU compute
If the unit of allocation is “a GPU instance”, you pay for idle time and suffer slow startup.
In a lease model, you can allocate VRAM for only as long as you actually use it, and you can make “GPU ended” a first-class, explicit state transition rather than an operational failure.
What You Build
An interactive environment where:
- GPU is acquired per session/cell as a short TTL lease.
- The environment renews while active.
- Long runs checkpoint state to block storage so preemption is cheap.
This recipe complements Recipe 18 (multi-user GPU studio) by focusing on the single-user ergonomics and the preemption/checkpoint loop.
Building Blocks
grafos_std::gpu::GpuBuildergrafos_std::block::BlockBuilderfor checkpointsgrafos_collections::durable::Durable<T>for model state snapshotsgrafos_observefor per-cell metering
Related API docs:
- GPU leasing and submission API (source)
- grafos-tensor guide
- Durable implementation (source)
- grafos-observe guide
Design
Preemptible GPU Sessions
Treat the GPU lease as ephemeral:
- default TTL: ~5 minutes
- renew while actively computing
- if renewal fails, stop and surface a structured error
The user experience should be:
- “GPU session ended, rerun this cell to reacquire”
not:
- “everything is slow and broken”
Checkpointing
To make preemption normal:
- checkpoint model state frequently (after each epoch or N steps)
- store checkpoints in block lease
If the GPU lease ends, the next cell:
- acquires a new GPU lease
- restores from checkpoint
- continues
Walkthrough (Implementation Sketch)
1. Acquire GPU Lease
let gpu = fabric.alloc_gpu() .min_vram(2 * 1024 * 1024 * 1024) .lease_secs(300) .acquire()?;2. Renew While Active
Use a renewal policy like:
- renew when remaining TTL < 25% (or every minute)
If renew fails:
- abort current cell
- checkpoint if possible
3. Checkpoint Model State
let ckpt_lease = BlockBuilder::new().min_blocks(512).lease_secs(3600).acquire()?;let mut d: Durable<ModelState> = Durable::new(initial_state, ckpt_lease);
// periodically:d.checkpoint()?;4. Restore After Preemption
After a session ends:
let d2: Durable<ModelState> = Durable::restore(existing_ckpt_lease)?;Failure Modes
LeaseExpiredmid-run: treat as preemption; surface clear message.Disconnected: retry or fail with “GPU unavailable”.- checkpoint failure: fail closed; user should not believe progress was saved.
Observability
Track per-cell:
- VRAM leased
- GPU-seconds consumed
- kernel launches
- checkpoint bytes written
This is what enables “cost per cell” UX.
Variations
- multi-GPU: acquire multiple leases for data parallelism
- interactive batching: batch small requests to amortize lease acquisition