Recipe 32: 1000 GPUs for One Second
Situation
You’re a researcher evaluating 1000 hyperparameter configurations for a model. Today’s options:
- SLURM queue: Wait days for 8 GPUs, then run sequentially for hours. Wall-clock: days.
- Rent a cluster: Provision 1000 GPUs. Provisioning takes longer than compute. You pay for the provisioning time. Teardown is another project.
- Spot instances: Chase preemptions across regions. Maybe get 200 GPUs. Write retry logic. Still hours.
The compute itself is embarrassingly parallel. Each configuration is independent — run a kernel, read a number, move on. If you could get 1000 GPUs simultaneously, the total wall-clock time would be seconds.
In the fabricBIOS model, GPU compute is a pool of VRAM slices, not a pool of whole machines. You lease 1000 slices, submit one kernel to each, collect results, and drop everything. The VRAM returns to the pool immediately. No provisioning, no teardown, no orphaned resources.
What You Build
A coordinator that:
- Defines 1000 work chunks (each = one hyperparameter configuration).
- Acquires GPU leases in a loop — as many as the fabric provides.
- Fans out: each chunk →
GPU_SUBMITto a leased VRAM slice. - Collects results via
JobCoordinatorwith automatic retry on transient failures. - Finds the best configuration.
- Drops all leases — VRAM returns to the pool.
Building Blocks
grafos_std::gpu::{GpuBuilder, GpuLease, FabricGpu}— GPU leasing and submission — sourcegrafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore, WorkChunk, ChunkId}— idempotent burst compute — sourcegrafos_leasekit::{RenewalManager, RenewalPolicy}— lease renewal during execution — source
Related:
- GPU_SUBMIT wire format (source)
- Recipe 16: Pop-Up Supercomputer — same burst pattern with CPU tasklets
- Recipe 29: CUDA Kernel on Leased GPU — single GPU_SUBMIT walkthrough
Design
GPU Capacity as a Pool
A 48 GB GPU can serve six 8 GB leases or forty-eight 1 GB leases simultaneously. A fabric with 100 nodes, each with 4 GPUs, exposes up to 19,200 one-GB slices. You don’t rent machines — you lease VRAM.
Your coordinator doesn’t know or care which physical cards it gets. It asks for VRAM, gets lease handles, submits kernels, reads results.
Elastic Acquisition
Acquire in a loop until you hit the target or the fabric is exhausted:
let mut leases = Vec::new();for _ in 0..1000 { match GpuBuilder::new().min_vram(vram_per_config).lease_secs(60).acquire() { Ok(lease) => leases.push(lease), Err(_) => break, // fabric exhausted — run with what we have }}If you get 800 instead of 1000, the job still runs — just 200 chunks wait for a free slot. If you get more later (leases expire on other tenants), you can add leases mid-run.
Stateless Kernels
Each kernel invocation is stateless: input bytes in, output bytes out. No shared device memory between
invocations. This is the GPU_SUBMIT pattern (Recipe 29), not the session pattern (Recipe 30). The node
creates a CUDA context, loads PTX, launches the kernel, reads output, destroys the context — one call.
This matters because stateless dispatch is idempotent. If a lease expires mid-execution, JobCoordinator
retries the chunk on a different lease.
Walkthrough (Implementation Sketch)
1. Define Work Chunks
use grafos_jobs::{WorkChunk, ChunkId};
#[derive(Clone, serde::Serialize, serde::Deserialize)]struct HyperparamConfig { config_id: u64, learning_rate: f32, batch_size: u32, dropout: f32,}
impl WorkChunk for HyperparamConfig { fn chunk_id(&self) -> ChunkId { ChunkId(self.config_id) } fn to_bytes(&self) -> Vec<u8> { postcard::to_allocvec(self).unwrap() } fn from_bytes(bytes: &[u8]) -> grafos_std::Result<Self> { postcard::from_bytes(bytes).map_err(|_| grafos_std::FabricError::IoError(-200)) }}2. Generate the Search Space
let configs: Vec<Box<dyn WorkChunk>> = (0..1000).map(|i| { Box::new(HyperparamConfig { config_id: i, learning_rate: 0.0001 * (1.0 + (i % 100) as f32 * 0.01), batch_size: 32 * (1 + (i / 100) as u32), dropout: 0.1 + (i % 10) as f32 * 0.05, }) as Box<dyn WorkChunk>}).collect();3. Acquire GPU Leases
use grafos_std::gpu::GpuBuilder;use grafos_leasekit::{RenewalManager, RenewalPolicy};
let vram_per_config: u64 = 512 * 1024 * 1024; // 512 MiB per evallet lease_ttl: u32 = 120;
let mut leases: Vec<_> = Vec::new();let mut renewal_mgr = RenewalManager::new();let policy = RenewalPolicy::default();
for _ in 0..1000 { match GpuBuilder::new().min_vram(vram_per_config).lease_secs(lease_ttl).acquire() { Ok(lease) => { renewal_mgr.register( lease.lease_id() as u64, lease.expires_at_unix_secs(), policy, ); leases.push(lease); } Err(_) => break, }}// Got leases.len() GPU slices — might be 1000, might be fewer.4. Compile the Evaluation Kernel
// eval_config.cu — evaluate one hyperparameter configurationextern "C" __global__ void eval_config( float* output, // single float: validation loss float learning_rate, int batch_size, float dropout) { // ... mini training loop on embedded dataset ... // Write final validation loss to output[0] if (threadIdx.x == 0) { output[0] = validation_loss; }}Compile once: nvcc --ptx eval_config.cu -o eval_config.ptx
5. Fan Out with JobCoordinator
use grafos_jobs::{JobCoordinator, RetryPolicy, MemoryOutputStore};
let ptx = include_bytes!("../vectors/gpu/eval_config.ptx");let mut lease_idx = 0;
let mut output_store = MemoryOutputStore::new();let mut coord = JobCoordinator::new(RetryPolicy { max_retries: 3, initial_backoff_secs: 1, max_backoff_secs: 16,});
let result = coord.run( &configs, &mut output_store, |chunk_bytes| { let config: HyperparamConfig = postcard::from_bytes(chunk_bytes) .map_err(|_| grafos_std::FabricError::IoError(-200))?;
// Round-robin across available leases let lease = &leases[lease_idx % leases.len()]; lease_idx += 1;
// Build kernel args let lr_bytes = config.learning_rate.to_ne_bytes(); let bs_bytes = config.batch_size.to_ne_bytes(); let do_bytes = config.dropout.to_ne_bytes();
let result = lease.gpu() .submit("eval_config", ptx) .grid([1, 1, 1]) .block([256, 1, 1]) .arg(&lr_bytes) .arg(&bs_bytes) .arg(&do_bytes) .max_output(4) // one f32 .launch()?;
Ok(result.output) }, |outputs| { // Find the config with the lowest loss let mut best_id: u64 = 0; let mut best_loss: f32 = f32::MAX; for (chunk_id, output) in outputs { if output.len() >= 4 { let loss = f32::from_ne_bytes(output[..4].try_into().unwrap()); if loss < best_loss { best_loss = loss; best_id = chunk_id.0; } } } postcard::to_allocvec(&(best_id, best_loss)).unwrap() },)?;
let (best_config_id, best_loss): (u64, f32) = postcard::from_bytes(&result.aggregate).unwrap();6. Drop Everything
drop(leases); // All VRAM returns to pool immediately.coord.teardown(&mut output_store);Your graph shows: rapid ramp to ~1000 GPU leases, flat during compute, immediate drop to zero.
Failure Modes
CapacityExceeded: Fabric doesn’t have 1000 free slices. Job runs with fewer — just slower.LeaseExpired: Classified as transient byRetryPolicy. Chunk is retried on another lease.STATUS_LOAD_FAILED: PTX compilation failed. Architecture mismatch — recompile without-arch.STATUS_LAUNCH_FAILED: Too many threads or bad kernel args. Permanent error — chunk fails.Disconnected: Node went away. Transient — retry on a different node’s lease.- Coordinator crash: All leases expire on their own. VRAM returns. Rerun the job.
Observability
gpu_leases_active— should spike to ~1000, then drop to 0gpu_submit_total/gpu_submit_errors— kernel execution ratechunks_done/chunks_retry_total— job progressgpu_vram_allocated_bytes— total fabric VRAM in use- Wall-clock time: start to finish should be seconds, not hours
Variations
- Monte Carlo simulation: Each GPU runs a simulation with different random seeds. Aggregate by averaging or computing confidence intervals.
- Batch inference: Each GPU runs inference on a different input batch. Aggregate results into a single output set.
- Rendering: Each GPU renders a different frame or tile. Collect frames into a video.
- Genetic algorithm: Each GPU evaluates fitness for a different individual. Aggregate selects the fittest for the next generation.
- Right-sizing VRAM: If your kernel only needs 256 MiB, lease 256 MiB — not 8 GB. Smaller slices mean more concurrent leases from the same hardware.
Testing
Run scheduler and job tests locally, then validate the GPU burst on a GPU-capable cell:
cargo test -p grafos-jobs -- coordinator # retry and aggregationcargo test -p fabricbios-core -- gpu_submit # wire format roundtripsgrafos deploy run --requires gpu --replicas 1000 --tasklet gpu-burst --json