HISTORICAL — API UPDATED. This recipe was written against the
v0 gpu_submit SDK surface (FabricGpu::submit(...).launch() and
GpuSubmitBuilder), which was removed from grafos-std in task
#57. The “single function call” framing (gpu_call(input)) was a
direct reflection of the v0 one-shot builder.
In the v1 session-op model, the equivalent pattern is:
// buf, module, sess, lease all drop — RAII teardown.
This is still “a single logical operation from the programmer’s
perspective,” but explicit about the device-memory and module
lifecycle. The v1 decomposition is intentional: it makes the
authority model (cap-rights, lease scoping) and the failure modes
(per-op SESSION_STATUS_*) visible at the call site rather than
hiding them inside a one-shot builder.
See
Recipe 30: Persistent GPU Sessions
for the canonical v1 pattern and docs/grafos-std-guide.md GPU
Module section for the full walkthrough. The text below is
preserved as historical context for how the v0 “pure function”
programming model looked; do not copy the Rust snippets into new
code.
Situation
You have a CPU-bound pipeline with one bottleneck step that would be fast on a GPU: a matrix multiply, an FFT,
a convolution, image scaling. The step is a pure function — data in, data out, no state carried between calls.
In a conventional setup, using a GPU for this means:
Install CUDA (or ROCm, or OpenCL).
Manage device contexts and memory allocations in your code.
Handle driver version mismatches across machines.
If it’s a cloud GPU: provision an instance, SSH in, install drivers, run your job, tear down.
You don’t want any of that. You want result = gpu_call(input) — like calling a function that happens to
run on someone else’s hardware.
What You Build
A pattern where GPU acceleration is a single function call:
Write a CUDA kernel once, compile to PTX offline.
Lease a VRAM slice (just enough for your I/O buffers).
GPU_SUBMIT sends a compiled kernel binary and arguments over the QUIC control stream to a GPU node. The
node loads the PTX, launches the kernel, synchronizes, copies output from device memory, and returns the
bytes. From your perspective, it’s a remote procedure call with bytes-in, bytes-out semantics.
Your code GPU node
──────── ────────
gpu.submit("fft", ptx)
.arg(&signal_bytes)
.max_output(signal_len)
.launch()
→ GPU_SUBMIT ──QUIC──→ cuModuleLoadData(ptx)
cuModuleGetFunction("fft")
cuLaunchKernel(...)
cuCtxSynchronize()
cuMemcpyDtoH(output)
← output bytes ◄──────
One network round-trip. No device context in your process. No CUDA headers. No driver dependency.
When This Is the Right Pattern
Use stateless GPU dispatch when:
The function is pure: same input always produces same output. No accumulated state.
Each call is independent: no need to keep data on the GPU between calls.
The compute dominates the transfer: if you’re sending 1 KB and getting back 1 KB, the QUIC
round-trip is negligible. If you’re sending 1 GB, consider a session (Recipe 30) to avoid
re-uploading.
You want simplicity: one call, one result, no cleanup.
Don’t use this pattern when:
You need multiple kernel launches on shared device memory → use sessions (Recipe 30).
You’re running the same kernel thousands of times with different args → use burst (Recipe 32).
The data is already on the GPU from a prior step → sessions keep it there.
Contrast with Recipe 29
Recipe 29 is an API tutorial: “here’s how GPU_SUBMIT works, field by field.” This recipe is a pattern:
when and why to treat GPU compute as a pure function, and how to integrate it into a larger pipeline
without restructuring your code around GPU lifecycle.
Walkthrough (Implementation Sketch)
1. Write the Kernel (One-Time, Offline)
// scale.cu — multiply every element by a factor
extern "C" __global__ void scale(float* data, int n, float factor) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) data[i] *= factor;
}
Compile: nvcc --ptx scale.cu -o scale.ptx
This is the only step that touches CUDA tooling. Your application code never imports CUDA.
// GPU step: scale by calibration factor (the bottleneck)
letscaled=gpu_scale(&normalized, 2.5)?;
// CPU step 2: filter
letfiltered=low_pass_filter(&scaled);
Ok(filtered)
}
The caller doesn’t know or care that gpu_scale runs on remote hardware. It takes a slice, returns a
Vec. The lease is acquired and dropped inside the function — invisible to the caller.
4. Reuse the Lease for Multiple Calls (Optional)
If you’re calling the function in a loop, acquire the lease once outside the loop:
Each submit still creates and destroys a context on the node, but the lease (VRAM reservation)
persists. This avoids the lease acquisition overhead per call while keeping the stateless dispatch
semantics.
Failure Modes
LeaseExpired: The lease TTL elapsed before the kernel completed. Use a longer lease_secs
or renew with lease.renew().
STATUS_LOAD_FAILED: PTX compilation failed on the node. Architecture mismatch — recompile
without -arch so the driver JIT-compiles for whatever GPU is present.
STATUS_LAUNCH_FAILED: Too many threads per block, invalid kernel name, or argument count
mismatch.
CapacityExceeded: No free VRAM on any node. Wait and retry, or reduce min_vram.
Disconnected: Node went away mid-execution. Retry the call — it’s stateless, so safe to
re-invoke.