Skip to content

Recipe 6: A Build System That Scales to 100 Cores in 200 Milliseconds

Situation

You have a build (or test) workload that is embarrassingly parallel:

  • compile units
  • code generation
  • lint / formatting
  • test shards

Traditional “burst capacity” approaches (VMs, containers, k8s) have non-trivial startup and teardown:

  • Seconds to minutes of overhead.
  • A cleanup problem when coordinators die mid-run.

In a lease-based system, you want:

  • acquire CPU capacity quickly
  • run work in a sandbox
  • return all resources automatically on drop or TTL expiry

The goal: make the “build farm” a transient resource graph rather than a fleet.

What You Build

A coordinator that:

  1. Leases CPU resources across multiple nodes (short TTL).
  2. Dispatches compilation/test work as WASM tasklets with fuel limits.
  3. Writes intermediate artifacts (object files, test logs) to leased block storage.
  4. Drops all leases at the end, leaving no cleanup tail.

Building Blocks

  • grafos_std::cpu::CpuBuilder and CpuLease
  • WASM tasklets (the payload you submit)
  • grafos_std::block::BlockBuilder for shared artifact storage
  • grafos_jobs::{JobCoordinator, RetryPolicy} for idempotent retry and dispatch — source
  • grafos_leasekit::RenewalManager for CPU lease renewal — source
  • grafos_observe for measuring lease churn and throughput

See:

Design

Work Decomposition

You need deterministic chunking:

  • For compilation: per-crate or per-module units.
  • For tests: shard list of tests by hash.

Each chunk must be:

  • idempotent (safe to retry)
  • bounded in time (fuel / TTL)

Isolation Model

Each task runs as WASM:

  • fuel-limited to bound CPU usage
  • bounded output size

This gives you container-like isolation without a container runtime.

Artifact Storage

Intermediate artifacts must outlive any single worker:

  • store in block leases
  • coordinator can read them back and link/aggregate

In more advanced designs, you might use a content-addressed store. For the recipe, a simple “object file per chunk” layout works.

Walkthrough (Implementation Sketch)

1. Coordinator Acquires CPU Leases

use grafos_std::cpu::CpuBuilder;
let mut workers = Vec::new();
for _ in 0..50 {
let lease = CpuBuilder::new().cores(2).lease_secs(120).acquire()?;
workers.push(lease);
}

You now have 100 cores worth of leased capacity, but you do not have a fleet to manage. You have 50 lease handles.

2. Coordinator Acquires Block Lease for Artifacts

use grafos_std::block::BlockBuilder;
let artifacts = BlockBuilder::new().min_blocks(4096).lease_secs(600).acquire()?;

3. Dispatch Work as Tasklets

Each worker launches a WASM tasklet:

let result = workers[i].cpu()
.submit(tasklet_wasm_bytes)
.fuel(5_000_000)
.input(chunk_descriptor_bytes)
.launch()?;

The tasklet writes its output to block storage (or returns it in the output buffer, if small).

4. Retry and Failure Handling

Use JobCoordinator with a RetryPolicy to handle worker failures declaratively:

use grafos_jobs::{JobCoordinator, RetryPolicy, Backoff};
let policy = RetryPolicy::default()
.with_max_retries(3)
.with_backoff(Backoff::exponential(100, 5000));
let mut coordinator = JobCoordinator::new(policy);
let result = coordinator.run(
chunks, // work items
artifacts, // shared block store for outputs
|chunk, store| { // exec_fn: run one chunk
let lease = CpuBuilder::new().cores(2).lease_secs(120).acquire()?;
let r = lease.cpu().submit(tasklet_wasm_bytes).fuel(5_000_000)
.input(chunk).launch()?;
store.write(chunk.id, &r.output)?;
Ok(r)
},
|results| { // agg_fn: combine outputs
Ok(results)
},
)?;

JobCoordinator retries failed chunks with exponential backoff. If a worker lease expires or disconnects, the chunk is resubmitted on a new lease automatically.

5. Lease Renewal

For long-running builds, use RenewalManager to keep CPU leases alive:

use grafos_leasekit::RenewalManager;
let mut renewals = RenewalManager::new();
for (i, worker) in workers.iter().enumerate() {
renewals.register(i as u64, worker.expiry(), Default::default());
}
// In your event loop:
let summary = renewals.tick(now);
// summary tells you which leases were renewed and which failed.

6. Teardown

When done:

  • drop CPU leases
  • drop the artifact lease (or keep it if you want cache/reuse)

If the coordinator crashes, TTL expiry tears down leases automatically.

Failure Modes

  • FabricError::Disconnected: worker node unreachable; retry elsewhere.
  • FabricError::LeaseExpired: TTL hit; this is a bug if common; increase TTL or renew earlier.
  • Output too large: enforce max_output and store outputs in block storage.

Observability

Track:

  • cpu_leases_active
  • tasklets_launched_total
  • tasklet_duration_ms histogram
  • artifact_bytes_written
  • retries_total

The “wow” metric is: lease count returns to zero immediately after job completion.

Variations

  • Warm pool: keep a few CPU leases alive for near-zero-latency bursts.
  • Cache: store compilation outputs in block storage keyed by hash.
  • Heterogeneous workers: pick nodes with AVX512 / big cores for heavy chunks.