Skip to content

Recipe 13: Testing a Distributed System Without a Distributed System

Situation

Distributed systems are hard to test because their behavior emerges from:

  • failure timing
  • partitions
  • retries
  • timeouts

Testing that in real multi-node environments is slow and flaky.

If your application is written against the lease abstraction rather than raw sockets, you can exercise a large fraction of the distributed behavior in-process with deterministic control.

What You Build

A test strategy that:

  • runs application logic against ScenarioRunner with step-based fault injection
  • deterministically injects failures (Disconnected, LeaseExpired, etc.) at specific steps
  • checks invariants using assertion helpers (assert_no_leaked_leases, assert_rebind_converges, etc.)

Building Blocks

  • grafos_testkit::{ScenarioRunner, FaultConfig, FaultInjector}source
  • grafos_testkit::{assert_no_leaked_leases, assert_rebind_converges, assert_eventually, assert_stale_epoch_rejected}source
  • grafos_observe for asserting lease lifecycle events

Related API docs:

Design

Declarative Fault Specification

Use FaultConfig to declare which operations fail at which step. No timers, no randomness, no sleeps — everything is step-based and deterministic:

use grafos_testkit::FaultConfig;
let faults = FaultConfig::new()
.read_fails_at(3) // step 3: reads fail
.write_fails_at(5) // step 5: writes fail
.lease_expires_at(7); // step 7: lease operations fail

ScenarioRunner Drives the Steps

ScenarioRunner ticks the FaultInjector before each step, so step N sees the fault state configured for tick N. Each step receives a &FaultInjector to query:

  • fi.should_fail_read() — true if the current step is configured to fail reads
  • fi.should_fail_write() — true for write failures
  • fi.should_fail_lease() — true for lease failures

Assertion Helpers

After running a scenario, use the built-in assertion helpers:

  • assert_no_leaked_leases() — verify all leases were freed
  • assert_rebind_converges() — verify convergence within a step budget
  • assert_eventually() — generic polling assertion
  • assert_stale_epoch_rejected() — verify stale-epoch operations are rejected

Walkthrough (Test Sketches)

Test: Read Failure and Recovery

use grafos_testkit::{FaultConfig, ScenarioRunner};
use grafos_std::error::FabricError;
let runner = ScenarioRunner::new("read-failure-recovery")
.with_faults(FaultConfig::new().read_fails_at(3))
.step("step 1: read ok", |fi| {
assert!(!fi.should_fail_read());
Ok(())
})
.step("step 2: read ok", |fi| {
assert!(!fi.should_fail_read());
Ok(())
})
.step("step 3: read fails", |fi| {
assert!(fi.should_fail_read());
// Application detects failure and starts recovery
Err(FabricError::Disconnected)
})
.step("step 4: read recovers", |fi| {
assert!(!fi.should_fail_read());
// Application has rebound to new lease
Ok(())
});
let result = runner.run();
assert_eq!(result.steps_executed, 4);
assert_eq!(result.steps_failed, 1);

Test: Lease Expiry Mid-Write

use grafos_testkit::{FaultConfig, ScenarioRunner};
use grafos_std::error::FabricError;
let runner = ScenarioRunner::new("lease-expiry-mid-write")
.with_faults(
FaultConfig::new()
.write_fails_at(2)
.lease_expires_at(2),
)
.step("step 1: write succeeds", |fi| {
assert!(!fi.should_fail_write());
Ok(())
})
.step("step 2: write + lease fail", |fi| {
assert!(fi.should_fail_write());
assert!(fi.should_fail_lease());
Err(FabricError::LeaseExpired)
})
.step("step 3: rebuild and continue", |fi| {
assert!(!fi.should_fail_write());
assert!(!fi.should_fail_lease());
Ok(())
});
let result = runner.run();
assert_eq!(result.steps_executed, 3);
assert_eq!(result.steps_failed, 1);

Post-Scenario Assertions

use grafos_testkit::{assert_no_leaked_leases, assert_stale_epoch_rejected};
// After any scenario, verify cleanup
assert_no_leaked_leases();
// Verify fencing rejects stale epochs
assert_stale_epoch_rejected();

Variations

  • property-based tests: randomize FaultConfig across many runs
  • multi-fault scenarios: combine read_fails_at, write_fails_at, and lease_expires_at on different steps
  • replay logs of events captured from production (if you build an event sink)