Skip to content

Recipe 58: Workload That Migrates on Node Drain

Situation

A stateful workload — a queue worker, a stream consumer, a long- running batch step — has been placed on a specific node. An operator decides to drain that node for hardware swap. The scheduler walks the node through Accepting → Draining → Drained → Maintenance → Returning → Accepting. The workload needs to:

  • Stop admitting new work the moment the node leaves Accepting;
  • Finish whatever’s in-flight, then checkpoint cleanly;
  • Persist the final checkpoint atomically when the node confirms Drained;
  • Stay quiet through Maintenance and Returning;
  • Resume from the checkpoint when the node (or a successor) goes back to Accepting;
  • Halt fail-closed if the fence ack is lost mid-drain (FenceLost) — operator intervention required before resuming.

The grafOS scheduler emits typed NodeModeTransition events on every state change. This recipe is the in-workload state machine that consumes those events and produces the right operational action.

What You Build

A pure decision function + a lifecycle state struct that:

  • Takes a typed NodeModeTransition, validates it against the scheduler’s state machine (NodeMode::legal_transition_to), and refuses garbage events fail-closed;
  • Emits a typed WorkloadAction (Accept, CheckpointAndDrain, PersistFinalCheckpoint, Quiesce, ResumeFromCheckpoint, Halt);
  • Carries minimal lifecycle state (current_mode, accepts_new_work, has_persisted_checkpoint) the runtime adapter can persist alongside its checkpoint payload.

The compiled recipe lives in cookbook/recipe-58-workload-on-node-drain.

Core grafOS API Path

use grafos_scheduler_service::{NodeMode, NodeModeTransition, NodeModeTransitionError};
// The state machine is authoritative here:
assert!(NodeMode::Accepting.legal_transition_to(NodeMode::Draining));
assert!(NodeMode::Draining.legal_transition_to(NodeMode::Drained));
assert!(NodeMode::Drained.legal_transition_to(NodeMode::Maintenance));
assert!(NodeMode::Maintenance.legal_transition_to(NodeMode::Returning));
assert!(NodeMode::Returning.legal_transition_to(NodeMode::Accepting));
// The fault path:
assert!(NodeMode::Drained.legal_transition_to(NodeMode::FenceLost));
assert!(NodeMode::FenceLost.legal_transition_to(NodeMode::Draining));
assert!(NodeMode::FenceLost.legal_transition_to(NodeMode::Drained));
// Illegal transitions return the typed error:
let illegal = NodeModeTransition {
node_id: 0x42,
from: NodeMode::Accepting,
to: NodeMode::Drained,
};
if !illegal.from.legal_transition_to(illegal.to) {
let _err = NodeModeTransitionError::IllegalTransition {
from: illegal.from,
to: illegal.to,
};
}

Program

use cookbook_recipe_58_workload_on_node_drain::{
action_for_transition, validate_transition, WorkloadAction, WorkloadLifecycle,
};
use grafos_scheduler_service::{NodeMode, NodeModeTransition};
let mut state = WorkloadLifecycle::fresh_on_accepting_node();
// Runtime adapter delivers a transition. Apply, get the action,
// route the action to the workload's executor.
let event = NodeModeTransition {
node_id: 0x42,
from: NodeMode::Accepting,
to: NodeMode::Draining,
};
match state.apply(event)? {
WorkloadAction::Accept => {/* keep admitting new work */}
WorkloadAction::CheckpointAndDrain => {
// Refuse new work; finish in-flight; checkpoint when
// in-flight drops to zero.
}
WorkloadAction::PersistFinalCheckpoint => {
// Drain reached fence ack; atomically write the final
// checkpoint somewhere durable.
}
WorkloadAction::Quiesce => {/* Maintenance / Returning */}
WorkloadAction::ResumeFromCheckpoint => {
// Node is back; resume from the most recent checkpoint
// (here or on another node, depending on the runtime).
}
WorkloadAction::Halt => {
// Fault — fence ack lost or unexpected transition. Refuse
// until an operator forensically clears.
}
}
# Ok::<(), grafos_scheduler_service::NodeModeTransitionError>(())

Design

The recipe splits the policy into three orthogonal pieces:

  1. validate_transition — checks the transition against the scheduler’s state machine (NodeMode::legal_transition_to). This is the same predicate the scheduler enforces on the producer side, so a workload that validates here rejects exactly the same garbage events the scheduler would.
  2. action_for_transition — pure mapping from (from, to) to a WorkloadAction. No state, no side effects. Easy to unit-test in isolation and easy to evolve when new transitions are added.
  3. WorkloadLifecycle::apply — composes validate + action, updates the in-process state (current_mode, accepts_new_work, has_persisted_checkpoint), and surfaces the action for the runtime to execute.

The WorkloadAction enum is closed-set: an unexpected transition maps to Halt, not Ignore. This is the fail-closed default — a workload running against an unknown state machine refuses to guess what to do.

The FenceLost fault path is explicit. The scheduler only emits Drained → FenceLost when the fence ack went missing. Two transitions out of FenceLost are legal — back to Draining (re-attempt) and to Drained (operator declared the prior fence valid forensically) — but the workload halts in either case until the operator’s intervention clears the state. The recipe encodes this: any transition where from == FenceLost returns WorkloadAction::Halt.

Failure Modes

  • Illegal transition delivered (event-bus glitch, stale message): NodeModeTransitionError::IllegalTransition { from, to } returned from apply. The workload’s lifecycle state does NOT mutate.
  • Fence ack lost (Drained → FenceLost): workload halts. The runtime should surface this on the operator alert channel.
  • Unexpected legal transition (a pair the state machine accepts but the recipe doesn’t model — e.g. a future addition): maps to WorkloadAction::Halt so the workload doesn’t silently misinterpret.
  • Idempotent same-state delivery: a duplicate Accepting → Accepting (or any other same-state pair) is accepted by the state machine, returns WorkloadAction::Accept. Polling loops re-applying observed state don’t synthesize failures.

Tests

Run it with:

Terminal window
cargo test -p cookbook-recipe-58-workload-on-node-drain

Six tests cover the full drain-and-rejoin lifecycle (5 transitions, each producing the right WorkloadAction), FenceLost halting, FenceLost recovery still halting (workload refuses to act until operator clears), illegal-transition rejection with typed error, idempotent same-state update accepted, and a validate_transition parity check against the state machine.

Adaptation Notes

  • Checkpoint payload: WorkloadLifecycle carries only the workload-lifecycle bits. Production callers wrap it with the domain-specific checkpoint state (in-flight queue positions, partial results, retry counters). The recipe’s has_persisted_checkpoint boolean signals that the workload CALLED its persist hook; the runtime is responsible for what’s actually durable.
  • Cross-node migration: WorkloadAction::ResumeFromCheckpoint doesn’t pin “where” — that’s the runtime adapter’s call. A typical adapter has its own placement input (the same node post-Returning → Accepting, or a different node when the drained one stays in Maintenance).
  • Operator-facing telemetry: emit the transition + chosen WorkloadAction as a typed event (snake_case as_str() — see docs/operations/siem-vocabulary-cookbook.md) so the dashboard surface shows drain progress per workload.
  • Multi-node workloads: the recipe scopes one workload to one node. Workloads that span multiple nodes (replicated services, distributed batches) run one WorkloadLifecycle per node and aggregate the action stream — e.g. when a 3-replica service sees one node hit CheckpointAndDrain, the orchestrator can pre-warm a fourth replica before the drain completes.

See also:

  • crates/grafos-scheduler-service/src/lib.rsNodeMode, NodeModeTransition, NodeModeTransitionError, legal_transition_to.
  • docs/operations/scheduler-features.md § “Node mode lifecycle” — operator-facing reference + state-machine diagram.
  • docs/operations/siem-vocabulary-cookbook.md — log filters keyed on kind == "node_mode_transition".
  • grafos admin manage-node --node <id> --mode <name> — the CLI surface operators use to drive the state machine.