Work Orders & Build System

This document explains the design rationale behind Brokkr’s work order system and its integration with Shipwright for container builds. Work orders extend Brokkr beyond Kubernetes manifest distribution into task orchestration across clusters.

Why Work Orders Exist

Brokkr’s core workflow — pushing YAML deployment objects to agents — solves the “distribute manifests” problem well. But real-world operations need more:

Build container images on specific clusters (near source registries, with GPU access, etc.)
Run one-off tasks like database migrations, cleanup scripts, or certificate rotations
Execute maintenance operations that need to happen on specific clusters

Work orders address these needs without changing the pull-based deployment model. They’re a parallel system: deployment objects are continuous desired state; work orders are discrete, one-time tasks.

Design Decisions

Pull-Based, Not Push-Based

Like deployment objects, work orders use a pull model. The broker doesn’t push commands to agents. Instead:

Admin creates a work order in the broker
Broker determines eligible agents via targeting rules
Agents poll for pending work orders they’re authorized to claim
One agent claims and executes the work order
Agent reports completion (success or failure)

This preserves Brokkr’s key property: clusters behind firewalls work without inbound connections.

Single-Claim Semantics

A work order is claimed by exactly one agent. This prevents duplicate execution — you don’t want two clusters running the same database migration. The claiming process is atomic: the work order transitions from PENDING to CLAIMED in a single database transaction.

Retry with Exponential Backoff

When a work order fails, the system supports automatic retries:

PENDING → CLAIMED → (failure) → RETRY_PENDING → (backoff expires) → PENDING → CLAIMED

The backoff follows the formula: 2^retry_count × backoff_seconds

Retry	Backoff (60s base)	Wait
1st	2¹ × 60	2 minutes
2nd	2² × 60	4 minutes
3rd	2³ × 60	8 minutes

After max_retries failures, the work order moves to the log with success: false.

Why Retryability Is Caller-Declared

The retryable flag on completion is set by the agent, not the broker. This is intentional: only the agent knows whether the failure was transient (network timeout, resource contention) or permanent (invalid YAML, missing permissions). Non-retryable failures skip the retry loop entirely.

Stale Claim Detection

If an agent claims a work order but crashes before completing it, the work order would be stuck in CLAIMED forever. The broker’s maintenance task detects stale claims:

Each work order has a claim_timeout_seconds (default: 3600 = 1 hour)
If a claimed work order exceeds its timeout, it’s released back to PENDING
The maintenance task runs every 10 seconds

State Machine

                    ┌──────────┐
                    │ PENDING  │◄─────────────────────┐
                    └────┬─────┘                      │
                         │ claim()                    │ backoff expires
                         ▼                            │
                    ┌──────────┐              ┌───────┴──────┐
                    │ CLAIMED  │              │RETRY_PENDING │
                    └────┬─────┘              └──────────────┘
                         │                            ▲
              ┌──────────┴──────────┐                 │
              ▼                     ▼                  │
     ┌────────────────┐    ┌────────────────┐         │
     │ complete_success│    │complete_failure │────────┘
     └────────┬───────┘    └────────┬───────┘  (retryable + retries left)
              │                     │
              ▼                     ▼
     ┌────────────────┐    ┌────────────────┐
     │  WORK_ORDER_LOG│    │  WORK_ORDER_LOG│
     │  success=true  │    │  success=false │
     └────────────────┘    └────────────────┘

Targeting

Work orders support three targeting mechanisms: hard targets (explicit agent UUIDs), label matching, and annotation matching. At least one must be specified — the API rejects work orders with no targeting.

All three mechanisms use OR logic: an agent is eligible if it matches any of the specified targets, labels, or annotations. When multiple types are combined, they’re also OR’d together — agent UUID-1 OR any agent with label builder:true can claim the work order.

Design note: This differs from template matching, which uses AND logic (a template’s labels must all be present on the stack). The rationale: work orders need to reach at least one capable agent (OR is permissive), while template matching needs to ensure full compatibility with a stack (AND is restrictive). See Template Matching & Rendering for comparison.

See the Work Orders Reference for the complete targeting API with request body examples.

Shipwright Build Integration

The primary built-in work order type is build, which integrates with Shipwright for container image builds.

The agent claims a build work order, applies the Shipwright Build resource, creates a BuildRun, watches it until completion, and reports back (including the image digest on success). See Container Builds with Shipwright for the complete operational guide.

Why Shipwright?

Shipwright provides a Kubernetes-native build abstraction:

No privileged containers — uses unprivileged build strategies (Buildah, Kaniko, etc.)
Cluster-native — builds run as Kubernetes resources, leverage cluster scheduling
Strategy flexibility — swap between Buildah, Kaniko, ko, S2I without changing build definitions
Build caching — strategies can cache layers for faster rebuilds

Builds have a 15-minute timeout — if the BuildRun doesn’t complete in time, it’s reported as failed. These timeouts are compile-time constants in the agent, not configurable at runtime.

Work Order Log

Completed work orders (success or failure) are moved from the active work_orders table to the work_order_log table. This is an immutable audit trail:

Active Table	Log Table
`work_orders`	`work_order_log`
Mutable (status changes)	Immutable (write-once)
Current/pending work	Historical record
Cleaned up on completion	Retained indefinitely

The log records: original ID, work type, timing, claiming agent, success/failure, retry count, and result message.

Custom Work Orders

Beyond builds, work orders support arbitrary YAML:

{
  "work_type": "custom",
  "yaml_content": "apiVersion: batch/v1\nkind: Job\nmetadata:\n  name: db-migration\nspec:\n  template:\n    spec:\n      containers:\n      - name: migrate\n        image: myapp/migrate:v1\n      restartPolicy: Never"
}

Custom work orders apply the YAML to the cluster and monitor completion. This enables arbitrary Kubernetes jobs, CronJobs, or any other resource to be orchestrated through Brokkr.

Work Orders Reference — API endpoints and data model
Container Builds with Shipwright — setup and usage guide
Data Flows — work order lifecycle in context
Architecture — system-level view

Keyboard shortcuts

Brokkr