Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Work Orders

Work orders are transient operations that Brokkr routes to agents for execution. Unlike deployment objects which represent persistent desired state, work orders are one-time operations such as container builds, tests, or backups.

Concepts

Work Order vs Deployment Object

AspectDeployment ObjectWork Order
PurposePersistent stateOne-time operation
LifecycleApplied, reconciled, deletedCreated, claimed, completed
ExamplesDeployments, ConfigMapsContainer builds, tests
StoragePermanent in stackMoved to log after completion

Work Order Lifecycle

PENDING -> CLAIMED -> (success) -> work_order_log
                  \-> (failure) -> RETRY_PENDING -> PENDING (retry)
                                \-> work_order_log (max retries)
  1. PENDING: Work order created, waiting for an agent to claim
  2. CLAIMED: Agent has claimed the work order and is executing
  3. RETRY_PENDING: Execution failed, waiting for retry backoff
  4. Completed: Moved to work_order_log (success or max retries exceeded)

Targeting

Work orders are routed to agents using the same targeting mechanisms as stacks:

  • Direct agent IDs: Route to specific agents by UUID
  • Labels: Route to agents with matching labels (OR logic)
  • Annotations: Route to agents with matching annotations (OR logic)

An agent can claim a work order if it matches ANY of the specified targeting criteria.

Work Types

Build (build)

Container image builds using Shipwright. The yaml_content should contain a Shipwright Build specification.

See Container Builds with Shipwright for details.

API Reference

Create Work Order

POST /api/v1/work-orders
Authorization: Bearer <admin-pak>
Content-Type: application/json

{
  "work_type": "build",
  "yaml_content": "<shipwright-build-yaml>",
  "max_retries": 3,
  "backoff_seconds": 60,
  "claim_timeout_seconds": 3600,
  "targeting": {
    "labels": ["env=dev"],
    "annotations": {"capability": "builder"}
  }
}

Parameters:

FieldTypeRequiredDefaultDescription
work_typestringYes-Type of work (e.g., “build”)
yaml_contentstringYes-YAML content for the work
max_retriesintegerNo3Maximum retry attempts
backoff_secondsintegerNo60Base backoff for exponential retry
claim_timeout_secondsintegerNo3600Seconds before claimed work is considered stale
targetingobjectYes-Targeting configuration
targeting.agent_idsarrayNo-Direct agent UUIDs
targeting.labelsarrayNo-Agent labels to match
targeting.annotationsobjectNo-Agent annotations to match

List Work Orders

GET /api/v1/work-orders?status=PENDING&work_type=build
Authorization: Bearer <admin-pak>

Query Parameters:

ParameterDescription
statusFilter by status (PENDING, CLAIMED, RETRY_PENDING)
work_typeFilter by work type

Get Work Order

GET /api/v1/work-orders/:id
Authorization: Bearer <admin-pak>

Cancel Work Order

DELETE /api/v1/work-orders/:id
Authorization: Bearer <admin-pak>

Get Pending Work Orders (Agent)

GET /api/v1/agents/:agent_id/work-orders/pending?work_type=build
Authorization: Bearer <agent-pak>

Returns work orders that the agent can claim based on targeting rules.

Claim Work Order (Agent)

POST /api/v1/work-orders/:id/claim
Authorization: Bearer <agent-pak>
Content-Type: application/json

{
  "agent_id": "<agent-uuid>"
}

Atomically claims the work order. Returns 404 Not Found if the work order does not exist or is not in a claimable state.

Complete Work Order (Agent)

POST /api/v1/work-orders/:id/complete
Authorization: Bearer <agent-pak>
Content-Type: application/json

{
  "success": true,
  "message": "sha256:abc123..."
}

Parameters:

FieldTypeDescription
successbooleanWhether the work completed successfully
messagestringOptional result message (image digest on success, error on failure)
retryablebooleanWhether the work order can be retried on failure (default: true)

Get Work Order Details

When retrieving a work order, the response includes error tracking fields:

FieldTypeDescription
last_errorstringError message from the most recent failed attempt (null if no failures)
last_error_attimestampWhen the last error occurred (null if no failures)
retry_countintegerNumber of retry attempts so far
next_retry_aftertimestampWhen the work order will be eligible for retry (null if not in retry)

These fields help diagnose failed work orders without needing to check the work order log.

Work Order Log

Completed work orders are moved to the log for audit purposes.

# List completed work orders
GET /api/v1/work-order-log?work_type=build&success=true&limit=100
Authorization: Bearer <admin-pak>

# Get specific completed work order
GET /api/v1/work-order-log/:id
Authorization: Bearer <admin-pak>

Query Parameters:

ParameterDescription
work_typeFilter by work type
successFilter by success status (true/false)
agent_idFilter by agent that executed
limitMaximum results to return

Retry Behavior

When a work order fails:

  1. Agent reports failure via /complete with success: false
  2. Broker increments retry_count
  3. If retry_count < max_retries:
    • Status set to RETRY_PENDING
    • next_retry_after calculated with exponential backoff
    • After backoff period, status returns to PENDING
  4. If retry_count >= max_retries:
    • Work order moved to work_order_log with success: false

Backoff Formula:

next_retry_after = now + (backoff_seconds * 2^retry_count)

Stale Claim Detection

The broker runs a background job every 30 seconds to detect and recover stale claims. A claim is considered stale when an agent has held a work order for longer than claim_timeout_seconds without completing it.

When a stale claim is detected:

  1. The work order’s claimed_at timestamp is compared against the current time
  2. If the elapsed time exceeds claim_timeout_seconds, the claim is released
  3. The work order status returns to PENDING
  4. The claimed_by field is cleared, allowing any eligible agent to claim it
  5. The retry_count is incremented (counts as a failed attempt)

This mechanism handles several failure scenarios:

  • Agent crashes: If an agent crashes mid-execution, the work order becomes claimable again
  • Network partitions: If an agent loses connectivity, work doesn’t remain stuck indefinitely
  • Slow operations: Legitimate long-running operations should set an appropriate claim_timeout_seconds value

The default claim_timeout_seconds is 3600 (1 hour). For build operations that may take longer, increase this value in the work order creation request.

Example: Container Build

# Create a build work order
curl -X POST http://localhost:3000/api/v1/work-orders \
  -H "Authorization: Bearer $ADMIN_PAK" \
  -H "Content-Type: application/json" \
  -d '{
    "work_type": "build",
    "yaml_content": "apiVersion: shipwright.io/v1beta1\nkind: Build\nmetadata:\n  name: my-build\nspec:\n  source:\n    type: Git\n    git:\n      url: https://github.com/org/repo\n  strategy:\n    name: buildah\n    kind: ClusterBuildStrategy\n  output:\n    image: ttl.sh/my-image:latest",
    "targeting": {
      "labels": ["capability=builder"]
    }
  }'

# Check status
curl http://localhost:3000/api/v1/work-orders/$WORK_ORDER_ID \
  -H "Authorization: Bearer $ADMIN_PAK"

# View completed builds
curl "http://localhost:3000/api/v1/work-order-log?work_type=build" \
  -H "Authorization: Bearer $ADMIN_PAK"

Database Schema

Work orders use a two-table design:

  • work_orders: Active queue for routing and retry management
  • work_order_log: Permanent audit trail of completed work

This separation optimizes queue operations while maintaining complete execution history.