Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Diagnostics Reference

Brokkr provides an on-demand diagnostic system for collecting Kubernetes pod information, events, and logs from remote clusters. Administrators request diagnostics through the broker API, and agents collect the data from their local clusters.

Diagnostic Request Lifecycle

Created (pending) → Claimed (by agent) → Result submitted (completed)
                  → Expired (past retention)
                  → Failed (agent error)

Status Values

StatusDescription
pendingRequest created, waiting for agent to claim
claimedAgent has claimed the request and is collecting data
completedAgent submitted diagnostic results
failedAgent encountered an error during collection
expiredRequest exceeded its retention period without completion

Data Model

DiagnosticRequest

FieldTypeDescription
idUUIDUnique identifier
agent_idUUIDTarget agent to collect from
deployment_object_idUUIDDeployment object to diagnose
statusStringCurrent status (see above)
requested_byString?Who requested the diagnostic (free-text)
created_atDateTimeRequest creation time
claimed_atDateTime?When agent claimed the request
completed_atDateTime?When result was submitted
expires_atDateTimeWhen the request expires

DiagnosticResult

FieldTypeDescription
idUUIDUnique identifier
request_idUUIDAssociated diagnostic request
pod_statusesString (JSON)Pod status information
eventsString (JSON)Kubernetes events
log_tailsString? (JSON)Container log tails (last 100 lines per container)
collected_atDateTimeWhen data was collected on the agent
created_atDateTimeRecord creation time

API Endpoints

Create Diagnostic Request

POST /api/v1/deployment-objects/{deployment_object_id}/diagnostics

Auth: Admin only.

Request body:

{
  "agent_id": "uuid-of-target-agent",
  "requested_by": "oncall-engineer",
  "retention_minutes": 60
}
FieldTypeRequiredDefaultConstraints
agent_idUUIDYesMust be a valid agent
requested_byStringNonullFree-text identifier
retention_minutesIntegerNo601-1440 (max 24 hours)

Response: 201 Created

{
  "id": "diag-uuid",
  "agent_id": "agent-uuid",
  "deployment_object_id": "do-uuid",
  "status": "pending",
  "requested_by": "oncall-engineer",
  "created_at": "2025-01-15T10:00:00Z",
  "expires_at": "2025-01-15T11:00:00Z"
}

Get Diagnostic

GET /api/v1/diagnostics/{id}

Auth: Admin or the target agent.

Response: 200 OK

If the diagnostic is completed, the response includes the result:

{
  "request": {
    "id": "diag-uuid",
    "status": "completed",
    "claimed_at": "2025-01-15T10:00:15Z",
    "completed_at": "2025-01-15T10:00:20Z"
  },
  "result": {
    "pod_statuses": "[{\"name\": \"myapp-abc12\", \"namespace\": \"default\", \"phase\": \"Running\", ...}]",
    "events": "[{\"event_type\": \"Normal\", \"reason\": \"Pulled\", ...}]",
    "log_tails": "{\"myapp-abc12/myapp\": \"2025-01-15 10:00:00 INFO Starting...\\n...\"}",
    "collected_at": "2025-01-15T10:00:18Z"
  }
}

Get Pending Diagnostics (Agent)

GET /api/v1/agents/{agent_id}/diagnostics/pending

Auth: Agent (own ID only).

Returns all pending diagnostic requests for the agent.

Response: 200 OKDiagnosticRequest[]


Claim Diagnostic Request

POST /api/v1/diagnostics/{id}/claim

Auth: Agent.

Transitions the request from pending to claimed. Only one agent can claim a request.

Response: 200 OK


Submit Diagnostic Result

POST /api/v1/diagnostics/{id}/result

Auth: Agent (must have claimed the request).

Request body:

{
  "pod_statuses": "[{\"name\": \"myapp-abc12\", \"namespace\": \"default\", \"phase\": \"Running\", \"conditions\": [{\"condition_type\": \"Ready\", \"status\": \"True\"}], \"containers\": [{\"name\": \"myapp\", \"ready\": true, \"restart_count\": 0, \"state\": \"running\"}]}]",
  "events": "[{\"event_type\": \"Normal\", \"reason\": \"Pulled\", \"message\": \"Successfully pulled image\", \"involved_object_kind\": \"Pod\", \"involved_object_name\": \"myapp-abc12\", \"count\": 1}]",
  "log_tails": "{\"myapp-abc12/myapp\": \"2025-01-15 10:00:00 INFO Starting server on :8080\\n2025-01-15 10:00:01 INFO Ready to accept connections\"}",
  "collected_at": "2025-01-15T10:00:18Z"
}

Response: 201 Created


Collected Data

Pod Statuses

Each pod status includes:

FieldTypeDescription
nameStringPod name
namespaceStringPod namespace
phaseStringPod phase (Running, Pending, Failed, etc.)
conditionsArrayPod conditions (Ready, Initialized, etc.)
containersArrayContainer statuses

Container status fields:

FieldTypeDescription
nameStringContainer name
readyBooleanWhether the container is ready
restart_countIntegerNumber of restarts
stateStringCurrent state (running, waiting, terminated)
state_reasonString?Reason for waiting/terminated state
state_messageString?Message for waiting/terminated state

Events

FieldTypeDescription
event_typeString?Normal or Warning
reasonString?Short reason string
messageString?Human-readable message
involved_object_kindString?Kind of involved object (Pod, ReplicaSet, etc.)
involved_object_nameString?Name of involved object
countInteger?Number of occurrences
first_timestampString?First occurrence
last_timestampString?Last occurrence

Log Tails

A JSON object mapping pod-name/container-name to the last 100 lines of logs:

{
  "myapp-abc12/myapp": "line 1\nline 2\n...",
  "myapp-abc12/sidecar": "line 1\nline 2\n..."
}

The maximum log lines collected per container is 100 (configured via MAX_LOG_LINES).


Automatic Cleanup

The broker runs a background task that periodically cleans up diagnostic data:

SettingDefaultDescription
broker.diagnostic_cleanup_interval_seconds900 (15 min)How often cleanup runs
broker.diagnostic_max_age_hours1Max age for completed/expired/failed diagnostics

The cleanup task:

  1. Expires pending requests past their expires_at time
  2. Deletes completed, expired, and failed requests older than diagnostic_max_age_hours
  3. Deletes associated diagnostic results