How-To: Running On-Demand Diagnostics
This guide shows how to collect pod statuses, Kubernetes events, and container logs from a remote cluster when a deployment is misbehaving. Brokkr’s diagnostic system lets you request this data through the broker API without direct kubectl access to the target cluster.
When to Use Diagnostics
Use on-demand diagnostics when:
- A deployment shows
degradedorfailinghealth status - You need to see pod conditions, restart counts, or OOMKill events
- You want container logs from a remote cluster you can’t directly access
- You’re troubleshooting why a deployment object failed to apply
Prerequisites
- Admin PAK for the broker
- The
deployment_object_idof the resource you want to diagnose - The
agent_idof the agent running in the target cluster - The agent must be connected and sending heartbeats
Step 1: Identify the Deployment Object
If you know the stack, list its deployment objects:
curl -s "http://localhost:3000/api/v1/stacks/${STACK_ID}/deployment-objects" \
-H "Authorization: <admin-pak>" | jq '.[] | {id, sequence_id, created_at}'
Check the health status to confirm something is wrong:
curl -s "http://localhost:3000/api/v1/deployment-objects/${DO_ID}/health" \
-H "Authorization: <admin-pak>" | jq .
Step 2: Find the Target Agent
List agents that target this stack:
curl -s "http://localhost:3000/api/v1/agents" \
-H "Authorization: <admin-pak>" | jq '.[] | {id, name, cluster_name, last_heartbeat}'
Verify the agent has a recent heartbeat (within the last few minutes).
Step 3: Request Diagnostics
Create a diagnostic request:
curl -s -X POST "http://localhost:3000/api/v1/deployment-objects/${DO_ID}/diagnostics" \
-H "Authorization: <admin-pak>" \
-H "Content-Type: application/json" \
-d "{
\"agent_id\": \"${AGENT_ID}\",
\"requested_by\": \"oncall-engineer\",
\"retention_minutes\": 120
}" | jq .
Save the diagnostic request ID from the response:
DIAG_ID="..."
The retention_minutes field controls how long the request stays active before expiring. Default is 60 minutes, maximum is 1440 (24 hours).
Step 4: Wait for Results
The agent picks up the diagnostic request on its next poll cycle. Poll the diagnostic status:
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" | jq '.request.status'
Status progression: pending → claimed → completed
Step 5: Read the Results
Once the status is completed, the full results are available:
# Pod statuses
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" | jq -r '.result.pod_statuses' | jq .
# Kubernetes events
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" | jq -r '.result.events' | jq .
# Container logs
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" | jq -r '.result.log_tails' | jq .
Reading Pod Statuses
Look for:
- Phase:
PendingorFailedindicates problems - Conditions: Check
Ready=Falsewith the reason - Containers: Look for
restart_count > 0,state=waitingwith reasons likeCrashLoopBackOff, orstate=terminatedwith reasonOOMKilled
Reading Events
Filter for warnings:
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" \
| jq -r '.result.events' \
| jq '.[] | select(.event_type == "Warning")'
Common warning events: FailedScheduling, Unhealthy, BackOff, FailedMount.
Reading Logs
Log tails are keyed by pod-name/container-name:
curl -s "http://localhost:3000/api/v1/diagnostics/${DIAG_ID}" \
-H "Authorization: <admin-pak>" \
| jq -r '.result.log_tails' \
| jq 'to_entries[] | "\(.key):\n\(.value)\n---"' -r
Each container’s last 100 log lines are included.
Troubleshooting
Diagnostic stays in pending state:
- Check the agent’s heartbeat — it may be disconnected
- Verify the agent is targeting the stack that contains the deployment object
- Check the agent logs for errors
Diagnostic moves to expired:
- The retention period elapsed before the agent could claim it
- Increase
retention_minutesand try again - Check if the agent is running and polling
Diagnostic moves to failed:
- The agent encountered an error collecting data
- Check the agent logs for Kubernetes API errors
- Verify the agent has RBAC permissions to read pods, events, and logs
Cleanup
Diagnostics are automatically cleaned up by the broker’s background task based on broker.diagnostic_cleanup_interval_seconds (default: 15 minutes) and broker.diagnostic_max_age_hours (default: 1 hour).
Related Documentation
- Diagnostics Reference — complete API and data model reference
- Monitoring Deployment Health — continuous health monitoring
- Health Endpoints — health check configuration