Monitoring Deployment Health

Brokkr agents continuously monitor the health of deployed Kubernetes resources and report status back to the broker. This provides centralized visibility into deployment health across all clusters without requiring direct cluster access. This guide covers configuring health monitoring, interpreting health status, and troubleshooting common issues.

How Health Monitoring Works

When an agent applies deployment objects to a Kubernetes cluster, it tracks those resources and periodically checks their health. The agent examines pod status, container states, and Kubernetes conditions to determine overall health. This information is reported to the broker, where it can be viewed through the API or UI.

Health monitoring runs as a background process on each agent. On each check interval, the agent queries the Kubernetes API for pods associated with each deployment object, analyzes their status, and sends a consolidated health report to the broker.

Health Status Values

The health monitoring system reports one of four status values:

Status	Description
`healthy`	All pods are ready and running without issues
`degraded`	Some pods have issues but the deployment is partially functional
`failing`	The deployment has failed or all pods are in error states
`unknown`	Health cannot be determined (no pods found or API errors)

Detected Conditions

The agent detects and reports these problematic conditions:

Container Issues:

ImagePullBackOff - Unable to pull container image
ErrImagePull - Error pulling container image
CrashLoopBackOff - Container repeatedly crashing
CreateContainerConfigError - Invalid container configuration
InvalidImageName - Malformed image reference
RunContainerError - Error starting container
ContainerCannotRun - Container failed to run

Resource Issues:

OOMKilled - Container killed due to memory limits
Error - Container exited with error

Pod Issues:

PodFailed - Pod entered failed phase

Configuring Health Monitoring

Enabling Health Monitoring

Health monitoring is enabled by default. Configure it through environment variables:

# Helm values for agent
agent:
  config:
    deploymentHealthEnabled: true
    deploymentHealthInterval: 60

Or set environment variables directly:

BROKKR__AGENT__DEPLOYMENT_HEALTH_ENABLED=true
BROKKR__AGENT__DEPLOYMENT_HEALTH_INTERVAL=60

Adjusting Check Interval

The check interval determines how frequently the agent evaluates deployment health. The default is 60 seconds, which balances responsiveness with API load.

For environments where rapid detection is critical:

agent:
  config:
    deploymentHealthInterval: 30  # Check every 30 seconds

For large clusters with many deployments, increase the interval to reduce API load:

agent:
  config:
    deploymentHealthInterval: 120  # Check every 2 minutes

Disabling Health Monitoring

To disable health monitoring entirely:

agent:
  config:
    deploymentHealthEnabled: false

Note that disabling health monitoring means the broker will not have visibility into deployment status.

Viewing Health Status

Via API

Query health status for a specific deployment object:

curl "http://broker:3000/api/v1/deployment-objects/{id}/health" \
  -H "Authorization: Bearer $ADMIN_PAK"

Response:

{
  "deployment_object_id": "a1b2c3d4-...",
  "overall_status": "healthy",
  "health_records": [
    {
      "agent_id": "e5f6g7h8-...",
      "status": "healthy",
      "summary": {
        "pods_ready": 3,
        "pods_total": 3,
        "conditions": []
      },
      "checked_at": "2025-01-02T10:00:00Z"
    }
  ]
}

Understanding the Summary

The health summary provides details about pod status:

{
  "pods_ready": 2,
  "pods_total": 3,
  "conditions": ["ImagePullBackOff"],
  "resources": [
    {
      "kind": "Pod",
      "name": "my-app-abc123",
      "namespace": "production",
      "ready": false,
      "message": "Back-off pulling image \"myapp:invalid\""
    }
  ]
}

Field	Description
`pods_ready`	Number of pods in Ready state
`pods_total`	Total number of pods found
`conditions`	List of detected problematic conditions
`resources`	Per-resource details (optional)

Common Scenarios

ImagePullBackOff

When the agent reports ImagePullBackOff:

Verify the image name and tag are correct
Check that the image exists in the registry
Verify the cluster has network access to the registry
Check image pull secrets are configured correctly

# Check pod events for details
kubectl describe pod <pod-name> -n <namespace>

# Check image pull secrets
kubectl get secrets -n <namespace>

CrashLoopBackOff

When containers repeatedly crash:

Check container logs for error messages:

kubectl logs <pod-name> -n <namespace> --previous

Verify the application configuration is correct
Check resource limits aren’t too restrictive
Ensure required environment variables and secrets are present

OOMKilled

When containers are killed for memory:

Increase memory limits:

resources:
  limits:
    memory: "512Mi"  # Increase as needed

Investigate application memory usage
Consider memory profiling to identify leaks

Unknown Status

When status shows as unknown:

Verify pods exist for the deployment object
Check the agent has RBAC permissions to list pods

Check agent logs for API errors:

kubectl logs -l app=brokkr-agent -c agent

Multi-Agent Deployments

When a deployment object is targeted to multiple agents, each agent reports its own health status. The broker stores health per agent, reflecting that the same deployment may have different health on different clusters.

The health endpoint always returns records from all reporting agents in the health_records array, along with an overall_status that reflects the aggregate state:

curl "http://broker:3000/api/v1/deployment-objects/{id}/health" \
  -H "Authorization: Bearer $ADMIN_PAK"

Webhook Integration

Configure webhooks to receive notifications when deployment health changes:

curl -X POST "http://broker:3000/api/v1/webhooks" \
  -H "Authorization: Bearer $ADMIN_PAK" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Health Alerts",
    "url": "https://alerts.example.com/webhook",
    "event_types": ["deployment.failed"]
  }'

The deployment.failed event fires when a deployment transitions to failing status.

Troubleshooting

Health Not Updating

If health status isn’t updating:

Check the agent is running and connected:
```
kubectl get pods -l app=brokkr-agent
```

Verify health monitoring is enabled:

kubectl get configmap brokkr-agent-config -o yaml

Check agent logs for health check errors:

kubectl logs -l app=brokkr-agent -c agent | grep -i health

Incorrect Health Status

If reported health doesn’t match actual pod status:

Verify pods have the correct deployment object ID label
Check the health check interval - status may be stale
Confirm the agent has permission to list pods across namespaces

High API Load

If health monitoring causes excessive Kubernetes API load:

Increase the check interval
Consider reducing the number of deployment objects per agent
Monitor agent metrics for API call rates

Configuration Reference - Agent configuration options
Architecture - How agents monitor health
Webhooks - Alert on health changes

Keyboard shortcuts

Brokkr