Monitoring Computation Graph Health
This guide shows how to use the Cloacina API to monitor the health of running computation graphs, accumulators, and reactors.
- API server running (see Deploying the API Server)
- A valid API key stored in
API_KEY - At least one computation graph registered with the server
Cloacina exposes two levels of health checking:
| Endpoint | Auth | Purpose |
|---|---|---|
GET /health |
None | Liveness — server is up |
GET /ready |
None | Readiness — DB reachable and no graphs crashed |
GET /v1/health/accumulators |
Required | Per-accumulator status |
GET /v1/health/graphs |
Required | Per-reactor status summary |
GET /v1/health/graphs/{name} |
Required | Single reactor detail |
The /ready endpoint returns 503 Service Unavailable when any registered computation graph has crashed (its task has exited), making it suitable for Kubernetes readiness probes and load balancer health checks.
curl -s http://localhost:8080/v1/health/accumulators \
-H "Authorization: Bearer $API_KEY" | jq
Response:
{
"accumulators": [
{
"name": "orderbook",
"status": "live"
},
{
"name": "pricing",
"status": "live"
},
{
"name": "exchange_rate_poller",
"status": "live"
}
]
}
| State | Meaning |
|---|---|
starting |
Loading checkpoint from DAL — normal at startup |
connecting |
Checkpoint loaded, establishing broker connection (stream accumulators) |
live |
Connected and processing events normally |
disconnected |
Lost broker connection, retrying — data may be stale |
socket_only |
No active source (passthrough accumulator) — healthy by definition |
A disconnected accumulator continues to accept socket pushes but is not receiving from its broker topic. The reactor that depends on it will enter degraded state.
curl -s http://localhost:8080/v1/health/graphs \
-H "Authorization: Bearer $API_KEY" | jq
Response:
{
"graphs": [
{
"name": "market_pipeline",
"health": {
"state": "live"
},
"accumulators": ["orderbook", "pricing"],
"paused": false
},
{
"name": "rate_monitor",
"health": {
"state": "warming",
"healthy": ["exchange_rate_poller"],
"waiting": ["fx_stream"]
},
"accumulators": ["exchange_rate_poller", "fx_stream"],
"paused": false
}
]
}
| State | Meaning |
|---|---|
starting |
Loading cache from DAL, spawning accumulators |
warming |
Some accumulators healthy, waiting for the rest — includes lists of healthy and waiting names |
live |
All accumulators healthy, evaluating reaction criteria |
degraded |
Was live, one or more accumulators disconnected — includes list of disconnected names |
The paused field indicates whether the reactor is accepting boundaries but skipping graph execution (useful for maintenance windows).
curl -s http://localhost:8080/v1/health/graphs/market_pipeline \
-H "Authorization: Bearer $API_KEY" | jq
Response when healthy:
{
"name": "market_pipeline",
"health": {
"state": "live"
},
"accumulators": ["orderbook", "pricing"],
"paused": false
}
Response when degraded:
{
"name": "market_pipeline",
"health": {
"state": "degraded",
"disconnected": ["orderbook"]
},
"accumulators": ["orderbook", "pricing"],
"paused": false
}
Returns 404 Not Found if the reactor name does not exist.
The /ready endpoint checks both database connectivity and computation graph status:
curl -s http://localhost:8080/ready | jq
Healthy:
{"status": "ready"}
Graph crashed:
{
"status": "not ready",
"reason": "crashed computation graphs",
"crashed_graphs": ["market_pipeline"]
}
Database unreachable:
{
"status": "not ready",
"reason": "database unreachable"
}
Use /ready as your Kubernetes readiness probe:
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
A graph enters the “crashed” state when its tokio task exits. This happens if the reactor’s run() future returns, which normally only occurs after a shutdown signal. An unexpected crash will flip the readiness check and remove the pod from the load balancer until it restarts and the graph re-registers.
Poll all reactors and alert on non-live states:
#!/usr/bin/env bash
set -euo pipefail
BASE_URL="${CLOACINA_URL:-http://localhost:8080}"
API_KEY="${API_KEY:?API_KEY must be set}"
reactors=$(curl -sf "${BASE_URL}/v1/health/graphs" \
-H "Authorization: Bearer ${API_KEY}" | jq -r '.reactors[]')
echo "$reactors" | jq -r 'select(.health.state != "live") |
"ALERT: reactor \(.name) is \(.health.state) — disconnected: \(.health.disconnected // "none")"'
Save as check-graphs.sh, make it executable, and run from a cron job or monitoring system.
A degraded reactor is still running. It continues to evaluate reaction criteria and fire the graph using the last known (cached) value from the disconnected accumulator.
Steps to investigate:
- Identify the disconnected accumulator from the
disconnectedlist in the reactor health. - Check accumulator health:
GET /v1/health/accumulators— look fordisconnectedstatus. - Verify the broker is reachable and the topic still exists.
- Check server logs for the accumulator reconnection attempts.
A degraded reactor recovers automatically when the accumulator reconnects and returns to live status. No manual intervention is required unless the broker is permanently gone.