Monitoring Computation Graph Health

This guide shows how to use the Cloacina API to monitor the health of running computation graphs, accumulators, and reactors.

Prerequisites

API server running (see Deploying the API Server)
A valid API key stored in API_KEY
At least one computation graph registered with the server

Health vs readiness

Cloacina exposes two levels of health checking:

Endpoint	Auth	Purpose
`GET /health`	None	Liveness — server is up
`GET /ready`	None	Readiness — DB reachable and no graphs crashed
`GET /v1/health/accumulators`	Required	Per-accumulator status
`GET /v1/health/graphs`	Required	Per-reactor status summary
`GET /v1/health/graphs/{name}`	Required	Single reactor detail

The /ready endpoint returns 503 Service Unavailable when any registered computation graph has crashed (its task has exited), making it suitable for Kubernetes readiness probes and load balancer health checks.

Listing accumulator health

curl -s http://localhost:8080/v1/health/accumulators \
  -H "Authorization: Bearer $API_KEY" | jq

Response:

{
  "accumulators": [
    {
      "name": "orderbook",
      "status": "live"
    },
    {
      "name": "pricing",
      "status": "live"
    },
    {
      "name": "exchange_rate_poller",
      "status": "live"
    }
  ]
}

Accumulator health states

State	Meaning
`starting`	Loading checkpoint from DAL — normal at startup
`connecting`	Checkpoint loaded, establishing broker connection (stream accumulators)
`live`	Connected and processing events normally
`disconnected`	Lost broker connection, retrying — data may be stale
`socket_only`	No active source (passthrough accumulator) — healthy by definition

A disconnected accumulator continues to accept socket pushes but is not receiving from its broker topic. The reactor that depends on it will enter degraded state.

Listing reactor health

curl -s http://localhost:8080/v1/health/graphs \
  -H "Authorization: Bearer $API_KEY" | jq

Response:

{
  "graphs": [
    {
      "name": "market_pipeline",
      "health": {
        "state": "live"
      },
      "accumulators": ["orderbook", "pricing"],
      "paused": false
    },
    {
      "name": "rate_monitor",
      "health": {
        "state": "warming",
        "healthy": ["exchange_rate_poller"],
        "waiting": ["fx_stream"]
      },
      "accumulators": ["exchange_rate_poller", "fx_stream"],
      "paused": false
    }
  ]
}

Reactor health states

State	Meaning
`starting`	Loading cache from DAL, spawning accumulators
`warming`	Some accumulators healthy, waiting for the rest — includes lists of `healthy` and `waiting` names
`live`	All accumulators healthy, evaluating reaction criteria
`degraded`	Was live, one or more accumulators disconnected — includes list of `disconnected` names

The paused field indicates whether the reactor is accepting boundaries but skipping graph execution (useful for maintenance windows).

Getting detail for a specific reactor

curl -s http://localhost:8080/v1/health/graphs/market_pipeline \
  -H "Authorization: Bearer $API_KEY" | jq

Response when healthy:

{
  "name": "market_pipeline",
  "health": {
    "state": "live"
  },
  "accumulators": ["orderbook", "pricing"],
  "paused": false
}

Response when degraded:

{
  "name": "market_pipeline",
  "health": {
    "state": "degraded",
    "disconnected": ["orderbook"]
  },
  "accumulators": ["orderbook", "pricing"],
  "paused": false
}

Returns 404 Not Found if the reactor name does not exist.

Readiness integration

The /ready endpoint checks both database connectivity and computation graph status:

curl -s http://localhost:8080/ready | jq

Healthy:

{"status": "ready"}

Graph crashed:

{
  "status": "not ready",
  "reason": "crashed computation graphs",
  "crashed_graphs": ["market_pipeline"]
}

Database unreachable:

{
  "status": "not ready",
  "reason": "database unreachable"
}

Use /ready as your Kubernetes readiness probe:

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

A graph enters the “crashed” state when its tokio task exits. This happens if the reactor’s run() future returns, which normally only occurs after a shutdown signal. An unexpected crash will flip the readiness check and remove the pod from the load balancer until it restarts and the graph re-registers.

Monitoring script

Poll all reactors and alert on non-live states:

#!/usr/bin/env bash
set -euo pipefail

BASE_URL="${CLOACINA_URL:-http://localhost:8080}"
API_KEY="${API_KEY:?API_KEY must be set}"

reactors=$(curl -sf "${BASE_URL}/v1/health/graphs" \
  -H "Authorization: Bearer ${API_KEY}" | jq -r '.reactors[]')

echo "$reactors" | jq -r 'select(.health.state != "live") |
  "ALERT: reactor \(.name) is \(.health.state) — disconnected: \(.health.disconnected // "none")"'

Save as check-graphs.sh, make it executable, and run from a cron job or monitoring system.

What to do when a reactor is Degraded

A degraded reactor is still running. It continues to evaluate reaction criteria and fire the graph using the last known (cached) value from the disconnected accumulator.

Steps to investigate:

Identify the disconnected accumulator from the disconnected list in the reactor health.
Check accumulator health: GET /v1/health/accumulators — look for disconnected status.
Verify the broker is reachable and the topic still exists.
Check server logs for the accumulator reconnection attempts.

A degraded reactor recovers automatically when the accumulator reconnects and returns to live status. No manual intervention is required unless the broker is permanently gone.

Monitoring Computation Graph Health

Monitoring Computation Graph Health

Prerequisites

Health vs readiness

Listing accumulator health

Accumulator health states

Listing reactor health

Reactor health states

Getting detail for a specific reactor

Readiness integration

Monitoring script

What to do when a reactor is Degraded

Related