Performance Characteristics
Baseline performance numbers for the embedded computation graph pipeline. The benchmark exercises the full pipeline from event injection to graph execution completion, using a minimal single-accumulator graph to isolate the framework overhead from the application logic.
injector
│
▼ (socket channel, capacity=64)
accumulator socket task
│
▼ (merge channel, capacity=1024)
accumulator processor task
│
▼ (boundary channel, capacity=256)
reactor
│
▼
graph_fn(cache snapshot)
│
▼
GraphResult
The benchmark measures wall-clock time from the moment socket_tx.send(bytes) returns to the moment graph_fn completes. This includes all channel hops, task wakeups, cache updates, dirty flag evaluation, and the graph execution itself.
- Graph: single accumulator (
source) →process→output(minimal overhead) - Binary:
cg-benchatexamples/performance/computation-graph/ - Strategy:
when_anyreaction criteria,Latestinput strategy - Channels: socket=64, boundary=256, merge=1024
| Metric | Debug Build | Release Build |
|---|---|---|
| p50 | ~0 us | 2,745 us |
| p95 | 7,638 us | 9,196 us |
| p99 | 9,076 us | 10,480 us |
| mean | 1,874 us | 3,355 us |
Measured at 1ms injection interval over 10 seconds (~10,000 events pushed, ~7,600-7,900 graph fires).
| Metric | Debug Build | Release Build |
|---|---|---|
| Max sustained | ~733 events/sec | ~763 events/sec |
Measured by ramping injection rate from 500us down to 10us interval until TrySendError::Full is detected.
These numbers measure sustained throughput from a live Kafka broker through the stream accumulator into graph execution. The soak ran for 5 minutes to surface any backpressure or offset-commit saturation.
| Accumulator type | Sustained throughput |
|---|---|
| Stream (latest value) | ~70 events/sec |
| Batch (flush after graph) | ~45 graph firings/sec |
Stream throughput is lower than the passthrough baseline because each message involves a Kafka recv() call (network round trip) plus an offset commit() call after graph execution. Batch throughput is lower still because the batch size determines firing frequency — smaller batches fire more often, larger batches fire less often but process more events per fire. These numbers reflect default Kafka consumer configuration; acks, fetch.wait.max.ms, and consumer group partition count all affect real-world throughput.
- Apple M3 (macOS)
- Rust 1.85+
- tokio 1.x multi-threaded runtime
The benchmark’s most counterintuitive result is that debug and release builds have nearly identical throughput (~733 vs ~763 events/sec). This is because the bottleneck is async channel hops and tokio task scheduling, not computation or serialization.
Each event traverses two mpsc channels before reaching the reactor:
injector
│ try_send() — if full, backpressure detected
▼
[socket channel, cap=64]
│ recv() + deserialize
▼
accumulator processor
│ process() — user code, ~negligible
│ serialize + send()
▼
[boundary channel, cap=256]
│ recv()
▼
reactor
│ cache.update() + dirty.set() + criteria check
│ graph_fn(snapshot).await
▼
GraphResult
Each channel hop involves a tokio task wakeup: a sleeping task is woken, scheduled onto a tokio thread, and begins executing. On the Apple M3, each wakeup cycle takes roughly 3-4ms under load. With two hops, the latency floor is approximately 6-8ms — matching the observed p95/p99 numbers.
Rust’s release optimization eliminates dead code and speeds up serialization, but it cannot eliminate tokio task scheduling overhead. The computation itself (the process() call and the graph function) is genuinely negligible compared to the scheduling cost. This means:
- Adding more complex user logic in
process()or graph nodes will not significantly affect throughput until it exceeds the scheduling cost - Profile-switching from debug to release will not dramatically change latency for typical workloads
- The latency floor is set by the number of channel hops, not by the amount of work done in each hop
Increasing channel buffer sizes increases latency (more queuing delay) without improving throughput. The current sizes (socket=64, boundary=256) are tuned for the latency/throughput tradeoff. A larger socket channel means more events can queue up before backpressure, which allows bursts but also allows a slow reactor to fall further behind before the injector notices.
The merge_channel_capacity (1024 by default) is larger because it is the internal merge point for both the socket task and the event source task. It needs headroom to avoid deadlocking when both paths produce events simultaneously.
The benchmark uses Latest input strategy: if 10 events arrive while the reactor is executing a graph, only the 10th value is in the cache when the next execution starts. This is why ~7,600-7,900 graph fires are observed for ~10,000 events pushed at 1ms intervals — some boundaries collapse because the reactor is busy. This is the correct behavior for reactive workloads. If every event must produce exactly one graph execution, use Sequential strategy (at the cost of higher per-event latency as the queue builds up).
# Full benchmark (default: 15s latency + 10s throughput)
angreal performance computation-graph-bench
# Quick run
angreal performance computation-graph-bench --latency-duration 5 --throughput-duration 3
# Release build for production-representative numbers
cd examples/performance/computation-graph
cargo run --release --bin cg-bench -- --latency-duration 15 --throughput-duration 10
The cg-bench binary is at examples/performance/computation-graph/src/main.rs. It creates an in-process graph with a single passthrough accumulator, injects events at a configurable rate, and records timestamps at injection and graph completion to compute latency histograms.
- Architecture — the reactor loop and input strategy semantics
- Accumulator Design — how channel sizes and accumulator types affect throughput