Multi-Tenant Recovery
This guide covers how recovery works in multi-tenant Cloacina deployments, how to handle common failure scenarios, and how to monitor tenant health.
Cloacina has recovery enabled by default. Each tenant’s recovery operates independently, so a failure in one tenant does not affect others.
// First runner creates schema and runs migrations
let runner1 = DefaultRunner::with_schema(db_url, "tenant_acme").await?;
// ... work gets interrupted ...
runner1.shutdown().await?;
// Second runner automatically recovers any interrupted work
let runner2 = DefaultRunner::with_schema(db_url, "tenant_acme").await?;
// - Schema already exists (not recreated)
// - Migrations already applied (not re-run)
// - Orphaned tasks automatically detected and recovered
// - Failed tasks retried based on retry configuration
Key points:
- Always on: Recovery is enabled by default for all executors
- Per-tenant isolation: Each tenant’s recovery is independent
- Automatic: No manual intervention needed
- Stateful: Schemas and data persist across restarts
When a runner crashes mid-execution, tasks that were in progress become orphaned. On restart, the recovery service detects these and re-queues them.
Step 1: Restart the runner for the affected tenant:
let runner = DefaultRunner::with_schema(db_url, "tenant_acme").await?;
Step 2: The runner automatically scans for orphaned tasks on startup. Tasks that were in a Running state with no active executor are identified as orphaned and rescheduled. No additional code is needed.
Step 3: Verify recovery by checking execution status:
let stats = runner
.get_cron_execution_stats(chrono::Utc::now() - chrono::Duration::try_minutes(30).unwrap())
.await?;
info!("Recovered executions: {}", stats.lost_executions);
info!("Successful executions: {}", stats.successful_executions);
If a process hosts runners for multiple tenants and crashes, each tenant must be recovered independently:
let tenants = vec!["tenant_acme", "tenant_globex", "tenant_initech"];
for tenant_id in &tenants {
info!("Recovering tenant: {}", tenant_id);
// Each call triggers independent recovery for that tenant's schema
let runner = DefaultRunner::with_schema(db_url, tenant_id).await?;
// Optionally check for recovered work
let stats = runner
.get_cron_execution_stats(chrono::Utc::now() - chrono::Duration::try_hours(1).unwrap())
.await?;
info!(
"Tenant {}: {} total, {} successful, {} lost",
tenant_id, stats.total_executions, stats.successful_executions, stats.lost_executions
);
}
When a PostgreSQL primary fails over to a replica, existing connection pools become invalid. To recover:
Step 1: Shut down all tenant runners gracefully (if possible):
for (tenant_id, runner) in tenant_runners.drain() {
if let Err(e) = runner.shutdown().await {
warn!("Failed to shut down runner for {}: {}", tenant_id, e);
}
}
Step 2: Re-create runners pointing to the new primary:
let new_db_url = "postgresql://cloacina:cloacina@new-primary:5432/cloacina";
for tenant_id in &tenants {
let runner = DefaultRunner::with_schema(new_db_url, tenant_id).await?;
tenant_runners.insert(tenant_id.to_string(), runner);
}
The new runners will pick up any orphaned tasks from the database and resume execution.
With PostgreSQL, all tenants share a single database but operate in separate schemas. Recovery characteristics:
- Shared infrastructure: A single database failover affects all tenants simultaneously
- Schema scoping: Recovery queries are automatically scoped to the tenant’s schema via
SET search_path - Connection pool reset: After a failover, all connection pools must be re-established
- Concurrent recovery: Multiple tenant runners can recover in parallel since schemas are independent
// PostgreSQL: all tenants share the same database URL, differ by schema
let runner = DefaultRunner::with_schema(
"postgresql://cloacina:cloacina@localhost:5432/cloacina",
"tenant_acme"
).await?;
With SQLite, each tenant has a completely separate database file. Recovery characteristics:
- Physical isolation: A corrupt database file affects only one tenant
- Independent recovery: Each tenant’s file can be backed up and restored independently
- No failover complexity: No connection pool invalidation since each file is self-contained
- WAL mode recommended: Use WAL journal mode for better crash recovery
// SQLite: each tenant gets its own database file
let runner = DefaultRunner::new(
"sqlite://./data/tenant_acme.db?mode=rwc&_journal_mode=WAL&_synchronous=NORMAL"
).await?;
For SQLite tenants, if a database file becomes corrupt, you can restore from a backup without affecting other tenants:
# Stop the runner for the affected tenant
# Replace the corrupt file with a backup
cp backups/tenant_acme.db data/tenant_acme.db
# Restart the runner -- recovery will handle any orphaned tasks
Run periodic checks across all tenants to detect problems early:
async fn check_tenant_health(
db_url: &str,
tenant_ids: &[&str],
) -> Result<(), Box<dyn std::error::Error>> {
for tenant_id in tenant_ids {
let runner = DefaultRunner::with_schema(db_url, tenant_id).await?;
let stats = runner
.get_cron_execution_stats(
chrono::Utc::now() - chrono::Duration::try_minutes(10).unwrap()
)
.await?;
if stats.lost_executions > 0 {
warn!(
"Tenant {} has {} lost executions in the last 10 minutes",
tenant_id, stats.lost_executions
);
}
if stats.success_rate < 95.0 {
warn!(
"Tenant {} success rate is {:.1}% (below 95% threshold)",
tenant_id, stats.success_rate
);
}
runner.shutdown().await?;
}
Ok(())
}
Use structured logging to make tenant-specific issues easy to filter:
use tracing::info_span;
for tenant_id in &tenants {
let span = info_span!("tenant", id = tenant_id);
let _guard = span.enter();
let runner = DefaultRunner::with_schema(db_url, tenant_id).await?;
// All log output within this scope includes tenant.id
}
When migrating an existing single-tenant deployment to multi-tenant:
// Existing single-tenant application uses public schema
let legacy_runner = DefaultRunner::new(db_url).await?;
// New tenant uses isolated schema
let tenant_runner = DefaultRunner::with_schema(db_url, "tenant_001").await?;
// Both can run side-by-side during migration
// Existing data remains in public schema
// New tenant data is isolated in tenant_001 schema
Key points:
- Existing data remains in the
publicschema - Each new tenant gets their own schema
- No data migration required – schemas are independent
- Applications can be migrated gradually
- Both single-tenant and multi-tenant can coexist
// Development: Quick setup for testing
let dev_tenant = DefaultRunner::with_schema(
"postgresql://localhost/dev_db",
"test_tenant"
).await?;
// Production: Full configuration
let prod_tenant = DefaultRunner::builder()
.database_url(&production_url)
.schema(&tenant_id)
.enable_recovery(true)
.max_concurrent_tasks(10)
.build()
.await?;
- Use only alphanumeric characters and underscores
- Examples:
tenant_001,acme_corp,customer_xyz - Avoid special characters, spaces, or hyphens
See the multi-tenant example for a working demonstration of these concepts.