Reliability & SLA
Automated backups, disaster recovery, and infrastructure-as-code ensure your Canton nodes stay operational.
Table of contents
Multi-Environment Strategy
Canton NaaS provides three identical environments for a staged deployment pipeline:
graph LR
Dev[Devnet<br/>Development] -->|promote| Test[Testnet<br/>Pre-Production]
Test -->|promote| Main[Mainnet<br/>Production]
| Environment | Purpose | Characteristics |
|---|---|---|
| Devnet | Development and experimentation | Fast iteration, frequent resets possible |
| Testnet | Pre-production validation | Mirrors production configuration exactly |
| Mainnet | Production workloads | Full security, monitoring, and backup coverage |
All three environments run on identical infrastructure templates. Configuration changes are validated in Devnet and Testnet before reaching Mainnet, eliminating environment-specific surprises.
Automated Backups
PostgreSQL Database Backups
Canton node state is stored in dedicated PostgreSQL databases. These are backed up automatically:
- Frequency — every 4–10 hours (configurable per environment)
- Retention — 5–10 days of backup history
- Storage — Google Cloud Storage with regional redundancy
- Method — scheduled CronJob with
pg_dumpfor consistent snapshots
Keycloak Identity Backups
User accounts, authentication realms, and OAuth client configurations are backed up separately:
- Dedicated CronJob for identity data export
- Stored in GCS alongside database backups
- Includes — realm settings, users, roles, client configurations
Backup Storage
| Data | Schedule | Retention | Storage |
|---|---|---|---|
| PostgreSQL (Validator) | Every 4–10h | 5–10 days | GCS bucket |
| PostgreSQL (Participant) | Every 4–10h | 5–10 days | GCS bucket |
| Keycloak Identities | Scheduled | Configurable | GCS bucket |
Disaster Recovery
Database Restore
Documented restore procedures allow recovery from any backup point:
- Identify the target backup timestamp
- Execute the restore job (pre-built Kubernetes Job template)
- Verify data integrity and node connectivity
- Resume normal operations
Infrastructure Recovery
The entire platform is defined as code in Git:
- Full environment rebuild — the complete Kubernetes configuration can be reapplied from the Git repository
- ArgoCD reconciliation — once the cluster is available, ArgoCD automatically restores all applications to their declared state
- No manual configuration — nothing exists only in the cluster; everything is version-controlled
Regional Redundancy
- GKE clusters run in European GCP regions
- GCS backups are stored with regional redundancy
- GCP KMS keys are regional resources with Google-managed durability
- DNS is managed through Cloudflare’s global anycast network
GitOps Reliability
The GitOps model provides several reliability guarantees:
Self-Healing
ArgoCD continuously compares the cluster state to the Git repository. If a resource is accidentally modified or deleted, ArgoCD automatically restores it:
graph LR
Git[Git Repository<br/>Desired State] -->|compare| ArgoCD[ArgoCD]
ArgoCD -->|sync| Cluster[Kubernetes<br/>Actual State]
Cluster -->|drift detected| ArgoCD
ArgoCD -->|auto-heal| Cluster
Instant Rollback
Rolling back a change is as simple as reverting a Git commit. ArgoCD picks up the revert and restores the previous configuration automatically — no manual Kubernetes operations needed.
Audit Trail
Every infrastructure change is recorded in Git history:
- Who made the change (Git author)
- What was changed (Git diff)
- When it was changed (Git timestamp)
- Why it was changed (commit message and PR description)
Managed Infrastructure
Google Kubernetes Engine (GKE)
The underlying Kubernetes platform is managed by Google Cloud:
- Automatic node upgrades — GKE applies security patches and Kubernetes version updates
- Node auto-repair — unhealthy nodes are automatically replaced
- Cluster autoscaling — compute resources scale with demand
- Regional clusters — control plane redundancy across availability zones
Certificate Management
TLS certificates are managed automatically:
- cert-manager monitors certificate expiry dates
- Renewal happens automatically before expiry (typically 30 days in advance)
- Zero-downtime rotation — new certificates are applied without service interruption
Secret Synchronization
External Secrets Operator runs continuously:
- Polling interval — secrets are checked for updates regularly
- Automatic propagation — when a secret changes in GCP Secret Manager, the Kubernetes secret is updated
- No restarts needed — pods pick up updated secrets based on their mount configuration
Availability Summary
| Component | Availability Mechanism |
|---|---|
| Canton Nodes | Kubernetes pod restart policy, health checks, ArgoCD self-healing |
| Database | Persistent volumes (premium SSD), automated backups, restore procedures |
| Authentication | Keycloak with identity backups, OIDC standard failover |
| Certificates | Auto-renewal via cert-manager, 30-day advance renewal |
| Secrets | GCP Secret Manager SLA, automatic sync via ESO |
| Monitoring | Multi-tier Prometheus (agent + central), Grafana high availability |
| Infrastructure | GKE managed control plane, node auto-repair, regional redundancy |