Reliability & SLA

Automated backups, disaster recovery, and infrastructure-as-code ensure your Canton nodes stay operational.

Table of contents
  1. Multi-Environment Strategy
  2. Automated Backups
    1. PostgreSQL Database Backups
    2. Keycloak Identity Backups
    3. Backup Storage
  3. Disaster Recovery
    1. Database Restore
    2. Infrastructure Recovery
    3. Regional Redundancy
  4. GitOps Reliability
    1. Self-Healing
    2. Instant Rollback
    3. Audit Trail
  5. Managed Infrastructure
    1. Google Kubernetes Engine (GKE)
    2. Certificate Management
    3. Secret Synchronization
  6. Availability Summary

Multi-Environment Strategy

Canton NaaS provides three identical environments for a staged deployment pipeline:

graph LR
    Dev[Devnet<br/>Development] -->|promote| Test[Testnet<br/>Pre-Production]
    Test -->|promote| Main[Mainnet<br/>Production]
Environment Purpose Characteristics
Devnet Development and experimentation Fast iteration, frequent resets possible
Testnet Pre-production validation Mirrors production configuration exactly
Mainnet Production workloads Full security, monitoring, and backup coverage

All three environments run on identical infrastructure templates. Configuration changes are validated in Devnet and Testnet before reaching Mainnet, eliminating environment-specific surprises.


Automated Backups

PostgreSQL Database Backups

Canton node state is stored in dedicated PostgreSQL databases. These are backed up automatically:

  • Frequency — every 4–10 hours (configurable per environment)
  • Retention — 5–10 days of backup history
  • Storage — Google Cloud Storage with regional redundancy
  • Method — scheduled CronJob with pg_dump for consistent snapshots

Keycloak Identity Backups

User accounts, authentication realms, and OAuth client configurations are backed up separately:

  • Dedicated CronJob for identity data export
  • Stored in GCS alongside database backups
  • Includes — realm settings, users, roles, client configurations

Backup Storage

Data Schedule Retention Storage
PostgreSQL (Validator) Every 4–10h 5–10 days GCS bucket
PostgreSQL (Participant) Every 4–10h 5–10 days GCS bucket
Keycloak Identities Scheduled Configurable GCS bucket

Disaster Recovery

Database Restore

Documented restore procedures allow recovery from any backup point:

  1. Identify the target backup timestamp
  2. Execute the restore job (pre-built Kubernetes Job template)
  3. Verify data integrity and node connectivity
  4. Resume normal operations

Infrastructure Recovery

The entire platform is defined as code in Git:

  • Full environment rebuild — the complete Kubernetes configuration can be reapplied from the Git repository
  • ArgoCD reconciliation — once the cluster is available, ArgoCD automatically restores all applications to their declared state
  • No manual configuration — nothing exists only in the cluster; everything is version-controlled

Regional Redundancy

  • GKE clusters run in European GCP regions
  • GCS backups are stored with regional redundancy
  • GCP KMS keys are regional resources with Google-managed durability
  • DNS is managed through Cloudflare’s global anycast network

GitOps Reliability

The GitOps model provides several reliability guarantees:

Self-Healing

ArgoCD continuously compares the cluster state to the Git repository. If a resource is accidentally modified or deleted, ArgoCD automatically restores it:

graph LR
    Git[Git Repository<br/>Desired State] -->|compare| ArgoCD[ArgoCD]
    ArgoCD -->|sync| Cluster[Kubernetes<br/>Actual State]
    Cluster -->|drift detected| ArgoCD
    ArgoCD -->|auto-heal| Cluster

Instant Rollback

Rolling back a change is as simple as reverting a Git commit. ArgoCD picks up the revert and restores the previous configuration automatically — no manual Kubernetes operations needed.

Audit Trail

Every infrastructure change is recorded in Git history:

  • Who made the change (Git author)
  • What was changed (Git diff)
  • When it was changed (Git timestamp)
  • Why it was changed (commit message and PR description)

Managed Infrastructure

Google Kubernetes Engine (GKE)

The underlying Kubernetes platform is managed by Google Cloud:

  • Automatic node upgrades — GKE applies security patches and Kubernetes version updates
  • Node auto-repair — unhealthy nodes are automatically replaced
  • Cluster autoscaling — compute resources scale with demand
  • Regional clusters — control plane redundancy across availability zones

Certificate Management

TLS certificates are managed automatically:

  • cert-manager monitors certificate expiry dates
  • Renewal happens automatically before expiry (typically 30 days in advance)
  • Zero-downtime rotation — new certificates are applied without service interruption

Secret Synchronization

External Secrets Operator runs continuously:

  • Polling interval — secrets are checked for updates regularly
  • Automatic propagation — when a secret changes in GCP Secret Manager, the Kubernetes secret is updated
  • No restarts needed — pods pick up updated secrets based on their mount configuration

Availability Summary

Component Availability Mechanism
Canton Nodes Kubernetes pod restart policy, health checks, ArgoCD self-healing
Database Persistent volumes (premium SSD), automated backups, restore procedures
Authentication Keycloak with identity backups, OIDC standard failover
Certificates Auto-renewal via cert-manager, 30-day advance renewal
Secrets GCP Secret Manager SLA, automatic sync via ESO
Monitoring Multi-tier Prometheus (agent + central), Grafana high availability
Infrastructure GKE managed control plane, node auto-repair, regional redundancy

© 2026 Noders. All rights reserved.

This site uses Just the Docs, a documentation theme for Jekyll.