Reliability & SLA

Automated backups, disaster recovery, and infrastructure-as-code ensure your Canton nodes stay operational.

Table of contents

Multi-Environment Strategy
Automated Backups
Disaster Recovery
GitOps Reliability
Managed Infrastructure
Availability Summary

Multi-Environment Strategy

NODERS NaaS for Canton Network provides three identical environments for a staged deployment pipeline:

graph LR
    Dev[Devnet<br/>Development] -->|promote| Test[Testnet<br/>Pre-Production]
    Test -->|promote| Main[Mainnet<br/>Production]

Environment	Purpose	Characteristics
Devnet	Development and experimentation	Fast iteration, frequent resets possible
Testnet	Pre-production validation	Mirrors production configuration exactly
Mainnet	Production workloads	Full security, monitoring, and backup coverage

All three environments run on identical infrastructure templates. Configuration changes are validated in Devnet and Testnet before reaching Mainnet, eliminating environment-specific surprises.

Automated Backups

PostgreSQL Database Backups

Canton node state is stored in dedicated PostgreSQL databases. These are backed up automatically:

Frequency — every 4–10 hours (configurable per environment)
Retention — 5–10 days of backup history
Storage — Google Cloud Storage with regional redundancy
Method — scheduled CronJob with pg_dump for consistent snapshots

Keycloak Identity Backups

User accounts, authentication realms, and OAuth client configurations are backed up separately:

Dedicated CronJob for identity data export
Stored in GCS alongside database backups
Includes — realm settings, users, roles, client configurations

Backup Storage

Data	Schedule	Retention	Storage
PostgreSQL (Validator)	Every 4–10h	5–10 days	GCS bucket
PostgreSQL (Participant)	Every 4–10h	5–10 days	GCS bucket
Keycloak Identities	Scheduled	Configurable	GCS bucket

Disaster Recovery

Database Restore

Documented restore procedures allow recovery from any backup point:

Identify the target backup timestamp
Execute the restore job (pre-built Kubernetes Job template)
Verify data integrity and node connectivity
Resume normal operations

Infrastructure Recovery

The entire platform is defined as code in Git:

Full environment rebuild — the complete Kubernetes configuration can be reapplied from the Git repository
ArgoCD reconciliation — once the cluster is available, ArgoCD automatically restores all applications to their declared state
No manual configuration — nothing exists only in the cluster; everything is version-controlled

Regional Redundancy

GKE clusters run in European GCP regions
GCS backups are stored with regional redundancy
GCP KMS keys are regional resources with Google-managed durability
DNS is managed through Cloudflare’s global anycast network

GitOps Reliability

The GitOps model provides several reliability guarantees:

Self-Healing

ArgoCD continuously compares the cluster state to the Git repository. If a resource is accidentally modified or deleted, ArgoCD automatically restores it:

graph LR
    Git[Git Repository<br/>Desired State] -->|compare| ArgoCD[ArgoCD]
    ArgoCD -->|sync| Cluster[Kubernetes<br/>Actual State]
    Cluster -->|drift detected| ArgoCD
    ArgoCD -->|auto-heal| Cluster

Instant Rollback

Rolling back a change is as simple as reverting a Git commit. ArgoCD picks up the revert and restores the previous configuration automatically — no manual Kubernetes operations needed.

Audit Trail

Every infrastructure change is recorded in Git history:

Who made the change (Git author)
What was changed (Git diff)
When it was changed (Git timestamp)
Why it was changed (commit message and PR description)

Managed Infrastructure

Google Kubernetes Engine (GKE)

The underlying Kubernetes platform is managed by Google Cloud:

Automatic node upgrades — GKE applies security patches and Kubernetes version updates
Node auto-repair — unhealthy nodes are automatically replaced
Cluster autoscaling — compute resources scale with demand
Regional clusters — control plane redundancy across availability zones

Certificate Management

TLS certificates are managed automatically:

cert-manager monitors certificate expiry dates
Renewal happens automatically before expiry (typically 30 days in advance)
Zero-downtime rotation — new certificates are applied without service interruption

Secret Synchronization

External Secrets Operator runs continuously:

Polling interval — secrets are checked for updates regularly
Automatic propagation — when a secret changes in GCP Secret Manager, the Kubernetes secret is updated
No restarts needed — pods pick up updated secrets based on their mount configuration

Availability Summary

Component	Availability Mechanism
Canton Nodes	Kubernetes pod restart policy, health checks, ArgoCD self-healing
Database	Persistent volumes (premium SSD), automated backups, restore procedures
Authentication	Keycloak with identity backups, OIDC standard failover
Certificates	Auto-renewal via cert-manager, 30-day advance renewal
Secrets	GCP Secret Manager SLA, automatic sync via ESO
Monitoring	Multi-tier Prometheus (agent + central), Grafana high availability
Infrastructure	GKE managed control plane, node auto-repair, regional redundancy