← all posts

PostgreSQL on Kubernetes: From StatefulSets to Operator-Managed Database Platforms

PostgreSQL has been the backbone of most projects I’ve delivered over the last decade. When someone needs real transactions, a solid SQL interface, and room to grow with extensions, Postgres is the database I reach for. Kubernetes, meanwhile, has become the default way to orchestrate everything else.

In 2024, I helped move a legacy ETL platform onto Azure Kubernetes Service. We started with a plain StatefulSet running a primary and a few replicas. That was enough to prove the concept, but it quickly exposed what wasn’t covered: backups, coordinated failover, replica promotion, and the kind of observability you need when it’s late and production is unhappy.

That is the point where a PostgreSQL operator becomes more than a convenience.

Why an operator matters

Kubernetes lets you declare intent, and controllers reconcile the cluster back toward that intent. A PostgreSQL operator follows the same pattern, but for databases: it adds CRDs and control loops that bundle operational steps like provisioning, scaling, failover, backups, user management, and monitoring into software. Instead of scripts and one-off CronJobs, you manage database resources and let the operator do the repetitive work.

The details vary a lot by operator. Some, like Crunchy Postgres for Kubernetes, run Postgres with StatefulSets. Others, like CloudNativePG, use their own controller model instead of a StatefulSet-based deployment. That difference matters when you debug scheduling, volume ownership, upgrades, or failover behavior.

A practical example using CloudNativePG

Here is a small CloudNativePG cluster:

nvim ~/snippet.yaml
1apiVersion: postgresql.cnpg.io/v1
2kind: Cluster
3metadata:
4  name: warehouse-db
5spec:
6  instances: 3
7  storage:
8    size: 1Ti
9  postgresql:
10    parameters:
11      max_connections: '600'
12      shared_buffers: '3GB'

Apply that and the operator creates the database pods, PVCs, and services, configures replication, and keeps the read/write service pointing at the current primary. If the primary fails, the operator can promote a replica and update the service that applications use for writes.

That does not mean zero interruption. Failover still has RTO and RPO trade-offs. CloudNativePG exposes settings such as failoverDelay and switchoverDelay, and its own documentation is explicit that tuning them can favor faster recovery or lower data-loss risk depending on the workload.

Backups are separate resources. In CloudNativePG, a scheduled backup is expressed with a ScheduledBackup CRD:

nvim ~/snippet.yaml
1apiVersion: postgresql.cnpg.io/v1
2kind: ScheduledBackup
3metadata:
4  name: warehouse-db-nightly
5spec:
6  schedule: '0 0 2 * * *'
7  backupOwnerReference: self
8  cluster:
9    name: warehouse-db

One subtle but important detail: CloudNativePG uses a six-field cron expression with seconds first. The schedule above runs at 02:00:00 every day, not at the same shape as a Kubernetes CronJob. Also check the current CloudNativePG backup documentation before copying backup configuration from an old article. Recent docs emphasize plugin-based object-store backups and Kubernetes volume snapshots; older native barmanObjectStore examples still exist, but the recommended path has moved.

Reconciliation in practice

Operators watch custom resources and continually reconcile them. With CloudNativePG, that can include resources such as Cluster, Backup, ScheduledBackup, Pooler, and Database, while roles are managed inside .spec.managed.roles. Other operators use different CRDs and field names.

That distinction is important. The operator pattern is portable, but manifests are not. Change the replica count, adjust connection parameters, swap an image tag, update backup policy, or add a pooler, and the operator drives its managed objects toward that state. The workflow pairs naturally with GitOps because database infrastructure changes can be reviewed and applied like other Kubernetes changes.

Older approaches predate this and lean more on imperative tooling, which can be awkward to wrap in a declarative deployment pipeline.

It’s not “just Postgres”

Modern operators often include integrations you would otherwise assemble yourself: connection pooling, backup tooling, HA orchestration, extension handling, monitoring hooks, and declarative database objects. The exact package depends on the operator.

CloudNativePG, for example, has a Pooler resource for PgBouncer:

nvim ~/snippet.yaml
1apiVersion: postgresql.cnpg.io/v1
2kind: Pooler
3metadata:
4  name: app-db-rw
5spec:
6  cluster:
7    name: app-db
8  instances: 2
9  type: rw
10  pgbouncer:
11    poolMode: transaction
12    parameters:
13      max_client_conn: '2000'
14      default_pool_size: '50'

That pooler is an access layer in front of the database, not a magic fix for connection management. Choose transaction or session pooling deliberately. Some application behavior, prepared statements, advisory locks, and session state assumptions can break when a pooler changes how sessions are reused.

Resilience and failover

Operators are designed around failure handling. They maintain standbys, coordinate promotion, archive WAL for point-in-time recovery, and keep backups running. But the exact safety model is not universal.

CloudNativePG supports asynchronous and synchronous streaming replication, publishes separate services for read/write and read-only traffic, and can build cross-cluster disaster recovery with replica clusters. During failover, the write service follows the promoted primary. With asynchronous replication, some data loss is still possible if the old primary accepted writes that never reached the promoted replica. With synchronous replication, you reduce that risk but may trade off write latency or availability.

On one team we moved from manual failovers, which involved humans, runbooks, and DNS, to automatic promotions in under a minute. That shift turned an incident into something closer to routine operations, but only because we had tested the path and understood what the automation was allowed to sacrifice.

Backups and recovery drills

Backups only matter if you can restore them. Operators can automate base backups, WAL archiving, retention, and restore workflows, but you still need to validate them.

A useful recovery drill should answer practical questions:

  • What is our recovery time objective?
  • What is our recovery point objective?
  • How long does a full restore take at the current database size?
  • How much WAL has to be fetched and replayed?
  • Are Kubernetes Secrets, certificates, and application credentials backed up separately?
  • Can we restore into a clean namespace or a different cluster?

For CloudNativePG, a one-off backup can be requested with a Backup resource:

nvim ~/snippet.yaml
1apiVersion: postgresql.cnpg.io/v1
2kind: Backup
3metadata:
4  name: app-db-before-upgrade
5spec:
6  cluster:
7    name: app-db

The next step is to restore into a temporary environment and deliberately test promotion, application connectivity, and read/write behavior. Do not treat a green backup status as a recovery plan.

Rolling changes and upgrades

Operators can orchestrate controlled changes by restarting instances one at a time or switching to a new image. You still need staging and maintenance planning, but you reduce the number of manual steps and the chances of someone skipping one.

PostgreSQL major upgrades deserve special care. Operator support differs: some projects provide declarative upgrade resources, some rely on dump/restore or logical replication workflows, and some expect you to run a controlled migration path outside the operator. Test the operator’s documented upgrade path before you bet production on it.

Declarative databases and roles

One friction point CloudNativePG solved for us was that we could describe databases and permissions alongside the applications that needed them. That made it easier to build a reusable internal framework for workloads deployed on Kubernetes while still fitting the security controls we had at the time.

Roles can be managed under the cluster spec:

nvim ~/snippet.yaml
1apiVersion: postgresql.cnpg.io/v1
2kind: Cluster
3metadata:
4  name: app-db
5spec:
6  managed:
7    roles:
8      - name: readonly
9        ensure: present
10        login: true
11        inRoles:
12          - pg_read_all_data
13
14      - name: app_migrator
15        ensure: present
16        login: true
17        inRoles:
18          - app_owner

Current CloudNativePG versions also expose declarative database management for database-scoped objects. The exact API is operator-specific, but the operational win is the same: reviewable manifests replace ad hoc SQL runbooks for common platform concerns.

Kubernetes-native benefits, with nuance

With an operator, PostgreSQL becomes an infrastructure-as-code object. CRDs can be applied by your GitOps tool, and the database stack can stay declarative. You also get a path to metrics, anti-affinity, tolerations, topology spread constraints, and node selectors, which are the tools you need to spread replicas across failure domains and avoid noisy neighbors.

Portability is also better than it would be otherwise: the definition lives in manifests. Moving environments becomes less about rewriting configurations and more about provisioning storage correctly and then syncing data via backup restore or replication.

But portability has limits. A CloudNativePG Cluster manifest is not a Crunchy PostgresCluster manifest. Operator choice becomes part of your platform contract.

Storage affects everything

Storage misconfiguration is a common failure mode. Putting WAL on slow network storage can produce write latency spikes and replication lag. Local SSDs can be faster but may change availability characteristics and failure implications. The practical takeaway is to test storage carefully and match the StorageClass to the workload.

For production, also test failure behavior: node loss, zone loss, volume detach/attach latency, snapshot restore time, and how your operator behaves when the Kubernetes control plane is degraded.

Picking an operator matters

There are many operators: CloudNativePG, Crunchy Postgres for Kubernetes, StackGres, KubeDB, Percona, and others. Swapping later is rarely painless, so don’t evaluate solely on popularity metrics. Look at release cadence, security handling, documentation quality, license/support posture, and whether important operations like initialization, cloning, upgrades, backup, restore, and connection pooling are mature.

OperatorDeployment / HA modelBackupsPoolingNotable strengths
CloudNativePGCustom controller model, PostgreSQL streaming replication, primary services managed by the operatorBackup and restore CRDs, plugin-based object-store backups, volume snapshotsPooler CRD for PgBouncerStrong Kubernetes-native design, declarative roles/databases, no external HA layer like Patroni
Crunchy Postgres for KubernetesStatefulSets for Postgres instances, Patroni-based HApgBackRestPgBouncer via operator-managed resourcesMature production story, explicit HA docs, strong backup/restore tooling
StackGresOperator-managed Postgres platform with HA componentsWAL-G-based backup/restore workflowsPgBouncer integrationUI/CLI, extension catalog, batteries-included platform experience

Choose like you’re selecting a vendor, because you effectively are.

Trade-offs: you still run a database service

An operator is not a shortcut to “no work.” It helps automate Postgres, but you still own the service.

Kubernetes won’t fix resiliency by itself, and an operator won’t either. You still need people who understand replication, WAL, tuning (shared_buffers, work_mem, autovacuum), storage, backup integrity, failover behavior, and Kubernetes concepts like storage classes, taints, affinity, and disruption budgets. Running databases on Kubernetes adds complexity to both sides.

Self-hosting Postgres on Kubernetes means you become your own DBaaS: upgrades, backups, storage, monitoring, capacity planning, and incident response are your responsibility. Managed services abstract a lot of that, but they can be expensive or restrictive. Operators help reduce lock-in and fit infrastructure-as-code practices, but they come with a pager.

Final note

If you plan to run Postgres on Kubernetes, treat the operator as the foundation of a database platform rather than a feature toggle. Start small, automate backups, rehearse restores and failovers regularly, and make sure at least one person truly understands both PostgreSQL and Kubernetes. That’s how a proof-of-concept StatefulSet becomes a service you can rely on.