Step-by-Step Guide to Setting Up Cross-AZ Read Replicas

Operational question: How do I provision a read replica in a different availability zone, verify replication is healthy, and route application traffic to it without risking data consistency violations?

Cross-AZ replicas distribute read traffic across physical fault domains within a single AWS region, insulating your read workload from a zonal hardware failure. Because they rely on asynchronous replication, they always carry some replica lag β€” architect for eventual consistency in read paths and enforce strict primary routing for transactional reads. Before provisioning, review the designing multi-region read replica topologies guidance to align on placement strategy and latency budgets.


Cross-AZ Read Replica Data Flow Diagram showing a primary PostgreSQL instance in Availability Zone A streaming WAL records asynchronously to a read replica in Availability Zone B. An application proxy layer routes INSERT/UPDATE/DELETE to the primary and SELECT queries to the replica. Availability Zone A Primary (rw) PostgreSQL / RDS wal_level = replica Availability Zone B Replica (ro) hot_standby = on replay_lag monitored WAL stream (async) TCP 5432 Β· 1–3 ms inter-AZ App Proxy Layer PgBouncer / HAProxy / RDS Proxy writes reads Application web / API tier

Symptom Identification

Recognize a cross-AZ replication problem β€” rather than an application or network issue β€” by checking these signals:

  • replay_lag in pg_stat_replication climbs above 500 ms and keeps rising under sustained write load.
  • CloudWatch metric ReplicaLag (RDS) exceeds your SLA threshold (typically 1–5 s for OLTP workloads).
  • Application reports β€œmissing recently committed rows” on replica reads β€” a read-after-write consistency violation caused by replica lag rather than a bug.
  • SHOW POOLS in PgBouncer shows cl_waiting > 0 sustained for more than a few seconds, indicating connection pool pressure downstream of a struggling replica.
  • VPC Flow Logs show packet retransmits or bandwidth caps between the primary and replica subnets.

Root Cause Analysis

Cross-AZ replication lag has three primary sources:

Network transit. WAL records must cross the inter-AZ network fabric (1–3 ms per round trip under normal conditions). Under a bulk load or large transaction, hundreds of MB of WAL can back up, and any bandwidth throttle or transient packet loss compounds the delay.

I/O saturation on the replica. The replica applies WAL at the speed its disk subsystem permits. If the replica’s provisioned IOPS are lower than the primary’s peak write throughput, the wal_receiver process falls behind, causing replay_lag to grow even when network conditions are fine. A common misconfiguration is provisioning the primary with io2 and the replica with gp3 at the default 3,000 IOPS.

Inactive replication slots accumulating WAL. If a replication slot was created but the standby disconnected, the primary retains WAL indefinitely (pg_replication_slots.active = false). This bloats the primary’s WAL directory and can exhaust disk, causing replication to stall for all standbys.

Pre-Deployment Architecture Validation

Before provisioning, confirm these conditions or the replica will lag from minute one:

  • VPC route tables show direct, low-latency peering between source and target AZ subnets β€” no NAT traversal in the WAL path.
  • Security group egress rules permit TCP 5432 (PostgreSQL) or TCP 3306 (MySQL) bidirectionally between primary and replica subnets.
  • Primary instance WAL configuration: wal_level = replica, archive_mode = on, max_wal_senders β‰₯ 3.
  • Primary provisioned IOPS capacity is at least 3Γ— peak write throughput so the replica can sustain catchup bursts.
  • DNS TTL on all database CNAME endpoints is ≀ 60 s to enable rapid routing failover during zonal degradation.
  • NTP/Chrony is synchronized across all AZs β€” clock skew above ~1 s breaks replication slot validation on self-managed PostgreSQL.
  • No inactive replication slots on the primary: SELECT slot_name, active FROM pg_replication_slots WHERE NOT active;

Step-by-Step Resolution

Step 1 β€” Provision the Replica Instance

Deploy in a different AZ, matching the primary’s instance class, storage engine, and KMS encryption key. Storage type mismatch is the most common provisioning mistake; the replica must match or exceed the primary’s IOPS capacity.

bash
# AWS RDS CLI β€” create a read replica in AZ-b from a primary in AZ-a
aws rds create-db-instance-read-replica \
  --db-instance-identifier app-db-replica-az2 \
  --source-db-instance-identifier app-db-primary-az1 \
  --availability-zone us-east-1b \
  --db-instance-class db.r6g.2xlarge \
  --storage-type gp3 \
  --no-auto-minor-version-upgrade

--allocated-storage is not accepted for create-db-instance-read-replica; the replica inherits storage size from the source automatically. Scale storage independently after creation with modify-db-instance.

Inline verification (wait for Available state):

bash
aws rds wait db-instance-available \
  --db-instance-identifier app-db-replica-az2

aws rds describe-db-instances \
  --db-instance-identifier app-db-replica-az2 \
  --query 'DBInstances[0].{Status:DBInstanceStatus,AZ:AvailabilityZone,Endpoint:Endpoint.Address}'

Step 2 β€” Configure Replication Parameters

For self-managed PostgreSQL on EC2, tune both sides before replica starts streaming:

ini
# postgresql.conf β€” Primary
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1024          # MB; keep enough WAL for brief replica disconnects
synchronous_commit = off      # async cross-AZ replication; local durability only

# postgresql.conf β€” Replica
hot_standby = on
hot_standby_feedback = on     # prevents primary vacuuming rows the replica still needs
max_standby_streaming_delay = 30s

Configure PgBouncer keepalives to prevent stale socket retention during transient inter-AZ network blips:

ini
# pgbouncer.ini
server_idle_timeout = 300
server_lifetime = 3600
tcp_keepalive = on
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3

Inline verification:

sql
-- Run on primary after replica connects
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- Expected: state = 'streaming', sync_state = 'async'

Step 3 β€” Configure Connection Routing and Read/Write Splitting

Deploy a proxy layer (PgBouncer, HAProxy, or RDS Proxy) to split reads from writes. Application-layer routing via ORM middleware is an alternative for codebases where a proxy layer adds operational overhead; the proxy approach is preferred for workloads with multiple application services.

Route all SELECT traffic to the replica endpoint and INSERT/UPDATE/DELETE to the primary. Use DNS CNAMEs with ≀ 60 s TTLs so you can cut over without application restarts:

code
db-write.internal  β†’  app-db-primary-az1.xxxx.rds.amazonaws.com
db-read.internal   β†’  app-db-replica-az2.xxxx.rds.amazonaws.com

Implement sticky session routing for transactional read paths (e.g., β€œread your own write” after a form submission) by tagging those requests to route to the primary for a configurable window after a write.

Inline verification (confirm routing is live):

sql
-- Connect via the read CNAME and check you are on the replica
SELECT pg_is_in_recovery();
-- Must return 't' (true) on the replica

-- Connect via the write CNAME and confirm primary
SELECT pg_is_in_recovery();
-- Must return 'f' (false) on the primary

Step 4 β€” Validate, Monitor, and Set Lag Alerts

Run a synthetic workload to confirm routing and measure baseline lag before serving production traffic:

bash
# Benchmark primary write throughput via the write endpoint
pgbench -h db-write.internal -U app_user -d app_db \
  -c 20 -j 4 -T 60 --no-vacuum

Monitor lag from the primary continuously. The query below surfaces per-replica lag in seconds:

sql
-- Run on primary (PostgreSQL)
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  EXTRACT(EPOCH FROM replay_lag) AS lag_seconds
FROM pg_stat_replication;

For detecting and handling replication lag in real-time, configure CloudWatch alarms (RDS) or Prometheus alerts (self-managed) to fire when lag_seconds > 5. Implement circuit breakers in the proxy layer to bypass the replica automatically when lag exceeds your SLA threshold β€” see fallback strategies when replicas fall behind for circuit-breaker patterns.

Validate read-after-write consistency by testing your application’s critical β€œwrite then read” paths under load, checking that REPEATABLE READ isolation on transactions that must read their own writes is routed to the primary.

Configuration Snippet

Minimal annotated configuration for a self-managed PostgreSQL cross-AZ setup:

ini
# ── PRIMARY (postgresql.conf) ────────────────────────────────────────────────
wal_level = replica            # minimum required for streaming replication
max_wal_senders = 10           # max concurrent wal_sender processes (one per replica + spares)
max_replication_slots = 10     # pre-allocate; must be β‰₯ number of replicas
wal_keep_size = 1024           # retain 1 GB of WAL for catchup on brief disconnects
synchronous_commit = off       # async mode; writes return after local WAL flush only

# ── REPLICA (postgresql.conf) ────────────────────────────────────────────────
hot_standby = on               # allow read queries while in recovery
hot_standby_feedback = on      # inform primary of oldest xmin on replica (prevents vacuum conflicts)
max_standby_streaming_delay = 30s  # how long to delay applying WAL to avoid killing long reads

# ── PGBOUNCER (pgbouncer.ini, replica pool) ──────────────────────────────────
[databases]
app_db_read = host=db-read.internal port=5432 dbname=app_db

[pgbouncer]
pool_mode = transaction        # transaction pooling for high concurrency
max_client_conn = 500
default_pool_size = 25
reserve_pool_size = 10         # extra connections for lag-driven traffic spikes
server_idle_timeout = 300
tcp_keepalive = on
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3

Verification and Rollback

Confirming the fix is working:

sql
-- Verify replication state on primary
SELECT
  application_name,
  state,           -- should be 'streaming'
  sync_state,      -- should be 'async'
  EXTRACT(EPOCH FROM replay_lag) AS lag_s
FROM pg_stat_replication;

-- Verify replica is accepting reads
SELECT pg_is_in_recovery(), now() - pg_last_xact_replay_timestamp() AS lag;

RDS equivalent β€” check CloudWatch ReplicaLag metric; should be < 1 s under normal OLTP load.

Rollback procedure if the replica diverges or lag becomes unrecoverable:

  1. Drain the replica: halt new connections via proxy admin (psql -p 6432 -c "PAUSE app_db_read;" or haproxy -sf).
  2. Remove the replica from DNS by setting the read CNAME weight to 0 or pointing it back at the primary endpoint.
  3. If data divergence is confirmed, promote the replica to standalone:
bash
# Self-managed PostgreSQL
pg_ctl promote -D /var/lib/postgresql/data

# AWS RDS
aws rds promote-read-replica \
  --db-instance-identifier app-db-replica-az2
  1. Revert IaC state to the last known-good baseline. Verify checksum integrity before re-establishing replication:
bash
pg_checksums --check -D /var/lib/postgresql/data
  1. Drop the stale replication slot on the primary, then recreate the replica from a fresh base backup:
sql
SELECT pg_drop_replication_slot('replica_slot');

Watch pg_stat_replication.state transition from catchup to streaming before restoring read traffic to the new replica.

Edge Cases and Gotchas

Cascading replication (replica of a replica). If you chain a second replica off the cross-AZ replica rather than the primary, WAL must traverse two hops: primary β†’ AZ-b replica β†’ downstream replica. Lag compounds at each hop, and promoting the intermediate replica breaks the downstream replica’s replication stream. Use a dedicated replication slot on the primary for each replica whenever possible. See comparing PostgreSQL streaming replication vs MySQL GTID for slot management differences between engines.

Logical replication vs physical replication for cross-AZ. Physical (streaming) replication replicates the entire PostgreSQL instance block-by-block; logical replication replicates individual tables as a decoded change stream. If you need to replicate only a subset of tables cross-AZ, or the source and target are different PostgreSQL major versions, logical replication for read scaling is the correct approach β€” but it does not replicate schema changes automatically and requires explicit publication/subscription management.

Connection pool exhaustion during initial sync. The initial base backup from primary to replica consumes significant I/O and can briefly saturate the primary’s connection quota. Disable automated backups on the replica during initial sync (--backup-retention-period 0) and re-enable post-sync via modify-db-instance to eliminate I/O contention and WAL shipping delays. Monitor max_connections usage on the primary during provisioning and increase reserve_pool_size in PgBouncer to absorb the connection spike from application retries.

Frequently Asked Questions

Does a cross-AZ read replica use synchronous or asynchronous replication?

AWS RDS cross-AZ read replicas use asynchronous replication by default. WAL records are streamed from primary to replica without the primary waiting for confirmation, which means replica lag is always non-zero under write load.

Can I promote a cross-AZ read replica to become the new primary?

Yes. For RDS, run aws rds promote-read-replica --db-instance-identifier <replica-id>. For self-managed PostgreSQL, use pg_ctl promote. The promoted instance becomes a standalone writable instance; you must update DNS and application connection strings manually.

How much additional latency does cross-AZ replication add compared to same-AZ?

Typical inter-AZ round-trip latency within a single AWS region is 1–3 ms. Under normal write load this adds only a few milliseconds to replica lag, but WAL-intensive workloads (bulk loads, large transactions) can push lag to seconds if the replica’s I/O cannot keep pace.


← Back to Designing Multi-Region Read Replica Topologies