Step-by-Step Guide to Setting Up Cross-AZ Read Replicas

Establish baseline architecture requirements, define RTO/RPO targets, and align on cross-AZ network topology. Review core replication concepts in Database Replication Fundamentals & Architecture to standardize terminology before provisioning. Cross-AZ replicas mitigate zonal failures but introduce asynchronous replication lag; architect for eventual consistency where applicable and enforce strict isolation boundaries for transactional workloads.

Pre-Deployment Architecture Validation

Audit VPC route tables to confirm direct, low-latency peering between source and target availability zones. Validate security group egress rules permit TCP 5432 (or engine-specific port) bidirectionally without NAT traversal. Enforce cross-AZ bandwidth quotas to prevent throttling during initial base backup transfer. Verify primary instance WAL configuration (archive_mode=on), provisioned IOPS capacity (minimum 3x peak write throughput), and IAM roles for automated snapshotting. Lower DNS TTLs to ≤60s on all database endpoints to enable rapid routing failover during zonal degradation. Ensure NTP/Chrony synchronization across all AZs to prevent clock skew from breaking replication slot validation.

Step 1: Provisioning the Cross-AZ Replica Instance

Deploy the replica using infrastructure-as-code or CLI. Match the primary’s instance class, storage engine (e.g., gp3/io2), and KMS encryption keys to prevent decryption overhead and cross-AZ data transfer penalties. Apply explicit AZ placement tags for scheduler awareness and capacity reservation.

# AWS RDS CLI example
aws rds create-db-instance-read-replica \
 --db-instance-identifier app-db-replica-az2 \
 --source-db-instance-identifier app-db-primary-az1 \
 --availability-zone us-east-1b \
 --db-instance-class db.r6g.2xlarge \
 --storage-type gp3 \
 --allocated-storage 500 \
 --no-auto-minor-version-upgrade

Disable automated backups (backup_retention_period=0) on the replica during initial sync to eliminate I/O contention and WAL shipping delays. Re-enable post-sync via parameter group apply.

Step 2: Configuring Replication & Connection Parameters

Tune engine-specific parameters to optimize async replication and connection handling. For PostgreSQL:

# postgresql.conf (Primary)
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
synchronous_commit = off # Async cross-AZ routing

# postgresql.conf (Replica)
hot_standby = on
hot_standby_feedback = on
max_standby_streaming_delay = 30s

Configure the connection pooler to enforce strict timeouts and idle session eviction:

# pgbouncer.ini
server_idle_timeout = 300
server_lifetime = 3600
tcp_keepalive = on
tcp_keepidle = 30
tcp_keepintvl = 10
tcp_keepcnt = 3

These settings prevent stale socket retention during transient cross-AZ network blips and ensure rapid connection recycling under load.

Step 3: Implementing Connection Routing & Read/Write Splitting

Deploy a stateful proxy layer (PgBouncer, HAProxy, or cloud-native router) between application servers and database endpoints. Implement read/write splitting using DNS CNAMEs or application-level middleware (e.g., Spring AbstractRoutingDataSource, Rails connects_to, or Django DATABASE_ROUTERS). Route SELECT queries to the replica endpoint and INSERT/UPDATE/DELETE to the primary. Reference advanced routing strategies in Designing Multi-Region Read Replica Topologies for latency-aware query distribution, sticky sessions for transactional consistency, and graceful connection draining during maintenance windows.

Step 4: Validation, Monitoring & Consistency Checks

Execute synthetic read/write workloads using pgbench or sysbench to verify routing behavior and failover thresholds. Monitor replication lag continuously:

-- Run on primary
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
 extract(epoch from now() - replay_lag) AS lag_seconds
FROM pg_stat_replication;

Validate read-after-write consistency by testing REPEATABLE READ and READ COMMITTED isolation levels against application transaction boundaries. Configure alert routing for lag spikes exceeding 500ms and connection pool saturation >85%. Implement automated circuit breakers in the proxy layer to temporarily bypass replicas when lag thresholds are breached.

Troubleshooting & Runbook Execution

Structured diagnostic workflow for production incidents involving replica degradation, routing failures, or consistency violations.

Symptom Identification

Detect elevated replication lag (replay_lag > 5s), connection pool exhaustion (waiting_connections > 0), stale read anomalies (missing recently committed rows), or proxy routing loops (cascading 503s). Correlate symptoms with APM traces (DB spans showing retry storms), database error logs (postgresql.log), and cross-AZ network telemetry (VPC Flow Logs, CloudWatch Network In/Out).

Root Cause Analysis

Isolate the failure vector:

  • Network Partitioning: Check AZ-to-AZ latency and packet loss via ping/mtr or cloud network metrics.
  • WAL Shipping Bottlenecks: Verify wal_keep_segments/max_slot_wal_keep_size isn’t causing disk pressure on the primary.
  • I/O Contention: Monitor replica iowait and disk_utilization. Cross-AZ bandwidth throttling often manifests as sustained wal_receiver stalls.
  • Routing Misconfiguration: Audit DNS caching delays (dig +trace) and application connection leak patterns (unreturned connections to pool).

Mitigation Strategies

Apply immediate containment:

  1. Dynamic Pool Resizing: Increase max_client_conn and reserve_pool_size in PgBouncer to absorb retry storms.
  2. Commit Behavior Adjustment: Temporarily set synchronous_commit = local on the primary to reduce write latency while lag recovers (note: increases data loss risk on primary crash).
  3. Traffic Rerouting: Shift 100% of read traffic to the primary via DNS weight adjustment or proxy config reload (pgbouncer -R).
  4. Vertical Scale: Upgrade replica instance class or storage IOPS if CPU/IO bottlenecks are confirmed.
  5. Circuit Breakers: Implement application-level fallbacks to bypass degraded replicas automatically when lag_seconds > 2.0.

Rollback Procedures

If data divergence or unrecoverable lag occurs, execute controlled rollback:

  1. Graceful Drain: Halt new connections to the replica via proxy admin: pgbouncer -R or HAProxy disable server db/replica-az2.
  2. Disable Read Routing: Remove or set DNS CNAME weight to 0 for the replica endpoint.
  3. Promote/Isolate: If divergence is detected, promote the replica to standalone: pg_ctl promote -D /var/lib/postgresql/data or aws rds promote-read-replica.
  4. Restore Primary Config: Revert IaC state to baseline. Verify checksum integrity using pg_checksums or pg_verifybackup before re-establishing replication.
  5. Re-sync Topology: Drop the degraded replication slot (SELECT pg_drop_replication_slot('replica_slot');), recreate the replica from a fresh base backup, and re-enable standard async replication. Validate pg_stat_replication state transitions from catchup to streaming before restoring read routing.