Disaster Recovery in AWS: A Practical Guide

Disaster recovery is not the same as having backups. Backups give you data. Disaster recovery gives you a running business. The difference matters when your primary region goes down, your database gets corrupted, or a critical service failure cascades through your application. AWS provides the building blocks for every DR strategy from basic backup-and-restore to multi-region active-active — but choosing the wrong one wastes money, and choosing nothing leaves your business exposed.

Disaster Recovery vs Backup: Understanding the Difference

Backup is about preserving data. You create copies of databases, files, and configurations so that if data is lost or corrupted, you can restore it. Backup answers the question: can I get my data back?

Disaster recovery is about preserving business operations. It encompasses the entire process of bringing your application back online — infrastructure, networking, compute, data, and DNS — after a major failure. DR answers the question: how quickly can my business operate again?

A solid backup strategy is a prerequisite for DR, but backups alone are not a DR plan. Restoring a database from backup takes time. Provisioning new infrastructure takes time. Configuring networking and DNS takes time. Your DR strategy defines how much of that work is done in advance versus on-demand during an actual disaster.

RTO and RPO: The Two Numbers That Define Your Strategy

Recovery Time Objective (RTO) is the maximum acceptable time your application can be down. If your RTO is 4 hours, your DR strategy must restore full operations within 4 hours of a disaster declaration. A 1-hour RTO requires significantly more investment than a 24-hour RTO.

Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. An RPO of 1 hour means you can afford to lose up to 1 hour of data. An RPO of zero means no data loss is acceptable, requiring synchronous replication.

These numbers come from business impact analysis, not engineering preference. Talk to stakeholders about what downtime costs per hour, what data loss is unrecoverable, and which systems are genuinely critical versus nice-to-have. The answers will vary dramatically between systems and will directly determine your DR investment.

Strategy 1: Backup and Restore

How it works: You maintain regular backups of all critical data in a secondary region. When disaster strikes, you provision new infrastructure from scratch and restore data from backups. Nothing runs in the DR region until you need it.

RTO: Hours to days, depending on infrastructure complexity and data volume.

RPO: Depends on backup frequency. Daily backups mean up to 24 hours of data loss.

Cost: Lowest. You only pay for backup storage during normal operations. Compute costs only begin during an actual disaster event.

Best for: Non-critical systems, development environments, and businesses where several hours of downtime is acceptable. Also appropriate as a starting point for businesses building toward a more robust strategy.

Strategy 2: Pilot Light

How it works: Core infrastructure components run continuously in the DR region at minimal scale. Databases are replicated. AMIs and container images are pre-staged. But application servers and web tiers are not running. During a disaster, you scale up the pre-positioned resources and start the application layer.

RTO: 30 minutes to a few hours. The database is already running and current, so you only need to start compute resources and update DNS.

RPO: Minutes, because databases are continuously replicated.

Cost: Low to moderate. You pay for database replication and minimal infrastructure in the DR region. Typically 10-20% of production costs during normal operations.

Best for: Business-critical applications that need recovery within an hour but cannot justify the cost of running full parallel infrastructure. Common for SaaS platforms and internal business applications.

Strategy 3: Warm Standby

How it works: A scaled-down but fully functional copy of your production environment runs in the DR region at all times. All tiers — web, application, and database — are active, just at reduced capacity. During failover, you scale up to production capacity and redirect traffic.

RTO: Minutes. The application is already running. You only need to scale up and redirect traffic via Route 53.

RPO: Near-zero with synchronous database replication.

Cost: Moderate to high. Running a scaled-down environment typically costs 25-50% of production. However, the warm standby can serve as a staging environment or handle read traffic during normal operations to offset costs.

Best for: Revenue-generating applications where minutes of downtime have measurable financial impact. E-commerce platforms, payment processing systems, and customer-facing SaaS products.

Strategy 4: Multi-Site Active-Active

How it works: Full production environments run simultaneously in multiple regions, actively serving traffic. Data is replicated bidirectionally. If one region fails, the other continues serving all traffic without any failover process.

RTO: Near-zero. There is no failover — traffic simply routes to healthy regions automatically.

RPO: Zero with synchronous replication, or near-zero with asynchronous replication and conflict resolution.

Cost: Highest. You are running full production infrastructure in multiple regions simultaneously. Typically 2x or more of single-region costs, plus cross-region data transfer.

Best for: Mission-critical systems where any downtime is unacceptable. Financial services, healthcare platforms, and global applications that already need multi-region presence for latency reasons.

Cost vs Recovery Time Tradeoff

The relationship between cost and recovery speed is not linear — it is exponential. Moving from a 24-hour RTO to a 4-hour RTO might cost 20% more. Moving from 4 hours to 30 minutes might cost 3x more. Moving from 30 minutes to near-zero might cost 5-10x more.

The right answer depends entirely on what downtime costs your business. If you lose $50,000 per hour of downtime, investing $5,000/month in warm standby infrastructure pays for itself after a single 2-hour incident. If downtime costs $500/hour, backup-and-restore with a 12-hour RTO is perfectly rational.

Do not over-engineer DR for systems that do not justify it. Apply different strategies to different systems based on their business criticality. Your customer-facing payment system needs warm standby. Your internal reporting dashboard can use backup-and-restore.

Key AWS Services for Disaster Recovery

Route 53: DNS-based traffic routing with health checks. Automatically redirects traffic to your DR region when the primary region fails health checks. Supports weighted, failover, and latency-based routing policies.

S3 Cross-Region Replication: Automatically replicates objects to a bucket in your DR region. Supports replication of existing objects, delete markers, and encryped objects. Essential for any DR strategy above backup-and-restore.

RDS Multi-AZ and Cross-Region Read Replicas: Multi-AZ handles single-AZ failures automatically. Cross-region read replicas provide a warm database in your DR region that can be promoted to primary within minutes.

Aurora Global Database: Purpose-built for multi-region deployments. Provides sub-second replication to up to five secondary regions with managed failover that promotes a secondary region in under a minute. The strongest option for relational database DR.

Testing Your DR Plan

An untested DR plan is an assumption, not a plan. Schedule regular DR drills — at least annually, quarterly for critical systems.

Tabletop exercises: Walk through the DR procedure on paper with your team. Identify who does what, what runbooks exist, and where gaps are. These cost nothing and reveal procedural gaps quickly.

Simulated failovers: Actually execute your failover procedure in a controlled manner. Redirect traffic to the DR region, verify application functionality, then fail back. Measure actual RTO versus planned RTO.

Chaos engineering: Use AWS Fault Injection Service to inject realistic failures — AZ outages, network latency, service degradation — and observe how your DR automation responds. This builds confidence that automated failover actually works under realistic conditions.

Common DR Mistakes

Never testing failover: The most common and most dangerous mistake. Teams build DR infrastructure and never validate it works. When a real disaster occurs, they discover configuration drift, expired credentials, or missing runbooks.

Forgetting DNS TTL: If your DNS records have a 24-hour TTL, a failover will take hours to propagate to all clients even if your infrastructure switches in minutes. Set low TTLs (60-300 seconds) on records used for DR failover.

Ignoring dependencies: Your application may fail over successfully, but if it depends on a third-party API that is region-specific, or a hardcoded IP address, the failover is incomplete. Map all dependencies and ensure they are DR-aware.

No runbook: DR procedures should be documented step-by-step so that anyone on the team can execute them under pressure at 3 AM. If the only person who knows the DR procedure is unavailable during the disaster, you effectively have no DR plan.

Start with Pilot Light, Scale from There

For most small and medium businesses, Pilot Light provides the best balance of cost and recovery speed. It keeps your data continuously replicated with an RTO measured in minutes to an hour, at 10-20% of production cost. You can always upgrade to Warm Standby later as your business grows and downtime becomes more expensive.