Episode 25: The AWS Shared Responsibility Model
When organizations move workloads into the cloud, one of the most important design considerations is how they will handle failure. No system is immune to outages, and businesses must plan for disruptions by building resilience, availability, and recovery into their architectures. AWS provides many tools and patterns that make it easier to prepare for the unexpected. For the AWS Certified Cloud Practitioner exam, you should understand the basic terms, the services that support backup and recovery, and the disaster recovery patterns that companies use to meet business continuity goals.
Resilience and redundancy are related but not identical. Resilience refers to a system’s ability to withstand and recover from disruptions, continuing to operate even when something goes wrong. Redundancy refers to having extra capacity or duplicate resources in place to take over if the primary ones fail. For example, running databases in multiple Availability Zones is redundancy, while designing failover routing that shifts traffic automatically is resilience. Together, these principles ensure systems not only survive failures but also continue to serve users with minimal impact.
Recovery Time Objective, or RTO, and Recovery Point Objective, or RPO, are two critical measures in disaster recovery planning. RTO defines how quickly you need to restore systems after an outage, while RPO defines how much data you can afford to lose, measured in time. For example, a bank may have an RTO of 15 minutes and an RPO of near zero, meaning downtime and data loss must be minimal. On the exam, remember that RTO is about time to recovery, and RPO is about acceptable data loss.
High availability is supported in AWS by using Multi-AZ deployments. Multi-AZ means placing resources across at least two Availability Zones within a Region. For example, an RDS database can be configured with a primary and standby replica in different zones. If one zone fails, AWS automatically shifts traffic to the standby, keeping the application running. Multi-AZ is one of the simplest and most effective ways to improve availability, and for the exam, remember that it is a built-in redundancy strategy.
The AWS Backup service provides centralized backup management across multiple AWS services. Instead of creating manual backups for each service, AWS Backup automates and enforces policies for resources like RDS, EFS, DynamoDB, and EC2. For example, an administrator can create a daily backup policy that applies to all databases in an account. AWS Backup simplifies compliance and reduces human error. For the exam, remember that AWS Backup is the managed service for automated, centralized backups across AWS workloads.
S3 provides built-in resilience features through versioning and MFA Delete. Versioning allows multiple versions of an object to be kept, so if a file is accidentally deleted or overwritten, older copies can be restored. MFA Delete adds an extra layer of protection by requiring multi-factor authentication before permanent deletions. These features make S3 data far more resilient to human error or malicious activity. For exam preparation, know that versioning and MFA Delete are tools to safeguard objects in S3.
Cross-Region replication is another S3 feature that enhances durability and disaster recovery. It automatically copies objects from one bucket to another in a different Region. This ensures data remains available even if an entire Region suffers an outage. For example, a business in North America might replicate backups to Europe for global redundancy. For the exam, remember that cross-Region replication supports durability, compliance, and recovery by spreading data geographically.
EBS snapshots provide backup capabilities for block storage volumes attached to EC2 instances. Snapshots are incremental, meaning only changes since the last backup are saved, which reduces cost and time. Snapshots can also be used to create new volumes or copied across Regions. Lifecycle policies can automate snapshot management, ensuring backups occur regularly. For the exam, remember that EBS snapshots are a core backup tool for EC2 instances.
RDS databases include automated backups and manual snapshots. Automated backups can be configured to retain data for a set number of days, supporting point-in-time recovery. Snapshots, created manually, provide backups that can be retained indefinitely. For example, before applying an update, an administrator might create a snapshot to ensure rollback capability. On the exam, expect questions about how RDS backups support high availability and recovery.
DynamoDB also supports backups, including point-in-time recovery, or PITR. PITR allows restoring a DynamoDB table to any point in the past 35 days. This is valuable for recovering from accidental deletions or corruption. On the exam, remember that PITR provides fine-grained recovery capability for DynamoDB workloads, ensuring business continuity even when data issues occur.
Route 53 health checks and failover routing add resilience at the DNS level. Health checks monitor endpoints, and if one becomes unavailable, traffic is rerouted automatically to a healthy endpoint. For example, a website may use failover routing between servers in two Regions, ensuring users always reach an operational site. For the exam, remember that Route 53 provides DNS-level failover, making it a key tool in disaster recovery strategies.
The pilot light disaster recovery pattern involves maintaining a minimal version of a workload in a secondary Region. In normal operation, only essential systems run in the secondary site, such as a database. In case of disaster, additional services are scaled up quickly. This keeps costs low while providing a foundation for recovery. On the exam, remember that pilot light balances readiness and cost efficiency.
The warm standby pattern keeps a scaled-down but fully functional version of a system running in a secondary Region. Unlike pilot light, warm standby has more services already active, so recovery is faster. When needed, the standby environment is scaled up to production capacity. This pattern is more costly but provides quicker recovery. For the exam, recognize warm standby as an intermediate option between pilot light and full multi-site setups.
The most robust disaster recovery pattern is multi-site active-active. In this setup, workloads run simultaneously in multiple Regions, with traffic routed between them. If one Region fails, the other continues without interruption. This provides the lowest RTO and RPO but is also the most expensive. For example, a global e-commerce site might run active-active across continents. On the exam, remember that multi-site active-active delivers maximum resilience at higher cost.
For more cyber related content and books, please check out cyber author dot me. Also, there are other prepcasts on Cybersecurity and more at Bare Metal Cyber dot com.
Choosing the right disaster recovery strategy depends heavily on RTO and RPO targets. If an organization requires near-zero downtime and cannot tolerate any data loss, then multi-site active-active is the best choice, despite its high cost. If the business can afford hours of downtime and some data loss, then less expensive approaches like pilot light or warm standby may be sufficient. Matching strategy to requirements ensures resources are not overinvested for workloads that don’t need extreme resilience. On the exam, remember that RTO and RPO determine which DR approach is appropriate.
Testing disaster recovery plans is just as important as creating them. AWS recommends running regular “game days,” which are practice exercises where teams simulate outages and test recovery procedures. These tests reveal weaknesses in the plan and ensure staff are familiar with their roles during real incidents. Without testing, even the best-designed plan may fail when put into action. For the exam, know that testing DR plans is a best practice to ensure recovery strategies work as expected.
Data protection is strengthened with encryption. AWS Key Management Service, or KMS, allows customers to encrypt backups, snapshots, and data stored in S3, RDS, or DynamoDB. Encrypting backups ensures that even if copies are intercepted or compromised, they cannot be read without the proper keys. For example, a healthcare company must encrypt patient data to meet regulatory requirements. On the exam, remember that KMS is the central tool for securing backup and recovery data.
IAM plays a role in backup operations by enforcing least privilege. Administrators should grant backup permissions only to specific roles or users who require them. For example, a backup operator role might be allowed to create snapshots but not delete them. Limiting permissions reduces the chance of accidental or malicious activity. On the exam, remember that IAM and least privilege are essential parts of securing backups and recovery operations.
Stateless design helps applications recover faster. A stateless application does not store session data or state information on individual servers, meaning instances can be replaced quickly without disrupting users. For example, storing session information in DynamoDB or ElastiCache allows EC2 instances to be disposable. This design supports resilience by making workloads easier to scale and recover. On the exam, know that stateless applications are faster to recover and more resilient than stateful ones.
AWS Application Migration Service helps organizations migrate and recover workloads by replicating servers into AWS. It supports rehosting scenarios, also called “lift and shift,” but it is also useful for disaster recovery. By keeping updated replicas of workloads in AWS, customers can launch replacement systems quickly if on-premises servers fail. For the exam, remember that Application Migration Service supports both migration and recovery by replicating workloads into AWS.
Auto Scaling supports resilience by ensuring recovery capacity is available when needed. If a failure occurs and workloads must be shifted, Auto Scaling can automatically launch new resources to meet demand. This prevents bottlenecks and ensures user experience remains consistent. For example, a pilot light system may rely on Auto Scaling to rapidly grow capacity after a failover event. On the exam, remember that Auto Scaling provides the elasticity needed for recovery.
Cross-account backup copies are another best practice. By storing backups in a separate AWS account, organizations protect them from accidental deletion or compromise in the primary account. This separation provides an extra layer of resilience. For example, if a malicious actor gains access to production, cross-account backups ensure data can still be restored. On the exam, know that cross-account copies strengthen resilience and disaster recovery.
Cost considerations are always part of disaster recovery planning. Active-active designs provide the best recovery but are the most expensive. Pilot light or warm standby offer lower costs but slower recovery. Organizations must balance business needs against budget constraints. For example, a social media site may justify high costs for active-active, while a small internal HR application may not. On the exam, remember that DR strategies represent trade-offs between cost, RTO, and RPO.
Monitoring and alerting are key to maintaining resilience. Services like CloudWatch, CloudTrail, and GuardDuty provide real-time insights into system health, performance, and security. Automated alerts ensure teams are notified immediately if problems arise, enabling quick response. For example, Route 53 health checks can trigger alerts when an endpoint fails. On the exam, remember that proactive monitoring supports resilience by catching issues before they escalate.
Common pitfalls in disaster recovery planning include failing to test backups, overlooking data transfer costs, and not updating plans as workloads evolve. Organizations may also underestimate the time needed to recover systems, leading to unrealistic RTOs. Avoiding these pitfalls requires discipline, regular reviews, and continuous improvement. For the exam, know that poor planning or lack of testing often causes DR failures.
Compliance is a major driver for disaster recovery. Industries like finance, healthcare, and government often require documented recovery strategies, encrypted backups, and evidence of testing. AWS provides tools to meet these requirements, but customers must implement them properly. For example, HIPAA compliance demands both data protection and documented recovery processes. On the exam, remember that compliance is a shared responsibility, with AWS providing secure infrastructure and customers ensuring workload-level recovery plans.
Resilience principles will appear on the Certified Cloud Practitioner exam in multiple ways. You may be asked which service provides point-in-time recovery for DynamoDB, what Multi-AZ offers for RDS, or which disaster recovery pattern is most cost-effective for certain requirements. The exam focuses on recognizing services, strategies, and trade-offs rather than technical details. Knowing these fundamentals ensures you can answer questions with confidence.
As we close this episode, remember that resilience, backup, and disaster recovery are not optional—they are essential for business continuity. AWS provides services like Backup, snapshots, replication, and monitoring to simplify the process, but customers must design strategies that align with business goals. Whether choosing pilot light, warm standby, or active-active, the key is to balance cost, recovery objectives, and compliance needs. For the exam, focus on the principles of resilience and the tools AWS offers. In practice, these strategies protect organizations from disruption and ensure they remain available to customers when it matters most.
