Strategic Product Management: How to Avoid Application Failures in the Cloud: Part 3

This is the third in a series of five blog posts that examine how you can build cloud applications that are secure, scalable, and resilient to failures - whether the failures are in the application components or in the underlying cloud infrastructure itself. In this post we will look at disaster recovery.

Disaster Recovery

While security groups, elastic load balancing, and auto scaling are important for making your application secure, scalable, and reliable, these features alone do not protect you against an outage that affects a whole data center¹, like those experienced by Amazon in Virginia and Ireland. To do that, you also need disaster recovery protection. But before we look at disaster recovery solutions for Amazon’s EC2 cloud, we first need to discuss how EC2 is segmented into Regions and Availability Zones, and the relationship between the two.

Amazon EC2 is divided into geographical Regions (U.S. West, U.S. East, EU, Asia Pacific, and so on) that allow you to deploy your application in a location that is best suited for a given customer base or regulatory environment.

Each region is divided into Availability Zones, which are defined by Amazon as “distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.” Additionally, Amazon states that “…each Availability Zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.”

This disaster recovery strategy enables the Amazon EC2 infrastructure to survive a complete failure of a data center in one Availability Zone by recovering applications in another Availability Zone. The key functionality behind the Amazon EC2 recovery solution includes Elastic IP Addresses and Elastic Block Store snapshots and replication.

Elastic IP Addresses

Elastic IP addresses are actually static IP addresses that are specifically designed for the dynamic nature of cloud computing. Similar to a traditional static IP address, they can be mapped to an application instance or to an ELB instance to provide a fixed address through which users can connect to your application. However, unlike traditional static IP addresses, you can programmatically reassign an elastic IP address to a different target instance if the original instance fails. The new target instance can even reside in a different Amazon Availability Zone, thereby allowing your application to fail over to a new Availability Zone in the event of a complete Availability Zone outage.

Amazon EC2 Elastic Block Store (EBS)

The Elastic Block Store (EBS) is a block-level storage system designed for use with Amazon EC2 instances. EBS volumes are automatically replicated within a given Availability Zone to ensure reliability. You can also create EBS snapshots, or incremental backups, which can be stored in a different Availability Zone. EBS snapshots provide a simple mechanism for replicating and synchronizing data across different Availability Zones — a requirement for any enterprise-caliber disaster recovery solution.

The frequency of the EBS snapshot will depend on the nature of your data and the recovery period that you want to provide for the fail over. If your data frequently changes and you need your replicated data to be as current as possible, you will need more frequent snapshots. However, if your data is relatively static or you can live with a fail over situation that uses data that might be a bit stale (e.g. 30 minutes or an hour old), your EBS snapshots can be less frequent.

The combination of an elastic IP address and Elastic Block Store snapshots to support a disaster recovery solution is illustrated in Figure 3.

Figure 3 - Disaster Recovery Using Elastic IP Address and EBS Snapshots

[1] You can use the Elastic Load Balancing functionality to load balance across application instances that reside in different Amazon Availability Zones. While this can protect against the complete failure of an Availability Zone or data center, it introduces more complexity such as real-time database synchronization across geographically distributed databases. If your application doesn’t require all application instances to be using a consistent data set, load balancing across Availability Zones might be a better option than a full disaster recovery solution. However, if you do require all application instances to be using the same consistent data set, it might be simpler to restrict your application to a single Availability Zone with a single data set and utilize a disaster recovery solution to protect against the complete failure of a that Availability Zone.

Strategic Product Management

Friday, November 16, 2012

How to Avoid Application Failures in the Cloud: Part 3

Disaster Recovery

Elastic IP Addresses

Amazon EC2 Elastic Block Store (EBS)

1 comment: