Tuesday, November 20, 2012

How to Avoid Application Failures in the Cloud: Part 4

This is the fourth in a series of five blog posts that examine how you can build cloud applications that are secure, scalable, and resilient to failures - whether the failures are in the application components or in the underlying cloud infrastructure itself. In this post we will look at application monitoring.

Monitoring


A key component of any successful application deployment — whether in the cloud or on premise — is the ability to know what is happening with your application at all times. This means monitoring the health of the application and being alerted when something goes wrong, preferably before it becomes noticeable to the application users. For on-premise applications, a wealth of solutions is available, such as HP’s Application Performance Management and Business Availability Center products. Most of the cloud infrastructure providers offer similar capabilities for your applications in the cloud. On Amazon EC2, application monitoring is provided by CloudWatch.

CloudWatch provides visibility into the state of your application running in the Amazon cloud and provides the tools necessary to quickly — and, in many cases, automatically — correct problems by launching new application instances or taking other corrective actions, such as gracefully handling component failures with minimal user disruption.

Cloudwatch allows you to monitor your application instances using pre-defined and user-defined alerts and alarms. If an alarm threshold is breached for a specified period of time (such as more than three monitoring periods), CloudWatch will trigger an alert. The alert can be a notification, such as an email message or an SMS text message sent to a system administrator, or it can be a trigger to automatically take action to try to rectify the problem. For example, the alert might be the trigger for the EC2 auto-scaling feature to start up new application instances or to run a script to change some configuration settings (e.g. remap an elastic IP Address to another application instance).


In the final post we'll look at a real life example of how all of the features that I've described over the first four posts in the series are used to create a secure, scalable and resilient service offering. 

Friday, November 16, 2012

How to Avoid Application Failures in the Cloud: Part 3

This is the third in a series of five blog posts that examine how you can build cloud applications that are secure, scalable, and resilient to failures - whether the failures are in the application components or in the underlying cloud infrastructure itself. In this post we will look at disaster recovery.

Disaster Recovery


While security groups, elastic load balancing, and auto scaling are important for making your application secure, scalable, and reliable, these features alone do not protect you against an outage that affects a whole data center1, like those experienced by Amazon in Virginia and Ireland. To do that, you also need disaster recovery protection. But before we look at disaster recovery solutions for Amazon’s EC2 cloud, we first need to discuss how EC2 is segmented into Regions and Availability Zones, and the relationship between the two.

Amazon EC2 is divided into geographical Regions (U.S. West, U.S. East, EU, Asia Pacific, and so on) that allow you to deploy your application in a location that is best suited for a given customer base or regulatory environment. 

Each region is divided into Availability Zones, which are defined by Amazon as “distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region. By launching instances in separate Availability Zones, you can protect your applications from failure of a single location.” Additionally, Amazon states that “…each Availability Zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.

This disaster recovery strategy enables the Amazon EC2 infrastructure to survive a complete failure of a data center in one Availability Zone by recovering applications in another Availability Zone. The key functionality behind the Amazon EC2 recovery solution includes Elastic IP Addresses and Elastic Block Store snapshots and replication.

Elastic IP Addresses


Elastic IP addresses are actually static IP addresses that are specifically designed for the dynamic nature of cloud computing. Similar to a traditional static IP address, they can be mapped to an application instance or to an ELB instance to provide a fixed address through which users can connect to your application. However, unlike traditional static IP addresses, you can programmatically reassign an elastic IP address to a different target instance if the original instance fails. The new target instance can even reside in a different Amazon Availability Zone, thereby allowing your application to fail over to a new Availability Zone in the event of a complete Availability Zone outage.

Amazon EC2 Elastic Block Store (EBS)


The Elastic Block Store (EBS) is a block-level storage system designed for use with Amazon EC2 instances. EBS volumes are automatically replicated within a given Availability Zone to ensure reliability. You can also create EBS snapshots, or incremental backups, which can be stored in a different Availability Zone. EBS snapshots provide a simple mechanism for replicating and synchronizing data across different Availability Zones — a requirement for any enterprise-caliber disaster recovery solution.

The frequency of the EBS snapshot will depend on the nature of your data and the recovery period that you want to provide for the fail over. If your data frequently changes and you need your replicated data to be as current as possible, you will need more frequent snapshots. However, if your data is relatively static or you can live with a fail over situation that uses data that might be a bit stale (e.g. 30 minutes or an hour old), your EBS snapshots can be less frequent.

The combination of an elastic IP address and Elastic Block Store snapshots to support a disaster recovery solution is illustrated in Figure 3.


Figure 3 - Disaster Recovery Using Elastic IP Address and EBS Snapshots


[1] You can use the Elastic Load Balancing functionality to load balance across application instances that reside in different Amazon Availability Zones. While this can protect against the complete failure of an Availability Zone or data center, it introduces more complexity such as real-time database synchronization across geographically distributed databases. If your application doesn’t require all application instances to be using a consistent data set, load balancing across Availability Zones might be a better option than a full disaster recovery solution. However, if you do require all application instances to be using the same consistent data set, it might be simpler to restrict your application to a single Availability Zone with a single data set and utilize a disaster recovery solution to protect against the complete failure of a that Availability Zone.

Tuesday, November 13, 2012

How to Avoid Application Failures in the Cloud: Part 2

This is the second in a series of five blog posts that examine how you can build cloud applications that are secure, scalable, and resilient to failures - whether the failures are in the application components or in the underlying cloud infrastructure itself. In the first post, we looked at securing applications. In this post we will look at scalability and availability.

Scalability and Availability


In today’s multi-tiered application architectures, clustering and load-balancing1 capabilities mean that scalability and availability often go hand-in-hand.

When applications are located on premise, you can configure load-balancing routers to spread connections and inbound traffic across multiple instances of an application in a cluster, providing better response times for users. Load balancing can also provide increased application availability, because the application is less susceptible to the failure of a single application instance. If one does fail, the load balancer can distribute the load over the remaining healthy instances in the cluster. Of course, some sessions or transactions might fail or be rolled back, but the application generally continues to operate unaffected by the instance failure.

Amazon EC2 Elastic Load Balancing (ECB)


Although you don’t have control of the hardware (e.g. routers) used in the Amazon EC2 cloud, you can still implement load balancing strategies for your applications using the Amazon Elastic Load Balancing (ELB) feature2. ELB allows you to load balance incoming traffic over a specified number of application instances, with automatic health-checking of each of the application instances. If an instance fails the health check, ELB will stop sending traffic to it.


Figure 2 – Amazon Elastic Load Balancing

Amazon EC2 Auto Scaling


The Amazon EC2 auto scaling feature can dynamically and automatically scale your applications — up or down — based on demand and other conditions (such as response time), so you only have to pay for the compute capacity you actually need and use. This is a case where cloud computing provides a clear cost advantage. If you wanted to be able to dynamically scale your on-premise applications, especially when using virtualization technologies such as VMware or the Xen hypervisor, you would first need to invest in and maintain excess server capacity to handle the peak application demand.

You can define your own Amazon auto-scaling rules to protect your application against slow response times or to ensure that there are enough ”healthy” application instances running to guarantee application availability.
  • Availability: You can specify that you always need a minimum of, say, four application instances running to ensure availability to users. The auto-scaling feature will check the health of your application instances to ensure that you have the specified minimum number of instances running. If the number of healthy instances drops below the minimum threshold, the auto-scaling feature will automatically start the required number of instances to restore your application to a healthy state.
  • Response time: You can also specify auto-scaling rules based on application response times. For example, you can define a rule to start a new application instance if the response time of the application exceeds 4 seconds for a 15-minute period. If you are using ELB with your application instances, the newly started instances are added to your load balancing group so they can share the user load with the other healthy instances. 

Summary


Given this brief description of load balancing and auto scaling within the Amazon EC2 cloud, you can see how these features can be applied to a multi-tiered application like the one illustrated in Figure 1 to improve scalability and availability. You can imagine that we could use ELB in front of each tier of the application — load balancing across the instances of each security group — and also apply auto-scaling rules to ensure that the application is resilient against an instance failure and can effectively respond to changes in user demand. We will examine a real-life example of combining security groups, load balancing, and auto scaling after we discuss disaster recovery in the next post.

[1] Load balancers provide a host of advanced functionality, including support for sticky user sessions, SSL termination (i.e. handling the SLL processing in the router), and multiple load balancing algorithms.

[2] Amazon ELB capabilities include SSL termination and sticky user sessions, enabling you to implement the same type of load balancing policies as you can with on-premise hardware-based load balancers.

Saturday, November 10, 2012

How to Avoid Application Failures in the Cloud: Part 1

This is the first of a series of five blog posts that examine how you can build cloud applications that are secure, scalable, and resilient to failures - whether the failures are in the application components or in the underlying cloud infrastructure itself.

When people think of “the cloud,” they tend to imagine an amorphous thing that is always there, always on. However, the truth is that the cloud — or, rather, applications running in the cloud — can suffer from failures just like those running on your on-premise systems. This became painfully clear in June, 2012, when an electrical storm in the mid-Atlantic region of the United States knocked out power to an Amazon data center in Virginia, resulting in temporary outages to services such as Netflix and Instagram. Similarly, in 2011, a transformer failure in Dublin, Ireland affected Amazon and Microsoft data centers, bringing down some cloud services for up to two days.  And, as recently as October of 2012, a problem with the storage component of the Amazon EC2 infrastructure caused disruptions for sites including Pinterest, reddit, TMZ, and Heroku.

As these examples show, the cloud itself is not immune to failures. But there are things you can do to protect your applications running in the cloud. In this series of blog posts, we will discuss some of the ways you can make your cloud applications more reliable and less prone to failures.

When looking at improving the resilience and reliability of your applications, you need to consider the following four factors:
  1. Security: Is your application protected against intrusion?
  2. Scalability and Availability: How can you make your application respond effectively to changing demand and, at the same time, protect against component failures?
  3. Disaster Recovery: What happens if, as in the examples above, an entire data center fails?
  4. Monitoring: How do you know when you have problems? And how can you respond quickly enough to prevent outages?
We will look at each of these factors in the context of an application running in the Amazon EC2 cloud infrastructure, as this is the environment in which Axway has the most experience. (Other cloud providers, such as Rackspace, provide similar capabilities.)

Security


Obviously, application security is very important to every organization. Preventing unwanted and unauthorized access to applications and data is critical because the consequences of a security breach, including potential data loss and exposure of confidential information, can be extremely costly in both financial and business terms.

When you are running applications in your own on-premise data center, your IT department can configure and manage security using well-tested methods such as firewalls, DMZs, routers, and secure proxy servers. They can create multi-layered security zones to protect internal applications, with each layer becoming more restrictive in terms of how and by whom it can be accessed. For example, the outer layer might allow access via certain standard ports (e.g. port 80 for HTTP traffic, port 115 for SFTP traffic, port 443 for secure HTTP traffic (SSL), and so on). The next layer might restrict inbound access to certain secure ports and only from servers in the adjacent layer — so, if you have a highly secure inner layer containing your database(s), you can allow access only via Port 1521 (the standard port used by Oracle database servers) and only from servers in the application layer.

When you move to the cloud, however, you are relying on others (the cloud infrastructure providers) to provide these security capabilities on your behalf. But even though you are outsourcing some of these security functions, you are not powerless when it comes to making your applications more secure and less susceptible to security breaches.

Amazon EC2 Security Groups


Amazon EC2 provides a feature called “security groups” that allows you to recreate the same type of security zone protection and isolation you can achieve with on-premise systems. You can use Amazon EC2 security groups to create a DMZ/firewall-like configuration, even though you don’t have access or control of the physical routers within the EC2 cloud. This allows you to isolate and protect the different layers of your application stack to protect against unauthorized access and data loss. Based on rules you define to control traffic, security groups provide different levels of protection and isolation within a multi-tier application by acting as a firewall for a specific set of Amazon EC2 instances. (See Figure 1)

 
Figure 1 - Amazon EC2 Security Groups

In this example, three different security groups are used to isolate and protect the three tiers of the cloud application: the web server tier, the application server tier, and the database server tier.
  • Web server security group: All of the instances of the web server are assigned to the WebServerSG security group, which allows inbound traffic on ports 80 (HTTP) and 443 (HTTPS) only — but from anywhere on the Internet. This makes the web server instances open to anyone who knows their URL, but access is restricted to the standard ports for HTTP and HTTPS traffic. This is typical practice for anyone configuring an on-premise web server. By defining security groups, you can have the same type of configuration in the Amazon EC2 cloud.
  • Application server security group: The AppServerSG security group restricts inbound application server access to those instances in the previously defined WebServerSG security group or to developers using SSH (port 22) from the corporate network. This illustrates a couple of important capabilities of security groups:
    1. You can specify other security groups as a valid source of inbound traffic.
    2. You can restrict inbound access by IP address.
    Specifying other defined security groups as a valid source of inbound traffic means that you can dynamically scale the web server group to meet demand by launching new web server instances — without having to update the application server security group configuration. All instances in the web server security group are automatically allowed access to the application servers based on the application server security group rule. Being able to restrict inbound access by IP address means that you can open ports within the security group, but only allow access by known (and presumably friendly) sources. In our example, we allow access to the application servers via SSH (for updates, etc.) only to developers connecting from the corporate network.
  • Database server security group: The DBServerSG security group is used to control access to the database server instances. Because this tier of the application contains the data, access is more restricted than the other layers. In our example, only the application server instances in the AppServerSG security group can access the database servers. All other access is denied by the security group filters. In addition to restricting access to the instances in the AppServerSG security group, you can also restrict the access to certain ports.  In our case, we’ve restricted access from the application servers so they can use only port 1521, the standard Oracle port.

In the next blog post in this series, we'll look at scalability and availability.