One of the things that Amazon recommends for high availability is to be sure that your services span multiple AWS regions. The need for this advice was definitely underlined last week when the AWS Frankfurt data center experienced a failure. While this sort of thing is extremely, rare it does happen.
When your services can fail over to another region, this lessens the impact that these types of failures can have on your organization. However, if all of your services were located in a data center that had a failure, you would experience downtime for those cloud services.
What caused the outage?
In the case of the Frankfurt event, AWS experienced an outage for the availability zone in Frankfurt for three hours when the air circulation systems failed. According to Amazon’s records, the outage began at 13:24 PDT on June 10 and initially caused “connectivity issues for some EC2 instances.” In this case, the outage lasted until 1633 PDT when network services were restored. By 1719 PDT Amazon issued an update that stated “environmental conditions within the affected Availability Zone have now returned to normal level.”
In this instance, the downtime was not caused by an actual fire or other major disaster. Instead, it was caused by a “failure of a control system which disabled multiple air handlers in the affected Availability Zone.” In other words, it was a problem with the air conditioners that cool the data center. They stopped working causing the internal temperatures in the data center to rise to abnormal levels. This resulted in the AWS servers shutting down. Normally, this is a situation that could have been handled quickly. However, there was an additional complication as hypoxic gas was released into the data center. Hypoxic gas is used instead of water to put out fires in data centers as water would damage the computing equipment. Hypoxic gas removes the oxygen in the air making it impossible for technicians to enter the affected areas.
Amazon issued a statement that described the event as follows, “While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly.”
Again, this shows the importance of cross-region protection for your essential services – even for cloud providers that typically provide very high availability. If you experience downtime of other abnormalities with your AWS services you can check on their status using the AWS Service Health Dashboard.