Amazon Web Services|Cloud Computing

AWS Details Frankfurt Data Center Outage Cause

One of the things that Amazon recommends for high availability is to be sure that your services span multiple AWS regions. The need for this advice was definitely underlined last week when the AWS Frankfurt data center experienced a failure. While this sort of thing is extremely, rare it does happen.

When your services can fail over to another region, this lessens the impact that these types of failures can have on your organization. However, if all of your services were located in a data center that had a failure, you would experience downtime for those cloud services.

What caused the outage?

In the case of the Frankfurt event, AWS experienced an outage for the availability zone in Frankfurt for three hours when the air circulation systems failed. According to Amazon’s records, the outage began at 13:24 PDT on June 10 and initially caused “connectivity issues for some EC2 instances.” In this case, the outage lasted until 1633 PDT when network services were restored. By 1719 PDT Amazon issued an update that stated “environmental conditions within the affected Availability Zone have now returned to normal level.”

Sponsored Content

Maximize Value from Microsoft Defender

In this ebook, you’ll learn why Red Canary’s platform and expertise bring you the highest possible value from your Microsoft Defender for Endpoint investment, deployment, or migration.

In this instance, the downtime was not caused by an actual fire or other major disaster. Instead, it was caused by a “failure of a control system which disabled multiple air handlers in the affected Availability Zone.” In other words, it was a problem with the air conditioners that cool the data center. They stopped working causing the internal temperatures in the data center to rise to abnormal levels. This resulted in the AWS servers shutting down. Normally, this is a situation that could have been handled quickly. However, there was an additional complication as hypoxic gas was released into the data center. Hypoxic gas is used instead of water to put out fires in data centers as water would damage the computing equipment. Hypoxic gas removes the oxygen in the air making it impossible for technicians to enter the affected areas.

Amazon issued a statement that described the event as follows, “While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly.”

Again, this shows the importance of cross-region protection for your essential services – even for cloud providers that typically provide very high availability. If you experience downtime of other abnormalities with your AWS services you can check on their status using the AWS Service Health Dashboard.

BECOME A PETRI MEMBER:

Don't have a login but want to join the conversation? Sign up for a Petri Account

Register
Comments (0)

Leave a Reply

Michael Otey is president of TECA, a technical content production, consulting and software development company in Portland,
External Sharing and Guest User Access in Microsoft 365 and Teams

This eBook will dive into policy considerations you need to make when creating and managing guest user access to your Teams network, as well as the different layers of guest access and the common challenges that accompany a more complicated Microsoft 365 infrastructure.

You will learn:

  • Who should be allowed to be invited as a guest?
  • What type of guests should be able to access files in SharePoint and OneDrive?
  • How should guests be offboarded?
  • How should you determine who has access to sensitive information in your environment?

Sponsored by: