Starting at 1:18 PM PDT we experienced connectivity issues to some EC2 instances, increased API errors rates, and degraded performance for some EBS volumes within a single Availability Zone in the EU-CENTRAL-1 Region.
At 4:26 PM PDT, network connectivity was restored and the majority of affected instances and EBS volumes began to recover.
At 4:33 PM PDT, increased API error rates and latencies had also returned to normal levels. The issue has been resolved and the service is operating normally. The root cause of this issue was a failure of a control system which disabled multiple air handlers in the affected Availability Zone.
These air handlers move cool air to the servers and equipment, and when they were disabled, ambient temperatures began to rise. Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached. Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity.
While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility.
After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly.
In the meantime, alternate fire suppression measures are being used to protect the data center. Once cooling was restored and the servers and network equipment was re-powered, affected instances recovered quickly. A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved.
We continue to work to recover those last affected instances and volumes, and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery of those resources, we recommend replacing any remaining affected instances or volumes if possible.
Gordon, we cannot predict how long the system can operate at this level, nor how long the reading will take. Please work as quickly as you can. Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence.
I remember the summer job I had working in the European HQ of a chemical company. I was managing servers and thus had access to the server room - after the stern safety lecture about the Halon suppression system and being informed that "if the alarm goes off, don't even bother going for the respirator on the wall, just bolt it out of there double quick".
I have been in this situation , maintenance on cooling tower 1 engineer put it into bypass mode so he can work on it safely (in case bms decides to turn it on while he is on it.) When he left the site he forgot to take it out of bypass mode. every few days the BMS rotates the cooling towers switched off tower B and turned on tower A which was still in bypss mode… so air was blowing but the water was not circulating. the temperature graphs show the effected floors of data Center space heat up very quickly. there was a lot of permanently damaged equipment 24/7 human operators ignored many warning about ambient temperature and Low humidity assuming it was system error ( does it look hot on the cameras?? shrug ?? ) various equipment have different power down temperatures the halon did not trigger for us. some equipment did not for whatever reason shutdown when they hit the cutoff temps and cooked themselves. I suspect the Low humidity caused too much static and that’s where the damage was from..
I guess thats something hard to test for , let’s put this switch in an oven and run it to make sure it auto powers down at 80c :) every time we release a firmware.
was a problem from multiple vendors who issued bios / firmware patches. Cisco, dell, Emc, brocade.
>a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire.
sounds like on a submarine. On submarine the people who couldn't make it out of the sealed section in time, would use the personal breathing device - i wonder if AWS has such thing in the datacenter, and how do they make sure that all people are out from a huge datacenter. Several years ago a Russian nuclear sub on a sea trials after refit had the fire suppression system activated in the section where a bunch of people gathered - crew and the navy yard civilians, probably they celebrated drinking and smoking - and insufficient number of the breathing devices combined with lack of skills to quickly use it on the part of the civilians resulted in 20 deaths and 40 injured (lung damage, etc.), mostly the civilians.
This is why you have those annoying kind of rules like "only 8 people allowed in this room". And then of course that gets ignored because 15 people easily fit inside. All fine until you find out there aren't enough breathing/floating/rescue kind of things for more than 8 people.
My guess for "how to make sure all people are out": typically in such industrial, high security facilities, each people going in would have to badge at each door, i.e. no holding the door for a group of people. Combined with many locked doors, this allows you to know exactly who is in which part of the building.
That, plus you typically dont go and hangout on the DC floor, only the personel who is really required gets to go in.
> My guess for "how to make sure all people are out": typically in such industrial, high security facilities, each people going in would have to badge at each door, i.e. no holding the door for a group of people. Combined with many locked doors, this allows you to know exactly who is in which part of the building.
Holding doors was always a problem, but I guess that nowadays badges can have RFID or some other kind of wireless tracking so even if you hold the door, people in the control room can always know how many people are in each room, all the time. It's like 6 years now since the last time I was in a datacenter...
>you typically dont go and hangout on the DC floor, only the personel who is really required gets to go in.
times changed. I remember my friend bunch of years ago complaining to me that the DC at their branch of a large well known transnational was the favorite place of the employees for various informal activities, including that he was constantly dealing with used drug paraphernalia being left all over the place.
Answer to the comment below: yep, mostly syringes/needles.
Usually there are muster stations where a designated emergency captain takes attendance for everyone who is expected to be present and a remote operations center that has access to HRIS data to know who is out/on vacation when that attendance is known.
Door locks during an emergency are usually denoted to be fail-open so someone doesn't get trapped, though that may differ depending on the security level of the facility.
Reminds me of my first trip to a datacenter, where the guy who accompanied us said: "In the event of a fire this room is filled with nitrogen in 20 seconds. But don't worry: nitrogen is not toxic!" Well, I was a little worried :)
What makes things worse is the most strict security measures in the industry and we always were afraid of fire at this scale when outside parties have to enter pods. And all the checkups after. I dont envy people that worked that and next shift
> Also they didn't have breathing gear (and trained staff) so you could go in and restart without waiting and also in case of an accident being able to try and rescue people.
At that point without actually going in and checking they have no real way of knowing if there really is/was a fire or not. So the proper procedure is to let the professionals handle it (wait for the fire department to clear the building). No amount of server downtime is worth sending a "not a firefighter" into a possibly burning building.
And facilities like these have strict control of where people can be so they know if someone is in there or not without of going in to check.
Also if you follow AWS HA guidelines, this does not lead to a service outage. We were affected by this and it knocked a dozen or two systems offline for 6 hours or so. AZ redundancy took over and that was it and oncall went back to sleep.
I was about to reply that AWS shouldn't be relied on for safety-critical systems, but someone is probably already doing that.
I'll revise that to - I hope that whomever is relying on AWS for safety-critical systems at least does it over many regions. It's still dumb, because even AWS occasionally has global/multi-region outages, but at least it hopefully reduces the chance for it.
> I was about to reply that AWS shouldn't be relied on for safety-critical systems, but someone is probably already doing that
Wtf, why not? It's drastically easier, and probably cheaper, to achieve that level of redundancy with AWS than doing it yourself.
> It's still dumb, because even AWS occasionally has global/multi-region outages
Really? Like when? The only potential one you can claim was multi-region, was when S3 us-east-1 was down, and with the old default behaviour - if you didn't specify where your S3 bucket is it would pass through us-east-1 to ask where it is - that impacted lazy code that had nothing to do with us-east-1. That's almost entirely on developers and such though, so hard to claim it was a multi-region or global outage.
AWS does this with all of their outages, presumably because Amazon HQ is in Washington (PDT). I would think UTC would be the preferred "universal" time for communicating these kinds of incidents, because AWS customers span the globe.
We have restored network connectivity within the affected Availability Zone in the EU-CENTRAL-1 Region. The vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance. The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but can confirm that there was no fire within the facility.
the two most likely scenarios where "environmental conditions" prevent access to a DC are fire or cooling failure. aws says it's not a fire, so probably the temperature rose above their allowable safety threshold for access.
“The EU ban on the use of Halon in fire extinguishers actually came into force in October 2000 and was implemented in the UK in 2003, as a result of scientific research linking Halon and other CFC’s to Ozone depletion. The ban in practice is not total.
Existing owners and users of Halon 1211 portable fire extinguishers may be able to claim exemption to the EU ban for certain “Critical Uses”.
Broadly speaking this includes limited applications within the aircraft industry, military / armed forces, petrochemical industry and some specific marine applications“
Chances are this facility was built after 2000. I think that makes it extremely unlikely it can claim exemption.
Former controls system guy and have worked in data centers. I'd be concerned about why a control system failure took down multiple air handlers. Units typically have their own controllers and can be configured to run by themselves without input from a "parent" controller.
As said in another comment, we had a dozen instances or two affected. Most of the hashi stack just lost a node and chugged along at reduced redundancy. A patroni/postgres cluster lost a replica, but automatically re-integrated it into the cluster. Very nice and smooth.
We have mostly found one or two classes of jobs in the orchestration for which nomad stopped retrying deployments before the ec2 instances running the allocations were fully failed and removed from the cluster - and our on-call was unsure how to handle that situation in nomad right. Network and routing were really weird at some point. Additionally, we ended up with a couple of container instances orphaned from the container management, which was strange for a moment.
This was made a bit more hectic over here because a second hoster apparently fried their own network at the same time so we needed some time to realize we have two issues.
Overall, 5/5 Outage, would fail again once we've updated our jobs. We're happily close to not caring about such an incident.
> Most of the hashi stack just lost a node and chugged along at reduced redundancy.
Amazon had released a version of their AWS Linux edition that rebooted randomly due to a kernel bug, and I was working on our staging cluster, but I didn't even notice that I had EC2 instances that randomly rebooted and dropped because Nomad just kept the workload up.
This is true, but there is a consistent identifier cross account (likely to make it possible to build multi-account architectures without having to measure network latency between zones manually) - this is the AZ ID instead of the AZ Name .
We had a few interesting "bugs" appear: Mostly logging went down, but EC2 machines kept running. E.g. We are running a lot in ECS with EC2 machines, like a RabbitMQ cluster with 20 instances all across all AZs. None of the machines died, none of the containers had degraded performance as far as I can tell, but the performance logs just disappear from 21:50 to 22:30 GMT+2. It's just blank int he Metrics and Cloudwatch dashboard. Same with other cloudwatch logs.
We also do have some EC2s running with nodejs applications and there the aws-sdk just errored out with "UnknownError: 503" and simply stopped logging until we restarted the machines. The machines itself were not stopped at all.
Other than that, I can't see any effects across our accounts. Also not RDS or anything else. Fascinating. Glad its under control and seemingly nobody died or so.
I wonder whether its because a large load of capacity needs suddenly in unaffected AZs put abnormal stress on things like networking to the whole unaffected AZs.
My only experience there is in Azure, when they deployed patches for the Heartbleed etc issues, for a few weeks things were much slower in CPU power (our response times just shot up 20% for no reason then recovered a few weeks later) and there were network related timeouts that were abnormal and it all settled down eventually.
Presumably you mean Fahrenheit? 100 degrees C is the boiling point of water at sea level. I wouldn't want to sit in a room hotter than boiling water. Where as 140 degrees F is 60 degrees C and a little under twice your body's natural temperature, or a bit above a really hot Mediterranean summers day. Which seems a much more realistic temperature.
60 degrees C is a lot above a really hot Mediterranean summer day. Average temp is 33 degrees in the hottest summer months, and in extremes it may be 40 degrees Celsius. 60 degrees would therefore be almost twice as hot as average or 50% hotter than even the hottest days.
Also 60 degrees Celsius would mean it made an appearance here:
> 60 degrees C is a lot above a really hot Mediterranean summer day. Average temp is 33 degrees in the hottest summer months, and in extremes it may be 40 degrees Celsius.
I was thinking 40 degrees. Given "average" isn't implied by "really hot", I'd say your 40 degree figure falls in line with what I had in mind. But you're right that it's still a big jump to 60 degrees.
Maybe I shouldn't have put vague comparisons in my post because it's rather distracting from the core point that the GP got their Celsius and Fahrenheit mixed up.
The exact temperature that's tolerable depends on the sauna, but can exceed 100 degrees C for dry saunas. Air has comparatively little heat capacity, so even at those temperatures, sweat can cool your body fairly efficiently (feels pretty toasty, though). If you add moisture to the air (steam sauna), that temperature becomes unbearable quickly, so those are at lower temperature.
It becomes unbearable not because of heat capacity but because of inability of the body to expel heat.
When outside temperature is equal or higher than body temperature the only way to expel heat is through sweating. For this the air must by dry enough (wet bulb temperature less than body temperature). https://en.wikipedia.org/wiki/Wet-bulb_temperature
If body can't expel enough heat it results in eventual death.
No Celsius. Admittedly 140 C is very hot (but not unheard of) for a sauna and you should probably not be in for more than 2-3 minutes max at a time. 90-110 C however is pretty standard and really nice. I can highly recommend giving it a try.
Pulling a spec sheet for a typical Dell 1U server, it has an operating range of 5°C to 40°C, and a non-operating (in storage) range of -40°C to 65°C. I imagine it would survive much higher temps, but it doesn't seem recommended. I would be wary of how reliable anything in that room might be later.
Perhaps not the PCB itself, but there's typically lots of thin wire around with thin insulation.
A lot of DCs after 2000 went from cold to warm based on the failure rates not changing.
If you get a chance to tour HE Fremont (the old Apple factory), it's not cold.
Factoid: most startups in SV started with a rack in a closet, that one day lost air circulation or cooling and went to the boiling point. I recommend people monitor the temp and cut 2 inches off the bottom and top of their closet door to allow passive cooling while you drive back to the office.
The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but *can confirm that there was no fire* within the facility.
The other day Zoom posted a status update that listed the time in both CEST, and PST... How did they get daylight saving time right for one and not the other? PST isn't a thing for another few months yet...
Google is a bit ... more interesting case, at least known. Namely they fucked up early in history by using local timezone, and when it started biting them it proved cheaper and easier to use that as the one global timezone rather than make everything base on UTC. Since then I heard it referenced as Mountain View Time.
OTOH, they could add translation on website for it...
I know you're being sarcastic, but funny how "localization" affects our perceived reality, looking it up there are also China-centric maps, so most Chinese students probably think all world maps are like that.
AFAIK most maps I've seen have the longitude 180 degrees West at the left edge, and 180 degrees East on the right.