AWS Frankfurt incident

(aws.amazon.com)

145 points | by jabo 9 days ago

13 comments

  • tyingq 9 days ago
    6:54 PM PDT

    Starting at 1:18 PM PDT we experienced connectivity issues to some EC2 instances, increased API errors rates, and degraded performance for some EBS volumes within a single Availability Zone in the EU-CENTRAL-1 Region.

    At 4:26 PM PDT, network connectivity was restored and the majority of affected instances and EBS volumes began to recover.

    At 4:33 PM PDT, increased API error rates and latencies had also returned to normal levels. The issue has been resolved and the service is operating normally. The root cause of this issue was a failure of a control system which disabled multiple air handlers in the affected Availability Zone. These air handlers move cool air to the servers and equipment, and when they were disabled, ambient temperatures began to rise. Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached. Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity.

    While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly.

    In the meantime, alternate fire suppression measures are being used to protect the data center. Once cooling was restored and the servers and network equipment was re-powered, affected instances recovered quickly. A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved.

    We continue to work to recover those last affected instances and volumes, and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery of those resources, we recommend replacing any remaining affected instances or volumes if possible.

    • xs83 8 days ago
      3.6 roentgen, not great, not terrible
      • varjag 8 days ago
        > This system is designed to require smoke to activate and should not have discharged.

        The Smoke Detector cloud instance was down for some reason so the system had to go with temperature!

        • bawolff 9 days ago
          Wow, sounds like the start of some implausible hollywood heist movie.
          • bombcar 9 days ago
            Gordon, we cannot predict how long the system can operate at this level, nor how long the reading will take. Please work as quickly as you can. Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence.
            • dharmab 9 days ago
              In Tenet, the protagonist and his partner trigger a halon system in a building to force an evacuation as part of a heist.
              • philjohn 8 days ago
                I remember the summer job I had working in the European HQ of a chemical company. I was managing servers and thus had access to the server room - after the stern safety lecture about the Halon suppression system and being informed that "if the alarm goes off, don't even bother going for the respirator on the wall, just bolt it out of there double quick".
                • toyg 8 days ago
                  In the first season of Mr Robot, when they want to create financial chaos by wiping records of all debt, iirc they trigger a fire alert in a datacenter to force a backup wipe of some sort.
            • Symbiote 8 days ago
              > A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved.

              No noticeable smoke, but that does sound a bit closer to a fire than they planned for.

              Either some system(s) not shutting down when the ambient temperature becomes high, or the ambient temperature was so hot it damaged powered-off equipment (!).

              • brentcetinich 8 days ago
                I have been in this situation , maintenance on cooling tower 1 engineer put it into bypass mode so he can work on it safely (in case bms decides to turn it on while he is on it.) When he left the site he forgot to take it out of bypass mode. every few days the BMS rotates the cooling towers switched off tower B and turned on tower A which was still in bypss mode… so air was blowing but the water was not circulating. the temperature graphs show the effected floors of data Center space heat up very quickly. there was a lot of permanently damaged equipment 24/7 human operators ignored many warning about ambient temperature and Low humidity assuming it was system error ( does it look hot on the cameras?? shrug ?? ) various equipment have different power down temperatures the halon did not trigger for us. some equipment did not for whatever reason shutdown when they hit the cutoff temps and cooked themselves. I suspect the Low humidity caused too much static and that’s where the damage was from..

                I guess thats something hard to test for , let’s put this switch in an oven and run it to make sure it auto powers down at 80c :) every time we release a firmware.

                was a problem from multiple vendors who issued bios / firmware patches. Cisco, dell, Emc, brocade.

                • paranoidrobot 8 days ago
                  There's always a risk that equipment that was running fine, will fail when powered back on, even if not damaged otherwise by the incident.

                  Thermal expansion/contraction is one thing - a solder joint that was marginal might have contracted enough to actually break.

                  • kube-system 8 days ago
                    IIRC it is common to experience a lot of spinning disk failures after a cooling failure.
                • trhway 8 days ago
                  >a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire.

                  sounds like on a submarine. On submarine the people who couldn't make it out of the sealed section in time, would use the personal breathing device - i wonder if AWS has such thing in the datacenter, and how do they make sure that all people are out from a huge datacenter. Several years ago a Russian nuclear sub on a sea trials after refit had the fire suppression system activated in the section where a bunch of people gathered - crew and the navy yard civilians, probably they celebrated drinking and smoking - and insufficient number of the breathing devices combined with lack of skills to quickly use it on the part of the civilians resulted in 20 deaths and 40 injured (lung damage, etc.), mostly the civilians.

                  • t0mas88 8 days ago
                    This is why you have those annoying kind of rules like "only 8 people allowed in this room". And then of course that gets ignored because 15 people easily fit inside. All fine until you find out there aren't enough breathing/floating/rescue kind of things for more than 8 people.
                    • antoinealb 8 days ago
                      My guess for "how to make sure all people are out": typically in such industrial, high security facilities, each people going in would have to badge at each door, i.e. no holding the door for a group of people. Combined with many locked doors, this allows you to know exactly who is in which part of the building.

                      That, plus you typically dont go and hangout on the DC floor, only the personel who is really required gets to go in.

                      • darkwater 8 days ago
                        > My guess for "how to make sure all people are out": typically in such industrial, high security facilities, each people going in would have to badge at each door, i.e. no holding the door for a group of people. Combined with many locked doors, this allows you to know exactly who is in which part of the building.

                        Holding doors was always a problem, but I guess that nowadays badges can have RFID or some other kind of wireless tracking so even if you hold the door, people in the control room can always know how many people are in each room, all the time. It's like 6 years now since the last time I was in a datacenter...

                        • trhway 8 days ago
                          >you typically dont go and hangout on the DC floor, only the personel who is really required gets to go in.

                          times changed. I remember my friend bunch of years ago complaining to me that the DC at their branch of a large well known transnational was the favorite place of the employees for various informal activities, including that he was constantly dealing with used drug paraphernalia being left all over the place.

                          Answer to the comment below: yep, mostly syringes/needles.

                          • qlm 8 days ago
                            Used drug paraphernalia? Were people shooting up or what? Surely smoking wouldn't be possible without alarms going off.
                          • orbz 8 days ago
                            Usually there are muster stations where a designated emergency captain takes attendance for everyone who is expected to be present and a remote operations center that has access to HRIS data to know who is out/on vacation when that attendance is known.

                            Door locks during an emergency are usually denoted to be fail-open so someone doesn't get trapped, though that may differ depending on the security level of the facility.

                        • greggyb 8 days ago
                          > Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity.

                          These may be redundant by a reasonable definition, but they clearly weren't here....

                          • fulafel 8 days ago
                            They seemed to be redundant in the other sense of the word.
                          • growt 8 days ago
                            Reminds me of my first trip to a datacenter, where the guy who accompanied us said: "In the event of a fire this room is filled with nitrogen in 20 seconds. But don't worry: nitrogen is not toxic!" Well, I was a little worried :)
                            • tyingq 8 days ago
                              Heh. Sure. Non toxic, but perhaps the most popular method of assisted suicide. Nothing to worry about.
                            • roudaki 8 days ago
                              What makes things worse is the most strict security measures in the industry and we always were afraid of fire at this scale when outside parties have to enter pods. And all the checkups after. I dont envy people that worked that and next shift
                              • tester34 8 days ago
                                boom there goes probably thousands of us dollars
                                • k12sosse 6 days ago
                                  It's ok. Jeff's gonna cover it with employees tips.
                                • Mauricebranagh 8 days ago
                                  Bit surprising that they didn't override this as there was no fire and take the hit from the over temp.

                                  Also they didn't have breathing gear (and trained staff) so you could go in and restart without waiting and also in case of an accident being able to try and rescue people.

                                  Back when worked in RnD for the lab where we could have had a Freon leak we had breathing gear just outside and some people trained to use it

                                  • doikor 8 days ago
                                    > Also they didn't have breathing gear (and trained staff) so you could go in and restart without waiting and also in case of an accident being able to try and rescue people.

                                    At that point without actually going in and checking they have no real way of knowing if there really is/was a fire or not. So the proper procedure is to let the professionals handle it (wait for the fire department to clear the building). No amount of server downtime is worth sending a "not a firefighter" into a possibly burning building.

                                    And facilities like these have strict control of where people can be so they know if someone is in there or not without of going in to check.

                                    • witrak 8 days ago
                                      > So the proper procedure is to let the professionals handle it (wait for the fire department to clear the building).

                                      Nevertheless, having breathing gear could allow beginning recovery action just after the fire department finished procedures. This would shorten recovery time.

                                      • Mauricebranagh 8 days ago
                                        That was my point and large industrial sites quiet often have their own internal fire service.
                                      • Mauricebranagh 8 days ago
                                        Video cameras maybe or sensors that detect products of burning :-)

                                        I would hope that they do audit people in and out so i case of accidents you can account for everyone.

                                        Oh and these would be trained people

                                      • darkcha0s 8 days ago
                                        I'll gladly take a few minutes of outage, if it means some guy doesn't have to run into an oxygenless building with nothing but a breathing apparatus to restart my server
                                        • tetha 8 days ago
                                          Also if you follow AWS HA guidelines, this does not lead to a service outage. We were affected by this and it knocked a dozen or two systems offline for 6 hours or so. AZ redundancy took over and that was it and oncall went back to sleep.
                                          • nix23 8 days ago
                                            Just imagine aws would be used to safe life's, like medical information's.
                                            • darkcha0s 5 days ago
                                              If you are using a single region/DC to store safety critical data you're already doing it wrong, and whoever handles your disaster recovery plan should be fired
                                              • nix23 5 days ago
                                                If your using a american company for privacy related data should be fired you mean?
                                              • paranoidrobot 8 days ago
                                                I was about to reply that AWS shouldn't be relied on for safety-critical systems, but someone is probably already doing that.

                                                I'll revise that to - I hope that whomever is relying on AWS for safety-critical systems at least does it over many regions. It's still dumb, because even AWS occasionally has global/multi-region outages, but at least it hopefully reduces the chance for it.

                                                • sofixa 8 days ago
                                                  > I was about to reply that AWS shouldn't be relied on for safety-critical systems, but someone is probably already doing that

                                                  Wtf, why not? It's drastically easier, and probably cheaper, to achieve that level of redundancy with AWS than doing it yourself.

                                                  > It's still dumb, because even AWS occasionally has global/multi-region outages

                                                  Really? Like when? The only potential one you can claim was multi-region, was when S3 us-east-1 was down, and with the old default behaviour - if you didn't specify where your S3 bucket is it would pass through us-east-1 to ask where it is - that impacted lazy code that had nothing to do with us-east-1. That's almost entirely on developers and such though, so hard to claim it was a multi-region or global outage.

                                        • hyperman1 8 days ago
                                          Strange to see them using PDT as a time zone. Both customers and local personell would be better served with either UTc or the local time zone.
                                          • benglish11 8 days ago
                                            AWS does this with all of their outages, presumably because Amazon HQ is in Washington (PDT). I would think UTC would be the preferred "universal" time for communicating these kinds of incidents, because AWS customers span the globe.
                                            • ThePowerOfFuet 8 days ago
                                              Google does it too. 'Murica.
                                            • rsync 9 days ago
                                              5:19 PM PDT

                                              We have restored network connectivity within the affected Availability Zone in the EU-CENTRAL-1 Region. The vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance. The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but can confirm that there was no fire within the facility.

                                              • GauntletWizard 9 days ago
                                                If you have data in Frankfurt, now is the time to test your backups. There's going to be a massive rash of failures in the next few months as hardware that was compromised but limping along dies off.
                                                • tyingq 9 days ago
                                                  Big sale on Frankfurt spot instances coming soon!
                                                • buremba 9 days ago
                                                  why?
                                                  • notatoad 9 days ago
                                                    the two most likely scenarios where "environmental conditions" prevent access to a DC are fire or cooling failure. aws says it's not a fire, so probably the temperature rose above their allowable safety threshold for access.

                                                    that's not good for computers either

                                                    • midasuni 8 days ago
                                                      We had a “halon” discharge in a data centre a few years ago. The pressure wave knocked out a whole rack of disks.
                                                      • aequitas 8 days ago
                                                        Like screaming at your disk?
                                                      • derefr 9 days ago
                                                        It could have just been that someone pulled a fire alarm when there wasn't a fire, and then they had to wait for the Halon to disperse.
                                                        • ta988 9 days ago
                                                          Other question, are there Halon detectors that tell you when it is safe? Or just like in MRI/NMR rooms simple oxygen meters.
                                                          • Someone 8 days ago
                                                            Nitpick: it’s unlikely this uses Halon. https://fireandsafetycentre.co.uk/blogs/extinguisher-types/h...:

                                                            “The EU ban on the use of Halon in fire extinguishers actually came into force in October 2000 and was implemented in the UK in 2003, as a result of scientific research linking Halon and other CFC’s to Ozone depletion. The ban in practice is not total.

                                                            Existing owners and users of Halon 1211 portable fire extinguishers may be able to claim exemption to the EU ban for certain “Critical Uses”.

                                                            Broadly speaking this includes limited applications within the aircraft industry, military / armed forces, petrochemical industry and some specific marine applications“

                                                            Chances are this facility was built after 2000. I think that makes it extremely unlikely it can claim exemption.

                                                            See also https://en.wikipedia.org/wiki/Montreal_Protocol, which says

                                                            “The Vienna Convention and the Montreal Protocol have each been ratified by 196 nations and the European Union, making them the first universally ratified treaties in United Nations history”

                                                            • MaxBarraclough 8 days ago
                                                              From the linked article, they're presumably using either CO2 or 'FE-36' then?
                                                              • ta988 8 days ago
                                                                I didn't know that, thanks next time I buy extiguishers for electronics I'll pay more attention.
                                                            • ta988 9 days ago
                                                              Do we have any idea how long it would take? I would have expected the venting systems to get that out pretty quickly.
                                                        • Proven 9 days ago
                                                          Baseless speculation.

                                                          And it's always the right time to test backups everywhere.

                                                        • c_o_n_v_e_x 6 days ago
                                                          Former controls system guy and have worked in data centers. I'd be concerned about why a control system failure took down multiple air handlers. Units typically have their own controllers and can be configured to run by themselves without input from a "parent" controller.
                                                          • simzor 9 days ago
                                                            This reminds me that we should get ChaosMonkey up and running. :D
                                                            • acid__ 9 days ago
                                                              Now imagining a literal chaos monkey running around a datacenter with a blowtorch, setting random racks on fire
                                                              • ta988 9 days ago
                                                                That's a concept for a startup, sell drones/robots that randomly put racks on fire.
                                                                • simzor 8 days ago
                                                                  I like that idea :D
                                                              • plasma 9 days ago
                                                                I'm curious to hear if anyone's multi-az setup (RDS, ECS, etc) handled this event without much of an issue?

                                                                I assume so but would be nice to know its working as expected!

                                                                • tetha 8 days ago
                                                                  As said in another comment, we had a dozen instances or two affected. Most of the hashi stack just lost a node and chugged along at reduced redundancy. A patroni/postgres cluster lost a replica, but automatically re-integrated it into the cluster. Very nice and smooth.

                                                                  We have mostly found one or two classes of jobs in the orchestration for which nomad stopped retrying deployments before the ec2 instances running the allocations were fully failed and removed from the cluster - and our on-call was unsure how to handle that situation in nomad right. Network and routing were really weird at some point. Additionally, we ended up with a couple of container instances orphaned from the container management, which was strange for a moment.

                                                                  This was made a bit more hectic over here because a second hoster apparently fried their own network at the same time so we needed some time to realize we have two issues.

                                                                  Overall, 5/5 Outage, would fail again once we've updated our jobs. We're happily close to not caring about such an incident.

                                                                  • kawsper 8 days ago
                                                                    > Most of the hashi stack just lost a node and chugged along at reduced redundancy.

                                                                    Amazon had released a version of their AWS Linux edition that rebooted randomly due to a kernel bug, and I was working on our staging cluster, but I didn't even notice that I had EC2 instances that randomly rebooted and dropped because Nomad just kept the workload up.

                                                                  • mercora 8 days ago
                                                                    note that every AZ assignment is mapped randomly for each account [0];

                                                                    "To ensure that resources are distributed across the Availability Zones for a Region, we independently map Availability Zones to names for each account.".

                                                                    my eu-central-1a is not necessarily yours.

                                                                    [0] https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

                                                                  • tomw1808 8 days ago
                                                                    We had a few interesting "bugs" appear: Mostly logging went down, but EC2 machines kept running. E.g. We are running a lot in ECS with EC2 machines, like a RabbitMQ cluster with 20 instances all across all AZs. None of the machines died, none of the containers had degraded performance as far as I can tell, but the performance logs just disappear from 21:50 to 22:30 GMT+2. It's just blank int he Metrics and Cloudwatch dashboard. Same with other cloudwatch logs.

                                                                    We also do have some EC2s running with nodejs applications and there the aws-sdk just errored out with "UnknownError: 503" and simply stopped logging until we restarted the machines. The machines itself were not stopped at all.

                                                                    Other than that, I can't see any effects across our accounts. Also not RDS or anything else. Fascinating. Glad its under control and seemingly nobody died or so.

                                                                    • clutchdude 9 days ago
                                                                      It affected services we have that are not located in the impacted AZ.
                                                                      • gkop 9 days ago
                                                                        Likewise, our RDS replica experienced degenerate lag, when neither the primary nor replica were in the scorched AZ.
                                                                        • plasma 9 days ago
                                                                          That's really interesting.

                                                                          I wonder whether its because a large load of capacity needs suddenly in unaffected AZs put abnormal stress on things like networking to the whole unaffected AZs.

                                                                          My only experience there is in Azure, when they deployed patches for the Heartbleed etc issues, for a few weeks things were much slower in CPU power (our response times just shot up 20% for no reason then recovered a few weeks later) and there were network related timeouts that were abnormal and it all settled down eventually.

                                                                          • xmodem 9 days ago
                                                                            Heartbleed? I assume you mean Spectre and Meltdown
                                                                            • celticninja 8 days ago
                                                                              Your comment is not clear, were you unaware of heartbleed? Or were you questioning if the GP was remembering correctly?

                                                                              For reference https://heartbleed.com/

                                                                              • chrisandchris 8 days ago
                                                                                It is clear. Heartbleed does not affect CPU like Spectre/Meltdown did (both migitations increased CPU usage quite a lot).
                                                                      • Lost one node in a small ElasicSearch cluster. The incident did not affect the cluster availability. The node was reintegrated into the cluster once it came back online.
                                                                        • adrianpike 9 days ago
                                                                          A handful of AZ-less managed AWS services were having a bad time(tm) during the peak of the incident, so if anyone was reliant on those they were having a bad time as well.
                                                                          • ununoctium87 8 days ago
                                                                            Some instances of our services went down but our deployments are multi-AZ by default so minimal perceptible outage to our customers
                                                                            • TheP1000 8 days ago
                                                                              Kinesis data streams and firehouse were down. All but one out of 300 RDSs failed over. One was down for 5 hours.
                                                                              • plasma 8 days ago
                                                                                Are you able to find out why it didn’t work? Very surprising.
                                                                            • slater 9 days ago
                                                                              is it on fire?
                                                                              • samizdis 9 days ago
                                                                                An article in The Register [1] says it wasn't a fire, but speculates:

                                                                                While we lack any evidence on which to base an assertion, The Register has reported on erupting UPSes and tiny puffs of smoke leading to hypoxic gas being released into data centres.

                                                                                The whole point of hypoxic gas release into data centres is to deprive fires of oxygen. And as humans need oxygen, it can be a while before engineers can return to a data centre.

                                                                                The Register mentions this as it fits the facts offered in this incident, and with AWS’s language about “environmental conditions” preventing entry.

                                                                                The Register says that it'll update its article if it gets further info.

                                                                                [1] https://www.theregister.com/2021/06/11/aws_eu_central_1_inci...

                                                                                • Right? What temperatures would servers continue to operate at, but would be unsafe for humans?
                                                                                  • lmilcin 9 days ago
                                                                                    I was in a server room that overheated and servers started to shut down. AC failed and there was no environmental monitoring.

                                                                                    The air was too hot to breathe.

                                                                                    I had to take and hold breath to jump in to shut down most important assets.

                                                                                    There was redundant air conditioning except for exactly one element: air inlet on the roof. A thin plastic bag blocked it.

                                                                                    • jacquesm 9 days ago
                                                                                      Classic. $0.05 bit of junk causing a very large multiple of damage.

                                                                                      Nice case to prove that plastic waste is hazardous though!

                                                                                      • lmilcin 8 days ago
                                                                                        Yes, that plastic bag cost couple million.

                                                                                        But the offset was definitely increased understanding of the concept of single point of failure.

                                                                                        • rdines 8 days ago
                                                                                          oh my. that reminds me of this story that I hadn't thought of in a long time, LOL: https://www.datacenterknowledge.com/archives/2012/07/09/outa...
                                                                                          • jacquesm 8 days ago
                                                                                            That article references the explosion at 'The Planet' in 2008, I got caught up in that, that wasn't fun at all. Fortunately we had good off-site backups.

                                                                                            Never knew about the squirrel angle, thank you!

                                                                                            • rdines 8 days ago
                                                                                              I forgot about that! Glad you came out ok in The Planet explosion, I remember that, it was insane. Fun fact: IBM now owns The Planet’s assets via Softlayer.
                                                                                              • jacquesm 7 days ago
                                                                                                By the way, we also were hit in 2003 when a transformer exploded.

                                                                                                This may be a hint that you don't want to colocate where I stuff my machines ;)

                                                                                                EV1 was pretty reliable right up to the point that it wasn't. Taught me some good lessons about multi-DC availability. Some of which were recently re-learned by the folks at OVH.

                                                                                      • ta988 9 days ago
                                                                                        Was there some work after to add other inlets?
                                                                                        • geoduck14 9 days ago
                                                                                          BRB. Going to go update my FMEA
                                                                                          • serpix 9 days ago
                                                                                            so, well over 140 Celsius then as there are saunas that are that hot and breathing in them is fine.
                                                                                            • thebeardisred 8 days ago
                                                                                              You're also not fully clothed and running around in a stress induced panic while in the sauna.

                                                                                              Actually, I apologize that was a huge presumption.

                                                                                              When _I_ am in a sauna I'm normally not fully clothed and running around in a stress induced panic.

                                                                                              • hnlmorg 8 days ago
                                                                                                Presumably you mean Fahrenheit? 100 degrees C is the boiling point of water at sea level. I wouldn't want to sit in a room hotter than boiling water. Where as 140 degrees F is 60 degrees C and a little under twice your body's natural temperature, or a bit above a really hot Mediterranean summers day. Which seems a much more realistic temperature.
                                                                                                • celticninja 8 days ago
                                                                                                  60 degrees C is a lot above a really hot Mediterranean summer day. Average temp is 33 degrees in the hottest summer months, and in extremes it may be 40 degrees Celsius. 60 degrees would therefore be almost twice as hot as average or 50% hotter than even the hottest days.

                                                                                                  Also 60 degrees Celsius would mean it made an appearance here:

                                                                                                  https://en.m.wikipedia.org/wiki/Highest_temperature_recorded....

                                                                                                  Given that the highest recorded air temperature on earth is 56.7 degrees, you may wish to revise your last sentence.

                                                                                                  • hnlmorg 6 days ago
                                                                                                    > 60 degrees C is a lot above a really hot Mediterranean summer day. Average temp is 33 degrees in the hottest summer months, and in extremes it may be 40 degrees Celsius.

                                                                                                    I was thinking 40 degrees. Given "average" isn't implied by "really hot", I'd say your 40 degree figure falls in line with what I had in mind. But you're right that it's still a big jump to 60 degrees.

                                                                                                    Maybe I shouldn't have put vague comparisons in my post because it's rather distracting from the core point that the GP got their Celsius and Fahrenheit mixed up.

                                                                                                  • Xylakant 8 days ago
                                                                                                    The exact temperature that's tolerable depends on the sauna, but can exceed 100 degrees C for dry saunas. Air has comparatively little heat capacity, so even at those temperatures, sweat can cool your body fairly efficiently (feels pretty toasty, though). If you add moisture to the air (steam sauna), that temperature becomes unbearable quickly, so those are at lower temperature.
                                                                                                    • lmilcin 8 days ago
                                                                                                      It becomes unbearable not because of heat capacity but because of inability of the body to expel heat.

                                                                                                      When outside temperature is equal or higher than body temperature the only way to expel heat is through sweating. For this the air must by dry enough (wet bulb temperature less than body temperature). https://en.wikipedia.org/wiki/Wet-bulb_temperature

                                                                                                      If body can't expel enough heat it results in eventual death.

                                                                                                    • zimpenfish 8 days ago
                                                                                                      Apparently there are some lunatics who go above 100C in saunas - sometimes they die.

                                                                                                      https://www.bbc.co.uk/news/magazine-10912578

                                                                                                      But it seems like 175F (80C) would be a normal-ish range. temperature.

                                                                                                      • fulafel 8 days ago
                                                                                                        Well, the article is about a first time someone died in a competitive "world championship" thing with constantly humidified 100+C air with a race about who can tolerate it the longest.

                                                                                                        Probably people die in saunas all the time in countries where it's popular, like they sometimes do when eating, walking, or sleeping or watching tv.

                                                                                                        • hnlmorg 6 days ago
                                                                                                          It really depends on the type of sauna, local customs, and even then there is still quite a large range due to individual preferences.

                                                                                                          80 degrees C would generally be the hotter end of spectrum. Some go as low as 40 degrees C

                                                                                                        • dagw 8 days ago
                                                                                                          Presumably you mean Fahrenheit?

                                                                                                          No Celsius. Admittedly 140 C is very hot (but not unheard of) for a sauna and you should probably not be in for more than 2-3 minutes max at a time. 90-110 C however is pretty standard and really nice. I can highly recommend giving it a try.

                                                                                                  • tyingq 9 days ago
                                                                                                    Pulling a spec sheet for a typical Dell 1U server, it has an operating range of 5°C to 40°C, and a non-operating (in storage) range of -40°C to 65°C. I imagine it would survive much higher temps, but it doesn't seem recommended. I would be wary of how reliable anything in that room might be later.

                                                                                                    Perhaps not the PCB itself, but there's typically lots of thin wire around with thin insulation.

                                                                                                    • ta988 9 days ago
                                                                                                      Insulation usually resist at must higher temperatures.
                                                                                                      • tyingq 9 days ago
                                                                                                        It depends. Some insulation melts at 60-90C.
                                                                                                    • JacobDotVI 9 days ago
                                                                                                      it might not be the fire itself, but the fire suppression system might uses gases fatal to humans and so they must wait for those gases to vent / dissipate.
                                                                                                      • ta988 9 days ago
                                                                                                        I've had servers running in a 35°C room for almost a day. Curiously 3 years later only one hard drive died. And they are >6yo machines. Recent machines are supposed to handle 35°C air intake according to Anandtech and special servers can take more https://www.anandtech.com/show/7723/free-cooling-the-server-....
                                                                                                        • offmycloud 9 days ago
                                                                                                          If they did have a fire, they're probably waiting for all the smoke and FM200 (or other extinguishing agent) to clear out before they send anyone in.
                                                                                                          • The hardware is heat-resistant, so probably around ~75C (167F).
                                                                                                            • redis_mlc 9 days ago
                                                                                                              If you can breathe, it should be ok.

                                                                                                              A lot of DCs after 2000 went from cold to warm based on the failure rates not changing.

                                                                                                              If you get a chance to tour HE Fremont (the old Apple factory), it's not cold.

                                                                                                              Factoid: most startups in SV started with a rack in a closet, that one day lost air circulation or cooling and went to the boiling point. I recommend people monitor the temp and cut 2 inches off the bottom and top of their closet door to allow passive cooling while you drive back to the office.

                                                                                                            • Last update:

                                                                                                              The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but *can confirm that there was no fire* within the facility.

                                                                                                              • mrandish 9 days ago
                                                                                                                Hmmm, maybe the ambient cooling system failed. No fire but got very toasty.
                                                                                                                • Hamuko 8 days ago
                                                                                                                  Yup. Cooling system failed -> fire suppression kicked in -> evacuation, chemicals released to displace oxygen.
                                                                                                            • intsunny 9 days ago
                                                                                                              Omg, please Amazon, switch to UTC for timestamps.

                                                                                                              It is 3AM in Germany, and I'm tired and I don't want to know what PDT is.

                                                                                                              • plasma 9 days ago
                                                                                                                I'm surprised these sites don't show a localised time with UTC next to it, even the popular statuspage.io doesn't do this, blows my mind.
                                                                                                                • JshWright 9 days ago
                                                                                                                  The other day Zoom posted a status update that listed the time in both CEST, and PST... How did they get daylight saving time right for one and not the other? PST isn't a thing for another few months yet...
                                                                                                                  • schoen 8 days ago
                                                                                                                    Maybe a European thought that Americans also use "S" for "Summer" rather than "Standard"?
                                                                                                                  • a012 8 days ago
                                                                                                                    And Google too. It makes me wonder who is their audience when they only use USA timezone while their customers are worldwide.
                                                                                                                    • p_l 8 days ago
                                                                                                                      Google is a bit ... more interesting case, at least known. Namely they fucked up early in history by using local timezone, and when it started biting them it proved cheaper and easier to use that as the one global timezone rather than make everything base on UTC. Since then I heard it referenced as Mountain View Time.

                                                                                                                      OTOH, they could add translation on website for it...

                                                                                                                      • antoinealb 8 days ago
                                                                                                                        My "new tab page" in chrome shows me clock for Mountain View, Zurich (local time for me), and a few other relevant ones. Very useful when working accross multiple time zones.
                                                                                                                    • simonmales 8 days ago
                                                                                                                      Annoys me too that a lot of tech companies are reporting in local timezones North American timezones.

                                                                                                                      Whilst in hotel quarantine I hacked a webapp together to show timezones around the world.

                                                                                                                      I did it glance at my remote colleagues local time.

                                                                                                                      https://localtime.app/

                                                                                                                      • a012 8 days ago
                                                                                                                        I hope you learnt something while doing that, but I present you https://www.worldtimebuddy.com/
                                                                                                                        • Marazan 8 days ago
                                                                                                                          I doscvered the joy of world time buddy when working on a team that had members in India, UK, Canada, New York and Australia.
                                                                                                                          • simonmales 8 days ago
                                                                                                                            Looks nice. Perhaps not the most friendly thing on mobile.

                                                                                                                            Reminds me of the meeting planner by timeanddate.com (which I use a lot of)

                                                                                                                        • rad_gruchalski 8 days ago
                                                                                                                          Here's a great website: https://everytimezone.com/.
                                                                                                                          • ksec 9 days ago
                                                                                                                            I am sorry I laughed so hard. Frankfurt in PDT.
                                                                                                                          • trhway 8 days ago
                                                                                                                            that can be done to look great on a webpage https://media.gettyimages.com/photos/clocks-showing-the-loca...
                                                                                                                            • 88840-8855 8 days ago
                                                                                                                              Why? Every single time I look on a map, the American continent is in the centre. It only makes sense to use PDT as the only correct timezone. /s
                                                                                                                          • Sounds like a status update from Fukushima.
                                                                                                                            • idownvoted 8 days ago
                                                                                                                              Well then Germany, you know what to do: Ban all AWS data centers!