Azure Appears to Be Full

(theregister.co.uk)

76 points | by estensen 10 days ago

14 comments

  • sz4kerto 10 days ago

    This is a bit like the financial crisis in 2008.

    In 2008, the idea was that if you bundle up a large bunch of mortgages, then the bundle will have low risk because the chances of everything failing at the same time is low. The cloud is designed so that resource usage spikes of individual customers can always be served because one customer is very small compared to the whole infrastructure.

    However, in some cases, these mortgages/resource spikes become highly correlated.

    • sksksk 10 days ago

      It's a pretty common model in many industries...

      If every gym member visited the gym at the same time, they wouldn't all fit. Only a small fraction of the members use the gym at any one time, so it works.

      Banks would crash if everyone tried to withdraw their money at the same time, but they don't, so the bank can loan the money out.

      • buran77 10 days ago

        > Only a small fraction of the members use the gym at any one time

        Only a fraction of the members use the gym at all. If every member of the gym wanted to use it there would be no reasonable schedule to make that possible. ~50% of gym members use it less than 100 times per year, and only ~25% use it consistently.

        For banks depending on legislation they have to keep 0/3/10% in reserves, depending on the size of the bank. Which is far worse than most clouds or gyms would ever offer.

        • axlee 10 days ago

          As a matter of fact, I cannot think of a single industry that can serve a full service to the entirety of their clients at a discrete point in time.

          • Intermernet 10 days ago

            Garbage collection. As in the people who pick up your bins. They do this on a weekly basis.

            • dragonwriter 10 days ago

              Garbage collectors, like clouds, have usage based pricing, and cannot handle everyone spiking at once. They can handle everyone at normal baseline usage, but that's just like saying a cloud provider can handle everyone using the same number of reserved instances they’ve purchased on a long-term basis.

              • tdrgabi 10 days ago

                But if all their customers want their garbage picked up today, or everyday, it won't work.

                • dragonwriter 10 days ago

                  More relevant to the cloud analogy, if all their customers wanted to purchase a large extra pickup (beyond their normal baseline) on their normal day, which is part of garbage service offerings, they wouldn't be able to accommodate it.

                  • Intermernet 10 days ago

                    Aha, thank you for finding the flaw in my example! I lounge corrected ;-)

                  • Intermernet 10 days ago

                    On any single day, they only have a set list of customers. This is part of the contract.

            • JackPoach 10 days ago

              That's really an interesting analogy. Of course all analogies are wrong BUT if AWS goes down, half the internet goes down (in terms of important services).

              • ironic_ali 10 days ago

                Including the CIA. Snowden will be happy!

                • cameronbrown 10 days ago

                  Isn't AWS GovCloud a dedicated set of DCs for government? I'm sure the CIA can give DMV jobs the boot if need be.

                  • votepaunchy 10 days ago

                    The AWS Secret Region is separate from GovCloud.

                    https://aws.amazon.com/federal/us-intelligence-community/

                    • sneak 10 days ago

                      Just the US government. They even have special racist hiring policies to comply with the strict regulations set about who is allowed to be in the building where the US government holds your data.

                      • votepaunchy 10 days ago

                        race != nationality != citizenship

                        • upofadown 10 days ago

                          Racism is normally implemented on attibutes other than race. Instead some aspect of the targeted that the race is used. So this does not disprove the contention by itself.

                          • cameronbrown 8 days ago

                            By definition, that's not racism then. It's classism or nationalism or whatever.

              • estensen 10 days ago

                I think it's very problematic that a major cloud provider is unable to update their status page, even when this has been ongoing for days. All green ticks here: https://status.azure.com/en-us/status

                • logicallee 10 days ago

                  How to Use a Cloud Provider Status Page:

                  1. Enter the URL of the cloud provider's status page into your browser and press enter.

                  2. If the status page loads instantly, all services are go.

                  3. If the status page takes between 2 and 5 seconds to serve, the cloud provider is experiencing a slowdown.

                  4. If the status page takes between 5 and 30 seconds to load, the cloud provider is experiencing a major problem.

                  5. If the status page takes between 30 seconds and 1 minute to load, requires you to refresh before you can see it, or fails to load completely such as with missing images, then the cloud provider is experiencing widespread problems in multiple regions and has only sporadic availability.

                  6. If the status page doesn't load at all, all services are down. Check the CEO's twitter page.

                  7. If the CEO's twitter page has a pinned tweet telling you not to worry, then all of your data has been lost.

                  • zip1234 10 days ago

                    From my experience, status pages usually are out of date unless they are internal. If you want the real info and are seeing issues on your end, check Twitter or open a ticket. Usually Twitter is faster.

                  • imeron 10 days ago

                    I heard from an insider that some Azure services had a 10x growth because of the recent changes in our society. It's not like you can prepare for a 10x hit.

                    My personal experience for our AWS CI infra that it's struggling more and more recently. Builds are slower on average than a couple of weeks ago. Maybe those VCPUs are not the same VCPUs as yesterday ;D.

                    • dmos62 10 days ago

                      There is something satisfying about the bounds of the cloud metaphor being reached (a cloud can't fill the sky).

                      • ironic_ali 10 days ago

                        A trip to Scotland will change your mind, fast...

                        • arethuza 10 days ago

                          That's a bit harsh, we've had at least 3 days this year where it wasn't raining!

                        • leecb 10 days ago

                          If the cloud is full, will it rain?

                          • jimktrains2 10 days ago

                            > a cloud can't fill the sky.

                            Someone's never been to Pittsburgh.

                          • redwood 10 days ago

                            AWS and GCP are not full

                            • jeffhuys 10 days ago

                              Feels like an opportunity for building a service that uses AWS, GCP or Azure based on which is cheapest at that moment + which is not "full"... Unless that already exists.

                              • tyingq 10 days ago

                                Lowest common denominator though. If you can use just plain old VMs, there's probably little value in using the big cloud vendors. Traditional hosting would be loads cheaper.

                                • dannyeei 10 days ago

                                  Really? Do you mean running your own data center or what services are you referring to?

                                  • tyingq 10 days ago

                                    No, not running your own data center. Traditional server hosting. Rackspace, Liquidweb, Packet.net and similar.

                                    Meaning that if you're going to use the lowest common denominator, why not pay fair market prices for egress and compute?

                                    Any value in cloud is typically the services that are higher level than a VM. Those services would be hard to put a generic multi-cloud facade in front of. It would be brittle and bug ridden.

                                • Juliate 10 days ago

                                  The famous "multi cloud" or "cloud agnostic" thing. We're not there yet.

                                  AWS, GCP & Azure still feel like a lot like PC, Amiga and Macintosh at this moment.

                                  • sarathyweb 10 days ago

                                    For my UG project, I am building a platform to provision, monitor and manage cloud resources from cloud providers including AWS, Digitalocean and GCP through a single web interface.

                                    The platform also have the ability to deploy commonly used web applications like WordPress, Moodle, etc.

                                    I will launch here in HN when the platform is ready to launch.

                                    If you have any questions or suggestions, please let me know.

                                • neilwilson 10 days ago

                                  Neither are other cloud providers I could mention kof.

                                  Plenty of us out there.

                                • tasubotadas 10 days ago

                                  That would explain why I have to wait sometimes ~1h until the Azure DevOps queued pipelines build starts.

                                  • taspeotis 10 days ago

                                    They do document capacity problems with the hosted pools on their status page https://status.dev.azure.com/_history e.g. from their last event:

                                    > Capacity constraints due to increased demand stemming from the global health pandemic are causing pipeline delays when using our hosted pools. We are working on mitigations, but currently expect the issue to persist for at least the rest of 25-March peak hours. You can work around these issues by temporarily moving critical pipelines to self-hosted agents.

                                  • paulcarroty 10 days ago

                                    Worked on small Azure setup several weeks ago and probably my experience will be useful for other people:

                                    Pro:

                                    - you can use shell in browser

                                    - traffic is cheaper related to AWS

                                    - fast 1GbE network

                                    Cons:

                                    - VM deploy is VERY slow, 2-3 minutes

                                    - no ipv6 out the box, you need a balancer(!) and 4-5 non-trivial shell commands

                                    - attaching new storage was extremely painful experience

                                    It general Azure feels just like middle cloud service.

                                    • JackPoach 10 days ago

                                      Yep, people are finally realizing that 'cloud' isn't something magical and limitless. It's just a bunch of servers, connected together, with each having a limit as to how much data in can store and process.

                                      • satanspastaroll 10 days ago

                                        I doubt anyone actually thinks that.

                                        It's understandable to be surprised, it's not every day everyone needs resources at once at the same time, although some foresight a month before couldn't have hurt

                                      • Just1689 10 days ago

                                        I think this introduces some interesting points to the DR and BCP conversation.

                                        Is it a safe bet that we can rely on the cloud to have capacity? Normally I wouldn't doubt it but in this sort of situation is becomes more likely they will be put under capacity stress.

                                        Will the cloud vendors learn and build slack in? I think they're very lean operations and maybe this kind of slack would damage the profitability too much.

                                        If the cloud vendors can't guarantee capacity ( I suspect this will be the conclusion ) then what does they mean for our DR and BCP planning?

                                        • redis_mlc 10 days ago

                                          > Normally I wouldn't doubt it but in this sort of situation

                                          Then you're very misinformed.

                                          As a cloud administrator, I see resource availability and account limits on a weekly basis going back years.

                                          I tell people:

                                          - to pre-provision at least some extra servers rather than wait for an autoscaling operation to fail.

                                          - that new instance types often are rolled out gradually, and lead time is often 1 month in AWS

                                          - that killing a 1000-node cluster then expecting to immediately rebuild it often doesn't work.

                                          - for DR and BCP planning, each region (or AZ) should be able to handle enough load at all times in case one region (or AZ) is unavailable. I've never seen anybody do that, even after I told them, because cost.

                                          • ldoughty 10 days ago

                                            For AWS, limit monitor is a handy tool for "small" customers:

                                            https://aws.amazon.com/solutions/limit-monitor/

                                            It starts having issues when you get to 5,000+ ec2 instances, but it's somewhat understandable that they don't aim to support that level of usage within a single AWS account.

                                            On another bullet point: if you go serverless (API/HTTP Gateway, Lambda, Dynamo DB), you automatically get full region DR. I personally recommend HTTP Gateway if you can swing it, API gateway is only worth of it you are doing personal projects (mostly free tier) or are seriously leveraging the API gateway specific features

                                            • yekta_ 10 days ago

                                              Seems like there's some confusion on what that one really does.

                                              It only notifies you about your own Service Limits, so you will know before you hit one in an unfortunate moment. It's important to monitor that, but it doesn't protect or notify you against cloud provider's own limitations. A scale-out event can still fail if AWS has no more extra capacity ("full") even if your limits allow you otherwise.

                                              AFAIK there's currently no way to know it beforehand if they actually have the capacity or not.

                                            • jiggawatts 10 days ago

                                              bUt ThE sAlEs GuY sAiD tHe ClOuD iS cHeApEr!

                                              • scarface74 10 days ago

                                                No one who works with cloud infrastructure would tell you the cloud is cheaper on a resource basis. It’s only cheaper if you can take advantage of elasticity and use managed services to reduce the overhead of managing infrastructure.

                                                But then don’t hire a bunch of old school net ops folks who got one certification and call themselves consultants when all they know how to do are lift and shifts.

                                                • jiggawatts 9 days ago

                                                  Can I visit the planet you -- and the three guys that voted my post down -- live on? Please?

                                                  Literally everywhere I go, and it's a lot of different organisations, there's people at every level and every job title unconditionally stating that the cloud is "cheaper".

                                                  They look at a cloud VM tier that is literally 3x more expensive per year than their current hardware and say that it is "cheaper" with a straight face.

                                                  They look at a SaaS that costs $1M per year while they're currently "burning" $200K in staffing costs to maintain an equivalent on-prem and say that they have to move to the cloud to save money.

                                                  They run VMs 24/7 with zero users and say that this is efficiency.

                                                  Nobody I've seen uses reserved instances or any similar cost-cutting measure, such as spot instances.

                                                  For fuck's sake, I just worked on a project to move users from one cloud service that cost $30/user/month to another cloud service that costs $100/user/month supposedly to... wait for... "save money".

                                                  • scarface74 9 days ago

                                                    And I addressed everything you said....

                                                    But then don’t hire a bunch of old school net ops folks who got one certification and call themselves consultants when all they know how to do are lift and shifts.

                                            • tyingq 10 days ago

                                              Assuming they expand the list of services it can provide, products like the AWS Outpost might be the eventual fix for that. Very expensive and limited right now though.

                                            • rzmnzm 10 days ago

                                              That's been a recurring issue with the UK South region ever since they introduced it.

                                              Theregister even reported on it a couple of years ago

                                              https://www.theregister.co.uk/2017/05/04/microsoft_azure_cap...

                                              • tyingq 10 days ago

                                                I'm watching many struggle also with on-prem VPNs, Citrix, WebEx, and so on. Though there do seem to be honest efforts to shore those up and also try more modern tools. I imagine a lot of stodgy companies will have a much better WFH environment after all the dust settles.

                                                • jiggawatts 10 days ago

                                                  I'm running around fixing load balancers, SSL gateways, NetScalers, converting double-hop to single-hop access, upgrading key components, etc...

                                                  It's kinda fun, but it's also infuriating that 90% of our customers decided to wait until the week before the lockdown that we warned them would be coming months ago.

                                                • swebs 10 days ago

                                                  Are there any good guides on switching from Azure to AWS?

                                                  • abafazi 10 days ago

                                                    I bet AWS or GCP will never have a problem like this because they're not stupid enough to operate at only 20% spare capacity at any given time

                                                  • rcarmo 10 days ago

                                                    Disclaimer: I work for Microsoft. I have no particular info on this, but I did read the article _yesterday_ over breakfast, followed links to complaints, etc., and would like to point out two things:

                                                    1. This appears to be a UK-centric thing (and those datacenters don't have the full Azure portfolio, as can be seen here: https://azure.microsoft.com/en-us/global-infrastructure/serv...)

                                                    2. The very last paragraph on the linked article reads: "Note that Azure is a huge service and it would be wrong to give disproportionate weight to a small number of reports. Most of Azure seems to be working fine. That said, capacity in the UK regions was showing signs of stress even before the current crisis, so it is not surprising that issues are occurring now."

                                                    All of this is public info, so maybe people should read up on facts first? :)

                                                    • tibiapejagala 10 days ago

                                                      Not sure if it is UK only issue. Couldn't start a VM yesterday in other European regions. Today I have some other problems with azure search services.

                                                      • rcarmo 10 days ago

                                                        P.S.: Thanks for the downvotes, I needed a reminder of the HN reaction to pointing out things people gloss over in articles...

                                                        • estensen 10 days ago

                                                          We're having problems creating VMs and databases in West Europe and North Europe, so it's not just a UK problem.