In 2008, the idea was that if you bundle up a large bunch of mortgages, then the bundle will have low risk because the chances of everything failing at the same time is low. The cloud is designed so that resource usage spikes of individual customers can always be served because one customer is very small compared to the whole infrastructure.
However, in some cases, these mortgages/resource spikes become highly correlated.
> Only a small fraction of the members use the gym at any one time
Only a fraction of the members use the gym at all. If every member of the gym wanted to use it there would be no reasonable schedule to make that possible. ~50% of gym members use it less than 100 times per year, and only ~25% use it consistently.
For banks depending on legislation they have to keep 0/3/10% in reserves, depending on the size of the bank. Which is far worse than most clouds or gyms would ever offer.
Garbage collectors, like clouds, have usage based pricing, and cannot handle everyone spiking at once. They can handle everyone at normal baseline usage, but that's just like saying a cloud provider can handle everyone using the same number of reserved instances they’ve purchased on a long-term basis.
More relevant to the cloud analogy, if all their customers wanted to purchase a large extra pickup (beyond their normal baseline) on their normal day, which is part of garbage service offerings, they wouldn't be able to accommodate it.
1. Enter the URL of the cloud provider's status page into your browser and press enter.
2. If the status page loads instantly, all services are go.
3. If the status page takes between 2 and 5 seconds to serve, the cloud provider is experiencing a slowdown.
4. If the status page takes between 5 and 30 seconds to load, the cloud provider is experiencing a major problem.
5. If the status page takes between 30 seconds and 1 minute to load, requires you to refresh before you can see it, or fails to load completely such as with missing images, then the cloud provider is experiencing widespread problems in multiple regions and has only sporadic availability.
6. If the status page doesn't load at all, all services are down. Check the CEO's twitter page.
7. If the CEO's twitter page has a pinned tweet telling you not to worry, then all of your data has been lost.
From my experience, status pages usually are out of date unless they are internal. If you want the real info and are seeing issues on your end, check Twitter or open a ticket. Usually Twitter is faster.
I heard from an insider that some Azure services had a 10x growth because of the recent changes in our society. It's not like you can prepare for a 10x hit.
My personal experience for our AWS CI infra that it's struggling more and more recently. Builds are slower on average than a couple of weeks ago. Maybe those VCPUs are not the same VCPUs as yesterday ;D.
> Capacity constraints due to increased demand stemming from the global health pandemic are causing pipeline delays when using our hosted pools. We are working on mitigations, but currently expect the issue to persist for at least the rest of 25-March peak hours. You can work around these issues by temporarily moving critical pipelines to self-hosted agents.
Yep, people are finally realizing that 'cloud' isn't something magical and limitless. It's just a bunch of servers, connected together, with each having a limit as to how much data in can store and process.
> Normally I wouldn't doubt it but in this sort of situation
Then you're very misinformed.
As a cloud administrator, I see resource availability and account limits on a weekly basis going back years.
I tell people:
- to pre-provision at least some extra servers rather than wait for an autoscaling operation to fail.
- that new instance types often are rolled out gradually, and lead time is often 1 month in AWS
- that killing a 1000-node cluster then expecting to immediately rebuild it often doesn't work.
- for DR and BCP planning, each region (or AZ) should be able to handle enough load at all times in case one region (or AZ) is unavailable. I've never seen anybody do that, even after I told them, because cost.
It starts having issues when you get to 5,000+ ec2 instances, but it's somewhat understandable that they don't aim to support that level of usage within a single AWS account.
On another bullet point: if you go serverless (API/HTTP Gateway, Lambda, Dynamo DB), you automatically get full region DR. I personally recommend HTTP Gateway if you can swing it, API gateway is only worth of it you are doing personal projects (mostly free tier) or are seriously leveraging the API gateway specific features
Seems like there's some confusion on what that one really does.
It only notifies you about your own Service Limits, so you will know before you hit one in an unfortunate moment. It's important to monitor that, but it doesn't protect or notify you against cloud provider's own limitations. A scale-out event can still fail if AWS has no more extra capacity ("full") even if your limits allow you otherwise.
AFAIK there's currently no way to know it beforehand if they actually have the capacity or not.
No one who works with cloud infrastructure would tell you the cloud is cheaper on a resource basis. It’s only cheaper if you can take advantage of elasticity and use managed services to reduce the overhead of managing infrastructure.
But then don’t hire a bunch of old school net ops folks who got one certification and call themselves consultants when all they know how to do are lift and shifts.
I'm watching many struggle also with on-prem VPNs, Citrix, WebEx, and so on. Though there do seem to be honest efforts to shore those up and also try more modern tools. I imagine a lot of stodgy companies will have a much better WFH environment after all the dust settles.
2. The very last paragraph on the linked article reads: "Note that Azure is a huge service and it would be wrong to give disproportionate weight to a small number of reports. Most of Azure seems to be working fine. That said, capacity in the UK regions was showing signs of stress even before the current crisis, so it is not surprising that issues are occurring now."
All of this is public info, so maybe people should read up on facts first? :)