What Happens When Hyperscalers and Clouds Buy Most Servers and Storage?

(nextplatform.com)

98 points | by rbanffy 14 days ago

12 comments

jedberg 13 days ago
I work for a company (Lambda Labs) that would span all these categories. We build GPUs and sell them, sending them off to the customer. We also build them and sell them, but then host it for the customer. We also buy them for ourselves and rent them either bare metal or virtualized.
The "build it and ship it to the customer" business is waning. The reason is because even the most deep pocketed customers no longer have data centers that can support the power and cooling needs of current GPUs. The latest from NVIDIA requires liquid cooling, tons of power, and solid floors (the racks are too heavy for raised floors).
Being able to run a modern server is becoming a very niche skill, one that most companies don't want or need to invest in.
[-]
- candiddevmike 13 days ago
  With the rise of WFH, I wonder if there could be a model (with some kind of trusted computing angle to prevent physical tampering) where you host your servers at employee locations/homes in a very distributed manner. Probably wouldn't fly with things like SOC2, but it's probably cheaper to pay/reimburse employees than pay for rack and network, especially with a small footprint.
  [-]
  - anon373839 13 days ago
    The heat, noise, and space consumption would be unbearable, I think.
    [-]
    - bravetraveler 12 days ago
      My garage (in Texas of all places) does fine
      Not to say we're free of challenges... but hey. We can make this happen. I have a couple racks in the space where car doors are expected, yet can still park.
      Think of the business continuity wins! Each employee has a Starting Google!
  - ianburrell 12 days ago
    Reliability is one reason, residential internet and power is less reliable than commercial. You will have to provide UPS to everybody. Plus, if employees are located together then there is risk that regional power outage could take the system.
    Then there is latency. Most apps talk to a data store, which is now separated from clients. The load balancers are going to be separated from the backend. Also, database servers tend to be beefy servers so either need to shared it or someone hosts a rack.
    Another is that need write software to really be distributed. There is difference between distributed when things break and distributed all the time over the internet.
  - countvonbalzac 13 days ago
    I wonder if that's less efficient though because now every employee has to manage a server, rather than having a small dedicated team that can manage hundreds or thousands of servers.
    [-]
    - candiddevmike 13 days ago
      Just have a fleet of NUC like things (that have ECC RAM ones) that you ship around if they break. You could have them boot from network/CDN and never worry about imaging, too.
      With 1Gb+ residential internet speeds, you could probably build a fairly robust and scalable platform like this.
      [-]
      - solarpunk 13 days ago
        If companies are seriously considering this, they should instead figure out how to distribute applications across their workstations to begin with, no need to distribute extra hardware if you can ensure security, connectivity, and uptime of a single laptop at each employees house.
      - ianburrell 12 days ago
        Rack servers are significantly more powerful than NUC. This means going to need more little servers. And more employees or employees that host lots of servers and turn into point of failure. There are also powerful servers, like database, that can’t be split.
        How many people do you think have symmetric gigabit? The company is going to have to pay for employees to upgrade. And do some separation to keep from saturating link from either work or home.
  - jeffjobs4000 11 days ago
    If there is access to the hardware, then the data and software are not secure. So you couldn't run any business applications or anything were you would need to have an assumption of data privacy.
  - JambalayaJim 13 days ago
    This is ludicrous and I’d revolt. A desk is already too much space to give up to my workplace.
- dylan604 13 days ago
  Do these solid floor racks still have troughs between the racks for flooding?
  [-]
  - ytjohn 12 days ago
    Datacenters I've been in that were built with "no raised floors" and flooding in mind do have a "raised floor", but it just happens to be solid concrete. This is very much the same as a warehouse or factory floor. It's built several feet higher than the surrounding parking lot, so that everything is level with loading docks.
    If you're talking in-building flooding, not much going on there other than the racks themselves keep the equipment off the floor. All power is ran from overhead. A lot of modern "rack groups" (OCS, TSCIF, etc) have water limited water protection in mind.
  - wmf 13 days ago
    Not that I've seen. It's just a concrete slab.
- echelon 13 days ago
  > The latest from NVIDIA requires liquid cooling, tons of power, and solid floors
  We won't be running on GPUs forever. The innovation in the chip space will continue.
  With as much margin as cloud takes, there will be increased interest in on-prem and mixed environments.
  We run our own GPUs in production as well as having a cloud footprint.
  [-]
  - latchkey 13 days ago
    GPUs are no longer GPUs... ai accelerators (AIA) is a better term. My MI300x don't have display ports. ;-)
    I actually see a lot of parallels to bitcoin mining. It started on cpus, then gpus, then fpgas and soon after asics. We are starting to see something similar in the AI space with groq and tenstorrent.
    The problem with asic's today is that in order to get to the same compute density of a AIA, due to AI's need for memory, you need significantly more hardware.
    Right now, MI300x is the leader with 192gb, if you do the math on groq, it just doesn't work out. This is part of the reason why the stopped selling to retail and are just focused on hosting services.
    It will be a while before the smaller chipmakers can get their hands on enough hbm to make a dent in the AIA market (due to AMD/NV hogging it all). Don't forget that AMD/NV can and will just come out with better designs too.
  - jedberg 13 days ago
    > We won't be running on GPUs forever. The innovation in the chip space will continue.
    Sure, but I'm seeing a divergence. There are a group of chips that are getting smaller and require less power and cooling that can do a lot of workloads that you used to need huge chips for, but there are the most advanced chips for the most advanced workloads getting bigger and more power hungry.
    There will always be the big customers who need the big chips, and that is what will be nearly impossible for them to self host.
    [-]
    - PLenz 13 days ago
      This is the same argument that mainframe companies made in the 80s. Mass commodity systems will eventually kill the big gpu except for legacy and niche workloads. It's the tech infra circle of life.
      [-]
      - cogman10 13 days ago
        I'm less optimistic. Unlike the 80s we are quickly approaching the limits of node feature sizes. Barring radically new designs and materials, I think it's reasonable to assume that we are reaching the power:computational limits of silicon.
        [-]
        latchkey 13 days ago
        Maybe in the case of CPUs. But, there is a fundamental technical shift going on away from computationally expensive things being done on CPUs and moving to AIAs. I suspect that this is why Jensen said Moore's law is dead.
        [-]
        refulgentis 13 days ago
        Jensen said Moore's Law is dead because of the dichotomy explained a few times -- the free lunches are coming fewer and far between, you can't get a substantial performance boost and power decrease every 18 months, per Moore's Law.
        We are reaching the power:computational limits of silicon.
        Both GPU and CPU.
        [-]
        latchkey 13 days ago
        Here is the quote:
        “Moore’s Law’s dead,” Huang said, referring to the standard that the number of transistors on a chip doubles every two years. “And the ability for Moore’s Law to deliver twice the performance at the same cost, or at the same performance, half the cost, every year and a half, is over. It’s completely over, and so the idea that a chip is going to go down in cost over time, unfortunately, is a story of the past. Computing is a not a chip problem, it’s a software and chip problem,” Huang said.
        What we are seeing now are software engineers offloading computationally expensive workloads to AIA's, more and more. This is enabled through the use of libraries like PyTorch and access to HPC levels of compute that were not broadly accessible before.
        [-]
        MichaelZuo 13 days ago
        The 'AIA's also have a scaling limit, and the coordination overhead increases exponentially too.
        e.g. A million chickens could have a thousand times more muscle mass than two strong oxen, but I doubt anyone could plow a field even 2x faster with a million chickens.
        So it's only a net benefit when the work can be split up into tiny parallel chunks and then recombined with near perfect efficiently.
        [-]
        latchkey 13 days ago
        > So it's only a net benefit when the work can be split up into tiny parallel chunks and then recombined with near perfect efficiently.
        Hasn't that pretty much always been the case?
        One thing we are seeing more and more of is composable fabrics where the PCIe bus is effectively being extended outside the case such that you log into a single instance, and instead of seeing just 8 AIA's, you now see 32+. This makes the coordination a lot easier.
        [-]
        refulgentis 13 days ago
        I don't know how you keep rewording simple things everyone knows and are being patiently explained to you, as if you are encountering them the first time in the thread.
        [-]
        latchkey 13 days ago
        Which composable fabric am I talking about then?
        [-]
        refulgentis 13 days ago
        >> So it's only a net benefit when the work can be split up into tiny parallel chunks and then recombined with near perfect efficiently. > Hasn't that pretty much always been the case?
        I was talking about that, the fabric stuff is a non-sequitor, random hardware, doesn't make GPUs 2x in speed every 18 months, much less lead to a shift in everyday computing loads to GPUs.
        1. Moore's Law also applies to GPUs.
        2. If we could make use of 1000 cores for anything but long tail tasks, ye average CPU would have a lot more than 8 cores by now.
        That's what the million chickens thing is about.
        You can give me ultra-unobtianium fabric for "AIAs", it doesn't matter unless I have a algorithm that's massively parallelizable.
        [-]
        latchkey 13 days ago
        You seem stuck on the Moore's law of things and I've moved past that.
        To be clear, I was responding to this from the OP above:
        "and the coordination overhead increases exponentially too"
        Cross node coordination is complicated and more easily managed with composable fabrics, which is why I brought it up.
        --
        I went back and read a bunch of your older comments here. At first, I thought it was personal but the tone that you use with me is carried across many comments that you've made to others. These also tend to be heavily downvoted.
        I went and looked you up a bit more, it seems you had an exit to Google and then worked there for a number of years. There is something about how you're communicating that seems brilliant, but at the same time, is missing some level of EQ.
        Like you're super smart, but also have some sort of ego about it all. As if you feel a deep need to somehow prove yourself as being "better" than the people around you. Saying things like this: "as if you are encountering them the first time in the thread"... is honestly, just rude.
        My goal here is to help educate and discuss things as best as I know how. If that isn't good enough for you, honestly, just keep it to yourself. Unlike you, I'm not putting my ego into it. If you have a personal problem with me or my background or what I'm saying, my suggestion would be to stick with the technicals and leave the ego at the door.
        Feel free to say whatever you want, but I won't respond to you any further.
        [-]
        refulgentis 12 days ago
        I'm sorry, I'm just here to share knowledge.
        There's a few things you got very wrong. I'm happy to talk about my background, but...I have a weird feeling you aren't actually interested.
        Seems you thought I was putting you down personally so you needed to find out who I was. I don't know you from an 8 year old. You're brilliant too or whatever you need to hear.
        Whole thing made me cringe. It's not worth it my friend. Makes you feel good until 30 seconds after you click post.
        Going off on a long personal attack, claiming you did your OSINT research on your interlocutor, and they're victimizing you with their vicious claims that GPUs are limited by Moore's Law...not a good look.
        [-]
        latchkey 12 days ago
        > I'm sorry, I'm just here to share knowledge.
        Apology accepted.
        Update: Weird, now you keep going after I accepted your apology. Now you're editing your comments to change things around to somehow make yourself look better. ¯\_(ツ)_/¯
        [-]
        refulgentis 12 days ago
        Also I flagged your comment, way out of bounds for HN. Also went back, lol, you got enraged because I didnt follow your rule I was only supposed to talk about the last sentence in the post you replied to. And now I remember why I was rolling my eyes and being direct with you. You were being rude to other people, kept changing the subject, and not addressing what people were saying. I.e. you completely ignored 2 different people's analogy about chickens, replied more about hardware, then I pointed out you did that, and you tried to pop quiz me, then when I patiently explained the relevance of their two comments, you went off about how you were done talking about anything else anyone said other than one sentence and I was trying to make myself feel smart (??? You're the one doing the pop quiz!!). Bonkers behavior.
        philipov 13 days ago
        What does AIA stand for? Is that short for "AI Assisted Chip Design" or something else?
        [-]
        latchkey 13 days ago
        AI Accelerator.
        GPU is a poor fit for what an MI300x is, given that it doesn't really have anything to do with graphics any more. The boards don't even have display outputs.
        CDNA3 vs. RDNA3
        https://www.amd.com/en/technologies/rdna
        https://www.amd.com/en/technologies/cdna.html
        foota 13 days ago
        I think AI Accelerator
        refulgentis 13 days ago
        AI accelerator and it is 100% fine you don't, it's, uh, novel work.
        wbl 13 days ago
        The chickens won that decisively.
      - fragmede 13 days ago
        What's different today is the divide between training and inference. Inference is ridiculously cheap compared to training, and we're still early days with optimizations across the whole of the stack, so we'll have to see how it develops. Once constant training gets figured out, then we're really in for a ride.
  - nradov 13 days ago
    The demand for compute utility capacity is effectively infinite. Eventually we'll move on from GPUs to some new architecture but the servers will still have to run in data centers that can provide enough power and cooling.
  - marcosdumay 13 days ago
    > With as much margin as cloud takes, there will be increased interest in on-prem and mixed environments.
    Or cheaper clouds...
    I wonder how much lock-in the current clouds actually have. And if they'll fully take customers down or entire industries hostage like the mainframes did.
    [-]
    - latchkey 13 days ago
      GPU clouds use long term reserved cloud contracts as lock-in. Lambda started off kind of bragging about shorter term contracts on their website and then changed that to one year after the demand picked up (and investors piled in tons of money).
      https://lambdalabs.com/blog/voltron-data-case-study-why-ml-t...
      "While most other cloud companies offer contracts starting at one year, you can reserve a Lambda Reserved Cloud Cluster starting at six months."
- juliangoldsmith 13 days ago
  Is any of this related to newer standards like Open Rack?
zer00eyz 13 days ago
What I find odd is that I know of a lot of companies looking a moving back to their own hardware.
They are going to need a hell of a lot fewer boxes today than in 2009 (15 years ago)
This is 2009: https://www.intc.com/news-events/press-releases/detail/1307/...
I can get a 128 core system today, if I need the density. And that is the thing, density is through the roof, and pricing is roughly the same as 2009.
The same thing has happened to storage performance and (to a degree) scale.
What would have taken a few racks 15 years ago barely fills one today.
30% percent of AWS's revenue is profit. Thats what you're leaving on the table. For a lot of companies (at least the ones that can plan ahead and do basic accounting) that does not make sense. Even more so when lending is tight and every dollar counts.
[-]
- jedberg 13 days ago
  > 30% percent of AWS's revenue is profit.
  If you built exactly what AWS has you wouldn't recapture that because most of that comes from operating efficiencies at scale. Yes, you can get some of that back, but it won't be 30%.
  [-]
  - SoftTalker 13 days ago
    Yes you have to think of the overhead. It's not just buying a rack and some hardware. You need power, cooling, standby generators, redundant internet connectivity, firewalls, people who know how to run all of those things, and people to manage those people.
    Amazon and Google and Microsoft spread the cost of all of that out over many customers, you're paying for it yourself if you're doing it yourself.
    As long as there is no collusion, having at least a few big cloud providers should eventually drive prices down to close to the marginal cost of providing the service. Your job is to avoid lock-in to any specific provider's platform.
    [-]
    - vidarh 13 days ago
      Most people rent rack space, or cages, rather than build data centers. I worked at Yahoo back when they were still 10k employees, and "their" data centers in London were still someone else's - they rented cages (if you haven't worked in coloed data centers before, you can rent by the 19" inch rack, or sometimes fractions, or you can have them wall of a section for just your racks).
      This is a commodity service - there are several data centers within 20 minutes of me, and dozens within an hour.
      All of the services you list, including the staff and access to them on a per incident or per hour basis is a commodity to the point that owners of these providers are increasingly real estate companies because they are "boring", relatively low margin plays priced by the square metre and kilowatts of power and HVAC capacity, not tech companies.
      So is services for these data centre operators. There's a company within walking distance of me whose sole product is software for optimising HVAC costs for data centers.
      My current employers core offering is digital twin solutions - being able to model and collect data from building sensors. Turns out a data centre needs mostly the same set of sensors as any other modern building. Just different densities and layouts. Yes, you want raised floors for cable ducts. Most offices have dropped ceilings for the same reasons but less volume.
      It's a basic real estate play, and building your own is not where you save over a public cloud unless you're a unicorn, because you can recapture most of the margin by going to a provider that doesn't hide that from you.
      It's a basic real estate play - the margins when you rent Colo space are nowhere near AWS level margins.
    - pjmlp 13 days ago
      Back in the day we would rent some racks, or virtual servers, at the local ISP.
      Very seldom it was done in-house.
      [-]
      - dilyevsky 13 days ago
        It’s still mostly done that way except you rent rackspace from equinix, coresite, etc and they buy transit separately from isp. Doesn’t make sense to build until you’re spending tens of millions a year on colos
    - kaliszad 13 days ago
      Yes, but your offering doesn't have to be so general as for instance on your hardware there does not have to be any sharing of resources with non-trusted parties. You also don't need a lot of the accounting and billing stuff internally. And in general, if you are willing to drop a nine or two on availability/ reliability, the costs fall much further down. Obviously, you have to do your own estimates and calculations but beating AWS on price in certain areas isn't really hard even if you just use a different hoster/ cloud.
      Source: Ran a SaaS business with a fraction of the cost that I normally hear from people.
      [-]
      - vidarh 13 days ago
        We used to price out an AWS transition every year at an old job. As well as managed hosting providers. 6-7 years in Hetzner reached parity with our colo setup and we started using them (our colos were near London, with London real estate prices). AWS however never got within a factor of 2.
        Which is fine for many businesses where a 2x on hosting does little to your margins, or you e.g. serve customers whose infra is in AWS, but for my employer at the time, moving to AWS would have meant bankruptcy.
  - chasd00 13 days ago
    I think the idea is a company would only build what they use instead of replicating AWS feature for feature. If they just build what the need then I think 30% is on the low side.
    [-]
    - jedberg 13 days ago
      Power and cooling are the two main things that they amortize over many customers, and everyone needs those.
      [-]
      - vidarh 13 days ago
        They're however low margin commodity services at a colo provider.
        Building your own data centers is the last of a long list of steps up in difficulty and one most people wont need.
        Renting by the rack in a colo facility with a support contract is however approachable from the point your cost is above ~$10k / months. Depending on location, once you're above ~10 racks or so per physical location, a dedicated cage can start to make sense, and a bit above that a data centre where you can rent office space for a couple of your own dedicated staff.
        Even many colo providers haven't owned all of their own setups for years. I think 2001 or so was the first time we had facilities in a colo building where half the building has been leased, ready made w/raised floors, HVAC, generators etc. to another colo provider which "just" had to staff it and handle sales and accounts.
        You can go to a commercial real estate broker and get a fractional data centre building, or you can go to a data centre operators and get a fractional functional, staffed data centre all the way from large percentages of the space with locked steel cages or walls separatibg your part from the rest down to individual 1U server slots.
        [-]
        vbezhenar 13 days ago
        How do you deal with hardware management? It seems that at this scale, hardware does not break often enough to warrant full-time system administrator position. However when hardware breaks, you need to fix it ASAP, you can't spend time looking for someone on the market for one-off job.
        [-]
        vidarh 13 days ago
        Firstly, if you need to fix hardware ASAP, you're dangerously underprivisoned. You avoid that either by having excess capacity - it's still cheaper than cloud unless you ridiculously overprovision - or you pick a provider that also offers managed servers or cloud servers, and fall back on using that for elasticity (this, incidentally, makes public cloud even less attractive cost-wise - being able to tie in VMs or managed servers into your infra to handle spikes means you can run your own servers far closer to capacity, and so the cost per unit of compute drops dramatically).
        But when you do need to, any colo provider offers "remote hands" and 24/7 staffing, so if you want you can usually have your server shipped straight to the data centre and just raise a ticket to have them rack and connect it.
        I have also been second level support for companies for this - getting people, or companies, on retainer separate from the data centre operator, to handle all your ops needs including going out to the data centre in the middle of the night to fix issues their on-site staff can't, is a service you trivially buy.
        But any reasonable setup these days will have IPMI and/or network boot, so you can have it slotted in an automatically boot into an installer image providing remote access to complete the setup, as well as to power cycle servers and check console output, so hands-on access is rarely needed beyond the initial racking, pulling and replacing hot-swappable drives, or eventual removal.
        In practice, last time I had racks where we owned the servers as opposed to managed servers where none of this was our problem (still vastly cheaper than public cloud, though somewhat less flexible if you want it to remain cheap), it took on average a fraction of a day per year per rack time spent dealing with anything lower level than our orchestration layer. I tracked the time, and it remained fairly stable over many years, including in cases where our rack costs were so low that it was justified to keep servers in the racks years past the point where they'd been written off.
  - withinboredom 13 days ago
    You mostly don't need all the things AWS built. You need, maybe, 10-20% of it.
    [-]
    - jedberg 13 days ago
      Right but you need the power and cooling, which is the main thing they amortize over many customers.
      [-]
      - withinboredom 13 days ago
        I pay a colo for power (which also covers cooling) and the costs are still waaaay less expensive than EC2; orders of magnitude less. Like <300USD a month (amortized) for hundreds of cores, hundreds of gb of ram, hundreds of tb of disk space. The same costs for a EC2 instances (reserved 3yr, so it compares nicely) is thousands of USD per month.
        [-]
        latchkey 13 days ago
        Anecdotal response. You don't scale to every single business trying to run this hardware. If that happened, I guarantee that your bills would go up significantly. I've actually been hearing recently of people getting booted from their data center cause the DC is trying to reconfigure their customer base in order to increase rack density.
        [-]
        withinboredom 13 days ago
        That's simply because there isn't as much demand, so supply is down. If demand goes up, supply goes up eventually, but price goes up until an equilibrium is reached.
        Eco 101.
        [-]
        latchkey 13 days ago
        A year ago, I had a call with Digital Reality on a Friday. I was looking for 5-10mw that they had available. They called me back on Monday to let me know that it was gone. Mind you, this was also $0.35/kwh, which is like residential rates!
        CoreWeave is in the process of opening 25 more data centers, concurrently.
        Eco 101 on steroids.
      - vidarh 13 days ago
        Unless your spend is in the tens of millions a year, you don't generally build datacentres, you buy space in a colo.
- ben-schaaf 13 days ago
  > I can get a 128 core system today, if I need the density.
  256 core now with bergamo.
  [-]
  - latchkey 13 days ago
    This is what we have in our MI300x box. 2x9754. Beast of a CPU. Intel has nothing even close.
    [-]
    - zer00eyz 13 days ago
      OHHHH That's like 500 cores in one box?
      cores/threads total? How much ram? disk? do you know the all in cost and rack U?
      There are lots of folks whose entire production infrastructure is half that core count or less.
      [-]
      - latchkey 13 days ago
        Whole box is a beast. 3TB ram, 122TB nvme... plus the 8x MI300x (192gb/each). 2x 400G cx7 NICs.
        Here are the CPU specs: https://www.amd.com/en/products/cpu/amd-epyc-9754
        We went with higher cores vs. higher clocks/cache due to wanting multi-tenancy on this box. MI300x is so new that we wanted to be able to give people 1-2 gpus at a time (via VM/pcie pass through), so we figured more cores are better for now.
        In the future, we will buy a mix of different CPUs (including Intel as well).
        Yes, of course I know the costs. It ain't cheap. ;-)
        [-]
        zer00eyz 13 days ago
        >> Yes, of course I know the costs. It ain't cheap. ;-)
        I hate that we have to be this way about prices (its stupid)
        Core density is through the roof... Your one box today is the equal of whats likely a 15x multiplier. 15 years ago that much compute would have spanned 2 racks. Or jammed one full of blade servers (and they were expensive).
        It's in what? 4? 8? u of space?
        [-]
        latchkey 13 days ago
        8u. Whole box weighs 300+lbs.
        alcover 13 days ago
        Those specs amaze me.
        How far we have gone for those who remember 90's computing !
latchkey 13 days ago
I'm building a business that operates as the cap/opex for businesses that do not want to invest in the huge upfront costs and effort to deploy the best compute and storage that we can get today. Especially around AI/ML/CFD type of workloads.
I see this as a niche for allowing you to focus on building products with features, while we handle the rest of the details. Gone are the days of just racking some servers, as the deployment complexity has only increased exponentially over time. Especially with this newer hardware, the infrastructure needs and failure rate, debugging issues are all through the roof.
I see us as kind of a middle ground between hyperscalers and building it yourself. We give full bare metal access to the hardware, as if you own it yourself and are sitting at it. This should help protect at least a small amount against hyperscalers and HPC completely dominating the market.
My competitors are all focused on one company beating another in this AI compute race, as if it is a sporting competition. My focus is on running only the best of the best, I don't care who's equipment it is.
We have been doing large scale deployments for a while now, but just getting started in this new business. Looking forward to seeing how things develop over time.
[-]
- kaliszad 13 days ago
  The deployment and operations of modern hardware certainly is a field in itself and always was. What experience do you have with that? It doesn't really seem exponentially more complex to me. Yes, accelerators are more common now and the heat production is similar to that of old rack-mountable mainframes but that in itself is nothing too new. Also, we have had low-latency interconnects in the HPC space for ages - that hasn't changed that much neither it is just more widely deployed now.
  [-]
  - latchkey 13 days ago
    Hi, thanks for the response.
    30 years of being in tech and startups. Previously deployed, managed and optimized 150,000 GPUs across 7 data centers.
    Power usage has changed quite a bit, such that we see articles now about how older datacenters power rack density doesn't work any more [0]. Each of our MI300x boxes are 7KW. That doesn't work with only a 20amp circuit to your rack.
    Liquid cooling has been in development for ages, but is now being pushed to the front due to the higher power requirements of the servers. We have to lower the PUE, in order to increase rack density. This increases complexity through the entire supply chain.
    Networking is more complicated. For example, we are running into sonic software issues with the need to deploy VRF to support multiple customers, and it is buggy AF in some implementations from some vendors. NICs no longer talk to CPUs, they talk directly to GPUs (RoCE), and that has more interconnects (8x), along with the way they are connected and managed at the switches. One design even puts each GPU on a separate switch, requiring even more cables. When you are at 400G/800G, which chip you choose in your switch matters.
    Relationships. Dealing with multiple vendors, supply chain issues, finding power/space, all these are more complicated now that demand is far far higher. 6 years ago, Cenly Chen from Supermicro said AI spend in 2025 would be $36B [1]. Current estimates now are $750B+.
    You can't just order a box of MI300x from your local BestBuy. These things are export controlled by the US govt and effectively classified as weapons with restricted usage. We have to KYC our customers.
    This is just the tip of the iceberg, I could go on and on... now that I'm in the thick of things, I feel pretty confident that we are providing a pretty necessary service. Especially at the scale we are building for.
    [0] https://www.datacenterknowledge.com/hardware/data-center-rac...
    [1] https://youtu.be/WzqBuiwkv5I?feature=shared&t=58
    [-]
    - kaliszad 12 days ago
      Thanks for the details. What a career!
jiayq84 13 days ago
I do a startup called Lepton AI. We provide AI PaaS and fast AI runtimes as a service, so we keep a close eye on the IaaS supply chain. For the last few months we see supply chain getting better and better, so the business model that worked 6 months ago - "we have gpus, come buy barebone servers" no longer work. However, a bigger problem emerges. Probably a problem that could shake the industry: people don't know how to efficiently use these machines.
There are clusters of GPUs sitting idle because companies don't know how to use them. It's embarrassing to resell them too because that makes the images look bad to VCs, but secondary market is slowly happening.
Essentially, people want a PaaS or SaaS on top of the barebone machines.
For example, for the last couple months we were helping a customer to fully utilize their hundreds-of-card cluster. Their IaaS provider was new to the field. So we literally helped both sides to (1) understand infiniband and nccl and training code and stuff; (2) figure out control plane traffic; (3) built accelerated storage layer for training; (4) all kinds of subtle signals that needs attention. Do you know that a GPU can appear OK in nvidia-smi, but still encounter issues when you actually run a cuda or nccl kernel? That needs care. (5) fast software runtimes, like LLM runtime, finetuning script, and many others.
So I think AI PaaS and SaaS is going to be a very valuable (and big) market, after people come out of the frenzy of "grabbing gpus" - and now we need to use them efficiently.
kaliszad 13 days ago
The idea is that servers should become a lot more manageable and debuggable to the very bottom. At least that's what Oxide Computer offer. (I am not affiliated with them in any way, just find the idea great.) There is a company in Germany - Cloud&Heat that offer whole rack solutions that are specially designed for efficiency/ water cooling and security. Both companies do offer interesting stuff for largeish customers.
FpUser 13 days ago
This is really scary. Everything we do gravitates to a practically single point of failure in way too many areas of life. At some point single small scale unfortunate event, evil entity / person and what not might bring countries to a point of collapse. And our "servants" do not seem to give a flying fuck about it. Well they might have secured place in hideouts of the world's Zuckerbergs.
blackeyeblitzar 13 days ago
Probably what’s needed is regulation, to control their pricing and ability to accept or deny customers. They’re basically a utility.
[-]
- dilyevsky 13 days ago
  Nothing will ensure incumbency better than slapping more regs on it
  [-]
  - blackeyeblitzar 13 days ago
    Is incumbency a bad thing if they’ve to serve everyone equally and at prices that are fair? It’s not like they operate without any regs at all today, right?
    [-]
    - dilyevsky 13 days ago
      It is a bad thing bc it stifles innovation and ultimately hurts everyone but incumbent shareholders and their pocket legislators. And on the latter point it is also unlikely to be “fair”
  - wmf 13 days ago
    Yes, you only regulate a monopoly after you've decided that competition is impossible.
    [-]
    - blackeyeblitzar 13 days ago
      I think we need to be more proactive than that personally. Lots of big companies do anti competitive things that we don’t want even if they aren’t exactly a monopoly. But no one’s doing anything about that.
nightshift1 13 days ago
I am looking at the graphs, and all I see is a substantial influx of new funds into cloud services. Non-cloud investments remain steady but are not experiencing significant growth. What does the volume indicate? Is the cost of on-premises infrastructure rising, or are people simply facing increased computing demands? I wish.
I sincerely hope that owning the entire stack in a private data center won't become exclusive to a select few hyperscalers in the coming years. I am somewhat biased, because I am a Linux admin forna company that has its own datacenters that is migrating to the cloud, but I find it distressing that everyone appears to be migrating to the cloud for reasons that seem dubious, such as "Gartner told me to." Like mentioned at the end of the article, I predict a great loss of knowledge for everything that is not abput the usage of a proprietary api.
karma_pharmer 12 days ago
The upside is that these folks are extremely, extremely space-and-power constrained. Cloud data centers cycle through hardware much faster than any other industry (enterprise users, engineering workstations, render farms, etc) and when they're done with it they dump it on ebay for pennies. If you look at the large-volume sellers of rackmount servers on ebay, you'll notice that there are one or two near each major data center cluster.
The best part is that they really do not care at all about any kind of cost recovery, so they dump truckloads of this stuff all at once and tank the price. If you can afford to wait you can get some utterly amazing deals on five-year-old top-of-the-line gear. If you buy enough to justify freight the (amortized) shipping is really cheap too.
hintymad 13 days ago
I wonder why the cost of running on the cloud, engineering cost included, is not a concern to companies. Per DHH, even a modest app like 37Sigal's would cost millions more than they liked (https://world.hey.com/dhh/we-have-left-the-cloud-251760fb)
On the other hand, as increasingly only a few cloud companies know how to build out datacenters, maybe the talent pool for building datacenters will keep shrinking - a vicious cycle in which fewer companies will choose to go with their own data centers.
[-]
- lowbloodsugar 13 days ago
  Sure. Bring home legacy services that you aren’t innovating on any more. Running it on a single server. Obviously that’s a great option for a small tech company.
sixdimensional 13 days ago
This is a good reminder that what we call the public cloud today is actually a misnomer - it’s not actually a public cloud, it’s someone else’s cloud. Just like we used to say the cloud is really just someone else’s computer.
I think people started calling it the public cloud simply because of the scale and seemingly easy access to everyone, where the current clouds are seemingly in open competition with each other.
[-]
- throwaway11460 13 days ago
  The public waterpool is also someone else's pool but it's still public.
  [-]
  - jedberg 13 days ago
    Kind of different. The public pool is paid for by taxes and is publicly owned. The public cloud is still privately owned.
    [-]
    - throwaway11460 13 days ago
      My point is that the pool very often is just a private company. All three public pools around me here in central Europe are privately held companies.
- curt15 13 days ago
  >Just like we used to say the cloud is really just someone else’s computer.
  I still describe cloud computing as renting someone else's computers.
- eddythompson80 13 days ago
  Not really sure what you mean by misnomer or what an “actual public cloud” would mean. It’s called public cloud to distinguish it from private clouds which plenty of large organizations have. The big cloud providers license their clouds for such deployments. Unless you mean that a “public cloud” should mean something like non-corporate governance?
  With things like kubernetes, it’s easier than ever to deploy your own private clouds these days.
  I think the only misnomer is the word “cloud” itself. We had the term “hosting providers/companies” which is exactly what these public cloud are. It’s just that the term “hosting provider/company” had about 15 years of precedence that almost always meant “small scale hosting”, so they just came up with a new marketing terms
  [-]
  - sixdimensional 10 days ago
    I mean it in the sense of a "public utility" [1]
    [1] https://en.wikipedia.org/wiki/Public_utility
    [-]
    - eddythompson80 9 days ago
      Public has lots of meanings. Here it just means not private.
- toast0 13 days ago
  Public has many meanings. Everyone can buy or access it is a valid meaning and doesn't say anything about ownership.
acd 13 days ago
Buy local also data center space.