Ask HN: How to learn how to scale software?

I'm a mid-level software engineer, specialized in back-end programming. I'm getting my hands on Elixir currently, and I've read a few interesting blog posts (notably the one from Discord[1]) about its use case.

I've read the book called "Designing Data-Intensive applications" but it didn't scratch the same itch. I'd like to know how all of you who are writing software for millions of people scale your software. I'm guessing a lot of it comes from theory (e.g. blog articles, textbooks, courses) since it's not easily practicable (at least in my own company I don't need to scale for millions of users).

I don't know where to start so I'm hoping some of you can help

[1] https://discord.com/blog/using-rust-to-scale-elixir-for-11-million-concurrent-users

35 points | by Pooge 30 days ago

10 comments

  • kunley 29 days ago
    A lot in this area comes from an expecience in operations or at least operations mindset.

    If you haven't solved certain problems caused by real users and real loads on production systems, you can still develop such mindset.

    Build a habit of overloading your app with tons of requests - using load testing tools like wrk, wrk2, k6. Try to check this way all layers of your system, including the database (for this you obviously need to be able to create load-testing requests causing a real conversation with the database). Check how much of i/o, memory, and cpu (in this order) your app, database, cache, messaging etc is using under the said load. Develop a habit of opening inspector in your browser and peek at http conversations, not only for your services, but for a random page you visit. Get acquanted with Brendan Gregg's perf tools and his blog. Know how to use strace (although it's usage on production is discouraged as it slows down the examined process substantially) and understand its output. In general, become interested when your code ends up doing kernel syscalls and when not and, how to limit syscalls by buffering and how to read what's really going on. Make sure you understand how the virtual memory works on a modern system. That's just things from the top of my head

  • SkyPuncher 29 days ago
    You learn by doing it, but it's also not incredibly complex for nearly all SaaS products (most simply are not at the scale that actually necessitate it).

    At it's most basic, you need insight into bottle necks. That means some sort of profiling tool. From there, you start looking at slow things and making decision.

    In my experience, those decisions largely come down to a few things:

    * Did I miss a basic optimization, like an index on a query, some sort of caching, an N+1, etc, etc. Don't look too deep, here.

    * Am I hitting hard limits of my hardware. Memory, CPU, disk throughput, network throughput, etc, etc. If you are, you should ask can I simply add more/bigger servers? If you are small, the answer is almost always "yes".

    * Am I doing something really ineffectively? This is similar to point 1, but looks at things a bit more holistically. This might be slow external network requests, a lack of caching, an inefficient join or query. Spend a lot of time here before moving onto the next point. These fixes are almost always easier than architectural changes.

    * Do I need to make an architectural change? Look for things poor dependency isolation, spaghetti code (really spaghetti data), and components that can be generalized.

    • tstrimple 28 days ago
      > That means some sort of profiling tool

      I’ve really enjoyed using Application Performance Monitoring tools which sample in real time across production workloads to get insights into application bottlenecks. Sometimes the APM tool alone is good enough to track down performance issues and sometimes you’ll need to drop down to language specific profiling tools. But it’s a great way to get pointed in the right direction based on actual app usage.

      • matt_s 26 days ago
        I also recommend an APM tool.

        An answer people will probably dislike and this may sound careless, but for the large majority of SaaS web applications out there scaling horizontally in the middle/web tier behind load balancers and having a beefy DB server with replication to a stand-by is going to be the answer like 99% of the time.

        From a business perspective. when you are operating in the cloud, the cost of having engineers spend a lot of time trying to tune something to run optimally can be more costly than just adding more cloud servers or increasing the size of existing or both. This is time lost that those engineers could be building new features of a product in order to bring in more revenue. Time-box that analysis to like 1 developer day (e.g. equivalent to 8 devs in a 1 hr meeting) and move on with business work. Or just increase resources til the problem goes away, assuming you've solved obvious things like N+1's, etc.

  • crq-yml 30 days ago
    What the contemporary dev environment teaches about scale is that it barely matters until you hit certain thresholds, and then it suddenly dominates everything.

    A corollary of this is that as your program becomes more complex, relatively more of the code ceases to matter to performance, because its purpose is to configure a certain inner loop, and the inner loop becomes the only real bottleneck.

    So, when coding for very small systems(retro or embedded) the whole program architecture matters, but at hyper scale, the problem is seen mostly in terms of efficiently distributing the load, and only deeply optimizing the program in very specific places.

    A good starting point for thinking about this is a system like BitTorrent: The ideal torrent experience will most likely use multiple peers, and the performance needs to remain consistent as the number of peers increases and the load becomes more complex to distribute. But it's not really about the local performance so much as it is maintaining the overall network conditions - if every peer is doing "OK" at serving and retrieving files that's better than a very good experience served inconsistently.

  • tacostakohashi 29 days ago
    Another way to think about it is to be aware of and keep in mind all the various limitations of your implementation.

    If you have a program that reads some stuff into memory, then transforms it, and then spits it back out (which is basically every program)... how can it still work if the data set is bigger than the amount of memory? What if it's bigger than the address space? (figure out how to split the data up and work on segments then combine the results)

    Using ints or longs or whatever as an identifier? What if we have more things to identify than UINT_MAX or ULONG MAX? What if we need to allocate identifiers so quickly, and in so many different places, that contention for the next available number becomes a bottleneck? (use uuids or random identifiers)

    A lot of people write software with the assumption that there is an unlimited amount of memory, CPU, etc., and usually that works fine, until it doesn't and it blows up. Scalability is about being aware of all the finite limitations, and having strategies for dealing with them instead of just hoping you don't get close to them.

  • aristofun 30 days ago
    Skilles like that are only really learned by doing it.

    There are no shortcuts, sorry. No amount of synthetic experiments, simulations, theoretical examples will give you the actual hands on knowledge.

    For sure, books etc. are useful to gain some basic understanding.

    But beyond that - you should set more practical and specific goals that would lead you to scaling challenges naturally.

    For example - build your own web crawler etc.

    Or just set a goal to join a company with demanding products.

    • mathgeek 30 days ago
      > Or just set a goal to join a company with demanding products.

      Thie highlights a related but slightly different problem: it's important to learn enough to be able to talk about design choices for optimization, as these companies will throw you into such an interview as part of the hiring process.

  • poulsbohemian 29 days ago
    I spent about a decade doing performance work, Ie: app crashes, can't handle the load, can't scale up to what we need, etc. I think there are at least parts to answering your question:

    1) Get good at testing and profiling applications so that you can identify specific bottlenecks and then find architectural solutions to those pain points.

    2) Get good at looking at an architectural model and identifying those pain points even before you've built a system. BUT - sometimes the problem is that you have incomplete information about the demands on a system and/or in other cases someone will want to over-architect from the beginning - which in some cases even becomes the bottleneck.

    I guess I'd add perhaps a third option based on your specific question: Learn what it means to scale software. Sometimes that means low level things in software. Sometimes that means architecture. And sometimes it's just throwing a shit-ton of hardware at a problem. So in your desire to scale software, you need to know what technical solutions are available to you and make the tradeoffs depending on the resources (time, money, people, etc) available.

    So how do you learn this stuff? Well, short of companies paying you like they did in my case, maybe consider some problem you want to explore, build a prototype, and consider introducing different types of both load and errors into the system so you can see how the system performs. A few years back I was asked to build a system to ingest what the client thought might be 10,000 images a day and run through a series of steps with those images. Well, the client effed up and I discovered that the real demand was more like 10,000,000 images a day that needed to be processed. So I had to modify the architecture. Point being - you could come up with some scenario like that, build yourself a little prototype lab experiment, and start playing with what happens as you modify parts of the architecture.

    • sagelemur 29 days ago
      Any good tools to build these prototype experiments?
      • poulsbohemian 29 days ago
        Whatever code you are handy with. You'll need land generator tools and profiling tools - and picking those out and developing skills with them is a step toward understanding scale too.

        In the course of building those prototypes, you are learning something about scale too... what if I spawn a bunch of threads to handle this task at this point? What if instead I put each of those tasks on a queue and have a task handler? What if I take this engine piece and put in on separate hardware? What if I have a bunch of these engines running in parallel? What if I make this change to the DB connection? What if I remove the dependency on the DB entirely?

        It's easy to say a lot of this stuff is basic computer science you should have learned in school, but that's not reality. There's a lot of emphasis put on algorithmic complexity and efficiency, there's a lot less put on systems design and architecture. You kinda have to do it in the wild to learn it.

  • brudgers 30 days ago
    Standard engineering practice scales to large engineering projects.

    Know what problem the design should solve.

    Know what resources are available.

    Measure, prototype, test iteratively.

    Eat the elephant one byte at a time.

    Good luck.

  • Jemaclus 30 days ago
    First of all, the fact that you're asking this question puts you ahead of most engineers that I know. There's a well known saying that goes something like "Make it work, make it work well, then make it fast."

    One of the simplest ways to think about scale is to think in terms of speed. This is a very very gross oversimplification and glosses over a lot of really important concepts, but at its core, you can say "if it's fast enough, it'll scale."

    In a very simple mathematical sense, consider the idea that you have a single-instance, single-threaded application with no concurrency. If a request takes 1000ms to run, then you can do, at most, 1 request per second. If the request takes 100ms, you can do 10 requests per second. If it takes 10ms, you can do 100 request per second, and if it takes 1ms, you can do 1000 requests per second.

    See? Speed is throughput is scale.

    But that is, obviously, an oversimplification of the problem. Real applications are multi-threaded, multi-instance, and offer concurrency. So now the problem is identifying your bottlenecks and fixing them. But again, at its core, the main idea is speed. How can you make things as fast as possible?

    (Note: There is a need to consider concurrency and parallelism, plus certain data stores have inherent speed limitations that may need to be overcome, and those things can offset poor speed, but the simplest path to scalability is speed and optimizing throughput.)

    The analogy I like to use is the grocery store. Imagine you own a grocery store, and you want to make as much money as possible. Well, the best way to do that is to make sure your customers can get their food and check out as fast as possible. That means making sure the food is easy to find (i.e., read access is fast!), that they don't have to wait to check out (i.e., queue depth is low), and that checking out is fast (i.e., writes are fast). The faster your customers can walk in the door and back out again, the more customers you can sustain over a period of time.

    On the other hand, if your customers take too long to find their groceries, or they spend too long waiting in line, or they have to write checks instead of swiping a smart phone, then you wind up with a backlog. And the larger the backlog, the longer it takes for money to hit your bank account.

    So in this sense, time is literally money. The faster they can get through your system, the better.

    I mentioned three different ways of thinking about speed: reads, writes, and queue depth.

    Keeping with our grocery store analogy, consider how to improve each of those things. How do you make sure your customers can find what they're looking for as fast as possible? You "index" things. You put signs on the aisle, you organize your content in a way that is intuitive and puts related things near each other. If you want spaghetti, the pasta and the sauce and the parmesan cheese are all right next to each other. If you want breakfast, the eggs and milk and cinnamon rolls are right next to each other. In and out.

    Similarly, your data needs to be organized smartly so that the user can get in and out as fast as possible. In a database, this means optimizing data structures, adding indices, and optimizing queries. Reduce expensive queries, keep cheap fast queries. Find ways to cache hot data. Make it easy to find what you need.

    For writes, how do you speed up writes? One way is to make things asynchronous. Throw things that can be eventually consistent into queues and let an asynchronous job handle it outside the normal flow. The customer experiences minimal latency, and you've introduced concurrency to keep the data flowing while the customer is doing something else. This is, in part, why those little screens at the checkout counter ask you so many questions. They're distracting you while the cashier is scanning your groceries.

    Queue depth optimization is important as well. If the queue gets really long at the grocery store, how do you improve that? You add more cashiers! The more cashiers you have, the more concurrent customers you can handle. But does it make sense to have 1 cashier per customer? Probably not. Now you've overscaled and you're spending too much money.

    As you can see, this is a complex operation, and again, my analogy is overly simplified and very dumb, but I hope this gives you a decent idea of how to visualize a scalability problem.

    I'm not familiar with Elixir, but frankly the concepts should translate to any language, although the details my vary.

    My suggestion? Learn how to do profiling, identify bottlenecks, and target the biggest bang for your buck. The big risk here is micro-optimization, so fight for changes that give you order of magnitude improvements. Saving 50 microseconds isn't worth your time, but shaving off 1500 milliseconds almost certainly is.

    Best of luck.

    • Everdred2dx 30 days ago
      As someone working in an extremely small scale world I don’t get opportunities to tackle scaling problems, so this was a very nice read. Thanks!
      • Jemaclus 29 days ago
        Even at small scales, speed is extremely important! I don't know if it's still true, but I talked to an engineer at SmartyStreets several years ago, and they said they could serve up 100,000 requests per second on a Raspberry Pi. At the time I didn't believe them, but since then I've developed several systems that could absolutely do that.

        At small companies, every dollar counts. You could save a bajillion dollars by serving up all your traffic on an AWS micro or small instance, instead of larger machines!

        So even at small scales, it's worth figuring out how to make things fast. That makes scaling later much easier!

    • M5x7wI3CmbEem10 29 days ago
      Thanks! Why would you want fast read access in the customer example? More customers looking for items means a smaller queue
      • Jemaclus 29 days ago
        You mean "customers in the aisles means shorter lines at checkout"? That would be true if you're optimizing for short lines at checkout, but what a business is really optimizing for is sales over time, and so you want as many people to come in, buy their stuff, and leave in as short a period as you can. The more people you can get in and out of your store, the more money you make.

        The more time they spend wandering the aisles (or browsing your site), the more opportunity for them to say "I can get this somewhere else faster/easier/cheaper." Every second they aren't punching in payment details is a moment for the baby to cry, for someone to call, for the boss to give them an assignment, for a bathroom break. Any of those things can totally break the moment and cause the customer to abandon their cart and leave.

        Solution? Don't let them hang around. Get them to their goal as fast as possible and keep the total transaction time -- from landing on the site to checking out -- as short as you possibly can.

        To put a more concrete example, imagine a company like Instacart. If it takes you more than an hour to fill up your cart on the site -- for whatever reason! whether bad organization, slow response times, whatever -- then you might as well just go to your local grocery store yourself! You can almost certainly be in and out of your local store in less than an hour.

        The value prop that Instacart has is "you never have to go to the grocery store again, because it's easier to order from home." But if it's harder to order from home, then what's the value of Instacart? (Again, I'm oversimplifying here, but this is the gist of the value prop. Instacart doesn't sell groceries -- it sells TIME. It sells that hour of your life back so you can spend it with your family or playing video games or arguing with me on HN.)

        And so in terms of scalability, Instacart wants you to land on the site, add everything to your cart, and get the order placed as fast as possible. And to do that, everything needs to be fast. The category pages need to load fast, the product detail pages need to load fast, your cart needs to load fast, the checkout pages need to load fast. The faster everything is, the quicker -- and better! -- your experience is.

        There are numerous studies out there that show that as little as 500ms latency can cost millions of dollars for a company. It's really important to keep everything fast!

        I have more thoughts about this, but this is the gist of the answer to your question: because the goal isn't short queues, but rather faster total trips.