How we built a serverless architecture with AWS


114 points | by kdeorah 102 days ago


  • LaserToy 102 days ago

    Well, you folks now made your business super coupled with AWS. I just have 1 word: Oracle

    • shiftpgdn 102 days ago

      It's pretty obvious that someone at Amazon is watching and voting on this thread because this is absolutely true. This kind of thing is vendor lock-in to the nth degree.

      • LaserToy 102 days ago

        Funny, right. I worked for at least 2 companies that at some point in time put a lot of money into Oracle. One of them is a leading gaming network, a lot of billions in revenue.

        Teams struggled with migration off it. It was a multi year/multi millions project and there is no end to it. And newcomers were saying -> oh, that was a silly idea to use all this stuff (why didn't they used Dynamo :) ), hovewer, 15 years ago it was pretty ok + Oracle solution architects were all over the company.

        I don't see how amazon's strategy is different. And I don't get how folks, who are saying Oracle lock was bad, but Amazon is ok, can justify such a thinking.

        I will put my money on it, in 10 years those will be good examples of how not to do things. Like, for example, when AWS leadership changes. And internet will be: who would've seen it coming.

        • zaarn 102 days ago

          Same reason that people think that Chrome lock-in is okay and IE lock-in was bad; the new product is shiny and has good features. At the moment.

          Well, and you gotta justify that 1 million $ investment (R&D + Costs) into your AWS architecture somehow.

        • jcrites 102 days ago

          Lots of folks from Amazon participate in Hacker News, like Tim Bray, Colm MacCarthaigh, and Jeff Barr. Jeff Barr often comments in threads about announcements he's written. One of Tim's blog posts was recently on the front page [1]. See: timbray, colmmacc, jeffbarr

          I doubt there's any kind of voting cabal, but if folks are participating then they're probably voting according to their inclinations. (I don't vote too much myself, either on comments or articles.)

          Any time you invent a new technology with a unique interface, then software built using that technology is coupled to it to some extent. It's actually fairly rare for software components to be so completely interchangeable that you can swap out implementations without changing the software that uses it.


          • jackpeterfletch 102 days ago

            At the most basic level, it wouldn't be a tough migration to any other FAAS. Yes it would be work, but I can't think of any other infra migrations that would be less effort.

            But also you don't need to think of lambda code as code that can _only_ necessarily run on AWS Lambda.

            We organize related lambdas (that would traditionally constitute an 'application') as a gradle multiproject, one module per lambda, with a common module for shared code, like DAOs. The CI creates and uploads an individual jar per Lambda, but updates them all every release.

            We then have an extra module that pulls all of those together onto a web API and can be run as container independently any FAAS. At that point the fact your deploying to Lambda is basically irrelevant to your code-base, it looks and feels like any other 'application' and is probably even a little more organized.

          • akishinevsky 101 days ago

            Usage of AWS services is a conscious decision, absolutely. However, the product architecture that uses these AWS services is subject to careful review for design and functional components integrity. Any of these components must be replaceable as issues/bottlenecks are identified. For example, if AppSync is proven to have issues as the company scales further, AppSync can be replaced with self-hosted GraphQL clusters. Additionally, other components in the architecture can be similarly replaced.

            • antpls 102 days ago

              Another way to phrase the question : How reproducible are architectures on AWS ?

              It's the reproducible build problem again, but at the architecture/infrastructure level

            • apsdsm 102 days ago

              It occurs to me that you can stick a single colon into the title after “architecture” and you pretty much get the summary of the article.

              • bborud 101 days ago

                I like the idea of serverless architectures, but I still wouldn't use it for anything that is important.

                - Using a serverless architecture almost always implied getting married to your provider. You can run your code in only one place. You have given up all bargain power. When the relationship ends you have to build your system over again.

                - It isn't really serverless; they're just not your servers.

                - They are only efficient for the workloads the architecture is designed for. Stray outside the parameters and things start to become expensive, slow or both.

                - If you use serverless architectures you have to make damn sure the people who built it stick around, because the only value you are left with if your provider folds or increases prices on you is inside the heads of the people who built the solution.

                I have already seen friends getting burnt by this. Typically people build a prototype or a technology demo, it gets funding, the CTO insists that it isn't important to do anything about the serverless bits and just go with them (there is no pause to make good on "we'll fix it later" once money gets involved), then get jerked around by the service provider because they can't provide the support needed. Then they slam head first into the costs of actual production traffic which, somehow, even though it requires only basic arithmetic skills, none of them had been able to calculate before the huge bills started rolling in.

                • plufz 101 days ago

                  > Using a serverless architecture almost always implied getting married to your provider

                  I don’t disagree but I’ve made a web app with aws serverless. Frontend on s3, Backend Python flask on serverless and MySQL server (haven’t tried RDS serverless yet). Works fine, had a compiled library that did not work but all standard stuff. No marriage. :)

                  • bborud 94 days ago

                    Then strictly speaking it isn’t serverless.

                  • notthisshitaga 101 days ago

                    "- Using a serverless architecture almost always implied getting married to your provider. You can run your code in only one place. You have given up all bargain power. When the relationship ends you have to build your system over again."

                    I think you can use a format that is mostly provider-independent. Any movement will change just the integration. Also, this kind of applies for anything large. You get married to the API anyway and movement will always be painful to some degree.

                    "- It isn't really serverless; they're just not your servers."

                    You mean they are not on prem? 'Cuz they can be. You mean that the name is bad? It really isn't as bad as people seem to imply. When done right, you don't worry about the servers. You mean that you have no control over the execution environment? Well, if spectre and meltdown have taught us anything is that really you lose control at some point anyway.

                  • holoduke 102 days ago

                    I don't understand what you really gain with this setup. I mean, this extreme vendor lock-in situation is so short term. The absolute wrong strategy if you ask me. I would be curious to see this company in 5 years from now.

                    • akishinevsky 101 days ago

                      Let me ask you this question differently: lets say you are exploring a new market opportunity for the an exciting product you want to build. Would you rather spend cycles building what AWS has done already and thus delay speed to market or would you rather use what has already been done by AWS managed services and build what nobody has done before? Surely, this architecture will evolve over time, but it will only evolve as the startup quickly discovers what the market needs.

                    • leetbulb 102 days ago

                      Oh boy, I bet that's costing a new car each month.

                      • nisten 102 days ago

                        I used to hate aws for how expensive their bandwidth and storage was, until I started actually using it last year. I think their new serverless stack is about to leave a lot of devops out of a job.

                        You can setup a a CI/CD pipeline in about half an hour with amplify, at the previous company I remember it taking a good 3 weeks to get CircleCi up and running properly.

                        And then moving a microservice over to it is basically 1 command, a few options, mostly just copy over the config from your old express backend with a few changes, and you're done. It's insane.

                        One other dev I've showed the lighthouse scores of the react stack I deployed on it even said "this should be illegal". And they're right, it's pretty much automated devops, the whole ap now loads in 300ms. If you have server side rendering in your app the static content will automatically be cached on their CDNs.

                        And if you want to save a bit of money you can just use google firebase for your authentication and db. GraphQl is surpsingly a breeze too as a middle layer if you want to leave your java or .net backend apis untouched.

                        At the end of the day, nodejs is completely insecure by design, your infrastructure will never be as secure as running it on gcp or aws. That's why you go serverless and stop messing with security and front end scalability.

                        If they solve the cold-start issue of databases on aurora they will completely dominate the market even more than they already have.

                        • closeparen 102 days ago

                          >You can setup a a CI/CD pipeline in about half an hour with amplify, at the previous company I remember it taking a good 3 weeks to get CircleCi up and running properly.

                          >And then moving a microservice over to it is basically 1 command, a few options, mostly just copy over the config from your old express backend with a few changes, and you're done. It's insane.

                          As a an engineer at a decent sized tech company, this sounds pretty normal, because our infrastructure teams have been providing it (and much more) to service authors via internal APIs/web UIs for much longer than "serverless" has been a buzzword.

                          • remify 101 days ago

                            Except now you don't need an infrastructure team. That's the whole point of serverless architecture: to be able to scale without having huge team scale as well.

                            • closeparen 101 days ago

                              You haven’t needed an infrastructure team since PHP shared hosting, and certainly not since Heroku or Elastic Beanstalk, except that people kept wanting greater complexity at lower cost. There is nothing new about “serverless” there.

                              • akishinevsky 101 days ago

                                There is a difference between what was then and what is today. The key difference is that the "serverless" term is massively overloaded here and once you dissect it you will see that it's a mix of multiple serverlessly managed services that we are able to take advantage of: Kinesis, DynamoDB streams, Kinesis Firehose, SNS, Lambda, Cloudwatch, and GraphQL/AppSync. Serverless computing came a long way.

                                • kdeorah 101 days ago

                                  Agree that serverless is a buzzword here, just as data science and machine learning have now become. It is about taming greater complexity at lower cost, and at increasingly more granular levels.

                                • akishinevsky 101 days ago

                                  Exactly. We were able to support millions of new devices without infrastructure involvement.

                              • scrollaway 102 days ago

                                Can you elaborate on Amplify? Is it really that good? It didn't take me terribly long to set up gitlab-ci with ECS and later Fargate; both of these feel more appropriate for web-serving apps.

                                I may see it in a full-JS app, but I still can't find a good fit for a JS-based backend. I've recently been exploring alternatives to Django for API backends and seriously considering a JS-based framework. I have yet to find one that is all three of: good, simple, in typescript. TypeORM looks excellent for the ORM side but there's still the matter of writing APIs; anything I've looked at (Express, Koa…) is atrociously repetitive compared to Django REST Framework -- NestJS is the best I've found, and it's still miles away.

                                • dmix 102 days ago

                                  SSR is going to do wonders for page load times on the internet as it finally gets popular via React/Vue. I hope it's the future for all of these heavy-weight user-facing JS apps.

                                  • Jedi72 102 days ago

                                    SSR is the future? What has PHP been doing for 15+ years?

                                    • dmix 102 days ago

                                      I'm talking about Next.js/Nuxt.js style JS front-ends replacing exactly that plus JS heavy frontends like Angular and SPA react apps which was the last decade's modus operandi.

                                      The way SSR hooks React/Vue into these JS apps "hydrating" them after loading prerendered component based make them interactive without losing any performance compared to static HTML, is unique and extremely powerful, which most people don't understand until they do it. It really is the future of frontend development.

                                      SSR combined with async loaded chunked bundles of components is far more than prerendering some server side Web apps templating library with full HTTP requests in between. All the power of a full fledged SPA but with none of the performance or SEO downsides with automatic offline + service worker caching. It's great for the webs future.


                                      • Jedi72 101 days ago

                                        Yo dawg I heard you like job security, so we put a program in your program, now you you can render while you render

                                        More seriously, I do understand the difference, but disagree with the whole approach in 95% of cases

                                        • dmix 101 days ago

                                          Even the Haskell people are adopting SSR’d JS-heavy frontends (Miso, Purescript, etc) for their web apps. That’s when you know it’s mainstream. Good luck with PHP!

                                    • manigandham 93 days ago

                                      SSR is the default for the web since it started.

                                    • root_axis 102 days ago

                                      Why do you say that nodejs is completely insecure by design and how does gcp or aws mitigate those security concerns?

                                      • rapsey 102 days ago

                                        Because you will inevitably have hundreds/thousands of dependencies, controlled by at least as many people, anyone of which could inject code to backdoor your server.

                                        A supply chain attack will sooner or later be the cause of a major incident.

                                        • holoduke 102 days ago

                                          It's the same for any other language . With java with c++, dot net, PHP and even with Erlang. None of them force you to use governed central repositories. And that's a good thing.

                                          • rapsey 102 days ago

                                            The scale is on a different level however. Your average node project will have 10/100x as many dependencies compared to other languages. Too many to conceivably check. Also due to how dynamic the language is, I think it is way easier to hide something.

                                        • nisten 102 days ago

                                          The V8 runtime itself is pretty secure. However every npm package has total access to your filesystem and network i/o. This is by design, the author of node himself has apologized for it and admitted that nothing can be done now because it'd basically break the internet. This means any package ( i.e. eslint), dependency, anything that has code from just one malicious contributor can grab away all your API keys, ssh keys(if you still use those), environment variables, crypto wallets of your users( this has actually happened a few times now at scale).

                                          With something like aws-amplify you just go on their site and put your environment variables there, instead of keeping them on your own machine.

                                          Now you don't have to worry about using sketchy docker images, or your junior devs using their work laptops on a malware infested gaming cafe while still running their localhost server.

                                          Aws and gcp can afford to have way better internal security and regular pentesting of their containers and infrastructure, so now wrapping those protecting layers around node, express, etc... is their problem. You just push your code the production or testing branch and they handle all the provisioning, builds and deployments in 3-5minutes.

                                          • root_axis 102 days ago

                                            The npm dependency issue is a serious concern, but I'm not convinced that gcp or aws would mitigate the issue. If the problem is unaudited code that could be potentially compromised, gcp and aws will run that compromised code without protest.

                                      • akishinevsky 101 days ago

                                        It's very easy to incur high costs here. We implemented cost analysis dashboards that allow us to monitor costs per each event, device, with visibility into each AWS service we use, with charts showing historical data. Fiscal planning is now part of our architecture design and implementation.

                                        • brad0 102 days ago

                                          It all depends on how much data they're processing. It looks to be mostly a pay-per-use model.

                                          I'd say their big cost is Kinesis and potentially API Gateway. Lambda is great for this kind of workload (mostly).

                                          • kdeorah 102 days ago

                                            Top 3 are EC2, DynamoDB and Lambda.

                                            • brad0 102 days ago

                                              I don't see EC2 in your architecture diagram. Where are you using EC2?

                                              How are you structuring your dynamo tables? Is there one table that is used much more than another?

                                              • kdeorah 102 days ago

                                                Not using EC2 directly, though AWS breaks that down cost with a line item for EC2.

                                                Many tables in DynamoDB. Two out of those are most used (equally).

                                        • agraebe 102 days ago

                                          Saw this post a while ago: - did you hit any limitations with AppSync?

                                          • kdeorah 102 days ago

                                            We haven't hit any scaling issues yet. GraphQL is nice. It's really about getting data directly from DynamoDB and Aurora to an end point that Android/iOS/React-JS can query and subscribe to. Apache Velocity Template Language that AppSync uses is a pain though. This post captures it well (unfortunately):

                                            • akishinevsky 102 days ago

                                              AppSync does have limitations we have to contend with. Custom scalar types cannot be defined hence we are not able to define strictly typed GeoJSON objects. Apache VTL has its own learning curve; once you master it you can implement functionality without leaning on invoking lambda functions and avoid paying for their usage in high volume GraphQl call scenarios and access queried data faster.

                                        • reilly3000 102 days ago

                                          How do you deal with Lambda concurrency? I have found its pretty easy to hit 1K concurrents if functions take a long time to run and receive bursty traffic.

                                          • L_226 101 days ago

                                            You can IIRC ping support and ask for a concurrency limit increase, but probably what I would do first is try to segregate lambda deployments and API endpoints (or whatever trigger) by region so that total load is distributed (you get 1000 concurrents per region). Obviously at this point you would also profile your code to optimise function executions.

                                            • xsmasher 102 days ago

                                              Do you mean you don't want it to handle 1k concurrent requests (you want some to be rejected or queued instead?) or do you mean that the concurrent execution causes some other problem?

                                              (honest question, not snark)

                                              • bzbz 102 days ago

                                                I think they mean there’s a 1k concurrent request limit that they hit. Though the alternative would be dedicated servers and load balancers, no?

                                                • reilly3000 101 days ago

                                                  Right, I'm referring to AWS limits. I was running a benchmark yesterday against a logging endpoint I made with a similar architecture to the article. One function is attached to a public ALB endpoint and does some validation then writes the event to SQS; this was taking 100-200ms with 128Mb of RAM. A second function was attached to the SQS queue; its job was to pull events and write them out to an external service (Stackdriver, which sinks to BigQuery). This function was taking 800-1200ms at 128Mb RAM, or 300-500ms at 512Mb (expensive!).

                                                  While running some load testing with Artillery I found that I was often getting 429 errors on my front-end endpoint. When pushing 500+ RPS, the 2nd function was taking up over 50% of the concurrent execution limit and new events coming into the front-end would get throttled and in this case thrown out. That also means that any future Lambdas in the same AWS account would exacerbate this problem. Our traffic is spiky and can easily hit 500+ RPS on occasion, so this really wasn't acceptable.

                                                  My solution was to refactor the 2nd function into a Fargate task that polls the SQS queue instead. It was easily able to handle any workload I threw at it, and also able to run 24/7 for a fraction of the cost of the Lambda. Each invocation of the Lambda was authenticating with the GCP SDK before passing the event and the Lambda has to stay executing while the 2 stages of network requests were completed.

                                                  I'm happy to report I haven't been able to muster a test that breaks anything since I started using Fargate!


                                                  • lentil 101 days ago

                                                    > the 2nd function was taking up over 50% of the concurrent execution limit and new events coming into the front-end would get throttled and in this case thrown out.

                                                    It sounds like you already found a great solution for your particular case. But it's also worth mentioning that you can apply per-function concurrency limits, which can be another way to prevent a particular function from consuming too much of the overall concurrency. For anyone who's lambda workload is cheaper than a 27/7 task, that could be a good option.

                                                    > Each invocation of the Lambda was authenticating with the GCP SDK before passing the event

                                                    I'm curious whether you tried moving the authentication outside of the handler function so it could be reused for multiple events? I've found that can make a huge difference for some use cases.

                                              • methodin 102 days ago

                                                1 lambda hitting 1k concurrent or many lambdas hitting that in aggregate?

                                              • thesanerguy 102 days ago

                                                How does the cost of DynamoDB (and other components) compares to other options that you considered, especially at scale? Would economics works with the same architecture say at 100X scale?

                                                • kdeorah 102 days ago

                                                  Good question. At 100x, probably not. At 10x, yes would be better than managing services on our own. By that time, we would have a better prioritized list of which services to self-manage and which ones to leave to AWS. Are you specifically concerned about DynamoDB for some reason?

                                                  • thesanerguy 102 days ago

                                                    How easy or hard would it be to switch to self-managed components as you grow from 10X to 100X? Quite often, they end up becoming a tech debt that remains in the back burner. Just curious.

                                                    • kdeorah 102 days ago

                                                      Ah yes. The engineer would tell you we can move when we want. The manager would tell you it is harder than it looks. Management would tell you it will never happen. :-)

                                                      See it as reducing startup risk and deferring the payment to when you become successful and have money/time to throw at problem. Though there are best practices to do it in a clean way so moving is easier.

                                                      Do you know some known gotchas here?

                                                    • awinder 102 days ago

                                                      I’d be curious why you think at 100x is where you would lose out on TCO with self managed. I feel like staff time commitment should only go up with larger fleets, you’d really start running into the pricing advantages on 0-rated network here etc.

                                                  • Charles_t 102 days ago

                                                    This is not going to scale. Lambdas are hella slow. The cold starts will kill you.

                                                    • brad0 102 days ago

                                                      The only two places where the cold starts will hurt is the API key auth and the JWT auth/posting to kinesis. Plus if it's being called with any kind of decent frequency it won't matter.

                                                      • poxrud 102 days ago

                                                        With constant traffic cold starts should not be happening. Also lambdas will stick around for 45-60 minutes before going cold.

                                                        • kdeorah 102 days ago

                                                          That's exactly right. Background location tracking leads to constant traffic. Not running into cold starts as an issue.

                                                        • vageli 101 days ago

                                                          > This is not going to scale. Lambdas are hella slow. The cold starts will kill you.

                                                          Frameworks like Zappa handle this out of this box by setting up executions to run via cloudwatch event cron.

                                                          • vorpalhex 102 days ago

                                                            You can pay to never hit cold starts..

                                                            • iends 102 days ago

                                                              No you can't, you can just try and keep your lambdas warm, but that isn't the same thing.

                                                              • jjtheblunt 102 days ago

                                                                wouldn't EC2 be an example of paying to never hit cold starts?

                                                              • snak 102 days ago

                                                                No, you can't.