I want to understand what kind of challenges you all are facing with regards to this. And any tools, practices, you’ll are using to reduce this pain? Ex- How do you deploy resources? How do you define architecture? How do you manage your environments, observability?, etc.
I'd use AWS + ECR + ECS personally. Stay away from kubernetes until you have a fairly large deployment as it requires huge administrative overhead to keep it up to date and well managed and a lot of knowledge overhead. Keep your containers entirely portable. Make sure you understand persistent storage and backups for any state.
Also stay the hell away from anything which is not portable between cloud vendors.
As for observability, carefully price up buy vs build because buy gets extremely expensive. Last thing you want is a $500k bill for logging one day.
And importantly every time you do something in any major public cloud, someone is making money out of you and it affects your bottom line. You need to develop an accounting mentality. Nothing is free. Build around that assumption. Favour simple, scalable architectures, not complex sprawling microservices.
I've currently a VP Eng at a "growth stage" startup, and the state of things when I joined was that about 75% of a dev's time was spent fighting with the external services and data. Someone would spend an entire sprint on a feature, but most of the time was actually fighting with our IdP.
A huge effort and focus (that I had to beat into everyone's head) was that being able to a) run everything locally and/or b) have reasonable fakes for external dependencies means that we can spend far more time building actual product stuff and far less time debugging infra issues.
Doesn't ECS violate this? This is something I've preferred in theory about kubernetes. At least in theory, it's supported everywhere. But I have also wondered whether it would be just as "easy" in practice to move from ECS to another cloud, as it is with EKS. (That is, neither would actually be easy.)
CDN->Load Balancer->autoscale docker->database. That serves 90% of use cases but you'll have access to the myriad other services they offer for anything else that comes up.
> ...
> Also stay the hell away from anything which is not portable between cloud vendors.
But if your team happens to have Kubernetes expertise, I would say go for Kubernetes.
Anything after building container images will be somewhat cloud-specific regardless. And arguably it can be easier to hire infra/platform folks with Kubernetes skillsets later when you need to.
Most people don’t need kubernetes, and ECS + Terraform is very approachable and easy to get started with, especially with Fargate for compute.
Who has access? How do we audit / rotate? How do we secure?
You can use this approach for each step along the way, how to secure secrets in your cloud? code? IaC? container deployments? CI/CD?
If we assume infra / app is code, the tooling matters a lot less. How do you provision certificates via IaC? How do you grant IAM to resources and how do you revoke?
There are examples like https://github.com/terraform-google-modules/terraform-exampl... of more advanced IaC architectures, but you can start as small or as complex as you want and evolve if done properly.
Personally, I love me some Kubernetes + ArgoCD (GitOps) + Google Workload Identity + Google Secret Manager, but I am 100% biased.
Is there something like this for AWS?
Check out the ECS repository for more complete examples.
- Document everything. Hacking together some cloud infra is easy, maintaining it is a completely different story. You want to have clear and up-to-date documentation on what exactly you have running and why. If you don't have good docs, it'll be impossible to scale, refactor, or solve outages.
- Make sure people are trained. Cloud infra management gets complicated real fast. You want people to actually know what they are doing (no "hey, why is the bucket public" oopsies), and you want to avoid running into Bus Factor issues. Not everyone needs to be an expert, but you should always have one go-to person for questions who's actually capable, and one backup in case something happens to them.
- Watch your bills. Every individual part isn't too expensive, but it all adds up quickly. Make sure you do a sanity check every once in a while (no, $5k / month for a DB storing megabytes is not normal). Don't use scaling as a "magic fix-everything button" - sometimes your code just sucks.
I can do both, but for a bootstrapped solo business, Kubernetes is overkill and overengineered. What I would really love is a multi-node podman infrastructure, where I can scale out without having to deal with k8s and its infernal circus of YAML, Helm, etcd, kustomize, certificate rotation, etc.
Recently I had to set up a zero-downtime system for my app, I spent a week seriously considering a move to k3s, but the entire kubernetes ecosystem of churn frustrated me so much I simply wrote a custom script based on Caddy, regular container health checks and container cloning. Easier to understand, 20 lines of code and I don't have to sell my soul to the k8s devil just yet.
Sadly, I don't think a startup can help make this better. I want a bonafide FOSS solution to this problem, not another tool to get handcuffed to. I seem to remember Red Hat where working on a prototype of a systemd-podman orchestration system to make it easy to deploy a single systemd unit into multiple hosts, but I am unable to remember what is it called any more.
---
Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.
I'd say it depends where you're coming from. For me, setting up a Kubernetes cluster (no matter which flavor) with external-dns and cert-manager will most likely take 30m-1h and that is the basic stuff that you need for running an app with the topics you mentioned. To navigate through k8s just use k9s and you're golden.
I never get where all the "k8s is the devil" comments come from. There is nothing really complex about it. It's a well defined API with some controllers, that's it. As soon as I need to have more than one server running my workloads I would always default to k8s.
And Linux is just a kernel with some tools, which are all well defined, that's it!. But if you need to debug a complex interaction "that's it" and "it's well defined" isn't enough.
Kubernetes is quite complex, with a lot of interactions between different components. Upgrades are a pain because all those interactions need to be verified to be compatible with one another, and the versioned APIs, as cool as a concept they are on paper, mean that there's constantly moving targets that need constant supervision. You can't just jump a version, you need to check all your admission controllers, CNI and CSI drivers, Ingress controller, cert-manager and all other things are compatible with the new version and with each other. This is not trivial at any scale, which is why many orgs adapt the approach of just deploying a new cluster, redeploying everything to it and switching over, which is indicative of exactly how much of a pain it is.
Even Google themselves that created it admit it's complex and have 3 managed services with different levels of abstraction to make it less complicated to use and maintain.
I understand how it works, it makes sense, but then you're faced with Helm, kustomize, jsonnet and a lot of bullshit just to have minimal and reproducible templating around YAML. Or maybe you should use Ansible to set it up. Maybe instead try ArgoCD. Everybody and their dog is an AWS or GCP evangelists and keep trying to discourage you from running it outside of the blessed clouds. If anything breaks you're told you're stupid and that's what you get, and I should've paid someone else to manage it.
It feels like everybody is selling you something to just manage the complexity they have created. This is what keeps me away. It's insane.
Personally I like VMs, and I’ve been toying around with the idea of having a LXD cluster on dedicated servers where each LXD container hosts some of my workloads.
- No ability to use secrets as environment variables, and no plans to change this
- Cannot use `network_mode` to specify a service to use for network connections a la Docker Compose
There were a few other minor issues which resulted in ditching Docker Swarm completely and moving to a Nomad + Consul stack instead.
I disagree - there's cloud PaaSes like Fargate or Cloud Run, or when self-managing, a third option which is much less known but quite easier while also being more flexible - HashiCorp Nomad. Disclaimer time: I work at HashiCorp, but I've had this opinion for years before joining - https://atodorov.me/2021/02/27/why-you-should-take-a-look-at... (it's out of date, but the principles still apply, just to a different extent, Nomad having gotten easier). All opinions are my own etc.
Not FOSS anymore, but free and source available, composable, and does a big portion of what Kubernetes does at a fraction of the complexity. I ran it in production for a few years and everything was a breeze, unlike the Kubernetes clusters I was maintaining at the same time. You get all the "basics" of HA, failover, health checks, easy ingress, advanced deployment types, basic secrets storage, etc. without having to write thousands of lines of YAML, and with simple upgrades and close to no maintenance.
> Also, I seem to be an outlier, judging from the rest of the comments, by running on dedicated servers. These days everybody is using one of the clouds and terribly afraid of managing servers. I think it's going to be hard to make DevOps better when everyone is in the "loving" Azure/AWS/GCP embrace: you're basically positioning as their competitor, as the cloud vendor itself is always trying to upsell its customers and reduce friction to as close to zero as possible.
Because running on dedicated servers means you spend time on managing them which could be more productively spent working on whatever your product is. Don't get me wrong, I love that stuff, but it's like when I started my blog - I spent weeks on CI/CD, a nice theme, a fancy static site generator, all sorts of optimisations, choosing a good hosting provider etc. etc. etc.... instead of actually doing what I was supposed to, writing content. When you manage your own servers your costs might even be higher compared to a F/C/P aaS from a cloud provider, especially with free trial year/free tier/startup credits. As long as you keep an eye on not locking yourself in, you could easily migrate most products to self-managed dedicated servers if performance/costs require it.
So, I don't really care to try out Nomad for my business, honestly.
I disagree with the DSL part, but it's in big part subjective so people have all sorts of different experiences and opinions.
I guess it is proprietary tech though which might be off putting.
We should keep things simple, KISS. My few key points for the future myself:
1. Be cloud agnostic. Everyone operates on margins and a slight increase in a cloud provider's fees could be a death sentence for your business. Remember, cloud providers are not your friends; they are in the business of taking your money and putting it in their pockets.
2. Consider using bare metal if it's feasible for your operations. The price/performance ratio of bare metal is unbeatable, and it encourages a simpler infrastructure. It also presents an opportunity to learn about devops, making it harder for others to sell you junk as a premium service. This approach also discourages the proliferation of multiple databases/tech/tools for the sake of CV updates by your colleagues, keeping your infrastructure streamlined.
3. Opt for versatile tools like Ansible that can handle a variety of tasks. Don't be swayed by what's popular among the "cool kids". Your focus should be on making your business succeed, not on experimenting with every new tool in the market. Master it well.
4. Make sure you can replicate your whole production stack on your box in a few seconds, a minute max. If you can't, well, back to the drawing board.
5. Use old and tried tech. Choose your tech wisely. Docker is no longer cool, and Podman is a rage on HN, but there are hundreds of man-hours of documentation online of every Docker issue you can think of. And Docker will stay for a while. The same for Java/Rails/PHP...
6. Keep everything reproducible in your repository: code, documentation, deployment scripts, and infra diagrams. I've seen people use one service for infra diagrams and another to describe database schema. It's madness.
7. (addon) Stay away from monorepos. They are cool, they are "googly," but you are not Google or Microsoft. They are notoriously hard to scale and secure without custom tooling, no matter what people say. If you have problems with the code sharing between repos, back to the drawing board.
I could potentially* get behind a logic of splitting unique web services into their own repo, but for libraries I find separate repos to be a huge unnecessary overhead.
*potentially, as in I still prefer a monorepo for everything but I can see valid arguments for this one case
I only think monorepos for a Javascript / Typescript stack are heaven sent (pnpm, turborepo), because reusing NPM packages otherwise is a huge pain.
There are a lot of great tools out there, but making them play well together is an exercise for the reader. There are also a lot of preference-based choices you need to make in how you want your setup to look, and what you chose will affect what tools make sense to you.
Do you go monorepo or polyrepo? If you go monorepo, how do you decide what to build and deploy on each merge? If you go polyrepo, how do you keep stuff in sync between any code you want to share?
Once a build is complete, how do you trigger a deployment? How does your CI system integrate with your deployment system, or is the answer "with some shell scripts you have to write"?
> How do you deploy resources?
For us, we have a monorepo setup with bazel. I wrote some fairly primitive scripts to scan git changes to decide what to build. We use Buildkite for CI, which triggers rollouts to kubernetes with ArgoCD. I had to do a non-trivial amount of work to tie all this together, but it's been fairly robust and has only needed a minimal amount of care and feeding.
> How do you define architecture?
Kubernetes charts for our services are in git, but there's some amount of extra stuff deployed (ingress controller, for example) that is documented in a text file
> How do you manage your environments
We don't need to deploy environments super often, so just do it manually and update documentation in the process if any variations are needed.
> observability
Datadog and sumologic.
Overall our setup doesn't come close to the setup I worked on at my last employer, but I have to balance time spent on devops infra with time spent on the product, and that setup took ~5 full time engineers to maintain.
Out of curiosity, why just the "readmeware" for those components? I can't think of a single thing that requires clickops in a modern k8s setup, so much so that in the beginning we used to bring up the full stack from nothing based on a single CFN template - roles, load balancer, auto-scaling group, control plane, csi driver (this was back when EKS was a raging tire fire), and then lay the actual business apps on it. The whole process took about 8 minutes from go
If nothing else, one will want to be cautious about readmeware components in disaster recovery situations. If no one has run those steps in 6 months, and then there's some kind of "all hands on deck," the stress will likely make that institutional knowledge leak out of their ears
Because there are so few of them. Our setup has an ingress controller and a certificate manager, and then some bookkeeping like copying the container registry credentials into every namespace
> I can't think of a single thing that requires clickops in a modern k8s setup
Absolutely agree.
> The whole process took about 8 minutes from go
How long to do the development and testing of the template, and what size is your team?
Don't get me wrong, I'm not happy about this situation. As well as the DR concern you raise, we can't quickly spin up short lived clones of our infra for testing complex changes, so we test them in our staging environment and have to block prod deploys until we're either happy with the change or decide to roll it back. At a larger org this would be a major headache but at our current size it does not matter.
So you don't need to deploy multiple times or you don't do it because the system is stable when you deploy less often? I mean is it by choice or because of some tool or expertise limitation?
> Also, for architecture stored in text files - does that cause any problems for you?
> does that cause any problems for you?
Nope. Our Kubernetes setup is just about as simple as it is possible to have a Kubernetes setup. I entertained the idea of going with something else, because we definitely don’t need all Kubernetes has to offer. But I settled on it because it’s what I know best and the overhead+risk of something new would have exceeded the cost of the unnecessary for us baggage that Kubernetes brings.
If our requirements were different or if I was making regular changes, we would be in a very different spot. But as it stands today it is just not a priority.
my advice would be:
Separate your build from your infra. Whilst its nice to have your cloud be spun up with CI, its really not a great use case, and means your CI has loads of power that can be abused.
Gitlab with local runners is a good place to start for CI. its relatively simple and your personal runners can be shared between projects (this is great for keeping costs down, but speeds up, as you can share a massive instance)
Avoid raw Kubernetes until you really really have to. Its not worth the time unless you have someone to manage it and your use case requires it. Push back hard on if anyone asserts that its a solution to x. Most of the time its because "its cool" K8s only really becomes useful if you are trying to have multiple nodes from different clouds/hybrid local/cloud deployment. For almost everything else, its just not worth it.
You are unlikely to change cloud providers, so choose one and stick to it. Use their managed features. Assuming you are using AWS, Lambdas are really good for starting out. But, make sure you start deploying them with cloudformation/terraform (terraform is faster, but not always better)
Use ECS to manage services, use RDS to manage data. Yes it is more expensive, but backups and duplication comes for free (ie you can spin up a test deployment with actual data.) Take the time to make sure that you are not using hand rolled stuff made in the web console, really put the effort into make sure everything is stored in terraform/CF and in a git repo somewhere.
Limit the access you grant to people, services and things. Take the time to learn IAM/equivalent. Make sure that you have bespoke roles for each service/thing.
Rotate keys weekly, use the managed key/secrets storage to do that. Automate it where you can.
I am curious about this because I see opposing views expressed by different people. I have never personally been in a position where the decision has been relevant.
I work at a cloud provider an I'm told that a big slice of our revenue comes from customers who are already load-balancing across multiple clouds, so if we degrade perf/dollar they just turn a dial to shift load to our competitors.
This has always seemed very smart to me and I would love to get more perspectives on how easy it is to get to that position where your infrastructure is so commoditised that you can migrate between providers and back at the push of a button.
Is this something people achieve late in their lifecycle after massive cost-optimisation push? Or is this actually something you can build towards from day 1?
It's only anecdata but I'd highly question that. All companies I worked in (startup, midsize, megacorps) went with one cloud provider and stick to it. That is also not only my experience but also from friends who work in the same field. There might be a slight difference with megacorps where I saw them using multiple cloud providers but more like: Team A is using AWS, Team B is using GCP. But never: Team A is using AWS and has a copy running, ready to go, on GCP.
I tend to agree with the GP, to a degree. Chose an cloud provider and stick with it, the probability that the company changes the provider is very low (in the end all cloud provider offer the same with similar prices, no need to switch) but don't fully buy in. Like if you're on AWS, of course use RDS for DB and S3 for object storage but don't use Code Pipelines to build. So don't go "fully" proprietary.
My sample size is me and my immediate friends. I suspect that if you have a resilient multi-cloud deployment, then you'll use it to hunt for a metric you want to hit (speed/price/latency)
However, the engineering cost to get there is pretty high, and almost negates the point of having the cloud (unless you have a scaling requirement where you need 10x at short notice.)
I worked at a large news company, and it was decided that it was cheaper to just pay for the hosted services than pay for the people to run them (think RDS vs home grown DB) RDS is what 2x the cost of a normal instance, but thats still cheaper than the three engineers + oncall to manage a custom deployment and manage the backups and migrations. (along with conway's law of having DBAs)
Our biggest problem is feature environments, or actual integration tests where multiple services have to change. Because infra is in its own repo in terraform and the apps have their own repo we don’t have a good way of creating infra-identical environments for testing code changes that affect multiple services. We always end up with some hack and manual tweaks in staging.
Data engineering is another problem, managing how to propagate app schema changes to the data warehouse is a pain because it has to happen in sequence across repo borders. If it was all one repo and we got a new data warehouse per PR it would be trivial.
Not trusting CI to hold secrets is another. As soon as we do anything in CI that needs “real” data we need to trigger aws ecs tasks, because circleci has leaked secrets before so we don’t trust them and keep all our valuable secrets that can access real data in aws ssm. The more complex the integrations the harder they are to test.
If we had a monorepo I think this type of work would be much easier. But that comes with its own set of problems, mainly deployment speed and cost.
If there was a way to snapshot all our state and “copy” it to a clean environment created for each PR that the PR could then change at will and test completely, end to end, that would be the dream.
OT1H, :fu: terraform, so I could readily imagine it could actually be the whole problem you're experiencing, but OTOH it is just doing what it is told, so that's why I wanted to hear more about what, specifically, the problem is? too many hard coded strings? permission woes? race conditions (that is my life-long battle with provisioning stuff)?
this whole Ask HN is nerd sniping me, but I'm also hoping that we genuinely can try and find some "this has worked for me" that can lift all boats
Then of course there’s always the issue of people taking shortcuts, “i will test this once then we know it works then i’ll just hardcode this thing”. Making stuff truly impotent and portable is extra work for a “nice to have” of a feature env. YAGNI, until you do. For most devs they’re happy to have something that works and they can move on and ship the next thing.
Personally i’m a big fan of monoliths because they supposedly make this stuff easier. Then again i’ve never worked on a huge one and my colleagues much prefer spinning up an “isolated independent loosely coupled” service to adding it in the main app.
I'm always baffled to see how many shops claim to either not have this problem, or sidestep it. Every project I've ever worked on has had multiple enhancements/fixes in flight at the same time, they need to be tested and deployed independently of each other, on their own timelines. For this we need story branches and a fast way of deploying different story branches to different test environments. If you're merging everything into kitchen-sink "dev", "staging", "test" branches because someone drank the Gitflow kool-aid, your confidence goes way down that a specific story branch is "ready to go" and that production will behave exactly like dev and test. And, as you mention, accomplishing the story-branch approach across multiple repos (assuming the change is large enough to affect multiple repos) sounds like swimming with crocodiles.
Now that I looked at the website to share the link, I would probably be scared away because it looks very enterprise and corporate but it works well and it's just a Docker container built in CI and I don't have to interact with it.
Imperative tools express schema changes as a filesystem of incremental "migrations", whereas declarative tools use a filesystem of CREATE statements and can generate DDL diffs to transition between states.
The declarative flow has a lot of benefits [1]. It is much closer to how source code is typically managed/reviewed/deployed, which is especially important for stored procedures [2].
However, using a purely declarative flow (as opposed to using a declarative tool just to auto-generate incremental migrations, which are then run from another tool) isn't necessarily advisable in Postgres, since Postgres supports DDL in transactions. So there's potentially multiple paths/orderings to reach any given desired state, as well as the ability to mix schema changes with data changes in one transaction. This means in Postgres there's often manual editing of the auto-generated incremental change, whereas in other DBs with fewer features this isn't necessary.
Disclosure: I'm the author of Skeema, mentioned in the GP comment.
[1] https://www.skeema.io/blog/2019/01/18/declarative/
[2] https://www.skeema.io/blog/2023/10/24/stored-proc-deployment...
Then you can just use pg like normal, directly.
In general when you have a solution that is supposed to handle any problem and scenario like with AWS you'll eventually end up with some complicated Frankenstein-y creation, there's probably no going around it if they want such a robust set of features and capabilities.
Well that and the constant cost increases. Container apps in Azure went from around $20-25 to $120 on our subscription. Along with all the other price hikes we're looking to move out of the cloud and go into something like Hetzner (but localized to our country).
I already was using more than one service but I cancelled Rackspace when this happened, although was not making enough yet to be fully and automatically redundant. It was a pain to suddenly have to drop everything and rebuild the service, as it broke the entire service I was offering. Actually I rebuilt everything on my existing service at Linode (now Akamai), and then got a VPS at Ramnode as my backup. So it was a lot of work suddenly thrown at me that I didn't need. Luckily, while my backup practices are not completely ideal, I do follow the 3-2-1 backup rule enough that it wasn't catastrophic.
Here's the message I got from Rackspace in 2019:
This message is a follow-up to our previous notifications regarding cloud server, "c0defeed". Your cloud server could not be recovered due to the failure of its host.
Please reference this ID if you need to contact support: "CSHD-5a271828"
You have the option to rebuild your server from your most recent server image or from a stock image. Once you have verified that your rebuilt server is online, you must (1) Copy your data from your most recent backup or attach your data device if you use Cloud Block Storage and (2) Adjust your DNS to reflect the new server’s IP address for any affected domains. When you have verified that your rebuilt server is online, with your data intact, you will need to delete the impacted server from your account to prevent any further billing associated with the impacted device.
We apologize for any inconvenience this may have caused you.
Number two: using tech that you really don't need. Keep it simple, even if it means not using something that the rest of the industry is in love with. Figure out what your requirements are, pick the simplest way of setting it up (keeping #1 in account, because using tooling that isn't able to grow with you is worse IMO), and keep it agnostic. Then when you start to hit the edges of the capabilities, either scale up your ops understanding and start using more sophisticated things, or hire someone in to help out.
Also, I really do agree with @KaiserPro when they say to separate your infra from your deploy. It makes moving to something else much, MUCH easier when you inevitably need to.
Without dedicated devops the challenge is allocating developer time to do the integration work required to get the various ops tools playing nicely together. In my experience that work is non-trivial and eventually leads to some form of dedicated ops.
That is also part of a dynamic -- lots of tools available for solving these problems exist because of dedicated ops. It's not easy for a team trying to build software to also take on these extra operational responsibilities.
What we're trying to do is create a highly curated set of what we think of as application primitives. These primitives include CI/CD, Logs, Services, Resources etc. Because they're already integrated it doesn't require developers to figure out API keys, access control, data synchronization, etc.
1. https://noop.dev
they seem to be using the excellent lima <https://github.com/lima-vm/lima#readme> for booting on macOS; I run colima for its containerd and k8s support but strongly recommend both projects $(brew install lima colima)
Linked in parent is a blueprint which is how we define applications. That config can be used to stand up an application locally as well as in a cloud environment without additional configuration.
Use code pipeline/cloud build then either container registry and ecs or beanstalk.
You get everything you mentioned for free with either set up.
Seriously it takes moments to write terraform or even do it by hand.
This comes on top of the benefit that architecting for multi-cloud at that scale usually forces you to simplify/rationalize/automate ruthlessly and so is probably beneficial for a "mature" org with lots of cruft anyway.
That said there may only be a few dozen companies in the world where this makes sense. I happen to work for one, but most engineers don't. Even then, this is at the extreme right of the maturity curve, meaning that there is lots of lower-hanging fruit to pick first
And this is possible because we follow an architecture-first approach. We have blueprints where you define your architecture, and plug that template to any cloud provider.
So yeah, one of the clients we were working with had to move from AWS to GCP (mostly for utilizing cloud credits). The setup was small but the estimated time was nearly 2-3 months. It's really a pain, and I think a lot of organizations don't think of doing it because when they weigh rewards and the effort/time needed, it doesn't make sense.
If we need to move it then we move it... not exactly difficult or time consuming. There are always ways of half moving things too, for example we could move the container deployment to another cloud but keep our build process and object store at the original vendor and so on.
In reality we are not going to be shifting production load between cloud vendors every 2 weeks.
Keeping it very simple: I push code to Github; then Capistrano (think bash script with some bells and whistles) deploys that code to the server and restarts systemd processes, namely Puma and Sidekiq.
The tech stack is fairly simple as well: Rails, SQLite, Sidekiq + Redis, and Caddy, all hosted on a single Hetzner dedicated server.
The only problem is that I can't deploy as often as I want because some Sidekiq jobs run for several days, and deploying code means disrupting those jobs. I try to schedule deployment times; even then, sometimes, I have to cut some jobs short.
Sounds like a use case for Cadence/Temporal-style fault-oblivious stateful execution with workflows. At last job, we did Unreal Engine deployments with pixel streaming at scale on a huge fleet of GPU servers, and the way we could persist execution state so hassle-free that the code would magically resume at the line right where it got interrupted was so astounding.
If AWS is required, a monolith on ECS Fargate + Aurora Postgres Serverless is perfectly cromulent. You'll need to string together some open source tools for e.g. ci/cd.
"In the cloud" has recently come to mean "NSA gets a copy of whatever they want" unless you choose wisely.
And even so, how much time would developers be spending on these tasks weekly?
And lack of attention to lean alternatives like docker swarm - is another one.
you can quite easily connect it to GCP/AWS/Azure