On Infrastructure at Scale: A Cascading Failure of Distributed Systems

(medium.com)

110 points | by aberoham 1926 days ago

11 comments

hardwaresofton 1925 days ago
This was a fascinating read.
> I’ve had concerns about the sidecars in the past, however after this event I am convinced that having each workload ship with their own logging and metric sidecars is the right thing. If these had been a shared resource to the cluster, the problem almost certainly would have been exacerbated and much harder to fix.
While I'm all for the sidecars model, I'm not sure that I agree with this...
If logging/metrics ingestion was a shared resource that applications were calling out to directly (basically a shim for kafka 99% of the time), it seems they could have built in a layer to dump writes to disk in the case kafka was unreachable. Even if they didn't and the shared service just started failing, that's a single (likely highly scaled) service failing (or best case 503ing and dropping logging+metrics where necessary) not service downtime/the cascading failures. Maybe I'm misunderstanding something about this quote.
Maybe it's just me, but workloads going down due to a sidecar that handles logging/metrics doesn't sound like the right prioritization to me.
Also, like other people have said, resource limits & using the kubernetes compute reservation systems[0] also seems like a key bit that was missing from this.
[0]: https://kubernetes.io/docs/tasks/administer-cluster/reserve-...
rmb938 1925 days ago
For those that are not too versed in distributed systems, kubernetes or docker I highly suggest you read the comments on the post. I am going to promot myself a bit here and link directly to my comment https://medium.com/@rmb938/thanks-dan-for-the-great-post-1ef...
Sadly a large environment was brought down due to not following best practices. This event probably could have been prevent or at least have very little impact if best practices were followed.
jacques_chester 1925 days ago
Some questions I had after reading it:
* Would this have been less painful with Netflix-esque patterns? Circuit breakers, throttling and the like? Failing hard on purpose is often easier to diagnose than failing fuzzily without clear cause.
* Would it be acceptable to maintain spare capacity? For a prod cluster on the main money path, I think everyone thinks yes. Who holds the decision bit for dev, which is a second-order source of money?
* Can I avoid the second act by having a clean way to (1) reset the entire system and (2) bring it up in an orderly fashion, as quickly as possible? Not every app needs to be relaunched simultaneously, there's presumably _some_ order of preference. Once the cluster state began to oscillate it doesn't seem like there were any easy options other than resetting it.
[-]
- kronin 1925 days ago
  The risk with having a reset mentality, "after all it's only dev" (edit: not intending to imply you said this), is that the opportunity to learn is lost. If they took the reset tack, then this occurred in production and they didn't have that option available, choosing the reset path would suddenly be seen in hindsight as a lost opportunity.
  [-]
  - jacques_chester 1925 days ago
    I hadn't thought of that. I still think it would be helpful in prod, though.
sciurus 1925 days ago
The authors conclusions seem questionable. Ryan Belgrave's comment on them is worth reading.
https://medium.com/@rmb938/thanks-dan-for-the-great-post-1ef...
tzhenghao 1925 days ago
For some reason, this reminds me of Jay's piece on reliability of distributed systems [1]
[1] - http://blog.empathybox.com/post/19574936361/getting-real-abo...
tomnipotent 1925 days ago
> This problem did not affect the TAP production environment in the same way due to it simply being a much smaller environment
Is it common for a dev environment to be larger than production...? Maybe paints a picture about Target's ratio of software developers to traffic/load?
[-]
- throwawayjava 1925 days ago
  He's not saying that the total dev environment is bigger than the total prod environment. He's saying that this one cluster of dev machines is much bigger than other clusters of machines (dev or prod). that could be true while simultaneously having many more prod machines than dev machines.
  The author explains: however there is one cluster that was built significantly larger than the others and hosted around 2,000 development environment workloads. (This was an artifact of the early days of running “smaller clusters, more of them”, when we were trying to figure out the right size of “smaller”.)
  [-]
  - tomnipotent 1925 days ago
    Thanks, somehow missed that.
- kronin 1925 days ago
  Another possibility is that they deploy feature branches in dev, so for a given service there are more versions running simultaneously.
tedk-42 1925 days ago
key take away that isn't mentioned in the article:
- Put resource limits on your pods.
[-]
- bboreham 1925 days ago
  What’s your reasoning?
  From my experience, a memory limit, if hit, will cause periodic OOM and restart; and a CPU limit will cause slowing-down of the processes.
  The tale already mentions restarting and slowing-down; it’s not clear to me that adding more would have improved anything.
  [-]
  - jpgvm 1925 days ago
    The issue was the Docker daemon was affected due to poor resource configuration. It should run at a higher priority and with guaranteed resources so it can't be impacted by user mode tasks.
    k8s has been slow to catch up in this area but finally has priorities and preemption. That said Docker normally doesn't run in a pod so it's also a matter of setting the node total allocatable CPU to something reasonable and ensuring all pods spawned by kubelet are nested under a namespace that has a lower priority than the system tasks (kubelet itself, Docker if you use it etc).
    [-]
    - merb 1925 days ago
      yeah after reading that I came to the same conclusion and tought that he mostly blames it on cluster size, which is just wrong.
    - bboreham 1924 days ago
      Kubernetes has a “-system-reserved” flag for that purpose. Limits are something else entirely.
    - chupasaurus 1925 days ago
      And it could be easily done by adjusting Docker's systemd service unit.
  - tedk-42 1919 days ago
    if you've got an app that constantly uses more memory then it's either a memory leak or poor tuning which results in that.
    An app without a CPU limit can completely consume the hosts CPU and take down everything else on that node. If you've got a very large instance with a lot of apps, a bunch of them are gonna experience a poor QoS.
    Their tale suggests their sidecar pods/containers (which likely didn't have limits) made their production issues worse. a CPU limit on those pods would have limited the damage those pods caused to the rest of their infra.
  - KaiserPro 1924 days ago
    You want to signal early that a node is unable to do what is requested of it.
    Its the same with unbounded queues, you need to put in limits so that alarms are triggered much earlier.
    infinitely spinning up new things is generally bad. Unless you are responding to an external signal. Minimum deploy times again are useful. There is a financial cost (on the cloud, not on real tin) to short term instances. So you need to have them on as long as possible.
    [-]
    - bboreham 1924 days ago
      Sounds like a Kubernetes resource request, rather than a limit.
coldcode 1925 days ago
Complex systems, even when think you designed to avoid complexity, are unpredictable when shit happens. Any system which is hard to completely understand is likely to fail in unforeseen ways, and possibly be difficult to recover from or to know how to change to keep it from happening again. Sadly I see these types of failures all the time where I work, though not quite the same scale. One time a simple DNS update brought down our operations worldwide, even in places thought not to be connected at all.
nerdbaggy 1925 days ago
Interesting. Is there a benefit to running Kubernetes in Openstack in their case vs running it on bare metal?
[-]
- erikb 1925 days ago
  Storage and network stability is much higher if you run it on top of Openstack. k8s on baremetal is a pain in the butt. Disadvantage is that you have even more layers of abstraction.
  [-]
  - nalbyuites 1925 days ago
    I understand storage, but could you elaborate on why there would be higher network stability with Openstack?
marknadal 1925 days ago
The only real way to deal with this is to test distributed systems.
Doing so is hard, but only way to reliably know a system behaves given unpredictable failures.
So learn up:
- http://jepsen.io
- https://github.com/Netflix/chaosmonkey
- https://github.com/gundb/panic-server
jijji 1925 days ago
I wonder if Target stores their customers cardholder PIN numbers still [1].
[1] https://www.theguardian.com/world/2013/dec/27/target-hackers...
[-]
- kiallmacinnes 1925 days ago
  According to your own linked article, target does not store the PIN numbers.
  What happens is, the PIN pad encrypts the PIN code and other payment info, using a key known only to the card issuer (i.e. VISA). This encrypted data then finds it way to the card issuer (e.g. Visa) for verification via one of a few possible paths, either PSTN dialup, or more common these days, over the internet. In the Target incident, hackers grabbed this data as it was being transferred over the LAN.
  (For completeness, there are many more organisations this encrypted data passes through between the merchant and card issuer, but nobody but the card issuer can decrypt it)
  Also, it's incredibly off topic for the post. I'll bet that's why your getting all the downvotes.