1 comments

  • csears 1986 days ago
    ... also known as a monitoring tool
    • chrisparnin 1985 days ago
      From our discussion with folks at Netflix. We had a long talk about their verification checks. In regards to monitoring:

      > Monitoring and alerting is a different beast of a topic. Certainly you can monitor all of these things, but our guidance on alerting strategies is along the lines of finding the top-level metrics for a service (google SREs would call these Service Level Objectives) and only alerting people to dig in when those are impacted. Usually error rates, latency (how many micro/milli/seconds to respond), and throughput (requests per second). A lot of this relies on the people writing the application to instrument their code to store metrics and events. There’s only so much much that you can monitor from outside the application. Some sort of outlier detection is popular too to terminate bad, one-off instances (particularly common in the public cloud)—funny thing is you have to alert on whether you’re killing too many instances too quickly (irony of automation).

      I would venture we want to separate testing images and infrastructure on startup before receiving traffic or just making it out of CI vs. monitoring the health of instances. But they also can have things in common!