Canary analysis: Lessons learned and best practices from Google and Waze

(cloud.google.com)

113 points | by Daviey 1928 days ago

3 comments

not_kurt_godel 1927 days ago
There are 2 gaps here, at least as they relate to my understanding of what "canaries" are (based on experience):
1. Requiring manual validation of canary results is antithetical to CI/CD principles. If you can't trust your canaries enough to automatically promote or block based on results, doing manual validation is just a band-aid that will have to be ripped off painfully eventually.
2. Canaries should run continuously against all deployment stages, not just as a one-time approval process in your pipeline. Good canaries give visibility into baseline metrics at all times, not just when validating a new application version.
Overall I would say this guide aligns more with what I would term as a non-CI/CD load-test approval workflow rather than a "canary".
[-]
- mrtrustor 1927 days ago
  Author here.
  1. Yes. This is what the article explains in the "Monitor your new pipeline before you trust it" section. This is more of a "how to get started", and when you just created your canary config, you probably don't want to trust it just yet to push to production.
  2. I'm not sure I understand here. What version of your application is your canary running if you're not using it to validate a new version? The same as the baseline? But then, what are you using it for?
  [-]
  - donavanm 1927 days ago
    For #2 canaries are commonly run against the stable/“production” deployment on some sort of periodic basis. This is used to approximate a customer experience and detect faults in underlying components, intermediate infrastructure, or changes outside of “the software” like configuration data. Its an adjunct or backstop to metric based anomoly detection. From what Ive seen.
    Edit: as an aside theres a very interesting area of discusssion around the spectrum of integration tests, canaries, & user experience monitoring. If you change the periods and the sources they seem to blend in to the same outcome. Ie write your integ tests to cover the ux. Run them continuosly. Associate results with underlying inputs. Suddenly theyre very much the same thing.
    [-]
    - cavisne 1927 days ago
      I suspect you are confusing the terms.
      What Google/Netflix call a "Canary" other companies call "deploying a single host/percentage of production traffic with a new version". When other companies talk about canaries they mean regular tests against production to detect issues.
      [-]
      - donavanm 1927 days ago
        Youre correct. I didnt catch the distinction when I first skimmed the article and parent comment. Part of it is that my active “canary” tests themselves emit relevant TSD indicative of system performance.
        The general concept outlined Id lump in to “approval workflows.” The gradual, intentional, deployment of mixed versions to the same workload Id call something like A/B or red/blue version deployments.
        [-]
        joshuamorton 1927 days ago
        At least from what I've seen red/green (I've heard blue/green) or A/B deployments represent a different thing. A blue/green deployment says you have 2 environments, each able to handle all of your traffic. So you have 2x the servers you need running, and move traffic between environments to upgrade. Its double buffering, but with binary versions.
        The (traffic) canarying process that Google and Netflix use, and that is described in this article is distinct from that, since you don't need a significant amount of overhead.
        [-]
        donavanm 1927 days ago
        Huh. Nomenclature. FWIW Ive also never heard of A/B being limited to binary or requiring full N sets of resources. I've only seen it as small subsets of traffic that is ramped up to some confidence interval. Similarly two concurrent variants is the simplest and minimal value. But Ive also seen literally thousands of concurrent variants with enough workload & consumers. Agree on overhead, as it's essentially a version management + stable routing problem you dont/shouldnt increase resource requirements.
    - not_kurt_godel 1927 days ago
      Yup, the key point is they let you treat complex systems as a blackbox that either does or does not successfully support the end-user functionality/experience. Ideally the components of the blackbox are sufficiently self-regulating that the end-to-end functionality is never interrupted, but complex systems by their nature exhibit emergent behavior that may not register as anomalous at a per-component level.
      Totally agree the nuances regarding levels of coverage between various end-to-end testing mechanisms is an interesting subject. Generally speaking I think it's preferable for canaries to essentially be continuously-run integ tests, but this is not universally applicable. Deep integ tests may be too resource-intensive to run continuously. Integ tests also aren't always ideal for monitoring behavior of long-lived resources as they typically start from a blank slate, create some new resources, test their functionality, and then clean them up. Canaries may be geared towards validating state that persists indefinitely by design.
  - not_kurt_godel 1927 days ago
    > 2. I'm not sure I understand here. What version of your application is your canary running if you're not using it to validate a new version? The same as the baseline? But then, what are you using it for?
    We run canaries continuously against all stages of our pipelines, including prod. Once a change is deployed to a stage (initially triggered by commit to git repo), there is a time window where canary failures will trigger approval workflow failure and rollback (note that we additionally have separate integration tests that also run as part of this approval workflow). Canary failures at other times trigger steady state alarms. Thus, our canaries serve both to validate new versions and continuously monitor the existing versions.
    [-]
    - joshuamorton 1927 days ago
      > Canary failures at other times trigger steady state alarms. Thus, our canaries serve both to validate new versions and continuously monitor the existing versions.
      This is bad.
      Canaries are not alerts. Alerting should be separate from canary. If you're getting steady state alerts from your canaries, you're steady-state alerting is bad.
      A system that isn't undergoing change should be able to entirely disable its canary, and you should still remain confident that any changes in traffic or whatnot should be handled by other alerting tools. If not, then your canaries aren't correctly balanced, and your getting canary failures due to traffic imbalances or something that shouldn't be affecting a good canary. In other words, if you're canary is alerting in steady state, your experimental setup is invalid and you don't have a good test/control pair. If you did, the only steady state alerts you'd get would be noise.
      You can't have the same experimental setup control for production differences to isolate changes due to binary version bumps, while also detecting changes in production traffic independent of binary version bumps. At least one of those is broken.
      [-]
      - not_kurt_godel 1927 days ago
        > This is bad.
        I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary.
        > A system that isn't undergoing change
        This is an assumption which may not always be valid for sufficiently complex systems/services. Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level.
        For example, imagine microservice A receives events from microservice B which it then transforms and passes to microservice C. A deployment is made to service A which intentionally ignores/filters certain events from B, but contains a bug which causes A to filter additional events that it should not. C's low traffic alarms begin firing due to a drop in received events but from A's perspective, all is well (assuming the bug is overlooked both by its unit tests and integration tests). Likewise, B sees no issues since it is still able to successfully hand off events to A as normal. Thus, C experiences a steady state failure without having undergone any changes itself. Without an end-to-end canary, the system is stuck in a broken state until the oncall is able to figure out that the alerts generated by C have been triggered by a change in A (requiring them to subsequently manually roll A back). Alternatively, with an end-to-end canary, A's deployment is automatically rolled back and the oncall is alerted to a pipeline blockage which can be investigated while the system as a whole continues to function properly as normal.
        Of course, it's certainly a possibility that such a scenario is indicative of poor architecture decisions, whether due to initial oversight or organic growth in complexity over time, but the reality is that such systems commonly exist and that it's very difficult to completely avoid such undesirable emergent behavior on sufficiently large timescales.
        Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage.
        As you state, ideally your experimental setup would completely isolate variables to the extent that alerting can be completely self-contained/decoupled without external input & measurement, and one should strive to meet this goal regardless of whether canaries are implemented or not. However, at the end of the day, your system needs to be working end-to-end at all times and canaries can be a powerful and elegant mechanism to validate this without creating a gordian knot of granular entwined inter-microservice operational dependencies. Continuous canaries should not be your first line of defense against bad deployments, but they can serve well as complementary guardrails in many circumstances.
        [-]
        joshuamorton 1927 days ago
        So, to preface this, so we're on the same page, a canary, in the context I'm speaking, is a specific type of comparison between a known good version X, and an unknown version X'. A canary is made up of a test and control pair, which should be as similar as possible (a common pattern is you sit both behind the same load balancer). This is done to isolate outside impact. We'll get to that more in a minute.
        >I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary.
        You misunderstand me then. I think that canarying is almost always a net positive. I am saying that canarying is not a replacement for steady-state alerting. Again, a canary should not be capable of detecting steady-state alert issues. If it did, it would by definition not be a controlled experiment, which is your aim.
        >This is an assumption which may not always be valid for sufficiently complex systems/services.
        If you are not currently running X and X' on your canary, than any attempt to detect differences between your test and control should result only in statistical noise. If you get any result that isn't statistical noise, your canary is flawed.
        >Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level.
        This is true. There are a few strategies to deal with this. One is to not worry about it, and one is to fully isolate a test stack and control stack. I'll discuss how both of those don't work the way you describe in the rest of your post.
        So let's look at your microservice example. Imagine A, B, and C. We have a dataflow of B -> A -> C. Above I described two possible layouts. One where you have a fully isolated canary stack, and one where you don't. Let's look at the first one first. Although an aside first, I really hope you mean "a feature flag is updated on A'", your features really shouldn't be tied to deployments.
        B -> A -> C exists, as does B' -> A' -> C'. Note that B, B' and C, C' may be identical here. A request to B goes to B or B', and then through the test or control stack entirely. Your canarying tooling detects a regression between B and B'. Note that you don't need to "constantly" canary here. When you change A', you can check the differences between B and B', A and A', and C and C'. You need only monitor those in response to a change. The downside to this is that you've artificially cut your traffic, so it's unclear if A and A' will really be getting equivalent traffic now. If you push a change to B and to A at the same time, you can't isolate it to either specific component. There's also a number of other issues that come with this plan that just make it difficult.
        The other (and much, much more common) option is to not fully isolate the entire canary stack. Your layout then looks something like [B, B'] -> [A, A'] -> [C, C']. At each arrow, a request may be routed to either S or S', and importantly you don't know which (and it shouldn't matter!). Then traffic drops to C and C' both! In expectation, C and C' will each receive half the traffic A produces and half the traffic A' produces. So if A' is producing half as much traffic as before, both C and C' will notice QPS drop to 75%. But since all a canary does is compare the traffic to C with the traffic to C', a continual canary will notice no change. You still need alerting checking globally that traffic to [C, C'] hasn't dropped.
        So, if you have a proper un-isolated continual canary, it won't notice any change. If you do have an isolated continual canary, you're able to run the entire canary evaluation across stack in response to a change anywhere in the stack, so you still don't need to do it continually.
        >Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage.
        No! These are all things that should be handled by non-canary alerting. Again, a well balanced canary will not detect traffic in an upstream service dropping to 0, because it will drop to zero in both the test and control environments, and comparisons between 0 and 0 are almost certainly within whatever acceptable bounds you have set.
        [-]
        not_kurt_godel 1927 days ago
        > So, to preface this, so we're on the same page, a canary, in the context I'm speaking, is a specific type of comparison between a known good version X, and an unknown version X'.
        Ah, that's the issue. You are using what I consider an 'alternative' definition of "canary". The concepts you are referring to are what I would refer to, broadly, as applying to "pre prod" (a term which I think is somewhat misleading and inappropriate itself, but that's what people call it in my circles, so meh). API Gateway, and I presume other entities, uses "canary" the same way, so you are definitely not alone: https://docs.aws.amazon.com/apigateway/latest/developerguide.... I have nits on a few of your points but overall what you're saying is sensible. Glad I understand now! Cheers.
joatmon-snoo 1927 days ago
For clarification, "canary" in this article refers to rolling out a new release to some small subset of production traffic, a la canary release (eg https://martinfowler.com/bliki/CanaryRelease.html).
At least a few other people in the comments are saying that in their experience, "canaries" are black-box monitoring programs that simulate critical user journeys. This is not what this article is discussing.
btmiller 1928 days ago
> Spinnaker is an open-source, continuous delivery system built by Netflix and Google
Politely correct me if I'm wrong...isn't Spinnaker originally a Netflix system, having nothing to do with Google? Unless perhaps the author is alluding to open source contributions by Google after the tool went open source?
[-]
- svachalek 1928 days ago
  It is originally a Netflix creation but Google has been a contributor since it went open source. In particular, the canary analysis features were co-developed with Google. (I was a contributor on the Netflix side.)
  [-]
  - techcofounder 1928 days ago
    That is correct. Google played a big role in open-sourcing Spinnaker alongside Netflix back in Nov 2015.
- jedberg 1928 days ago
  It's just a question of where you draw the line. Google made a lot of contributions after it went open source, particularly with the canary analysis.
  But I mean, I evaluated the initial design docs, so did I build it too? :)
  [-]
  - ajordens 1927 days ago
    Not that it really matters but Google had been an active participant in the project for at least a year prior to the open sourcing in 2015. Meh.
- mrtrustor 1928 days ago
  Author here. As mentioned in the other comments, Google is a very active contributor to Spinnaker. Today, Netflix and Google are the 2 companies contributing the most to the project, hence this sentence.