Cadence: Uber's Workflow Orchestration Engine

(github.com)

243 points | by vruiz 1830 days ago

24 comments

jbbarth 1830 days ago
Cadence looks like an OSS version of Amazon Simple Workflow (SWF) service. The author used to work on SWF at AWS afaik.
I'm a heavy SWF user at work for managing complex data pipelines. SWF requires an important conceptual and tooling effort in the beginning, but it gets reimbursed if you use it a lot.
As for other comments mentioning Airflow: the programming model is quite different , since Airflow as far as I understand forces to provide a DAG of tasks upfront. SWF (and Cadence?) doesn't, it coordinates the work of Deciders and Activity Workers and only acts as a source of truth for the state of the workflow (+ distribute task in a unique manner to many long-polling workers). As a result you don't declare anything upfront and can have deciders take dynamic decisions along the way, which is really nice when you want very dynamic logic for your workflows (e.g. dynamic partitioning of tasks, decisions depending on external factors, etc.).
I'd love to have Maxim insights about how Cadence compares to SWF, and what would be the reasons/challenges behind migrating from SWF to Cadence for SWF users (except that SWF is basically stale for 4+ years and rigged with arbitrary limits)
[-]
- mfateev 1830 days ago
  Cadence vs SWF
  Cadence was conceived and is still led by the original tech leads of the SWF.
  SWF had no new features added for the last 5 years. Cadence is open sourced and is under active development.
  Cadence was initially based on SWF public API. It uses Thrift and TChannel for communication and SWF uses AWS version of REST. Currently the API is not compatible with SWF as Cadence added a large number of new features and deprecated a few problematic ones. We are planning migrating to gRPC later this year.
  Cadence can potentially run on any database that supports single shard multi-row transactions as a backend. Currently it supports Cassandra and MySQL.
  SWF has pretty tight throttling limits. Cadence scales very well with use cases in production that require 100s of millions of open workflows and thousands of events per second.
  SWF has pretty tight limits on individual payloads and number of events. For example maximum activity input size is 32k. Cadence currently has 256k limit. SWF history size limit is 10k events while Cadence limit 200k. All other limits are also higher.
  Cadence has no limit on the activity and workflow execution duration.
  Cadence through archival supports unlimited retention after a workflow closure.
  SWF has Java and Ruby client libraries. Cadence has Java and Go client libraries.
  SWF Java library is fully asynchronous and relies on both code generation (through annotation processor) and AspectJ. It is hard to set up, doesn't play well with IDEs and has very steep learning curve. Cadence Java library (as well as Go one) allow writing workflows as synchronous programs which greatly simplifies the programming model. It also just a library without any need for code generation or AspectJ or similar intrusive technologies.
  Cadence client side libraries have much better unit testing support. For example the Java library utilizes an in-memory implementation of the Cadence service.
  Cadence features that SWF doesn't have:
  Workflow stickiness. SWF replays the whole workflow history on every decision. Which means that a workflow resource usage is proportional to O(n*n) of number of events in the history. Cadence caches workflows on a worker and delivers only new events to them. The whole history is replayed only when a worker goes down or the workflow gets out of cache. So Cadence workflow resource usage is O(n) of number of events in the history. For large workflows it makes a huge difference. It also leads to higher per workflow scale. For example it is not recommended to have workflows that execute over a hundred activities in SWF. Cadence routinely executes workflows that have over thousand activities or child workflows.
  Query workflow execution. It allows synchronously get any information out of a workflow. An example of a built-in query is a stack trace of a running workflow.
  Cross region (in AWS terminology) replication. SWF in each region is fully independent and if the regional SWF is down all workflows in the region are stuck. Cadence supports asynchronous replication across regions. So even in the event of a complete loss of a region the workflows continue execution without interruption.
  Server side retry is an ability to retry an activity or a workflow according to an exponential retry policy without growing the history size.
  Reset is an ability to restart a workflow from any point of its execution by creating a new run and copying a part of the history. For example the reset is used to automatically roll back workflows to the point before a bad deployment that was rolled back.
  Cron is an ability to schedule a periodic workflow execution by passing cron string to the start method.
  Local activity is a short activity that is executed in the context of a decision. It uses 6x less DB operations that a normal activity execution.
  Long poll on history allows to efficiently watch for new history events and is also used for efficiently waiting for a workflow completion.
  Cadence uses the elastic search for visibility. Soon it is going to support complex searches across multiple customer defined columns which is far superior to the tag based search SWF supports.
  If decider constantly fails during a decision SWF records a few events on every failure eventually growing the history beyond the limit and terminating a workflow. Cadence supports transient decision feature that doesn't grow history on such failures. It allows continuing workflows without a problem after the fix to the workflow code is deployed.
  Cadence provides command line interface
  Cadence Web is open sourced and is much nicer than the SWF console.
  Cadence supports local development through unit testing as well as using local docker container that contains the full implementation of the Cadence service and the UI.
  Cadence doesn’t yet have activity and workflow type registration. The advantage is that changes to activity or workflow scheduling options do not require version bumps that affect clients.
  [-]
  - jbbarth 1830 days ago
    Wow thanks for the completeness of the answer.
    I found myself nodding on all the conceptual limits and features you added, I'm sold, gonna try Cadence asap :)
    The tooling around SWF (web console and ability to get insights about tasks, failures, etc.) is definitely a big one from an operational perspective. The SWF console is indeed absolutely terrible (with basic bugs not fixed for years, like broken pagination), so we ended up developing our own here at Botify, along with a python based client lib that mimics most of RubyFlow principles. I'm curious if all this can be integrated with Cadence, will have a look. I can keep you informed if you feel it's valuable for the Cadence project.
    [-]
    - mfateev 1830 days ago
      Besides the UI Cadence provides a CLI that supports most of the API features.
      The core API is almost the same, so porting an existing python client should not be a very large task.
  - throwaway6497 1829 days ago
    Is there design/architecture doc for Cadence. Wanted to learn the design goals, non-goals, alternatives considered, trade-offs made for building a system like Cadence.
    [-]
    - mfateev 1829 days ago
      We don't have a public design document. This presentation contains some details about the internal architecture: https://www.youtube.com/watch?v=5M5eiNBUf4Q
  - wikyd 1829 days ago
    Thanks, that was really interesting! Why switch from Thrift / TChannel to gRPC?
    [-]
    - mfateev 1829 days ago
      TChannel has very limited language support and is essentially deprecated. gRPC is supported by majority of mainstream languages and is under active development.
  - erpellan 1829 days ago
    Everything here looks awesome! Except gRPC... I thought gRPC was great until I had to use it in anger. JSON or CBOR for me!
    https://reasonablypolymorphic.com/blog/protos-are-wrong/
    [-]
    - mfateev 1829 days ago
      gRPC is not exposed to the Cadence users directly. They program against the client side library that completely hides the communication mechanism. And you are free to choose any object serialization mechanism. Currently JSON is the default wire encoding.
redact207 1830 days ago
I've been working on a similar workflow engine for node at https://www.npmjs.com/package/@node-ts/bus-workflow
The main objective of workflows is to manage long running processes. By processes I mean business processes like coordinating the activities of fulfilling a customer order (settling charges, picking inventory, packing, dispatching, email receipts etc). It's a way to keep all those individual commands decoupled but coordinate them at a higher level.
This isn't a new concept by any means, and is often paired with Domain Driven Design and message based systems. Doing so gives you a library of events everytime something happens in your system that can be reacted to in a workflow.
If you've ever dealt with microservices, or even a monolith where two internal services are incorrectly coupled together then this approach may be worth looking into.
[-]
- wetpaste 1830 days ago
  Thank you. I feel kind of silly about this but I feel like I've had a hard time understanding when an org should, or could use something like this. I have seen them mentioned but every time it's explained it's explained with more abstract language on top of it that confuses me. I keep hearing "it manages business processes" but then it fails to mention if this means like, a human being's process within an org, or something coupled with an application of some sort that has business processes in the application? Does this type of thing replace sort of what Jira does, make a ticket and then pass it off to the next team or whatever? Do you ship it with the app for on-premise deployments of a software product? I have a hard time seeing the big picture with things like this sometimes. Then I hear workflow orchestrator and I think, oh okay so like ansible, but for, work...flows? But what is a workflow really exactly?
  [-]
  - mfateev 1830 days ago
    My personal opinion is that the workflow is any business logic that goes beyond a single request reply. Examples of workflows:
    * Service deployment.
    * Uber trip
    * Media processing (download file, transcode, upload result)
    * Order processing
    * Customer support ticket processing
    * Customer incentive program management
    * Data pipeline processing
    * ML Training
    * Distributed CRON
    * Customer signup flow
    and many others.
  - raxxorrax 1829 days ago
    This could also be used to kill off systems like SharePoint in many businesses and that would be great.
    Seriously, its workflow engine has race conditions, randomly fails and has no transaction management. But there are few alternatives. I don't know why there hasn't been any real contender. You would need a full suite to challange it though.
    [-]
    - Angostura 1829 days ago
      Speaking as someone who has just implemented a complex business multi-step business process workflow in Sharepoint 2016 - I concur.
  - redact207 1830 days ago
    I can relate, it didn't make sense to me either in the beginning but eventually the penny dropped after working with it in a past company.
    Rather than give a half cooked explanation here and risk more confusion, I'll update the readme with some examples and ping you if you're interested.
    [-]
    - nij4uyr 1829 days ago
      redact207, could you please tell me if I understood it correct?
      Say, I have two services deployed individually working in their own domains.
      UserService and EmailService
      When the task is a simple user signup & welcome email
      1. Workflow requests UserService.signupUser
      2. UserService.signupUser creates User, then dispatches UserCreated event.
      From my understanding, this is where it's different between having Workflow and not having it.
      3. Workflow receives UserCreated event, then requests EmailService.sendEmail
      IF I did not have Workflow in my design, then EmailService will be listening to UserCreated event from UserService directly.
      It sounds almost same as having an orchestration service called UserSignup Service, and do the same thing what Workflow does.
      Can you say my understanding correct? Thanks!
      [-]
      - redact207 1829 days ago
        Yep! The one thing I would change though is that it's common for workflows to be started by an event. So in your example, the first step would be UserService.signupUser, that emits a sign-up event, that starts a workflow that sends the email.
        Without the workflow/orchestration, we're effectively coupling the EmailService to the UserService, and it's that type of coupling that reduces reusability and isolation.
        [-]
        nij4uyr 1829 days ago
        Thank you very much for your reply! I wonder if the flow can start from Workflow not from UserService.
        E.g., 1. Browser requests with UserSignup event (or API gateway receives it, do not call UserService but emits event)
        2. Workflow receives the event, then calls UserService.signupUser activity
        3. UserService creates a user then return the call back to Workflow
        4. Workflow resolves the call(singupUser) then calls EmailService.sendEmail
        5. EmailService.sendEmail sends an email then calls back to Workflow
        6. Workflow resolves the call(sendEmail) then the flow is completed
        The difference is that Every workflows will be defined inside Workflow and Services won't serve requests directly, which I believe this gives a complete view to the flow.
        However there must be something that I'm missing here since what I'm describing seems like an anti pattern.
        [-]
        redact207 1829 days ago
        It's certainly acceptable to start workflows explicitly! However it wouldn't be a good fit for the user signup process.
        In the above example, the user just wants to signup. They don't care about receiving a welcome email or being subscribed to a mailing list or anything else, they just want to register an account. That's a pretty good use case for just `POST /signup`, that hits the user service and spits out an event that the user has signed up.
        Starting a new workflow when that event is published makes sense in this case.
        An example when a workflow is started explicitly could be something like doing a fire system test. You could:
        1. shoot off a command that starts the FireSystemTestWorkflow
        2. the workflow sends a bunch of commands to test sensors and sirens
        3. those things publish events that they're functioning correctly
        4. the workflow waits for all of these events to come back
        5. the workflow publishes a FireSystemTestedSuccessfully event
        The nice side of this is that the workflow can respond if a sensor or siren fails and does what's called a "compensating action", ie: compensates for the deviation of the successful path by performing a corrective action like sending a command to start the device or notify a technician.
        [-]
        nij4uyr 1829 days ago
        Wow. Thanks for the extra explanation about when the different approach can benefit.
        I love you. Will definitely give bus-workflow a thorough shot.
        And a very last question if you could spare a bit more time..
        Again the user signup & event flow,
        When Workflow calls EmailService.sendEmail (given that the communication is via RPC and EmailService.sendEmail is an async operation that will resolve if the email was sent successfully), should Workflow wait for the sendEmail operation to resolve and complete the flow? Or should EmailService dispatches an event EmailSent so that Workflow can complete the flow?
        This is a bit off the topic but I've been sticking with RPC style call rather than Events but still don't know yet what the best practice is.
        Many Thanks
        [-]
        mfateev 1829 days ago
        Look at the Cadence samples. You write code as a synchronous event handler, but it is interpreted as a workflow.
        https://github.com/uber/cadence-java-samples
fokinsean 1830 days ago
Sounds cool, but even after poking through the repos I still don't fully understand what it does.
[-]
- maxmcd 1830 days ago
  The linked talk provides context and use cases within the first 10 minutes or so: https://atscaleconference.com/videos/cadence-microservice-ar...
bozoUser 1830 days ago
Looking at a few comments, can someone answer on whats are the nuances between managing data workflows vs service workflows ?
formalsystem 1830 days ago
Does anyone know what are the best OSS Orchestration Engines? Am wondering what I should be comparing this to.
[-]
- manyxcxi 1830 days ago
  I don’t know about best, as there are lots of trade offs between various projects. I’ve reviewed and worked with a lot of Java based ones, top of mind:
  - Airflow
  - jBPM
  - Enhydra Shark
  - Activiti
  - Netflix Conductor
  Most of those are more workflow engine than pure orchestration.
  A few years back, we searched high and low and I was generally unhappy with the commercial offerings so we wound up building around Netflix Conductor (after a disastrous run with Joget, which is built on Enhydra Shark).
  Since then I’ve been pretty happy with Conductor, submitted numerous PRs and accidentally became one of the people keeping the MySQL backend implementation going forward.
  [-]
  - mfateev 1830 days ago
    AFAIK Netflix Conductor was inspired by the AWS Simple Workflow engine. The Cadence is a direct evolution of SWF in the open source world. I was tech lead for both of them :).
  - zok3102 1829 days ago
    You may also want to look at Project Flogo (flogo.io) - in particular, the Flogo Flow action. It's a process engine in Golang that can be used to implement service orchestration, event-driven choreography integration patterns.
    Flogo has been focused more on stateless service choreography running in Serverless FaaS or Edge devices - however the underlying process engine can be used to implement a long running orchestration process with fully externalized state for resilience and scale. It's 3-clause BSD including the modeling UI with hundreds of triggers and activity extensions in the community. We have a fairly active gitter channel in case you have questions for the core team (https://gitter.im/project-flogo/Lobby)
    Disclaimer: I work at TIBCO & we provide commercial support for Project Flogo as well as use it in our commercial products.
  - opportune 1829 days ago
    I guess technically Oozie is also a FOSS Java-based workflow/orchestration engine, except it's mostly for Hadoop jobs. Haven't used too many orchestration engines but I found Oozie very barebones
iblaine 1830 days ago
Workflow engine = can support hundreds of parallel workflows. This includes airflow, luigi, dagster, appworx, are used to manage data and are typically processes that run in minutes to hours. Orchestration engine = can support millions of parallel workflows. This includes Uber Cadence & Netflix Conductor, are used to manage services and are typically processes that run in microseconds to minutes.
[-]
- mfateev 1830 days ago
  Cadence does support processes that run unlimited time. We have workflows in production that are always running. For example there are services at Uber that have an always open Cadence workflow per rider.
  [-]
  - viswabharathi 1827 days ago
    Uber could've used Cadence in place of Piper/Airflow isn't it. Can you shed more light on where both of these fit.
  - iblaine 1830 days ago
    Very interesting. Can you give an example where you'd want to track a process that takes days or months to execute?
    [-]
    - mfateev 1830 days ago
      For example the Uber loyalty program needs to accumulate points (similar to airline points). A customer workflow receives trip completion events and updates the state accordingly. When a certain number of points is reached some actions (mostly calls to downstream services) are executed.
      [-]
      - chrischen 1829 days ago
        For this use case, could it be solved with an asynchronous task system such as Celery? What advantages does Cadence offer for something like this, which seems to be repurposing the scheduling system of Cadence to process instantaneous events?
        Also how does it listen for events? From polling?
        [-]
        mfateev 1829 days ago
        It could be solved by Celery, but it would also require a database and Celery doesn't scale that well unless runs on top of Redis which is not really fault tolerant. Also actions invocation with guarantees and exponential retries is not trivial with Celery.
        Actually at Uber large number of services are being migrated from Python/Celery to Go/Cadence.
ajbosco 1830 days ago
I thought they used Piper (based on Airflow) for workflows at Uber. https://eng.uber.com/managing-data-workflows-at-scale/
[-]
- thor24 1830 days ago
  That is for data workflows (your etl jobs basically). This is for services.
  [-]
  - dgladkov 1829 days ago
    If you already have a service that requires ETL, Cadence might be a good choice as well. You will need more boilerplate and setup compared to Airflow/Piper, but you can share data structures between ETLs and services that use it and have more control over execution and deployment as you own your workers.
timbray 1829 days ago
If you want to do your workflow in your own procedural code, SWF (and presumably Cadence) are good choices. If you want to use a dependency graph, Airflow is for you (but I hear operating it is kind of tricky). If you like a state-machine/flowchart kind of approach, AWS Step Functions.
AWS customers these days seem to mostly like Step Functions, although SWF isn't going away, and lots of EC2 instances are running Airflow. Obviously, some people want a managed service and others want OSS that they can control & fine-tune. Nothing wrong with either choice.
Most of the engineering cycles these days are going into Step Functions, keep an eye on that space.
[-]
- drewda 1829 days ago
  Some people like both managed services _and_ OSS. It's nice to see how, say, GCP offers a managed Airflow service.
  (Edit: I don't mean this as a negative comment about AWS managed workflow services. Just pointing out some advantages to GCP's approach at present. I'm a happy user of both AWS and GCP services.)
- timbray 1829 days ago
  (Oh, should disclose, I helped design & build Step Functions.)
  [-]
  - mfateev 1829 days ago
    If only SWF was extended to run deciders on AWS Lamda. Without this the main advantage of the Step Functions is hosting. I personally would rather see an integrated system where the Step Functions are a natural extension of SWF not a completely separate system.
    This is the direction the Cadence is going. We are planning to add support for integrating custom DSLs easily, but maintaining the core code based libraries.
    BTW: If anyone is interested on running the Step Functions DSL on top of Cadence contact the Cadence team. We could work together to get it implemented.
mleonard 1829 days ago
Hi I watched the Cadence talks and read through the golang code a while back and love what you're doing with cadence. Really glad to see you're moving from Thrift/TChannel to protobuf/grpc - that was a blocker before.
If anyone could help me understand the following I'd appreciate it...
I understand that the event history is cached at worker nodes and the whole history of events is only delivered to workers if needed (ie if out of cache) and that normally cadence manages to deliver events for the same workflow to the same worker.
My question relates to what exactly happens within a single worker process.
On each new event, does the worker process loop through all events in the history - starting from the beginning - in order to get back its internal state, and then process the new event, and then shutdown ready to repeat this for the next event.
Or... does cadence keep the internal state around by keeping the goroutine alive waiting on a channel midway through it's workflow logic and waiting for the next event to continue execution.
Thanks
[-]
- mfateev 1829 days ago
  It is the latter. The workflow state object is chached including the goroutines the workflow code is blocked on. And the new events are applied to the cached object.
onionking 1822 days ago
Hello, I am very interested in knowing the open source version of SWF. I am heavy SWF user and I watched all Cadence videos especially the architecture one. I wonder what is the shard recovery mechanism for a single shard on one host ? Let's say one host is down, how the shards on that host are recovered on the next host ? I heard the presenter said consistent hashing or RingPop, so I am thinking all shards should be migrated to the next available host or hosts ?
amelius 1829 days ago
Cadence is also the name of an electronics design automation company.
https://www.cadence.com/
jontro 1830 days ago
How does this compare to netflix conductor? I've just started experimenting with it and somehow I missed this during my evaluation
[-]
- mfateev 1829 days ago
  I'm from Cadence team, so I'm obviously biased :).
  Besides very different implementation backends the main difference is that Conductor defines workflows through a JSON based DSL and Cadence defines workflows as code. Because of that it is possible to extend Cadence to interpret Conductor DSL, but the reverse is not possible.
  I believe that any non trivial workflows that have some state management requirements are more easily expressed as code. Any attempt to come up with JSON or XML or YAML or whatever language for workflows is always be inferior to existing programming languages like Go, Python or Java.
  [-]
  - sandGorgon 1829 days ago
    How does this work ? Do you store the code for the workflow in the database ?
    [-]
    - mfateev 1829 days ago
      Cadence is a service. The workflow and activity code lives outside of it. Think about Cadence workflow and activity code the same way you think about a queue consumer which is external to the queueing service.
      [-]
      - sandGorgon 1829 days ago
        No - I'm wondering about the abstractions that allow for specific workflow specifications to happen in code versus a DSL/JSON.
        There are two extremes here - I can slap celery and run a bunch of custom code as workflows. On the other hand, I can use a workflow system with a built in DSL that abstracts some of the underlying behaviour out.
        Cadence seems to fall in the middle - and I'm wondering how it works. Why doesn't it generate to the same mess that celery+a bunch of custom python code become ?
        [-]
        mfateev 1829 days ago
        Cadence is above :).
        It allows:
        Integrate any DSL without modiying the core service. Internally at Uber there are at least half dozen DSLs running on top of it.
        It allows to write code that hides all the complexity that leads to the mess of queue + db implementations. The beginning of this talk explains the idea: https://youtu.be/BJwFxqdSx4Y
        The gist of it is that you write just your business logic without thinking about callbacks and storage.
        I recommend looking at the Cadence samples to get the taste of it. Join the Cadence Slack channel if you have any specific questions: https://join.slack.com/t/uber-cadence/shared_invite/enQtNDcz...
        [-]
        sandGorgon 1829 days ago
        Thanks for a detailed reply. Really appreciate it. We have an internal workflow engine that's nice, but still has numerous quirks. We are trying to learn better.
        I'm wondering why you didn't create a DSL in the first place...if it's ending up in DSL all over the place. To choose an example - why did you go the kubernetes way with yaml . Sure it's verbose as hell...but there aren't multiple forms of yaml that could bitrot.
        Why do language primitives matter ?
        Let me also ask you another related question - suppose you wanted to build a GUI that builds workflows for your business teams..are you saying you generate language code ? Or do you generate yaml/JSON and then interpret it using your worker code ? Again - the same question: wouldn't it have been better to have a uniform declarative JSON ?
        [-]
        mfateev 1829 days ago
        DSL is DOMAIN specific language. But it is common mistake to call GENERIC workflow definition a DSL.
        My opinion is that DOMAIN specific languages are awesome when they are used for a specific narrow domain. For example AWS CloudFormation template is a DSL for a cloud deployment. If This Then That, also known as IFTTT is another good example of a narrow workflow definition.
        At the same time a generic turning complete language in JSON/YAML/XML always starts simple but ends up as a complete mess. See https://mikehadlow.blogspot.com/2012/05/configuration-comple.... Any programming language is much better for writing complex programs. The problem is that most existing workflow/orchestration systems force developers to use unnatural programming patterns and libraries to make the code fault tolerant. Cadence is an attempt (pioneered by my team at AWS Simple Workflow and later picked up by Azure Durable Functions) to implement workflows as natural programs without much boilerplate concepts and code.
        Think, why nobody tries to write complex backend programs in JSON? The reason that programming languages have well defined ways to deal with complexity. JSON based languages are good for limited domains, but anything complex makes them unusable. I've seen it hundreds of times already when DSL is abused and developers hate it.
        So most of the Cadence workflows are written directly as Go/Java code. But when DSL is appropriate it can be added trivially by interpreting it by the worker code.
        >Again - the same question: wouldn't it have been better to have a uniform declarative JSON ?
        Again, declarative JSON is declarative only in narrow domains. The generic workflow definition language in JSON works only for very simple scenarios and is harder to write and debug then Go/Java code.
- jusonchan81 1829 days ago
  Conductor solves the orchestration problem well. One of the most heavily used components across Netflix. GitHub and other companies use Conductor too. The Json DSL is simple to learn.
chrischen 1829 days ago
Is this like Celery + RabbitMQ but with a GUI? On a high-level it sounds like that, but could someone be so kind as to give an example use case of this? How is it different that it's not just described as a distributed task queue?
[-]
- mfateev 1829 days ago
  The main difference is that workflow has state and tasks can be very long running. Also workflow can react to external asynchronous events. Visibility into overall progress is also a very important feature. When using a queueing system it is hard to answer about the current state of the business process.
  For example implementing service deployment to a public cloud usign Celery + RabbitMQ is very non trivial and error prone. It is a pretty streightforward Cadence workflow.
cle 1829 days ago
What does the workflow versioning story look like? This is one of the most frustrating parts of SWF that Step Functions + Lambda has effectively solved for us.
[-]
- mfateev 1829 days ago
  Cadence doesn't require workflow and activity type registrations which eliminates most of the problems with SWF versioning. Cadence supports versioning of the workflow code out of the box. Any change has to be protected with a version condition. It works even with shared libraries and very long running workflows.
viswabharathi 1827 days ago
Can somebody from Uber clarify? Uber could've used Cadence instead of Piper / Airflow isn't it. Please correct me if I'm wrong?
[-]
- mfateev 1827 days ago
  Yes, it is technically possible to use Cadence instead of Piper/Airflow. The reason the Piper/Airflow is used is mostly historical. It was around for quite a time and Cadence is a relatively new project.
vxa_victor 1830 days ago
How does it compare to BPM solution like Camunda?
[-]
- mfateev 1830 days ago
  Cadence allows to write workflows as code. Think of it as a virtual machine for OO code that makes that code fully fault tolerant to process failures. As it is code it can be used to implement any business logic. Camunda is BPMN engine which interprets BPMN workflow definition. It is possible to implement a Cadence workflow that interprets BPMN without changing the core Cadence service.
  [-]
  - _ph_ 1829 days ago
    And we are talking about a software company and a software package. Sounds like a strong overlap to me - or do you think you can call a new software library "microsoft"?
dlphn___xyz 1830 days ago
how does this compare to Luigi or Airflow?
[-]
- alfalfasprout 1830 days ago
  I would strongly suggest not using Airflow if your company doesn't already... and that's coming from someone working at the company that made it.
  1) It has no knowledge of data dependencies. This means task success is very primitive and you end up with a bunch of "polling" tasks waiting eg; for a table partition to land. 2) The UI constantly hangs and/or crashes 3) Airflow "workers" using Celery are rarely correctly given the right numbers of tasks. OOM-ing, etc. are all commonplace even if using Docker.
  And many, many more.
  I strongly suggest using Apache Beam or Argo w/ Kubernetes instead. They can scale quite a bit more and deal with long running tasks well.
  [-]
  - red_hare 1830 days ago
    I'm actually finding airflow + beam + kubernetes to be a really powerful combination. Especially if you're on GCP where airflow is managed by Cloud Composer, kubernetes is managed by GKE, and beam is managed by Dataflow.
    Our airflow cluster _does_ almost nothing. Our operators just tell kubernetes to run containers with commands or dataflow to run some beam template and waits for the results from afar.
    I love this setup but man do I agree with your points about the UI. For a product so powerful I can't believe how problematic it is.
- mfateev 1830 days ago
  The biggest difference besides the programming model is scale. Luigi and Airflow target data pipelines that don't require much scale. Cadence is built to support business level transactions. It can handle tens of thousands of events per second and hundreds of millions of open workflows. Obviously it is also a good fit for low scale use cases. Cadence is also so generic that it can be used to implement practically any workflow definition language. For example it is possible to create an extension to run Airflow pipelines on Cadence.
  I'm from the Cadence team.
  [-]
  - zimablue 1829 days ago
    As someone who uses (but doesn't love, actually) Airflow. I think this might be a situation where lots of people are tempted to think that they're "scale". I disagree slightly with your use of "business level" to describe this.
    I've seen quant hedge fund platforms(fairly intensive data engines compared to most businesses) using airflow, because when it comes down to it hundreds of DAGs is more likely than hundreds of thousands.
  - davis_m 1830 days ago
    > Obviously it is also a good fit for low scale use cases.
    Is this a given? Just because something is necessary at scale doesn't mean it is a good fit for low scale use cases. I would expect the opposite is actually true.
    [-]
    - mfateev 1830 days ago
      We have both very low scale use cases as a single distributed cron as well as very high scale use cases in production at Uber.
- daveFNbuck 1830 days ago
  Cadence says it's meant for long-running tasks, while Luigi tasks are supposed to be short.
  [-]
  - mfateev 1830 days ago
    Cadence supports both long-runnig as well as very short tasks.
    [-]
    - daveFNbuck 1829 days ago
      I didn't think that Cadence would fail if the task finished too quickly. The point is that Cadence is designed to solve a slightly different problem than Luigi.
mshockwave 1830 days ago
just a random comment: I thought Cadence, the EDA company, owns the trademark for the name
[-]
- jacques_chester 1829 days ago
  Trademarks are partitioned by subject matter: https://www.uspto.gov/web/patents/classification/selectnumwi...
bhouston 1829 days ago
How does this compare to Argo?
[-]
- mfateev 1829 days ago
  Argo workflows are DAGs written in YAML. The types of workfows you can create using this syntax is very limited. Cadence gives you full power of a programming language like Java or Go to implement workflow logic. It is possible to implement support for argo DSL on top of Cadence. The reverse is not possible.
  Cadence is also more scalable supporting tens of thousands events per second and hundreds of millions of open workflows.
hcnews 1830 days ago
Does Cadence work for low latency scenarios? Ex. web serving?
[-]
- kbuckner 1830 days ago
  We use Cadence to run low-latency workflows for routing customer support tickets. It handles our use case very well. https://eng.uber.com/customer-obsession-ticket-routing-workf...
- mfateev 1830 days ago
  It can be used for them. There are multiple customers that have workflows with both synchronous and asynchronous components. For example customer sign up flow might require a long running background check.
mleonard 1829 days ago
The way I think about workflow engines is as follows. Please comment and correct me. Keen to discuss.
Workflow engines like Cadence essentially work by letting you write regular-looking procedural logic for your workflow. This looks and feels very much like writing an async function with async-await in javascript or C#.
The state in your workflow is then implicit in your code instead of explicit. Here's what I mean by that:
Usually you would explicity serialise your state between each incoming event and do an atomic-compare-and-set operation on an external database to store the new state.
For a new incoming event: (1) fetch the state, (2) marshal into an object in your programming language (ie a java class or golang struct), (3) given the current state, process the event and perform any external actions like sending an email. These external events need to be ok with being done with at-least-once semantics. Update the state object ready for the next incoming event. (4) serialise the state, (5) store the state in the database (atomically update with a compare-and-set operation). (6) Repeat on each event. Do everything with at-least-once repeat-on-failure semantics.
In a workflow engine like cadence, what is persisted to the database is the entire history of events instead of a single state object as described above.
In Cadence the code you write looks very much like async-await style code in languages like javascript or C#. The workflow logic is in some sense an async function that pauses at await statements and picks up again where it paused when the next event comes in.
Remember that cadence stores the entire history of events for a workflow. It does this so that it can rerun the workflow from the beginning, this time with the new incoming event on the end of the history.
Notice that you need to be careful about your workflow being deterministic.
Optimisations: (1) it knows when it is replaying already-seen-events and doesn't redo external events such as sending emails. (2) it tries to resend events to the same worker node each time. It caches events at worker nodes. (3) at the macro level everything is highly-available and repeat-on-failure-with-backoff to ensure progress and at-least-once-semantics. (4) it supports repeating workflows and child workflows (5) monitoring, tracing, other things you'd expect (6) etc
Importantly: notice that there is still in some sense a single state object. The state is just implicit: it is deep down in the internal state machine of the language you wrote the workflow function in (java, golang). Instead of serialising state such as 'time-since-last-email' in a state object to a database... you have 'time-since-last-email' as a local variable in the scope of the workflow function. Similarly your programming language is tracking the call stack and current execution position of the function... normally you'd keep track of progress through the workflow in the state object and condition on this state when receiving a new incoming event.
Thinking about state as explicit (state object approach) versus implicit (replay-history approach) helps me when thinking about cadence and similar workflow engines.
....................................................... Thanks for reading so far. I'd love to hear from users of cadence at uber or elsewhere:
(1) why do you choose to write workflows with implicit state (by replaying history) instead of storing the state explicitly as a serialised state object in the database?
My guess: developer productivity of writing and maintaining the workflows. Having a common approach and single observable system for many different workflows.
(2) how do you reason about long-running workflows where the business logic needs to be updated? Would this not be much easier if the state object (say a serialised protobuf) was stored explicitly in the database?
(3) wouldn't non-determinism be much easier as well if you stored the state explicitly?
[-]
- mfateev 1829 days ago
  (1) It simplifies the programming model. There is no way to serialize a state of the call stack through a library in most programming languages. You mentioned that Cadence is similar to C# wait/async. The SWF Flow library is. But the Cadence workflow code is fully synchronous, not requiring callbacks unless needed by the business logic. Applying new events to the cached workfow is also more efficient for large states.
  (2) It depends. Nothing prevents a workflow writer from checkpointing the state explicitly (by calling continue as new). Infinitely running workflows do it periodically. But having the events history is awesome for rollbacks. For example in Cadence it is possible to rollback a bad change and automatically rollback the state of all your workflows to the good state. In database world a change that corrupted the state is much harder to deal with.
  (3) The experience shows that the determinism requirement while requires some learning is not that hard to deal with. But the superior programming model it allows is liked by users.
whoevercares 1829 days ago
Interesting! Is it possible to run this with dynamoDB backend?