Neural Ordinary Differential Equations

(arxiv.org)

240 points | by asparagui 1958 days ago

5 comments

duvenaud 1958 days ago
Senior author here, I'm happy to answer any questions.
We just released source code: https://github.com/rtqichen/torchdiffeq . This includes PyTorch implementations of adaptive ODE solvers that can be differentiated through automatically. So you can mix and match these ODE solvers with any other differentiable model component.
There's already been a bit of follow-up work, turning Continuous Normalizing Flows into a practical generative density model: https://arxiv.org/abs/1810.01367
And now we're mainly working on 1) Regularizing ODE nets to be faster to solve and 2) getting the time-series model to scale up and extend it to stochastic differential equations.
[-]
- ssivark 1958 days ago
  Could you place the work in context, and provide a simplified explanation for someone who understands math and ML, but is not familiar with the literature on normalizing flows and autoencoders? Thanks!
  I tried reading, but the abstract and introductory section were a little too terse for me :-)
  [-]
  - duvenaud 1958 days ago
    Sure thing. A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this:
```
  h1 = f1(x)
  h2 = f2(h1)
  h3 = f3(h2)
  h4 = f3(h3)
  y  = f5(h4)
```
    They now build them like this:
```
  h1 = f1(x)  + x
  h2 = f2(h1) + h1
  h3 = f3(h2) + h2
  h4 = f4(h3) + h3
  y  = f5(h4) + h4
```
    Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.
    In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.
    We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.
    Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.
    [-]
    - im3w1l 1958 days ago
      So one question about that. In
      h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4
      the functions are all different. But to see it as "a primitive ODE solver", then the functions should be the same?
      So if I understand correctly you have a different take on RNN's but not on deep residual nets in general?
      [-]
      - sampo 1958 days ago
        > then the functions should be the same?
        If we conceptually think that advancing from one neural net layer to the next one is the same as taking a time step with an ODE solver, then a bit more precise notation would be
        h1 = f(t=1,x) + x h2 = f(t=2,h1) + h1 h3 = f(t=3,h2) + h2 h4 = f(t=4,h3) + h3 y = f(t=5,h4) + h4
        Now you can say that the function f is always the same, but it still can give very different values for Δh when evaluated at different time points.
        [-]
        soVeryTired 1958 days ago
        I do think it's misleading to compare the method to a general feed-forward network though, for two reasons.
        First, to preserve the analogy between eq. 1 and 2, the thetas in equation two should have their own dynamics, which should be learned.
        Second, even if Equation 1 doesn't allow it, in a general feed-forward network it's possible for the state to change dimension between layers. I don't see how that could happen with the continuous model.
        Neat paper, but it'd be nice if they had tied the analogy more explicitly to RNNs in the introduction.
        [-]
        duvenaud 1958 days ago
        The comparison we make is to residual networks, which I think is valid. First, we do parameterize a theta that changes with time, using a hypernet. But this is equivalent to the way sampo wrote the equations above - you can just feed time as another input to the dynamics network to get dynamics that change with time.
        Second, I agree that general feedforward nets allow dimension changes, but resnets don't. This model is a drop-in replacement for resnets, but not for any feedforward net. If we gave the wrong impression somewhere, please let us know.
        We didn't make the analogy with RNNs, because I don't think it fits - standard input-output RNNs have to take in part of the data with every time step, while here the data only appears at the input (depth 0) layer of the network.
        [-]
        soVeryTired 1957 days ago
        You're absolutely right - sorry, somehow I managed to miss the explicit time parameter in your equation two, and didn't read carefully enough to see that you were restricting the discussion to resnets and normalising flows.
        You might be able to make a better connection to RNNs by having the input data as a 'forcing' function in your ODE. But you probably need some regularity conditions on the input data to make sure the result is nicely behaved.
        im3w1l 1957 days ago
        > standard input-output RNNs have to take in part of the data with every time step
        Well they can but I don't see they why they have to. And couldn't your network also take input and give output for all times?
        > First, we do parameterize a theta that changes with time, using a hypernet.
        Ah I see. Did you end up using it in the final model? I don't see that in the mnist example, but I could be missing it as I only skimmed the code.
        joe_the_user 1958 days ago
        Hasn't there been similar work to this in the past?
        I don't see a "related work" section in your paper.
        [-]
        duvenaud 1957 days ago
        Section 7 is titled "Related Work".
        [-]
        joe_the_user 1957 days ago
        Thanks,
        Apologies for not looking harder
        magicalhippo 1958 days ago
        Now it's been ages since I dabbled with neural nets so this might be completely silly, but can't a change in dimension be thought of as forcing the weights to/from certain nodes to be zero?
        [-]
        duvenaud 1957 days ago
        Ah, that would stop certain dimensions from changing, but the output would still be the same size.
        [-]
        eximius 1957 days ago
        While technically still the same size, I think he's proposing that it's, in a sense, isomorphic to a dimension change if the fix to zero is propogates throughout the remainder of the layers (until the next 'change' that is).
        Or something like that.
        [-]
        magicalhippo 1957 days ago
        I hadn't entirely fleshed out the idea, but yeah.
        Take a simple NN with 3 layers: 5 neurons in the input layer, 3 in the hidden and 1 output.
        Force inputs to neuron 4 and 5 in the hidden layer to be zero, and force inputs to neurons 2-5 to be zero in the output layer (and ignore their output). I'm assuming the transfer function obeys f(0) = 0, if not, fix output to zero as well.
        My thought was this would be similar to how you enforce boundary conditions when solving partial differential equations by directly setting the value of certain matrix elements before running the solver.
        Again, may be completely silly.
      - gundeep59 1958 days ago
        They model the above as dh(t)/dt, generalizing the discrete case (the equations you wrote) to a continuous case. Check the eq 2 in the paper. The statement following the equation makes it clear that "Starting from the input layer h(0), we can define the output layer h(T) to be the solution to this ODE initial value problem at some time T". Here, as per my understanding, h(0) can be the input itself. The function f mentioned in the eq 2 is the RNN cell.
      - spaced-out 1957 days ago
        This paper explains the reasoning begins that
        https://arxiv.org/abs/1512.03385
    - chombier 1957 days ago
      > We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system?
      What about symmetries of the underlying continuous system?
      I'm under the impression that having deep nets as ODEs should make it possible to enforce a certain geometry on the information flow (like incompressible fluid, Hamiltonian, etc..) which would correspond to some invariant of the whole network.
      Does this idea make sense?
      [-]
      - thanatropism 1957 days ago
        My dissertation was about energy- and symplecticity preserving methods for Hamiltonian ODEs. Try to find the book by Blanes or the one by Leimkuhler.
    - snrji 1958 days ago
      Off-topic: could you tell us what degree did you study? What is your academic background?
      [-]
      - duvenaud 1958 days ago
        I did a CS undergrad at the University of Manitoba. Then took some time off to do a startup and was in the army reserves. Then when to UBC to do an MSc in CS + Stats. My Phd was officially in the made-up-sounding subject of "Information Engineering" at Cambridge, but really I just worked on Bayesian nonparametrics the whole time. I didn't start working on deep learning until my postdoc.
        [-]
        snrji 1957 days ago
        Thanks!
      - Donald 1958 days ago
        His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/
        [-]
        snrji 1957 days ago
        > His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/
        Thank you.
  - innagadadavida 1957 days ago
    From a software standpoint, will any of these ideas be ported to TensorFlow or is this very different?
- stealthcat 1958 days ago
  How is this different from spiking neural networks (SNN)? Seems like easier and simplified (e.g. no synapse) and densely connected version of SNN, with less control on inter neuron connections.
  Last time, I read work to port SNN (with ODE solver) to Pytorch https://github.com/Hananel-Hazan/bindsnet https://www.frontiersin.org/articles/10.3389/fninf.2018.0008...
  [-]
  - duvenaud 1958 days ago
    My understanding is that the advantage of something like a SNN is mainly in custom-hardware ultra-lower-power applications. But it adds a lot of complexity and makes training a lot more difficult, compared to something like ODE nets or standard discrete neural nets.
    Looking at the links you provided, they don't appear to train the network dynamics, only the feedout weights. In principle you could differentiate through the ODE solvers used in the software package linked to train the dynamics, but as far as I know we were the first people to release an open-source set of differentiable solvers using reverse-mode adjoint sensitivities (the most scalable variant of autodiff for training scalar losses). [https://github.com/rtqichen/torchdiffeq]
    In the paper we talk a bit about how you can now more easily train Possion process likelihoods, which might make training SNNs easier. I'm not sure.
    [-]
    - ssfrr 1957 days ago
      > as far as I know we were the first people to release an open-source set of differentiable solvers using reverse-mode adjoint sensitivities
      Would the DifferentialEquations.jl work in the Julia community qualify? As I understand it for the pure Julia solvers you can run them through autodiff to differentiate with respect to the parameters. Chris Rackauckas has talked a lot about how cool it is that he can take his solvers that were not written to be AD-aware and use somebody else’s AD package that’s not specialized for diff eqs, and combined you get differentialable diff eqs.
      [-]
      - ChrisRackauckas 1956 days ago
        Yes, we've had this working in Julia for a few years now (with Flux.jl and ReverseDiff.jl tracing-based reverse-mode) and it's surprising that wasn't know because it's been pretty widely publicized (and mentioned to the authors). In fact, we even have a preprint showing that reverse-mode adjoint isn't what you want to use in most scenarios:
        https://arxiv.org/abs/1812.01892
        We identified that tracing-based reverse-mode adjoint + solving without mutation is not competitive with the other sensitivity analysis methods (this is the method which they have implemented in this repository) when comparing it to methods which are fully compiled and utilize mutation. Since the methods in SciPy don't utilize mutation (and add quite a bit of their own overhead, and don't necessarily work well with the JIT compilation being of context switching, as previously demonstrated in http://juliadiffeq.org/2018/04/30/Jupyter.html ), I wouldn't be surprised that they saw a speedup anyways over the odeint+adjoint method described in the paper, but that's mostly because the method from the paper is highly inefficient.
        In terms of efficient implementations, the first ones which are equivalent to a reverse-mode adjoint were actually those of FATODE ( https://dl.acm.org/citation.cfm?id=2048596&dl=ACM&coll=DL ) which implemented quite a few Runge-Kutta and Rosenbrock methods with forward-mode and reverse-mode transforms of the solvers manually written in there (described as "discrete sensitivity analysis") for hooking into Fortran AD implementations. JuliaDiffEq's methods are based along this existing heritage and testing.
        But sensitivity analysis goes back a lot further. In their paper though they utilize the adjoint sensitivity analysis, which the most common software one would point to for this is SUNDIALS CVODES (https://computation.llnl.gov/sites/default/files/public/cvs_... ). If you compare page 26 of the SUNDIALS manual to the Neural ODE paper, you'll notice that it's eerily similar because it's almost the same method. And actually, 2.7.1 describes an improvement to the method in the Neural ODE paper by using "checkpointing". In the Neural ODE paper, to do a reverse solve of the adjoint ODE it solve the forward ODE from the beginning time point until the point. Clearly, this is really slow because it requires a lot of forward solves over long intervals. You can instead save a few time points (checkpoints) along the forward solve and use those as the starting point, and this is the method implemented in CVODES and IDAS. In JuliaDiffEq, since we were focused on less memory intensive applications (CVODES did it for PDEs), we did it in a way where you get a continuous solution in one forward solve, and then just use the interpolation for an essentially free calculation for the Jacobian at a given time point. This of course goes from many ODE solves to a single ODE solve, at the cost of using more memory, and we proposed implementing checkpointing in a GSoC. But anyways, the derivation of the paper for the adjoint method wasn't new and it was actually already improved by 2005 in the TOMS paper on CVODES ( https://computation.llnl.gov/projects/sundials/toms_cvodes_w... ), with other methods for improving runtime done as well.
        In a previous Twitter thread, the senior author also claimed that using the autodifferentiation to do the vector-Jacobian products was a first. For reference, what I mean is that there is a vjp in the adjoint ODE and this can be solved without explicitly building the Jacobian and seeding the backsolve of the derivative function f appropriately. I will admit that in JuliaDiffEq we weren't doing this, but we have since improved our implementation to make use of this trick (and the advantage of this method is shown in the Arxiv paper linked above). That said, the Neural ODE paper isn't the first group to do this either, with earlier work actually being traced back to CasADi (https://link.springer.com/chapter/10.1007/978-3-642-30023-3_...), and of course the author of CasADi was confused as well in the same Twitter thread about the novelty claim here.
        And this is just the tip of the iceberg for schemes utilizing adjoint methods. There's entire pharmacometric software (NONMEM) and engineering software (Modelica) built back in the 80's for building the adjoint equations in ways to get these sensitivities for parameter estimation of methods.
        So I honestly cannot find a single part of the adjoint method either as mentioned or implemented where this PyTorch implementation is the first. I think a lot of this comes from a lack of technical expertise in the area of numerical differential equations and just an honest mistake though. There are other glaring examples of this in the paper as well. For example, in the paper they mention:
        >To solve ODE initial value problems numerically, we use the implicit Adams method implemented in LSODE and VODE and interfaced through the scipy.integrate package. Being an implicit method, it has better guarantees than explicit methods such as Runge-Kutta but requires solving a nonlinear optimization problem at every step.
        However, it's well known that not all implicit methods have better stability guarantees than explicit methods. Specifically, the implicit Adams methods they mention are Adams-Bashforth-Moulton methods are not unconditionally stable and have stability similar to explicit methods (you can find this in a lot of numerical analysis course lecture notes, like this one: http://www.math.vt.edu/people/embree/math5466/lecture37_38.p... . The best resource on this is probably Hairer Solving Ordinary Differential Equations I: Non-stiff Problems). Because of this lack of stability, LSODE with Adams coefficients is only recommended non-stiff equations and common stiff test examples like the ROBER will cause it to fail. Instead, the backwards differentiation formula (BDF) is the similar method which is used for stiff equations (there are lots of sources on this. For example, ode113 is the Adams and ode15s is the BDF, and go look at MATLAB's documentation for how they do the recommendations. Another example is CVODE which implements both the Adams and BDF methods, and only recommends the Adams method for non-stiff equations). This also means that the current set of PyTorch differential equation solvers is only applicable to (some) non-stiff ODEs BTW.
        I do want to end this critique though by saying that the paper itself describes a very new and novel application of differential equation solvers as part of a neural net in a way that seems very beneficial. So even though what was demonstrated is not technically sophisticated compared to current software and techniques, their application is a very interesting idea and a new research direction which is deserving of a best paper. I don't think we should lose track that the idea of an RNN as an Euler discretization and thus building the continuous one to get better training times is a very neat solution to a machine learning problem and I very much respect the authors for this advance.
        I think that what we can learn from this case is that numerical differential equation and machine learning communities need to come together to do this in a way that pushes the research and the software forward. We are starting to do this in Julia, and that's why in our Arxiv paper you can see the JuliaDiffEq crew has teamed up with the machine learning and AD people (Jerret Revels and Mike Innes) to make sure we can get the full adaptive + events + stiffness handling etc. solves working with AD, and get these utilized in new applications for machine learning. With the help of one of the paper's authors we hope to put out a package in January for implementing said neural ODEs in Julia. From there we can all probe the method our own ways. Since my background is in numerical differential equation solvers, I plan to look at this to see what kinds of solvers are optimal in this application (I think some EPIRK type methods might be the right deal). The ML folks will likely want to actually use it as a model to train and predict. The AD folk see this as a good test case for source-to-source transformations, since building a tape a la Flux or PyTorch can be pretty inefficient for large nonlinear operations like an ODE solver (so we plan on seeing what happens with Zygote.jl on this. Right now it errors, but we'll see!).
        I see a very nice future merging differential equations and machine learning in various ways, and this paper is a fantastic start in that direction. Let's just make sure we cite the relevant existing work from both communities.
        [-]
        duvenaud 1955 days ago
        Thanks for the detailed remarks, and for pointing out that my statement above about the novelty of our implementation was wrong - CasADi uses the same algorithm as was released in 2013. My apologies. Joel Andersson pointed this out to us and we added a cite to his thesis several months ago.
        As for Sundials and FatODE, my understanding was that they used finite differences, forward mode, or differentiating the solver operations for at least some aspect of their sensitivity analysis.
        On another topic, I think you might be misunderstanding how we're running our adjoint sensitivity analysis. You say
        > In the Neural ODE paper, to do a reverse solve of the adjoint ODE it solve the forward ODE from the beginning time point until the point. Clearly, this is really slow because it requires a lot of forward solves over long intervals.
        This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.
        I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?
        I also agree there is a lot of numerical work to be done in this area, and I'm glad that people more knowledgable than us (such as yourself) are looking at it too!
        [-]
        ChrisRackauckas 1955 days ago
        >As for Sundials and FatODE, my understanding was that they used finite differences, forward mode, or differentiating the solver operations for at least some aspect of their sensitivity analysis.
        No, their sensitivity analysis doesn't use finite differences. They perform the sensitivity analysis as described in their documentation. If methods for the Jacobian calculation are not provided, then they utilize finite differences on the Jacobian calculation of course, using the same routine as for the stuff solver. But with a given Jacobian function there's no finite differences or forward mode.
        >This isn't true - when we do the reverse solve, we get all gradients using a _single_ solve going backwards in time. I'm now realizing that the misunderstanding might be caused by our Fig. 2, which shows that multiple backward solves are necessary when the loss depends on the state at multiple time points.
        No. Of course it's a single solve going backwards in time. I see what my misread was, but I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable. Without a reversible integrator (implicit Adams is not reversible), it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct. A good test equation for this is probably the Lornez equation with standard parameters over a time like [0,300]. The backwards pass will diverge and not necessarily be on the same butterfly wing as the forwards pass. The Julia code using CVODE is shown here ( https://gist.github.com/ChrisRackauckas/fef4ae7778320530d44b... ) and you see that starting from [1.0,1.0,1.0] the result of going backwards is then off by [-17.5445, -14.7706, 39.7985]. So there you go, CVODE's Adams method ends up on a different "wing" of the butterfly when integrated backwards, ending up not even close to the actual initial point (CVODE is the successor to LSODE, both by Alan Hindmarsh, but utilizes constant leading coefficient forms to reduce computations. So not exactly the same as the paper, but very close). Thus to ensure correctness, existing sensitivity analysis packages only get Jacobians of f using data from forward passes. The Neural ODEs may have had a small enough Lyopunov coefficient or a short enough integration that this wasn't an issue, but it is in general something to note. Of course, if you are only doing this on Hamiltonian systems...
        >I agree that our statement about the stability of implicit over explicit was overly broad, thanks for pointing that out. Can you suggest a more accurate statement about the advantages of implicit over explicit methods?
        Any statement on it is too broad to be useful. Runge-Kutta Chebyshev methods are explicit methods for stiff systems. Implicit Adams is an implicit method for non-stiff systems. And there's many more examples. The stiffness handling also depends on implementation details. Using functional iteration on a BDF method reduces the region of stability, which is why BDF needs to use Newton's method for solving the implicit equation in order to be applicable to stiff ODEs. It's best to just talk about the stability of individual methods and their implementation.
        [-]
        duvenaud 1955 days ago
        > But with a given Jacobian function there's no finite differences or forward mode.
        Right, but instantiating an entire Jacobian is always going to scale at least quadratically with time. The point I was trying to make is that the existing non-adjoint approaches were never going to scale to large systems with millions of parameters. This is the main attraction of reverse-mode, and it appeared to me that this was a major obstacle for fitting large models using existing packages (excepting CasADi).
        > I'm surprised you'd attempt to solve the equation backwards like that because it's known to not be stable.
        > it's a well-known result that the method drifts from the true solution doing a backwards integration, so the values so z(t) needs to computed with forward passes in order to be correct.
        I agree that a purely reverse-mode gradient solve will diverge from the forward trajectory to some degree. But to say the gradients are 'correct' or not seems a bit strange to me. Every numerical solve introduces some degree of error, and I think the most useful discussion to have is the tradeoff between computation cost and numerical error. Re-solving the system forwards is one strategy to reduce error at the cost of computation. Another strategy would be reducing the error tolerance of the reverse solve. There are situations where our strategy might give worse precision wrt the parameter gradients for a given computational budget, but I wouldn't dismiss it out of hand. Especially since it's about as computationally cheap as one could hope for - O(1) memory and similar time cost as the forward solve. Also, it worked for our applications.
        > Any statement on it is too broad to be useful.
        I appreciate the detailed reply. But is there anything you can say about when to try implicit methods over explicit? What was the motivation for developing implicit methods in the first place?
      - duvenaud 1955 days ago
        Well, you can differentiate through the operations of ODE solvers written in any framework that has autodiff. But the whole idea of adjoint sensitivity analysis is to not differentiate through the operations of the solver, to save memory and to control numerical error more directly.
    - duvenaud 1955 days ago
      Whoops, I just realized that I failed to qualify that statement properly. CasADi [https://web.casadi.org/], released in 2013, uses adjoint sensitivities to compute gradients through ODE solutions using symbolic (but still reverse-mode) differentiation.
      I should have said something like: We were the first to implement adjoint sensitivities in a modern, tracing-based autodiff framework suitable for machine learning [HIPS autograd]. But I messed up and should have given credit to Joel Andersson for being first, my apologies.
- ArtWomb 1958 days ago
  Congrats on winning Best Paper ;)
  Is the Zero-G environment of deep space the ideal environment for continuous differential models?
  [-]
  - duvenaud 1958 days ago
    Thanks! If I understand your question, you're asking if the ODE time-series model we proposed works best in things like idealized frictionless or gravity-less setting? The answer is that we expect it to work (in principle) even for messy, non-physics-experiment-style situations. This because we don't model the system dynamics directly, but in a latent space that we learn. So even in a messy situation like health monitoring of humans, as long as the data are being driven mostly by hidden factors that are changing smoothly through time (like overall health, infections, hormones or whatever) the neural ODE can abstract away from the messy data. This is all still just in principle - there are a few technical challenges to scaling these models up. But nothing that looks insurmountable right now.
    [-]
    - zygotic12 1958 days ago
      Stunning work. Please don't waste yourself in Finance - I bet your phone has already rung.
- joe_the_user 1958 days ago
  What results do you have on state of the art metrics?
  Also, any outline of how one would go about training one of your systems? What kind of dataset does it do best with?
  [-]
  - duvenaud 1958 days ago
    None in this paper, which is just showing proofs of concept. But our follow-up paper does have SOTA density modeling among the class of efficiently-sample-able generative models. This is because the design of continuous flow models is less constrained than for discrete flows. But other than that, it's still early days for this whole model class.
    You train these models in the same way you train a neural network, with stochastic gradient descent. For supervised learning, its main benefit is extra flexibility in the speed/precision tradeoff. For time-series problems, we expect this will ultimately allow us to handle data that's collected at irregular intervals - but we haven't yet moved past the prototype stage in that setting.
- byt143 1958 days ago
  How does this relate to previous work that's described NNets in the limit of infinite width as GPs?
  [-]
  - duvenaud 1958 days ago
    Ooo, great question. In both cases, we define a function using a sum over an infinite number of infinitesimal things.
    In the GP case, those things are random and independent of each other, so the central limit theorem applies and we get a simple Gaussian.
    In the ODE case, the infinitesimals are deterministic and depend on the previous ones in the sequence, so the final answer is deterministic and impossible to compute exactly in general.
    You could also use a GP to model the dynamics of an ODE, and this was done recently: [https://arxiv.org/abs/1803.04303] although the drawback was that they couldn't train the model by backpropagation.
- rkevingibson 1958 days ago
  I'm a little late to this thread, but wanted to thank you for the paper! I found it interesting enough to blog about it (https://rkevingibson.github.io/blog/neural-networks-as-ordin...).
  I wonder if you've given any thought to generalizing to fractional differential equations? My intuition tells me that the dynamics that you're learning are "local" in the sense that the ODE solvers depend only on the current state (and maybe some recent history), whereas learning the dynamics of a fractional system could give the system a larger "history" in the case of your time-series models.
- ExpectMax 1958 days ago
  Thanks for this awesome peace of research! I'm really looking forward to further developments in the field :)
  I have two small questions regarding the paper:
  1. When comparing to Normalizing Flows (planar flows), in Section 4.1, how were these fitted in the Maximum Likelihood Training section? If I understand it correctly NF's don't have a closed form inverse, s.t. ML training should not be possible.
  2. Do you encounter any issues regarding stability during training? Other Flow based approaches such as Glow use certain tricks to ensure that the Flow initial reduces to an identity transform, to increase stability and ensure reliable convergence.
  [-]
  - duvenaud 1958 days ago
    1. Great question! You're correct that standard NF isn't efficiently invertible. CNF is, and we wanted a fair comparison. So for this experiment, we reversed the direction that NF transforms the data, so that it goes from the data to the latent space. Training this way means that you can't use the resulting model as a generator, but it at least let us compare likelihoods with CNF for this paper.
    2. We had to set the error tolerance relatively small during training to keep the gradients stable. I don't think we used any fancy initialization tricks, but to be honest I have to ask Ricky Chen and Will Grathwohl, who ran all the FFJORD experiments.
- diego898 1958 days ago
  Hello! Can you comment on how this relates to: https://arxiv.org/pdf/1804.00779.pdf
  [-]
  - duvenaud 1958 days ago
    Good question. The model developed in that paper, (neural autoregressive flows) is a discrete-layered architecture. It's a member of the Normalizing Flows family of models. Normalizing flows define a parametric density by transforming a sample from a Gaussian by a series of transformations:
```
  z0 ~ Normal(0, I)
  z1 = f1(z0)
  z2 = f2(z1)
  z3 = f3(z2)
  x  = f4(z3)
```
    and they use the change of variables formula to compute p(x):
```
  log p(x) = log p(z0) + log det | df/dz0 |
```
    In our paper, we propose a continuous-time version of normalizing flows, called Continuous Normalizing Flows. We derived a continuous-time version of the change of variables formula:
```
  dlogp(z(t)) = -trace(df/dz)
```
    Anyways, the confusion is probably that both models are called flows. We wish that Normalizing Flows had used a different name, so that we could save the word "flow" for continuous-time transformations, but it's too late for that :)
    Neural autoregressive flows are powerful density models, but are computationally costly to sample from. Continuous normalizing flows cost about the same to evaluate densities and to sample. We compared the two in a follow-up paper: [https://arxiv.org/abs/1810.01367]
- wjhuggins 1958 days ago
  Would it be possible to apply the kind of reasoning you used to a discretization of space also?
  One could imagine starting with a convolutional neural network that is also a residual network (I may be butchering the proper terminology here) and taking the limit of an infinitely fine discretization in space as well as time to arrive at a PDE instead of an ODE.
  Would the adjoint method approach that you used for backpropagation work in this case?
- ratman 1958 days ago
  Hi,
  1. What is a black box ode solver?
  (It sounds like a proprietary tool but that doesn't make sense and googling didn't bring up anything useful)
  2. Have you found any areas where there is a significant difference in the quality of the results using this method rather then the normal discrete method?
  I am sorry if these questions are a bit bellow you, or if you already answered them in the paper.
  [-]
  - theSage 1958 days ago
    Black box means you have no access to how a system actually does some task. All you know is that given an input the system will provide the output. ODE solver is an ordinary differential equation solver. For example [a]
    [a] : https://docs.scipy.org/doc/scipy/reference/generated/scipy.i...
  - duvenaud 1958 days ago
    1. As the other reply said, we just mean an ODE solver whose internal operations are a detail we don't need to worry about.
    2. This paper only had proofs of concepts and toy demos. But our follow-up paper, FFJORD [https://arxiv.org/abs/1810.01367] we used the continuous normalizing flows to get SOTA efficiently sample-able density models.
- iandanforth 1958 days ago
  Several recent papers take steps to isolate weights from updates during training to prevent catastrophic forgetting. In the ODE formulation is there a way to do something similar?
  [-]
  - duvenaud 1958 days ago
    Interesting question. There might be, but that's more a question of fitting parameters. This paper was about a different way to set up a parametric model, that can be trained in the usual way. I don't think the fact that internally it uses ODEs changes its susceptibility to catastrophic forgetting in the online learning setting.
- MielMe 1950 days ago
  Was the backwards integration stable? It would be surprising from a mathematical perspective of ODEs.
- bionhoward 1957 days ago
  Would it be possible to apply ODE nets to
  Universal [Solomonoff] Induction
  Program [Levin] Search
  Tensor Factorization
  ?
  Bion Howard (a biologist)
  medium.com/bit-pharma
  bitpharma.com
bitL 1958 days ago
Wow, I need to master ODEs/PDEs to keep up with Deep Learning now! Seems like one has to be a master of statistics, operations research, calculus and algorithms to push it forward!
Comparison to RNN was impressive! Any well-known real-world models for comparison to state of art?
[-]
- duvenaud 1958 days ago
  The closest thing to a continuous-time RNN that exists are Neural Hawkes Processes [https://arxiv.org/abs/1612.09328]. They use a different model where observing the system necessarily changes its state, which is natural in some settings but doesn't fit in others. On the other hand, their model scales today and ours doesn't :)
- mishurov 1958 days ago
  For using a numerical solver e.g. Runge-Kutta you need nothing to master, statistics operations master.
heinrichf 1958 days ago
Some reporting in MIT's newsletter on AI: https://mailchi.mp/technologyreview/a-new-type-of-deep-neura...
syntaxing 1958 days ago
I remember something similar for CFD application but haven't seen much after that. It would be awesome if we can build a cheap and fast Navier Stokes solver with neural networks.
[-]
- mishurov 1958 days ago
  You haven't read what the paper about, are you? The Navier Stokes PDEs and its numerical solvers have absolutely nothing with this paper.
shsjxzh 1945 days ago
Hi, I am very interesting about your models. When you do the back propagation, it seems that it still needs complex calculation. Although O(1) memory cost is an important contribution, do you think record some of the intermediate value will significantly boost the training?