A Handwavey Explanation of Metropolis-Hastings

(ncollins.github.io)

28 points | by luu 1496 days ago

2 comments

  • ur-whale 1495 days ago
    Somewhat OT, but: I've always felt there's some sort of connection between Metropolis and the alias method [1]

    Can someone more knowledgeable perhaps shed some light on this idea?

    [1] https://en.wikipedia.org/wiki/Alias_method

    • dahart 1495 days ago
      I don't pretend to be all that knowledgeable, but I've implemented & used both methods...

      There is a similarity in the sense that both are trying to generate samples with a density that is proportional to a given distribution. Metropolis generally in the continuous case, and the alias method is for discrete distribution sampling.

      Apart from that, I don't see a very strong connection between the techniques. Metropolis is for random walks, while the alias method is for direct sampling.

      The very brilliantly clever observation of the alias method is that when you have N items in your discrete distribution, you can sample from a uniformly discretized set of buckets that have a width that is the average size of your discrete weights, and then split up and distribute any of the weights that are of larger than average size into other buckets that have less than average size. It possible to always have no more than 2 items in a bucket, so it allows storing the data structure in an array of size 2N, and allows O(1) sampling from your discrete weights.

      Metropolis is more about taking a walk through a probability field, so each sample is in some sense a result of the history of the previous samples, and it spends more time in local areas where probabilities are high.

      With Metropolis, you have to make sure you don't get stuck in local maxima, that your sampler can break through low-probability regions in your distribution. With the alias method, you get independent samples, there's no risk of missing some of the space. Both methods can accept quasi-random or low-discrepancy inputs, but both of them will have limited benefit, unlike inverse transform sampling.

  • mlthoughts2018 1495 days ago
    I feel like any explanation of Metropolis-Hastings that doesn’t explain detailed balance and why it matters is so oversimplified as to not only be useless in practice but also to be dangerous by giving mathematical lay persons a very false confidence that they understand it or could just take the relatively simple code of the algorithm in the case of a symmetric proposal function and plug it in for a real use case, without understanding autocorrelation, thinning, burn in, convergence, or multi-chain methods (all of which are simply table stakes to actually use MCMC for any real purpose).
    • dahart 1495 days ago
      Just for friendly argument’s sake, allow me to disagree a little bit, and explain why. It’s a shame if we can’t allow people to learn and share mathematical intuition in small pieces. I like that the author saw the boundary behavior and thought about it, questioned why it actually works. The intuitive argument presented does take one step in the direction of detailed balance, it just doesn’t cover the continuous case. (And I’m not sure - does detailed balance actually apply to the boundary? The process doesn’t run in reverse for areas where your probability field goes to zero, right?)

      I don’t necessarily think that an article on math should cover everything that is known about a subject. It’s okay to write for an audience that doesn’t know anything about autocorrelation and multi-chain methods. The need to elaborate so thoroughly on every possible prerequisite and/or application, and cover every corner case you might have is one of the reasons so many people dislike reading math on Wikipedia ... it’s unapproachable unless you already know everything about it. It’s becoming a reference only for experts not very usable for learning by a student.

      Anyway, what’s the true danger of a simple or incomplete understanding? It’s not that likely to lead to people putting the wrong thing into nuclear reactors; isn’t it more likely to lead to someone getting the wrong answer in a weekend project software and then spending Sunday learning a little more about MCMC methods?

      • mlthoughts2018 1495 days ago
        > “ Anyway, what’s the true danger of a simple or incomplete understanding?”

        In practice the danger is someone not trained in statistics will copy/paste the “simple” approach and generate poor chains of samples, and base a seriously incorrect MCMC calculation off their misunderstood application.

        If it’s clear this is just for teaching, then sure the risk is less. But it’s not usually clear.

    • cf 1495 days ago
      While I think detailed balance is typically how we explain why Metropolis Hastings works at all. It's actually not quite what matters. What matters is ergodicity and detailed balance is a proxy for checking for that. That's helpful since ergodicity does have an easier intuition behind it as well.
      • mlthoughts2018 1495 days ago
        Detailed balance is the specific way that ergodicity is guaranteed for Metropolis-Hastings (literally, it is the required aspect of the proof when you decompose the transition function into the proposal and acceptance / rejection functions), hence why detailed balance is the specific thing for this case.
        • cf 1493 days ago
          I think we are agreeing. My point is you can get away with explaining Metropolis-Hastings in a useful way and never mention detailed balance. You will need it to work out a formal proof of convergence, but you don't need it to understand or even code up the algorithm.

          I think explanations don't require formal proofs to be useful.

          • mlthoughts2018 1492 days ago
            No, in the case of Metropolis-Hastings, you do need to understand detailed balance to understand the algorithm, because the conditions required of your proposal function literally come from it... not in a vague way about math convergence, but literally in the algorithm. For asymmetric proposal functions this is critical to understand.

            Otherwise you are cargo cult copying some code and expecting it to work, without understanding the algorithm the code is executing.

            Further you need to understand the convergence aspects too, because when you realize a real sample in a chain from your software application, whether or not you can safely use that chain of samples for the estimation or simulation you intended really depends on autocorrelation and convergence criteria. You cannot just believe the code was OK so the samples can be used... but “intuition” tutorials like this give a false sense of security that you can.