Morton: Bit Interleaving in C/C++

(github.com)

119 points | by tosh 14 days ago

11 comments

  • Malipeddi 13 days ago
    Related write-ups below along with comparison of performance of different approaches. BMI2 instruction set based method seems to be the fastest of the lot.

    1. https://www.forceflow.be/2013/10/07/morton-encodingdecoding-...

    2. https://www.forceflow.be/2016/01/18/libmorton-a-library-for-...

    3. https://github.com/Forceflow/libmorton

    I used it to accelerate nearest neighbor detection for collision processing in particle-laden flow for modeling complicated domains in 3d (for biological fluids simulation). I was using it as a locality sensitive hashing to put particles near each other in the same bucket in an hash map. I came across the ideas of BIGMIN (big minimum) and LITMAX (little maximum) for range search in a morton encoded data that I found to be cool.

  • zX41ZdbW 13 days ago
    I use this technique in a few of my projects:

    https://reversedns.space/ (a map of the Internet) - Morton curve is used as a visualization tool;

    https://adsb.exposed/ (a visualizer of air traffic) - Morton curve is used as a database index;

    Both projects are open-source, so you can find how exactly it is applied: https://github.com/ClickHouse/adsb.exposed/blob/main/setup.s...

  • jandrewrogers 13 days ago
    Construction of n-dimensional Morton codes of an arbitrary bit length is regular enough that you can generate optimal shift/mask sequences at compile-time. There are also simpler generator algorithms that are slightly less optimal (e.g. in a few cases there may be an unnecessary shift/mask step).

    Intel microarchitectures added fast ALU instructions (PDEP/PEXT) that can directly effect this as long ago as Haswell (2013) but AMD only very recently added comparable instruction implementations so portability of these instructions is an issue. Also, depending on the dimensionality and bit length of the code, there may be a few cases where the optimal shift/mask version will be about as fast without the portability concerns.

    • jart 13 days ago
      AMD introduced BMI2 in Excavator c. 2015.
      • jandrewrogers 13 days ago
        AMD did not have useful implementation of PDEP/PEXT instructions until Zen 3, which is barely more than 3 years old. All prior AMD implementations were microcoded for compatibility purposes only. If you cared about performance it was faster to do it in software for most applications. The Zen 2 implementation was something like 6-7x slower than the original Haswell implementation.

        For all practical purposes, you had to write performance code like AMD did not support PDEP/PEXT.

        • mlochbaum 13 days ago
          It's much worse than that: based on measurements these use a loop that goes one set bit of the mask argument at a time. uops.info and so on measure a mask that's almost all zero so they vastly underreport. Worst case is in the hundreds of cycles. Do not use pdep/pext before Zen 3. https://twitter.com/uops_info/status/1202950247900684290
  • jokoon 13 days ago
    I already implemented some spatial partitioning with a Z order curve, it's not too difficult to write and understand, but it works well.

    If I remember, the main advantage is that you can use a sorted multimap, insert item by converting 2D coordinates to 1D with some function. Multimaps are already implemented and fast.

    The result is that your data is packed, and "sorted" by 2D location, so when you query items in an area, it's supposed to be very fast because of cache friendliness, I think. The interesting trick is when you query a rectangle, you calculate 2 corner bounds, and use them to query the multimap.

    Of course, this is a simple way to index 2D data, and R trees are still faster, but no engineers will enjoy implementing an R-tree. BSP or Kd-trees are also much harder to implement than a Z-order curve.

    I am not good at those subjects, but I feel like a Z order curve is the "best" way to index 2D, since it's a very good trade-off between simplicity and performance, which is if one wants to write lean software, like for a video game for example. Feel free to correct me.

    Although I think there are things around Z order curves that are still under patent.

  • tubs 13 days ago
    This doesn't work efficiently for non-square arrays.

    E.g. if you have a 128x32 then you don't want to interleave the y bits once they are exhausted (only need 5 bits of y, but 7 bits of x to fully address this).

    The posted implementation would produce

        y6,x6,y5,x5,y4,x4,y3,x3,y2,x2,y1,x1,y0,x0
    
    but really we want:

        x6,x5,y4,x4,y3,x3,y2,x2,y1,x1,y0,x0
    • HelloNurse 13 days ago
      In a microprocessor with fixed-size registers numbers of 5 and 7 bits are going to be extended to at least 8 bits and interleaved into a code of 16, 32, 64 or even more bits: knowing that y5 and y6 are zero doesn't allow for a more efficient computation.

      On the other hand, if you intend to store and compare compact codes of 12 bits you only need a few shift, OR and AND operations (or possibly dedicated fancy instructions) to knock out the known zero bits and place the significant digits contiguously.

  • muragekibicho 13 days ago
    I saw something similar in Bit Twiddling hacks. Out of utter curiosity, when would you need to interleave bits in prod? Is it something a Saas dev would be doing or maybe sb in embedded programming?
    • tubs 13 days ago
      To expand on azornathogron's answer, when you are working with 2d data (it generalises to 3d too!) you often want to filter pixels in a rectangular area. This is commonly for bilinear filtering or some kind of convolution kernel.

      If you don't interleave the bits and have large textures (think 4096 pixels wide) where each pixel is 4bytes big, that means there is a distance of 16kb between a pixel and the pixel below it.

      This is super bad for caches (really important for the TLB in the MMU which is usually way smaller than data caches).

      In GPU literature you'll see this called "tiling" (again like azornathogron said it's not always pure morton order), Intel document their tiling layout, here's an older layout doc:

      https://docs.mesa3d.org/isl/tiling.html

    • azornathogron 13 days ago
      If you're doing low level graphics programming (processing pixel data) or in some other way dealing with 2D raster type data, you might want to work with data in Morton order or with Morton order tiles or something similar. Interleaving the bits of the x & y coordinate values helps to put pixels that are close together in your 2D space also close together in memory, which can help with making best use of caches.

      See for example https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-... (search "Morton" in the page)

      There are probably other use-cases that I'm unaware of.

    • pbsd 13 days ago
      Keccak (and other ciphers only using bit rotation and bitwise ops) can use bit interleaving to avoid slow 64-bit rotations on 32-bit hardware, by replacing 1 64-bit rotation by 2 independent 32-bit rotations on the interleaved words [1, §2.1].

      [1] https://keccak.team/files/Keccak-implementation-3.2.pdf

    • chrchang523 13 days ago
      Two-bit values are common in bioinformatics, and I’ve found the ability to efficiently convert between packed arrays of 1- and 2-bit values to be valuable in that domain.
    • gliptic 13 days ago
  • robinsonb5 13 days ago
    The masking constants in the source bring back memories of Chunky-to-Planar conversion on the Amiga 25 years ago!
    • vardump 13 days ago
      I was thinking exactly same.
  • dotnet00 13 days ago
    Morton codes are pretty fun, I remember playing with them a few years ago when toying around with tricks for optimizing voxel rendering.

    I wrote a routine that was the fastest I could manage to encode/decode them, I wonder how it compares nowadays to this implementation. This was back when Ryzen CPUs still implemented PDEP in microcode, so lookup table and bit twiddling was much faster than PDEP.

  • skavi 12 days ago
    SVE2 includes BDEP and BEXT which should let you do this absurdly quickly.
  • limbicsystem 13 days ago
    Weird use case: sending EEG trigger codes through a Dpixx interface to an old ANT amplifier requires the codes to be Morton numbers. It took us a loooong time to work that out :)
  • zerr 13 days ago
    In C.
    • cmovq 13 days ago
      C/C++ is valid if it uses a subset of C that is also valid C++ and would compile with a C++ compiler.