The radix 2^51 trick (2017)

(chosenplaintext.ca)

518 points | by mooreds 1399 days ago

21 comments

  • Strilanc 1399 days ago
    In hardware this would be called a carry save adder [1].

    One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.

    Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].

    1: https://en.wikipedia.org/wiki/Carry-save_adder

    2: https://arxiv.org/abs/1905.08488

    3: https://youtu.be/upTipX9yXNg?t=777

    4: https://arxiv.org/abs/1905.09749

  • tspiteri 1399 days ago
    > Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2^256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.

    If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.

    • beagle3 1399 days ago
      Indeed, if you use only 48 bits, you could also parallelize using the FP hardware - the mantissa is 52 bits, so if you use 48 bit limbs, you have 16 rounds before carry. Which is much less than 16 (or even 13) bits, but for processors which have distinct FP vs. integer adders, and that can issue them in parallel - you might get a speed boost.
    • gene91 1399 days ago
      Exactly. Any one can offer an explanation why they didn't take this path for 256-bit numbers instead?
      • dgoldstein0 1399 days ago
        Taking a guess: if you use this form to do a bunch of operations before normalizing, you don't need to worry as much about intermediate overflow. E.g. you could easily add 10 different 256 bit numbers in this form and normalize at the end. Whereas if the leading chunk is 64 bits, every addition can overflow.

        So 52/51/51/51/51 seems like the more generally useful choice

        • romwell 1399 days ago
          To expand of that: the whole idea is that normalization is not applied after every step.

          This approach is efficient when you know you have to add a bunch of numbers together. You add them up, and then normalize.

          The example can be, e.g. doing long multiplication, or adding all numbers in a list.

          (Just repeating what was said in the parent comment for emphasis)

        • bonzini 1399 days ago
          Why would you ever worry about overflow of the leading chunk (aka limb)?
          • dgoldstein0 1399 days ago
            Because you want to know when your result is inaccurate?
            • bonzini 1399 days ago
              But you're operating modulo 2^256.
      • kroeckx 1399 days ago
        Because in crypto, the math usually isn't mod 2^256. For instance in curve25519 you do do math mod 2^255-19, so they actually all fit in 51 bit.
        • tspiteri 1398 days ago
          Ah, then in that case it's actually five 51-bit limbs making up the required 255 bits. So probably the article has a slight mistake here assuming that 52 bits of the highest limb are used which led to an inaccurate reason when trying to explain why it's fine.
      • ChrisLomont 1399 days ago
        "Unfortunately, we can’t do that here – a 64-bit integer only has so many possible values" and "The remaining 12 or 13 bits give us the extra “digits” we need for preventing carries."

        using 64 bit registers, but only using 51 bits, has no overflow. Using 64 bit adds in 64 bit registers overflows, making the carry updates more costly.

  • brandmeyer 1399 days ago
    The design of Poly1305 was directly informed by this very technique, with the addition (heh) of performing the arithmetic on the x87 FPU operating in 80-bit mode. It even has some zero padding bits placed strategically to aid the technique, hence the security bound of 2^106 instead of 2^128 or 2^130.

    See also https://cr.yp.to/mac/poly1305-20050113.pdf

  • steerablesafe 1399 days ago
    Redundant number systems are a really great topic.

    https://en.wikipedia.org/wiki/Redundant_binary_representatio...

    https://en.wikipedia.org/wiki/Carry-save_adder

    http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf

    I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.

    • jjoonathan 1399 days ago
      I love that the same trick works for human brains as well as CPUs. Your brain already pays the price to represent numbers as high level abstract concepts rather than tightly packed fields. Just let the digits overflow, then clean them up later.

          14*14
          140 + 14*4
          140 + 4(16)  <-- 4 tens + 16 ones
          18(16)
          196
      • HeWhoLurksLate 1399 days ago
        I did the same thought and broke them up a little differently:

           14*14
           7*7*2*2 [b/c 14 = 2*7]
           49*2*2
           98*2
           196
        
        For whatever reason, my brain likes using twos (and powers of it), so I play to that strong spot wherever possible.

        EDIT: as noted below, I made a mistake (7·7=49,not 47,) and carried it halfway through my solution. Oops.

        • jjoonathan 1399 days ago
          Yeah, for sure, if the numbers break easily into small factors, that's the way to go. Redundant numbers still work even when the factors are few, large, or difficult to calculate, however:

              17 * 17
              170 + 17*7
              170 + 07(49)
              1(14)(49)
              24(49)
              289
        • iso947 1399 days ago
          7x7 isn’t 47, and 94x2 isn’t 196
          • dr_zoidberg 1399 days ago
            Despite that, the breakup does hold up the correct calculation:

                14 * 14
                (7 * 2) * (7 * 2)
                7 * 7 * 2 * 2
                49 * 2 * 2
                98 * 2
                196
            
            And on my side, 49 * 2 quickly resolves to "a hundred minus 2" (because 49 * 2 = (50 - 1) * 2), and pretty much the same for 196 (200 - 4).
          • HeWhoLurksLate 1399 days ago
            Woops, fixed it.

            Thanks for noticing!

    • jchook 1399 days ago
      For my undergrad thesis I studied “Fibinary”, or Fibonacci coding. You can think of it like “base phi”, using Fibonacci numbers as the place values, and only 0,1.

      The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.

      It also has applications in data compression.

      https://en.m.wikipedia.org/wiki/Fibonacci_coding

  • majke 1399 days ago
    Funnily enough this is the code I was reading today

    https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7...

        s = a ^ b;          // sum bits
        r = a & 0x7f7f7f7f; // clear msbs
        t = b & 0x7f7f7f7f; // clear msbs
        s = s & 0x80808080; // msb sum bits
        r = r + t;          // add without msbs, record carry-out in msbs
        r = r ^ s;          // sum of msb sum and carry-in bits, w/o carry-out
  • posix_me_less 1399 days ago
    Very interesting. It may seem that the bad performance of the ADC instruction is due to bad design of the ALU, but I think this is more of an example of how replacing (single sequential task) by (several parallel tasks whose results are combined at the end) allows the CPU to finish faster.
    • cwzwarich 1399 days ago
      He’s testing on Haswell, where (at least according to Agner Fog) ADC has an extra uop compared to ADD. On Broadwell and later, non-memory ADC is a single uop with a single cycle of latency. He would probably see a big speedup on a newer CPU.
    • csense 1399 days ago
      It's a data flow problem. Four ADD instructions can be parallelized. Four ADC instructions have to run serially.
      • a1369209993 1399 days ago
        And in particular, sixteen ADC instructions in four chains shouldn't run any slower than sixteen ADD instructions (assuming four execution ports and a large enough reorder buffer), but a: Haswell sucks, and b: even on properly-designed CPUs, making parallelization easier is almost always some kind of win, because general-purpose CPUs always have dispatch overhead.
  • CliffStoll 1399 days ago
    Carries were a major problem in mechanical calculators as well. For example, the upper section of the Curta had a special bell & gearing to allow carries. The Friden STW - a mechanical behemoth - had a complex set of gears within the moving carriage to allow carries & borrows. Same with Monroe & Marchant mechanical calculators. For electric motor driven calculators, the additional weight of the moving carriage - as well as the gearing - set limits on how fast the machine could add and subtract.
  • ramshorns 1399 days ago
    It's amazing that the 17 instructions at the end to tidy up all the carries, which looks like it has to be done serially, is still faster. But I guess each register's carry bits are independent from the ones that get carries added from the last register, so it could be ...

      mov W, E
      mov V, D
      mov U, C
      mov T, B
      shr W, 51
      shr V, 51
      shr U, 51
      shr T, 51
      add D, W
      add C, V
      add B, U
      add A, T
    
    which does seem like it could be parallelized.
    • Suncho 1399 days ago
      > each register's carry bits are independent from the ones that get carries added from the last register

      No they're not. Let's say B's non-carry bits are all 1. If you carry anything from C, that will affect B's carry bits.

      • londons_explore 1399 days ago
        You could check if the middle 64-12-12 bits are all 1's, and if they are branch to a slow codepath.

        It would be an incredibly rare case, so branch prediction will always get it right, and the cost of a branch nearly nill.

        • a1369209993 1399 days ago
          If you're trying to add two 256-bit numbers, there's probably cryptography involved, and data-dependant branches are poison to crypto code.
    • gpvos 1399 days ago
      Well, it's only faster if you do three or more additions before that. But otherwise yes.
  • bobbiechen 1399 days ago
    Cool trick. I wonder how it would compare in performance to a carry-lookahead adder [1], which uses the fact that you can compute any carry digit directly from a combination of the original inputs (which is increasingly long the deeper you go).

    [1] https://en.wikipedia.org/wiki/Carry-lookahead_adder

    • jonsen 1399 days ago
      To compare you would have to implement carry-lookahead in software. How would you do that?
      • bobbiechen 1399 days ago
        Yikes, good question. I thought about it for a bit:

        Here's a toy example adding three-bit numbers, where all letters are bits (0 or 1) and you can normally only add one bit at a time.

              a b c
            + d e f
            -------
            g h i j
        
        Instead of doing

            j = c ^ f
            j_carry = c & f
            i = j_carry ^ (b ^ e)
        
        in two steps, we can condense it to

            j = c ^ f
            i = (c & f) ^ (b ^ e)
        
        And similarly, we can get g and h combinationally from the input:

            #                   "j_carry AND one of b, e"
            i_carry = (b & e) | ( (c & f) & (b | e) )
        
            h = i_carry  ^ (a ^ d)
            h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
        
            g = (a & d) | (i_carry & (a | d))
            g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
        
        So we end up directly computing

            g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
            h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
            i = (c & f) ^ (b ^ e)
            j = c ^ f
        
        All of these innermost combinations like (a & d) you can get just by AND / OR of the original inputs.

        But as you suggest (I think), then you need to combine these further to actually get results. In theory they're all parallelizable (if you picture this as tree), but I don't see a good way to do that quickly. Though I'm no assembly expert so maybe I'm missing something.

        I guess thinking in terms of gate delays doesn't really help much at this abstraction level.

      • londons_explore 1399 days ago
        It can be done by AND-ing the input numbers.

        Not sure you'd gain any speed boost though.

  • munificent 1399 days ago
    Ah, this post is so good. This is why I come to HN.
  • zodiac 1399 days ago
    This reminds me of "zeroless number systems" or "bijective number systems" (https://en.wikipedia.org/wiki/Bijective_numeration#The_bijec...) - in the language of the article, the allowed digits in base 10 are 123456789A. Note that 0 is not an allowed digit.

    For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...

    I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).

    They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.

    Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...

  • pachico 1399 days ago
    I really enjoyed it. I might never use what I learned by reading it but it was so well explained. Bravo!
  • glangdale 1399 days ago
    A question here is whether this trick (which I acknowledge is very interesting) is worth doing on Broadwell. Broadwell has a pair of new instructions (ADCX and ADOX) that don't behave like a normal ADC - they only read and set a single bit (carry bit and overflow bit). As such you can do two of these at once and at least on recent Intel processors there are two execution ports (0 and 6) that can handle these instructions.

    You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).

  • peter_d_sherman 1399 days ago
    >"The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end."

    Brilliant and tremendously interesting!

    Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)

  • Slartie 1399 days ago
    What I'm wondering about now is: does that 'adc' opcode have any other useful application except for adding numbers bigger than the width of a single register?

    Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.

    • FartyMcFarter 1399 days ago
      It's not useless bloat though. It's probably hard to beat in terms of code size, which is often even relevant for performance (due to cache size considerations).

      Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.

    • mrfredward 1399 days ago
      I believe adc is still commonly used as a way of reading the carry bit. For example, assuming eax has already been zeroed, adc eax, eax would be the idiomatic way of reading the carry bit into eax.

      The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.

      • giovannibajo1 1399 days ago
        Actually, setc is the idiomatic way, and doesn’t depend on the previous value of the destination register so it’s much faster to execute because it has no dependencies
        • zwegner 1399 days ago
          No, setcc actually does depend on the previous value of the register, because it only comes in the low-byte variants (i.e. al, bl, cl, dl etc)--the top 7 bytes of the destination are left unmodified.

          A one-instruction variant I've seen that gets you the negated value of the carry flag is sbb rax, rax. This doesn't depend on the previous value of rax in a mathematical sense, though I'm unsure if it depends in an out-of-order sense; that is, if it's recognized as a zeroing idiom (or rather, zeroing or all-one-ing in this case). Probably not.

          EDIT: did a quick test, and sbb rax, rax etc are not recognized to be a zeroing idiom, at least on my old Haswell. So it's still one instruction, but has a dependency on the previous register value.

      • kccqzy 1399 days ago
        The carry flag is also the canonical way to signal "something has gone wrong" in low-level code. Example: Intel virtual machine extension instructions (like vmwrite, vmxon) all set the carry flag when they fail. But no, you'd usually use jc or setc to read the flag.
    • 0x0 1399 days ago
      x86 has a lot of "useless bloat" with legacy high-level instructions. Modern compilers avoid those and newer cpus aren't engineered with performance in mind for those instructions since nobody sane uses them anymore.
    • waynecochran 1399 days ago
      Hopefully this will all be replaced with quantum computer that does addition with ripple carry or uses QFT... :)

      https://link.springer.com/article/10.1007/s11432-015-5411-x https://arxiv.org/pdf/1411.5949.pdf

    • andrewla 1399 days ago
      That might be another reason why it is slow -- it is an old opcode that has to be supported for compatibility, but isn't prioritized in any of the pipelines in newer chip designs.
    • fegu 1399 days ago
      This is the reason some of the esoteric x86 instructions actually perform slower on newer hardware. They are just there for backwards compatibility.
    • londons_explore 1399 days ago
      ADC is still the quickest way if you need to add two 256 bit numbers.
  • eumenides1 1399 days ago
    This is really cool, but gives me PTSD from my compliers course.
    • jonsen 1399 days ago
      Were you forced to comply against your will?
      • chowyuncat 1399 days ago
        ED-209 was a killer class.
  • barbegal 1399 days ago
    What sort of real world use case adds large numbers like this?
    • commandlinefan 1399 days ago
      public-key crypto.
      • cirno 1399 days ago
        And djb's implementations of Ed25519, Curve25519 use this exact radix 2^51 trick.
        • mratsim 1399 days ago
          Actually on 32-bit it uses a 2^25.5 radix, some are 2^26-bit limbs and others are 2^25.
          • cirno 1398 days ago
            I was meaning on 64-bit, but it's good to clarify that, thank you.
    • Y_Y 1399 days ago
      hn upvotes
      • saagarjha 1399 days ago
        Hacker News is a bit too small for that
  • smithza 1399 days ago
    The post did not mention potential cost of splitting and masking these into 5 limbs instead of 4. Perhaps I am missing that in this proposed system, 256-bit ints are always stored in 5 registers, then you only normalize when the result is needed? What are the costs of splitting/masking into 5-limbs and could the scaling impact the gains of parallel adds/subtracts meaningfully?
  • dreamcompiler 1399 days ago
    TLDR: Denormalized polynomial addition followed by normalization is faster than normalized polynomial addition on Haswell. Nice trick!
  • gautamcgoel 1399 days ago
    What I want to know is what, if any, are the implications for CPU architecture. How should bignum arithmetic be efficiently implemented in hardware?
    • not2b 1399 days ago
      Since good bignum libraries already use tricks like this, and many others besides (like using FFT to handle very large bignum multiplication), it doesn't seem that the hardware needs to change.
    • kwillets 1399 days ago
      There's a whole subfield of adder design focused on carrying, since it's a big dependency chain even within one register. I believe this is a software version of what some adder circuits already do (although I'll need some reading to check).
  • asciimov 1399 days ago
    edit: Ignore this, I'm tired and dumb.

    You aren't gaining performance by reducing carries, you are gaining performance by running these additions in parallel.

    The ALU is doing the addition in binary. Even though you add padding to the beginning of the number chunks it's still a binary number. When you recombine the chunks of numbers you will be doing all those carries you feel like you got for free.

    • tspiteri 1399 days ago
      But the carry propagation blocks parallelism. So by reducing the need to carry, you are getting the parallel gains.
      • asciimov 1399 days ago
        But you are doing the same number of carries. Op is making it sound like you are doing less carries.
        • phkahler 1399 days ago
          It depends how many numbers you add. If you compute A+B it will be slower because the Carrie's still have to be handled after the addition. But if you compute A+B+C+D+E you do 4 fast addition and only deal with carry bits once. TFA allows up to about 8000 numbers to be summed for a single pass of carry propagation.
          • asciimov 1399 days ago
            Right, I understand we are parallelizing the addition operations and the performance gain is due to adding chunks of the numbers in parallel.

            What I am saying is the total number of carry operations is conserved, even though they are delayed.

            • MauranKilom 1399 days ago
              I can't tell from your comments whether you realized yet, but the trick is only useful if you add several 256 bit numbers. You can add up to 2^13 numbers (each one 256 bits) before you actually have to do the four carry shift-outs once. In effect, you do reduce the number of carry operations (to four, albeit slightly more involved), and that is what ends up saving time.

              There is no performance gain if you only add two numbers, even though that's the setup the article discusses for 90% of its length. It seems that this is leading to some confusion.

              • asciimov 1399 days ago
                Yeah, I re-read the whole thing again, and I see the mistake I made. I took a wrong turn when op went into base 37 numbers, which made me consider every bit carry not just the carry on word boundary.
        • tspiteri 1399 days ago
          You're grouping carries. If you have 13 spare bits, you can accumulate 13 additions before you have to take care of carries. Then you have to add some overhead for bit manipulation, but the article's point is that the parallelism gains more than make up for the bit manipulation.
          • asciimov 1399 days ago
            Right, am not saying that this isn't faster, I am saying you aren't reducing to total number of carries.
            • tspiteri 1399 days ago
              You are reducing the number of carries. The article explains it clearly. If you're not convinced by the article, I won't try to convince you. (Edit: the last thing I'll add: you are reducing the number of carries for 64-bit additions. When adding 13 51-bit numbers, that does not count as carries instruction-wise, as it's still add, not adc.)
              • asciimov 1399 days ago
                I see what I was doing wrong, I was considering total bit carries during the addition, ie... 1+1= 10 and is a one bit carry, and not the carry operation done after overflowing the word boundary.

                Next time, I'll keep from reading tech articles while running on too little sleep.

                Thanks for your understanding.

                • tspiteri 1399 days ago
                  No worries, I kinda realized just after posting the comment, which explains why I added the edit. :)
    • brohee 1399 days ago
      He didn't talk about it, but energy usage is likely going up, so maybe if you keep doing it long enough the frequency will have to be throttled down in order to cool down.
      • tspiteri 1399 days ago
        I don't think so. If a core is busy, it's going to be consuming almost the same amount of power whether it's using one ALU or four. So the best strategy is to finish your processing as fast as possible so that you can idle, where the significant power saving happens. Basically, if the processing finishes faster, you save energy.