In hardware this would be called a carry save adder [1].
One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.
Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].
> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2^256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.
If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.
Indeed, if you use only 48 bits, you could also parallelize using the FP hardware - the mantissa is 52 bits, so if you use 48 bit limbs, you have 16 rounds before carry. Which is much less than 16 (or even 13) bits, but for processors which have distinct FP vs. integer adders, and that can issue them in parallel - you might get a speed boost.
Taking a guess: if you use this form to do a bunch of operations before normalizing, you don't need to worry as much about intermediate overflow. E.g. you could easily add 10 different 256 bit numbers in this form and normalize at the end. Whereas if the leading chunk is 64 bits, every addition can overflow.
So 52/51/51/51/51 seems like the more generally useful choice
Ah, then in that case it's actually five 51-bit limbs making up the required 255 bits. So probably the article has a slight mistake here assuming that 52 bits of the highest limb are used which led to an inaccurate reason when trying to explain why it's fine.
"Unfortunately, we can’t do that here – a 64-bit integer only has so many possible values" and "The remaining 12 or 13 bits give us the extra “digits” we need for preventing carries."
using 64 bit registers, but only using 51 bits, has no overflow. Using 64 bit adds in 64 bit registers overflows, making the carry updates more costly.
The design of Poly1305 was directly informed by this very technique, with the addition (heh) of performing the arithmetic on the x87 FPU operating in 80-bit mode. It even has some zero padding bits placed strategically to aid the technique, hence the security bound of 2^106 instead of 2^128 or 2^130.
I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.
I love that the same trick works for human brains as well as CPUs. Your brain already pays the price to represent numbers as high level abstract concepts rather than tightly packed fields. Just let the digits overflow, then clean them up later.
Yeah, for sure, if the numbers break easily into small factors, that's the way to go. Redundant numbers still work even when the factors are few, large, or difficult to calculate, however:
For my undergrad thesis I studied “Fibinary”, or Fibonacci coding. You can think of it like “base phi”, using Fibonacci numbers as the place values, and only 0,1.
The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.
s = a ^ b; // sum bits
r = a & 0x7f7f7f7f; // clear msbs
t = b & 0x7f7f7f7f; // clear msbs
s = s & 0x80808080; // msb sum bits
r = r + t; // add without msbs, record carry-out in msbs
r = r ^ s; // sum of msb sum and carry-in bits, w/o carry-out
Very interesting. It may seem that the bad performance of the ADC instruction is due to bad design of the ALU, but I think this is more of an example of how replacing (single sequential task) by (several parallel tasks whose results are combined at the end) allows the CPU to finish faster.
He’s testing on Haswell, where (at least according to Agner Fog) ADC has an extra uop compared to ADD. On Broadwell and later, non-memory ADC is a single uop with a single cycle of latency. He would probably see a big speedup on a newer CPU.
And in particular, sixteen ADC instructions in four chains shouldn't run any slower than sixteen ADD instructions (assuming four execution ports and a large enough reorder buffer), but a: Haswell sucks, and b: even on properly-designed CPUs, making parallelization easier is almost always some kind of win, because general-purpose CPUs always have dispatch overhead.
Carries were a major problem in mechanical calculators as well. For example, the upper section of the Curta had a special bell & gearing to allow carries. The Friden STW - a mechanical behemoth - had a complex set of gears within the moving carriage to allow carries & borrows. Same with Monroe & Marchant mechanical calculators. For electric motor driven calculators, the additional weight of the moving carriage - as well as the gearing - set limits on how fast the machine could add and subtract.
It's amazing that the 17 instructions at the end to tidy up all the carries, which looks like it has to be done serially, is still faster. But I guess each register's carry bits are independent from the ones that get carries added from the last register, so it could be ...
mov W, E
mov V, D
mov U, C
mov T, B
shr W, 51
shr V, 51
shr U, 51
shr T, 51
add D, W
add C, V
add B, U
add A, T
Cool trick. I wonder how it would compare in performance to a carry-lookahead adder [1], which uses the fact that you can compute any carry digit directly from a combination of the original inputs (which is increasingly long the deeper you go).
Yikes, good question. I thought about it for a bit:
Here's a toy example adding three-bit numbers, where all letters are bits (0 or 1) and you can normally only add one bit at a time.
a b c
+ d e f
-------
g h i j
Instead of doing
j = c ^ f
j_carry = c & f
i = j_carry ^ (b ^ e)
in two steps, we can condense it to
j = c ^ f
i = (c & f) ^ (b ^ e)
And similarly, we can get g and h combinationally from the input:
# "j_carry AND one of b, e"
i_carry = (b & e) | ( (c & f) & (b | e) )
h = i_carry ^ (a ^ d)
h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
g = (a & d) | (i_carry & (a | d))
g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
So we end up directly computing
g = (a & d) | ((b & e) | ( (c & f) & (b | e)) & (a | d))
h = ( (b & e) | ((c & f) & (b | e)) ) ^ (a ^ d)
i = (c & f) ^ (b ^ e)
j = c ^ f
All of these innermost combinations like (a & d) you can get just by AND / OR of the original inputs.
But as you suggest (I think), then you need to combine these further to actually get results. In theory they're all parallelizable (if you picture this as tree), but I don't see a good way to do that quickly. Though I'm no assembly expert so maybe I'm missing something.
I guess thinking in terms of gate delays doesn't really help much at this abstraction level.
This reminds me of "zeroless number systems" or "bijective number systems" (https://en.wikipedia.org/wiki/Bijective_numeration#The_bijec...) - in the language of the article, the allowed digits in base 10 are 123456789A. Note that 0 is not an allowed digit.
For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...
I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).
They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.
Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...
A question here is whether this trick (which I acknowledge is very interesting) is worth doing on Broadwell. Broadwell has a pair of new instructions (ADCX and ADOX) that don't behave like a normal ADC - they only read and set a single bit (carry bit and overflow bit). As such you can do two of these at once and at least on recent Intel processors there are two execution ports (0 and 6) that can handle these instructions.
You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).
>"The key insight here is that we can use this technique to delay carry propagation until the end. We can’t avoid carry propagation altogether, but we can avoid it temporarily. If we save up the carries that occur during the intermediate additions, we can propagate them all in one go at the end."
Brilliant and tremendously interesting!
Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)
What I'm wondering about now is: does that 'adc' opcode have any other useful application except for adding numbers bigger than the width of a single register?
Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.
It's not useless bloat though. It's probably hard to beat in terms of code size, which is often even relevant for performance (due to cache size considerations).
Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.
I believe adc is still commonly used as a way of reading the carry bit. For example, assuming eax has already been zeroed, adc eax, eax would be the idiomatic way of reading the carry bit into eax.
The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.
Actually, setc is the idiomatic way, and doesn’t depend on the previous value of the destination register so it’s much faster to execute because it has no dependencies
No, setcc actually does depend on the previous value of the register, because it only comes in the low-byte variants (i.e. al, bl, cl, dl etc)--the top 7 bytes of the destination are left unmodified.
A one-instruction variant I've seen that gets you the negated value of the carry flag is sbb rax, rax. This doesn't depend on the previous value of rax in a mathematical sense, though I'm unsure if it depends in an out-of-order sense; that is, if it's recognized as a zeroing idiom (or rather, zeroing or all-one-ing in this case). Probably not.
EDIT: did a quick test, and sbb rax, rax etc are not recognized to be a zeroing idiom, at least on my old Haswell. So it's still one instruction, but has a dependency on the previous register value.
You're right, but only for pre-Haswell chips, where al etc. are renamed separately. From Haswell onwards, they aren't, and depend on the old value. So movzx eax, al gets you mathematical independence, but not out-of-order independence. There's a bunch more detail in this Stack Overflow question (referenced in the one you linked): https://stackoverflow.com/questions/45660139/how-exactly-do-...
The carry flag is also the canonical way to signal "something has gone wrong" in low-level code. Example: Intel virtual machine extension instructions (like vmwrite, vmxon) all set the carry flag when they fail. But no, you'd usually use jc or setc to read the flag.
x86 has a lot of "useless bloat" with legacy high-level instructions. Modern compilers avoid those and newer cpus aren't engineered with performance in mind for those instructions since nobody sane uses them anymore.
That might be another reason why it is slow -- it is an old opcode that has to be supported for compatibility, but isn't prioritized in any of the pipelines in newer chip designs.
The post did not mention potential cost of splitting and masking these into 5 limbs instead of 4. Perhaps I am missing that in this proposed system, 256-bit ints are always stored in 5 registers, then you only normalize when the result is needed? What are the costs of splitting/masking into 5-limbs and could the scaling impact the gains of parallel adds/subtracts meaningfully?
Since good bignum libraries already use tricks like this, and many others besides (like using FFT to handle very large bignum multiplication), it doesn't seem that the hardware needs to change.
There's a whole subfield of adder design focused on carrying, since it's a big dependency chain even within one register. I believe this is a software version of what some adder circuits already do (although I'll need some reading to check).
You aren't gaining performance by reducing carries, you are gaining performance by running these additions in parallel.
The ALU is doing the addition in binary. Even though you add padding to the beginning of the number chunks it's still a binary number. When you recombine the chunks of numbers you will be doing all those carries you feel like you got for free.
It depends how many numbers you add. If you compute A+B it will be slower because the Carrie's still have to be handled after the addition. But if you compute A+B+C+D+E you do 4 fast addition and only deal with carry bits once. TFA allows up to about 8000 numbers to be summed for a single pass of carry propagation.
I can't tell from your comments whether you realized yet, but the trick is only useful if you add several 256 bit numbers. You can add up to 2^13 numbers (each one 256 bits) before you actually have to do the four carry shift-outs once. In effect, you do reduce the number of carry operations (to four, albeit slightly more involved), and that is what ends up saving time.
There is no performance gain if you only add two numbers, even though that's the setup the article discusses for 90% of its length. It seems that this is leading to some confusion.
Yeah, I re-read the whole thing again, and I see the mistake I made. I took a wrong turn when op went into base 37 numbers, which made me consider every bit carry not just the carry on word boundary.
You're grouping carries. If you have 13 spare bits, you can accumulate 13 additions before you have to take care of carries. Then you have to add some overhead for bit manipulation, but the article's point is that the parallelism gains more than make up for the bit manipulation.
You are reducing the number of carries. The article explains it clearly. If you're not convinced by the article, I won't try to convince you. (Edit: the last thing I'll add: you are reducing the number of carries for 64-bit additions. When adding 13 51-bit numbers, that does not count as carries instruction-wise, as it's still add, not adc.)
I see what I was doing wrong, I was considering total bit carries during the addition, ie... 1+1= 10 and is a one bit carry, and not the carry operation done after overflowing the word boundary.
Next time, I'll keep from reading tech articles while running on too little sleep.
He didn't talk about it, but energy usage is likely going up, so maybe if you keep doing it long enough the frequency will have to be throttled down in order to cool down.
I don't think so. If a core is busy, it's going to be consuming almost the same amount of power whether it's using one ALU or four. So the best strategy is to finish your processing as fast as possible so that you can idle, where the significant power saving happens. Basically, if the processing finishes faster, you save energy.
One interesting fact about carry save adders is that the carry part of the register can be extended in order to avoid any carrying between words for very long periods of time. Instead of using 53 bits to represent 52 bits, use 60 bits, and now you can perform 256 sequential additions with no carrying between words before the carry part saturates and needs to be handled.
Somewhat surprisingly, it's even possible to use this construction in the context of a quantum computation [2][3]. The reason it's surprising is because tacking on additional registers like this can act as an information leakage mechanism (which is an error in a quantum computation). For analogy, you can imagine that if you computed a public key using this trick that you would not be comfortable handing an attacker the registers storing your public key before you removed the carry padding that made the computation faster. But it turns out that if you just initialize the carry padding randomly, subtracting out of the next word at the start to ensure everything sums to the correct result at the end, then you can show that the information leakage and chance-of-bad-overflow is exponentially suppressed in the length of the carry padding. Currently the most efficient known way to implement Shor's algorithm uses this technique [4].
1: https://en.wikipedia.org/wiki/Carry-save_adder
2: https://arxiv.org/abs/1905.08488
3: https://youtu.be/upTipX9yXNg?t=777
4: https://arxiv.org/abs/1905.09749
If you're going to wrap, you could go all the way and assign 64 bits to the most significant limb; that way you save 12 bits which you can spread to the other four limbs. You can go from {52, 51, 51, 51, 51} to {64, 48, 48, 48, 48}, so you have a spare 16 bits instead of 13 bits.
So 52/51/51/51/51 seems like the more generally useful choice
This approach is efficient when you know you have to add a bunch of numbers together. You add them up, and then normalize.
The example can be, e.g. doing long multiplication, or adding all numbers in a list.
(Just repeating what was said in the parent comment for emphasis)
using 64 bit registers, but only using 51 bits, has no overflow. Using 64 bit adds in 64 bit registers overflows, making the carry updates more costly.
See also https://cr.yp.to/mac/poly1305-20050113.pdf
https://en.wikipedia.org/wiki/Redundant_binary_representatio...
https://en.wikipedia.org/wiki/Carry-save_adder
http://lux.dmcs.pl/csII/ca2_RedundantNS.pdf
I have an arbitrary precision arithmetic library in my mind that represents numbers as streams of digits starting from the most significant digit. A redundant numeric representation could avoid unbounded backtracking (think flipping between 0.9999... and 1.0000... in a regular representation) and could guarantee forward progress for the generated digits.
EDIT: as noted below, I made a mistake (7·7=49,not 47,) and carried it halfway through my solution. Oops.
Thanks for noticing!
The “normalized” representation, called Zeckendorf representation, perfectly packs error correction into the binary, as it ensures no number will ever have consecutive ones.
It also has applications in data compression.
https://en.m.wikipedia.org/wiki/Fibonacci_coding
https://github.com/constructor-igor/cudafy/blob/11cdd4def4e7...
No they're not. Let's say B's non-carry bits are all 1. If you carry anything from C, that will affect B's carry bits.
It would be an incredibly rare case, so branch prediction will always get it right, and the cost of a branch nearly nill.
[1] https://en.wikipedia.org/wiki/Carry-lookahead_adder
Here's a toy example adding three-bit numbers, where all letters are bits (0 or 1) and you can normally only add one bit at a time.
Instead of doing in two steps, we can condense it to And similarly, we can get g and h combinationally from the input: So we end up directly computing All of these innermost combinations like (a & d) you can get just by AND / OR of the original inputs.But as you suggest (I think), then you need to combine these further to actually get results. In theory they're all parallelizable (if you picture this as tree), but I don't see a good way to do that quickly. Though I'm no assembly expert so maybe I'm missing something.
I guess thinking in terms of gate delays doesn't really help much at this abstraction level.
Not sure you'd gain any speed boost though.
For instance here are the first few natural numbers in this system: 1,2,3,4,5,6,7,8,9,A,11,12...
I first read about them in Chris Okasaki's purely functional data structures book, they have some applications in terms of designing data structures (e.g. combining perfect-powers-of-two sized heaps to build heaps of arbitrary size).
They're conceptually interesting too, being a positional number system with no redundancy (exactly one way to represent each number) and without 0. Since the concept of a 0 digit is not required, one could argue that it's conceptually easier for e.g. a classical Roman mathematician to learn this than the positional system we do use today.
Also, excel's system for numbering columns uses this - A, B, C, ... X, Y, Z, AA, AB ...
You'd still need some cleverness if your goal is to add just one giant bignum (as opposed to doing lots of independent add-with-carry sums).
Brilliant and tremendously interesting!
Useful for if/when I write a fast super-long integer math library in the future (yes, I know GNU already has one, but it's always interesting to know/understand tricks like this for >64-bit, AKA "too-long-to-fit-in-a-single-register" integers...)
Because it does look a little bit as if it was engineered for that particular situation, and if that was the case, with this situation now in practice being solved much more efficiently without a single 'adc' instruction, the opcode might suddenly be completely useless bloat - which of course still had to be maintained forever, even though nobody uses it in practice, because it's part of the spec.
Furthermore, the article is talking about adding many numbers, which is just one of the possible use cases. As far as I can tell, adc is still useful for adding a pair of 128-bit numbers on a 64-bit CPU for example.
The carry flag gets set by all sorts of instructions other than addition and subtraction (comparisons, bit shifts, multiplication, etc) so this is more useful than you might think.
A one-instruction variant I've seen that gets you the negated value of the carry flag is sbb rax, rax. This doesn't depend on the previous value of rax in a mathematical sense, though I'm unsure if it depends in an out-of-order sense; that is, if it's recognized as a zeroing idiom (or rather, zeroing or all-one-ing in this case). Probably not.
EDIT: did a quick test, and sbb rax, rax etc are not recognized to be a zeroing idiom, at least on my old Haswell. So it's still one instruction, but has a dependency on the previous register value.
You should be able to break the dependency and avoid the partial register stall by doing: movzx eax, ax
See: https://stackoverflow.com/questions/41573502/why-doesnt-gcc-... ; https://software.intel.com/en-us/forums/intel-isa-extensions... ; https://www.agner.org/optimize/microarchitecture.pdf (section 6.8)
https://link.springer.com/article/10.1007/s11432-015-5411-x https://arxiv.org/pdf/1411.5949.pdf
You aren't gaining performance by reducing carries, you are gaining performance by running these additions in parallel.
The ALU is doing the addition in binary. Even though you add padding to the beginning of the number chunks it's still a binary number. When you recombine the chunks of numbers you will be doing all those carries you feel like you got for free.
What I am saying is the total number of carry operations is conserved, even though they are delayed.
There is no performance gain if you only add two numbers, even though that's the setup the article discusses for 90% of its length. It seems that this is leading to some confusion.
Next time, I'll keep from reading tech articles while running on too little sleep.
Thanks for your understanding.