The x86 Advanced Matrix Extension

(fuse.wikichip.org)

78 points | by rbanffy 4 days ago

11 comments

  • PaulHoule 4 days ago

    I wonder what Charlie Demerjian is going to say about this.

    It is an awful lot of registers for a feature that few programs may use. There's the risk that bfloat16 is a fad and five years from now it is used hardly at all. At best it is for a full stack perception-and-synthesis feature about as good as

    https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video

    Most of all it drives me nuts that Intel is going down the fixed-length SIMD instruction route (going wider) where the instructions you write are specific to the structures of the processor you run on. Machines like the ILLIAC, Cray, and the vector processor for the 3090 mainframe would automatically chunk the work so that you didn't need to rewrite your code for a future wider model.

    ARM is doing it the right way:

    https://alastairreid.github.io/papers/sve-ieee-micro-2017.pd...

    • Veedrac 4 days ago

      It's a controversial opinion I've not heard anyone else express, but IMO variable-width vector instructions are the wrong approach, since they're optimizing for the easy cases and solving problems that don't matter much, like instruction counts.

      Although x86's SIMD extensions have a bunch of crippling issues, fixed-width instructions are fundamentally more general because they work both for vectorizable length-N loops, and for the many, many cases where you have a fixed units of work that you can perform in parallel. If you design your architecture to depend on loops to extract full performance, you cut yourself off from many productive uses of vector instructions outside of loops, and make it more difficult to juggle cases where you have more than one loop dimension.

      Upgrading to new architectures with longer supported vector lengths is best left to recompilation and variable-width vector libraries.

      • ritter2a 4 days ago

        Wouldn't be the first of Intel's ISA extensions to be unsuccessful because of its limitations, look at MPX: https://intel-mpx.github.io/

        • zamalek 4 days ago

          > It is an awful lot of registers for a feature that few programs may use

          I was just wondering what you could do with 8KiB of register memory, if you ignored the intent of the registers outright (assuming you can load/store to the classical registers from these registers).

          • api 4 days ago

            Cache expanded cryptographic keys and tables entirely in registers for one...

            • fuoqi 3 days ago

              AFAIK SIMD accelerated cipher implementations already mostly do it. For example, AES-NI based implementations of AES-128 and AES-192 keep round keys in XMM registers without reloading them during processing of blocks.

          • rbanffy 4 days ago

            Intel reasons it's so massive that any way they decide to go is The Right Way and people will accommodate anything they do. So far, this is mostly true - we have code that takes different branches depending on the ISA extensions available, probably all the way down to x87.

            • jeffbee 4 days ago

              I don't think that's how these features get developed. Someone who orders CPUs by the cubic meter comes to Intel and asks for a bfloat16 unit. Six years later, they ship one. I know for a fact this is how BMI came to exist.

              • rbanffy 4 days ago

                Indeed. We should never think that we are Intel's (or AMD's or IBM's) customers. They have a very short list - Dell, Lenovo, HP and, more recently, Google, AWS, Facebook and some of these are procuring directly from TSMC.

                • im3w1l 4 days ago

                  Commodity GPU's and before that playstations were used by researchers for their excellent parallel computing performance. So the influences go both ways.

                  • jeffbee 4 days ago

                    Thinking it over a bit more, there definitely are examples that seem to have been driven speculatively by Intel, and their application in industry is not clear. Optane is the one that strikes me as a thing that Intel decided they _could_ do, and then did. The companies where I have worked that considered Optane did so after Intel shipped it, they didn't do research that concluded they needed it before it existed. Also, I can't recall an instance where the TCO looked good for Optane.

                    But as for architectural extensions, there clearly are customers asking for these things, even TSX which has never worked.

          • api 4 days ago

            Intel is going to start thrashing around now, adding a million features to try to beat ARM and AMD on various microbenchmarks and special use cases. Meanwhile ARM and AMD will keep winning on throughput, price/performance, and (for ARM particularly) performance/watt.

            Any win all these features bring can also be achieved with ARM or Zen by adding more cores, with the exception of those few cases where you have huge computational tasks that cannot be efficiently parallelized and where there are few discrete jobs to allow for coarse grained (job/task) parallelization. There are not many of these.

            Meanwhile all these features are going to make Intel chips even more complex, making bugs more likely and making iteration more costly.

            My read is that Intel is shooting for maximum possible single threaded performance because they can't compete on power efficiency or many-core. They can't compete in those areas because their process nodes are not as small as what TSMC can offer, and both (most) ARM chips and AMD are using TSMC and fabricating at 7nm and soon 5nm. (Yes I know nm node sizes are no longer directly comparable, but they are ahead of Intel and likely to stay ahead unless Intel can really push hard on fab engineering.)

            • TinkersW 4 days ago

              It sounded interesting until I saw that the only float type it supports is Brain float :(

              • cesaref 4 days ago

                I'm kind of interested to see what bfloat16 sounds like (for audio DSP). I'd expect it to be good enough for a large number of algorithms so long as they are properly stable, and if we have decent performance and reduced power use, i'm all for that!

                • klodolph 4 days ago

                  I’m sure you could come up with an application for it, but if you want audio output at some point, half-float sounds like quite a challenge.

                  - The -66dB noise floor is pretty bad, and it accumulates at every step.

                  - 11 bits is probably not enough for filter coefficients. So your filters would likely be running with single precision floats, at least. Even low-cost DSP chips tend to give you a big chunk of bits for your filters.

                  - Naïve oscillator designs would accumulate a lot of error. Back of the envelope calculation, if you wanted an oscillator at C4, you’d likely be around 1/4 tone sharp or flat unless you ran the oscillator at higher precision.

                  I’m definitely of the mind that bit depth is overrated in music. 16 bits is great-for mastered music and simple tweaks. By contrast, from my experiences writing DSP code, it often makes your code simpler and faster to run at higher depths and sample rates, and then convert to e.g. 16 bit as the very last step. The problem is that squeezing good output from low precision or low sample rates requires more complicated and slower algorithms.

                  • rbanffy 4 days ago

                    Probably not as crisp as 16-bit integers - mantissa is 7 bits, so 8-bits with sign. Clever use of exponents may give some additional range, but I wouldn't be too optimistic.

                    The good thing is that, as any FP number, can represent a very large range of values, so I'd expect percussion and other things with very high-frequency transients to sound nice. My bet is it'll sound "colorful". IEEE float16 should sound better.

                    But I'm no expert and I don't have the time to write something that force a good high dynamic range track to be rounded to the nearest bfloat16 and then expanded back. Also, I don't have gear good enough to hear anything better than CD-grade audio.

                • innocenat 4 days ago

                  That tile registers are crazy. But I wonder how long it will actually take to actually be performant, considering the present problem with AVX512 switching.

                  • google234123 4 days ago

                    AVX512 is already performant.

                    • jfkebwjsbx 4 days ago

                      Only in a subset of cases, which is the problem: you cannot simply always use it like the previous extensions.

                      • klodolph 4 days ago

                        None of the previous extensions could be used blindly, either. It was a while before people figured out how to use MMX or SSE well, and people still often find that the scalar version of an algorithm beats their vector version.

                        • jfkebwjsbx 4 days ago

                          I am not talking about ease of use, but about the downclock.

                          The other extensions do not trigger it, not even AVX256.

                          With AVX512 is not always a win, and you don't even know until you try in particular hardware.

                          • jcranmer 4 days ago

                            The 256-bit vector instructions do trigger a downclock, but not as severe as the AVX512 downclock.

                            • jfkebwjsbx 3 days ago

                              You are 100% right, it is AVX1 I was thinking about (which nowadays I am not sure if that has downclock or not either).

                            • MaxBarraclough 4 days ago
                              • Const-me 4 days ago

                                I don’t think that applies to modern AMD processors, though.

                                Agner’s microarchitecture.pdf says about Ryzen “There is no penalty for mixing AVX and non-AVX vector instructions on this processor.”

                                Not sure if it applies to Zen 2 but I’ve been using one for a year for my work, AVX 1 & 2 included, I think I would have noticed.

                                • jcranmer 4 days ago

                                  AMD processors used to implement AVX instructions by double-pumping them, using only 128-bit vector ALUs. This means there's no clock penalty, but there's also no speedup over an SSE instruction by doing so. I don't know if this is still the case with the newest µarchs though.

                                  • Const-me 4 days ago

                                    > but there's also no speedup over an SSE instruction by doing so

                                    Just because they are split doesn’t mean they run sequentially. Zen 1 can handle up to 4 floating point microops/cycle, and there’re 4 floating-point execution units, 128-bit wide / each (that’s excluding load/store, these 4 EUs only compute).

                                    Native 256 bit are even faster due to less micro-ops and potentially more in-flight instructions, but I’m pretty sure even on Zen 1 AVX is faster than SSE.

                                    • innocenat 4 days ago

                                      It depends. If EUs are actually the bottleneck then doing SSE or AVX wouldn't have any different in speed in such case.

                                      However, when instruction decode/retire is the bottleneck, AVX can be faster. I remembered this can be the case on Intel Sandy Bridge (first-gen AVX, double pumped, retire 3 instructions/cycle), where AVX can sometimes be faster (usually it's not that different)

                                      With recent CPUs from both Intel/AMD able to at decode/retire at least 4 instructions per cycle this really cease to be the case.

                                      • Const-me 4 days ago

                                        > AVX can be faster

                                        Yes. Another possible reason for that is instructions without SSE equivalents. I remember working on some software where AVX2 broadcast load instruction helped substantially.

                                • google234123 4 days ago

                                  Why link to a 5 year old thread? There has to be more recent work.

                                  • tarlinian 4 days ago

                                    There is more recent work. This blog post by Travis Downs is the most detailed analysis of transition behavior I've seen: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

                                    For general guidelines on when to use AVX-512, this (older post) remains best guide I've seen: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

                                    • google234123 4 days ago

                                      So, are programs that are compiled with those instructions faster or slower? In my experience they have been faster.

                                      • klodolph 4 days ago

                                        Short answer: Yes, faster. Long answer: It depends, and you may be measuring the wrong thing.

                                        Among other things, it depends on the workload and the exact processor. You can find plenty of cases where AVX512 makes things faster. You can also find cases where the entire system slows down because it is running sections of AVX512 code here and there—apparently, for certain Intel processors, the processor needs to turn on top 256 bits of the register files and interconnects, and to get full speed for AVX512 it will alter the processor’s voltage and clock speed. This reduces the speed of other instructions and even other cores on the same die (which may be surprising).

                                        While the specifics may be new, the generalities seem familiar—it has long been true that a well-intentioned improvement to a small section of your code base can improve performance locally while degrading overall system performance. The code that you’re working on occupies a smaller and smaller slice of your performance metrics, and meanwhile, the whole system is slowing down. There are so many reasons that this can happen, dynamic frequency scaling with AVX512 is just one more reason.

                                • klodolph 4 days ago

                                  Different causes, similar consequences.

                                  • innocenat 4 days ago

                                    Porting SSE to AVX code (with equivalent instruction and proper vzeropper) will increase performance in most case (the only case where it can be slower, on top of my head, is on Sandy Bridge). The same is not true for AVX to AVX512.

                                    • Const-me 4 days ago

                                      It will increase performance if you have sufficient amount of dense data on input.

                                      When that’s the case, especially if the numbers being crunched are 32-bit floats, there’s not much point of doing it on CPU at all, GPGPUs are way more efficient for such tasks.

                                      However, imagine sparse matrix * dense vector multiplication. If you rarely have more than 4 consecutive non-zero elements in rows of the input matrix, and large gaps between non-zero elements, moving from SSE to AVX or AVX512 will decrease the performance, you’ll be just wasting electricity multiplying by zeros.

                                      • tarlinian 4 days ago

                                        So in some sense very similar to SKX behavior? The first iteration of the instruction implementation requires judicious use of instructions, while later implementations (this is something to be upset about...those "later implementations" should have been available quite some time ago).

                                        This is also ignoring the fact that none of these penalties come into play if you use the AVX512 instructions with 256-bit or 128-bit vectors. (This still has significant benefits due to the much nicer set of shuffles, dedicated mask registers, etc.)

                                        • google234123 4 days ago

                                          AVX to AVX512 will "increase performance in most case"s. https://www.researchgate.net/figure/Speedup-from-AVX-512-ove...

                            • unwind 4 days ago

                              Meta: I'm not a native speaker, but that lonely 'W' in the title really irks me. Suggested alternate title would be something like "Intel's Sapphire Rapid debuts x86 Advanced Matrix Extension" (59 chars).

                              • beojan 4 days ago

                                It needs a '/' following it (and should probably be a lowercase 'w'). 'w/' is quite a common abbreviation for 'with' though (with 'w/o' for 'without').

                                Presumably there's a character limit on HN titles, and '/' isn't allowed either?

                                • messe 4 days ago

                                  Or how about remove “the ” from the start of the sentence, and just writing “with”

                                  • stefan_ 4 days ago

                                    Or we just replace the w/ with "in".

                                • waynesonfire 4 days ago

                                  where are these extensions being used? I didn't look hard but are there open source libraries / compilers that will take advantage of these?

                                  It's sort of amazing the amount of performance you can squeeze when you fix your OS and cpu architecture.

                                  There are so many extensions: https://software.intel.com/sites/landingpage/IntrinsicsGuide -- are we suppose to write our own libraries to leverage these or do we need to file tickets with our favorite compilers for them to develop these optimizations?

                                  Oh, one more questions, how do these overlap with AMD?

                                  • mratsim 4 days ago

                                    BLAS libraries, oneDNN, OpenCV, Eigen, Tensorflow, PyTorch, LLVM MLIR, ...

                                    AMD usually implements them but with a couple of years of delay.

                                    For example AVX512 is not implemented, and we had to wait for Ryzen 3 to have the same AVX capabilities as Intel (2 AVX units per core instead of one).

                                    • fomine3 3 days ago

                                      An Intel engineer sends PR https://github.com/herumi/xbyak/pull/95

                                    • sradman 4 days ago

                                      OK, this is a new Intel SIMD-like instruction set: AVX for vectors, now AMX for matrices. I guess this is an alternative to Nvidia GPU, Google TPU, Apple Neural Engine, etc..

                                      • nabla9 4 days ago

                                        It's not general alternative. It's good some subset of inference tasks.

                                        Intel deploys their own GPU some time in the future.

                                      • mratsim 4 days ago

                                        Looks very interesting but ... AVX512 is already problematic cooling wise, this seems even worse.

                                        • deltasquared 4 days ago

                                          I am wondering why I would want this on a CPU when this kind of processing is already available on a GPU.

                                          • chrisseaton 4 days ago

                                            Where is your data? Is it in the CPU cache or is it in the GPU? Computing where your data is, rather than moving your data to where your compute is, can often be the best option.

                                            • emcq 4 days ago

                                              For small networks it's often a win to stay on chip at least on the power side. But if you do need to go off chip for memory it's hard to beat the memory bandwidth you have on a GPU.

                                          • im3w1l 4 days ago

                                            How big of an issue is context switching in the middle of a sequence of AMX operations?

                                            • jcranmer 4 days ago

                                              The AMX extensions drop 2 more components into XSAVE: XTILECFG (which is 64 bytes) and XTILEDATA (8192 bytes).

                                              Interestingly, there does seem to be a new extension (see §3.2.6 of https://software.intel.com/content/www/us/en/develop/downloa...) that, on first glance, looks to be a per-thread enable/disable bit for using these registers, which suggests that an OS could make a process-level capability to enable/disable AMX and thereby not bother saving these registers on context switches if it's switching to a process without AMX.

                                            • snvzz 4 days ago

                                              In other news, bloated CISC architecture becomes further bloated.

                                              I'm looking forward to RISC-V's V extension, which is due to become standard around September. Unlike AVX512 and friends, this one is vector size agnostic.

                                              • monocasa 4 days ago

                                                This isn't built on the AVX512 style register file either; or even a vector register at all. It's a set of huge matrix registers, so it's pretty orthogonal to both AVX512 and RV-V.

                                                • jabl 4 days ago

                                                  I haven't followed the RV-V extension in a while, but IIRC it has acquired features to configure the vector registers as matrix tiles, and matrix multiplication instructions.

                                                  • monocasa 4 days ago

                                                    It hasn't as of the 0.9 draft, but maybe there's something new I don't know about.

                                                    • jabl 4 days ago

                                                      I tried to look it up again, but I couldn't find anything. Might have been just some offhand remark in some presentation about future improvements upon the basic vector ISA.