I wonder what Charlie Demerjian is going to say about this.
It is an awful lot of registers for a feature that few programs may use. There's the risk that bfloat16 is a fad and five years from now it is used hardly at all. At best it is for a full stack perception-and-synthesis feature about as good as
Most of all it drives me nuts that Intel is going down the fixed-length SIMD instruction route (going wider) where the instructions you write are specific to the structures of the processor you run on. Machines like the ILLIAC, Cray, and the vector processor for the 3090 mainframe would automatically chunk the work so that you didn't need to rewrite your code for a future wider model.
It's a controversial opinion I've not heard anyone else express, but IMO variable-width vector instructions are the wrong approach, since they're optimizing for the easy cases and solving problems that don't matter much, like instruction counts.
Although x86's SIMD extensions have a bunch of crippling issues, fixed-width instructions are fundamentally more general because they work both for vectorizable length-N loops, and for the many, many cases where you have a fixed units of work that you can perform in parallel. If you design your architecture to depend on loops to extract full performance, you cut yourself off from many productive uses of vector instructions outside of loops, and make it more difficult to juggle cases where you have more than one loop dimension.
Upgrading to new architectures with longer supported vector lengths is best left to recompilation and variable-width vector libraries.
AFAIK SIMD accelerated cipher implementations already mostly do it. For example, AES-NI based implementations of AES-128 and AES-192 keep round keys in XMM registers without reloading them during processing of blocks.
Intel reasons it's so massive that any way they decide to go is The Right Way and people will accommodate anything they do. So far, this is mostly true - we have code that takes different branches depending on the ISA extensions available, probably all the way down to x87.
I don't think that's how these features get developed. Someone who orders CPUs by the cubic meter comes to Intel and asks for a bfloat16 unit. Six years later, they ship one. I know for a fact this is how BMI came to exist.
Indeed. We should never think that we are Intel's (or AMD's or IBM's) customers. They have a very short list - Dell, Lenovo, HP and, more recently, Google, AWS, Facebook and some of these are procuring directly from TSMC.
Thinking it over a bit more, there definitely are examples that seem to have been driven speculatively by Intel, and their application in industry is not clear. Optane is the one that strikes me as a thing that Intel decided they _could_ do, and then did. The companies where I have worked that considered Optane did so after Intel shipped it, they didn't do research that concluded they needed it before it existed. Also, I can't recall an instance where the TCO looked good for Optane.
But as for architectural extensions, there clearly are customers asking for these things, even TSX which has never worked.
Intel is going to start thrashing around now, adding a million features to try to beat ARM and AMD on various microbenchmarks and special use cases. Meanwhile ARM and AMD will keep winning on throughput, price/performance, and (for ARM particularly) performance/watt.
Any win all these features bring can also be achieved with ARM or Zen by adding more cores, with the exception of those few cases where you have huge computational tasks that cannot be efficiently parallelized and where there are few discrete jobs to allow for coarse grained (job/task) parallelization. There are not many of these.
Meanwhile all these features are going to make Intel chips even more complex, making bugs more likely and making iteration more costly.
My read is that Intel is shooting for maximum possible single threaded performance because they can't compete on power efficiency or many-core. They can't compete in those areas because their process nodes are not as small as what TSMC can offer, and both (most) ARM chips and AMD are using TSMC and fabricating at 7nm and soon 5nm. (Yes I know nm node sizes are no longer directly comparable, but they are ahead of Intel and likely to stay ahead unless Intel can really push hard on fab engineering.)
I'm kind of interested to see what bfloat16 sounds like (for audio DSP). I'd expect it to be good enough for a large number of algorithms so long as they are properly stable, and if we have decent performance and reduced power use, i'm all for that!
I’m sure you could come up with an application for it, but if you want audio output at some point, half-float sounds like quite a challenge.
- The -66dB noise floor is pretty bad, and it accumulates at every step.
- 11 bits is probably not enough for filter coefficients. So your filters would likely be running with single precision floats, at least. Even low-cost DSP chips tend to give you a big chunk of bits for your filters.
- Naïve oscillator designs would accumulate a lot of error. Back of the envelope calculation, if you wanted an oscillator at C4, you’d likely be around 1/4 tone sharp or flat unless you ran the oscillator at higher precision.
I’m definitely of the mind that bit depth is overrated in music. 16 bits is great-for mastered music and simple tweaks. By contrast, from my experiences writing DSP code, it often makes your code simpler and faster to run at higher depths and sample rates, and then convert to e.g. 16 bit as the very last step. The problem is that squeezing good output from low precision or low sample rates requires more complicated and slower algorithms.
Probably not as crisp as 16-bit integers - mantissa is 7 bits, so 8-bits with sign. Clever use of exponents may give some additional range, but I wouldn't be too optimistic.
The good thing is that, as any FP number, can represent a very large range of values, so I'd expect percussion and other things with very high-frequency transients to sound nice. My bet is it'll sound "colorful". IEEE float16 should sound better.
But I'm no expert and I don't have the time to write something that force a good high dynamic range track to be rounded to the nearest bfloat16 and then expanded back. Also, I don't have gear good enough to hear anything better than CD-grade audio.
None of the previous extensions could be used blindly, either. It was a while before people figured out how to use MMX or SSE well, and people still often find that the scalar version of an algorithm beats their vector version.
AMD processors used to implement AVX instructions by double-pumping them, using only 128-bit vector ALUs. This means there's no clock penalty, but there's also no speedup over an SSE instruction by doing so. I don't know if this is still the case with the newest µarchs though.
> but there's also no speedup over an SSE instruction by doing so
Just because they are split doesn’t mean they run sequentially. Zen 1 can handle up to 4 floating point microops/cycle, and there’re 4 floating-point execution units, 128-bit wide / each (that’s excluding load/store, these 4 EUs only compute).
Native 256 bit are even faster due to less micro-ops and potentially more in-flight instructions, but I’m pretty sure even on Zen 1 AVX is faster than SSE.
It depends. If EUs are actually the bottleneck then doing SSE or AVX wouldn't have any different in speed in such case.
However, when instruction decode/retire is the bottleneck, AVX can be faster. I remembered this can be the case on Intel Sandy Bridge (first-gen AVX, double pumped, retire 3 instructions/cycle), where AVX can sometimes be faster (usually it's not that different)
With recent CPUs from both Intel/AMD able to at decode/retire at least 4 instructions per cycle this really cease to be the case.
Short answer: Yes, faster. Long answer: It depends, and you may be measuring the wrong thing.
Among other things, it depends on the workload and the exact processor. You can find plenty of cases where AVX512 makes things faster. You can also find cases where the entire system slows down because it is running sections of AVX512 code here and there—apparently, for certain Intel processors, the processor needs to turn on top 256 bits of the register files and interconnects, and to get full speed for AVX512 it will alter the processor’s voltage and clock speed. This reduces the speed of other instructions and even other cores on the same die (which may be surprising).
While the specifics may be new, the generalities seem familiar—it has long been true that a well-intentioned improvement to a small section of your code base can improve performance locally while degrading overall system performance. The code that you’re working on occupies a smaller and smaller slice of your performance metrics, and meanwhile, the whole system is slowing down. There are so many reasons that this can happen, dynamic frequency scaling with AVX512 is just one more reason.
Porting SSE to AVX code (with equivalent instruction and proper vzeropper) will increase performance in most case (the only case where it can be slower, on top of my head, is on Sandy Bridge). The same is not true for AVX to AVX512.
It will increase performance if you have sufficient amount of dense data on input.
When that’s the case, especially if the numbers being crunched are 32-bit floats, there’s not much point of doing it on CPU at all, GPGPUs are way more efficient for such tasks.
However, imagine sparse matrix * dense vector multiplication. If you rarely have more than 4 consecutive non-zero elements in rows of the input matrix, and large gaps between non-zero elements, moving from SSE to AVX or AVX512 will decrease the performance, you’ll be just wasting electricity multiplying by zeros.
So in some sense very similar to SKX behavior? The first iteration of the instruction implementation requires judicious use of instructions, while later implementations (this is something to be upset about...those "later implementations" should have been available quite some time ago).
This is also ignoring the fact that none of these penalties come into play if you use the AVX512 instructions with 256-bit or 128-bit vectors. (This still has significant benefits due to the much nicer set of shuffles, dedicated mask registers, etc.)
Meta: I'm not a native speaker, but that lonely 'W' in the title really irks me. Suggested alternate title would be something like "Intel's Sapphire Rapid debuts x86 Advanced Matrix Extension" (59 chars).
The AMX extensions drop 2 more components into XSAVE: XTILECFG (which is 64 bytes) and XTILEDATA (8192 bytes).
Interestingly, there does seem to be a new extension (see §3.2.6 of https://software.intel.com/content/www/us/en/develop/downloa...) that, on first glance, looks to be a per-thread enable/disable bit for using these registers, which suggests that an OS could make a process-level capability to enable/disable AMX and thereby not bother saving these registers on context switches if it's switching to a process without AMX.