Exploring .NET Core platform intrinsics: Part 4 – Alignment and pipelining

(mijailovic.net)

107 points | by benaadams 2099 days ago

2 comments

zvrba 2098 days ago
Further optimization potential: the four lines
```
    sum = Avx2.Add(block0, sum);
    sum = Avx2.Add(block1, sum);
    sum = Avx2.Add(block2, sum);
    sum = Avx2.Add(block3, sum);
```
have all a serializing dependency on sum variable. But (integer) addition is associative and commutative, so you could sum it in a tree-like manner, ending up only with a a single serializing dependency:
```
    sum01 = Avx.Add(block0, block1);
    sum23 = Avx.Add(block2, block3); // These two run in parallel
    sum = Avx.Add(sum, sum01); // sum01 hopefully ready; parallel with sum23
    sum = Avx.Add(sum, sum23); // sum23 hopefully ready
```
Where only the last line serializes with the previous one. Maybe the HW is smart enough to rename the registers and do the same thing internally, but it'd be interesting to benchmark it.
[-]
- Metalnem 2098 days ago
  I already tried that, but was disappointed that the performance gain was only 1%, which is why I didn't include the optimization in the post.
  [-]
  - physguy1123 2098 days ago
    You should try maintaining 4 independent sum variables and summing after the loop so there's no serializing dependency at all. Such a transformation in microbenchmarks is a fun trick to show the power of a proper OOO engine with pipelined instruction units. Assuming no memory problems, one should be able use issue-width*instruction latency independent sum streams without spending more time in the hot loop.
    For what it's worth, the vmovdqa only has a 4-wide issue width if it is moving between registers, the memory load has a 2-wide issue width. Floating point adders themselves only have a 1-2 wide issue widths depending on your hardware so it doesn't really matter.
rossnordby 2099 days ago
Seeing the intrinsics APIs get filled out- in the open, no less- has been pretty exciting. The fact that something like AES would be implemented competitively in C# is not something I would have predicted even five years ago.
It's remarkable how fast the language and runtime have evolved for performance. It wasn't that long ago that I was manually inlining Vector3 operators to try to get a few extra cycles out of XNA on the Xbox360.
[-]
- pjmlp 2097 days ago
  The Xbox360 runtime was notorious bad and suffered from the WinDev/DevTools difference of opinions how the future of WIndows development should look like.
  Hence killing XNA when they took over Windows 8 development, WinRT and such.
  It took all the reorganizations and change of politics, for the .NET Runtime finally start getting some additional love regarding performance.
  [-]
  - oceanswave 2097 days ago
    > The Xbox360 runtime was notorious bad and suffered from the WinDev/DevTools difference of opinions how the future of WIndows development should look like.
    The past tense structure makes it sound like progress has been made on this front while it’s still the same problem presently. It’s just that those tools in particular have been deprecated (and not replaced)
    [-]
    - pjmlp 2097 days ago
      Well, how would you correctly phrase it in proper English then?
      That was the reason why the XBox 360 runtime was bad, the remaining of my comment refers to the standard .NET Framework.