AMD Ryzen is a good architecture at a good price. But compared to Intel, there are two important differences IMO:
1. pext / pdep are emulated -- It takes many cycles for pext and pdep instructions to be executed, while Intel can execute them once per clock. This is a crazy awesome instruction for any low level programmer, and its a shame it isn't possible to utilize it on AMD Zen processors.
2. Zen is a bit slower with 256-bit AVX Instructions.
1. Zen offers more cores per dollar
2. Zen offers two AES-encryption units per core. This means you can run two AES-instructions per clock tick. Dunno why AMD does this, but its kinda cool in some obscure cases I've coded.
I hope AVX-512 is studied for its lessons on how not to do an ISA. I have very few technical complaints about it, but the following were a deal breaker
1. Limiting it to a subset of chips, and initially not releasing it for client chips at all. Creating ISAs for a small part of the market is the best way to ensure it does not see any use.
2. Reasoning about AVX512 performance is ridiculously difficult for most workloads because of the clock penalty. Unless you have cases where you are all AVX512 all the time, you will likely see performance drops.
I suspect the dual AES units are a side effect of Zen having two 128-bit SIMD units. The AES instructions use the SIMD registers so naturally the AES implementation is integrated into the SIMD unit. Presumably it was easier to duplicate the AES engine along with the rest of the SIMD unit than to split it out.
pext/pdep are awesome, but I imagine you'd never notice the difference in real world usage. You'd have to use a program often where those instructions are on the critical path and comprise a significant percentage of execution time. The chances of that are slim to none. You may well notice the extra cores though, to a point, depending what you do.
I was writing a program similar to the 4-coloring problem. I represented colors as 2-bits (color 0, 1, 2, and 3).
I also created a bitmask representation of relations, which would represent 1-variable in 4-bits, 2-variables in 16-bits, 3-variables in 64-bits, and 4-variables in 256 bits.
Ex: Texas / Oklahoma / Arizona relation would be a 64-bit number ("true" means a color-set is in the relation. "False" means the color-set is not in the relation), and extracting or packing the data into these three variables would be a pdep or pext operation.
Extracting data (pext) would be a "select" operation. While PDEP + OR would be a "update" operation over the relation. I've written a join for fun, but I haven't gotten much further than that. First, because pdep / pext were slow on my machine. Second, because I figured out an alternative solution to my particular problem.
I think the pext / pdep instructions have HUGE implications to 4-coloring problem, 3-SAT, Constraint Solvers, etc. etc. More researchers probably should look into those two instructions.
Just look at Binary Decision Diagrams, and other such combinatorial data structures, and you can definitely see the potential uses of PEXT / PDEP all over the place.
Hobby code: I slowly developing a Gin rummy  engine. Card sets are represented by 64 bit integers and all operations (findig melds, sets, etc...) are implemented with bitwise operators.
I have used pext/pdep for the iterator implementation (iterate over all subset of a card set, or all combination of n cards).
(e.g. To iterate over all 10 card combination brute-force and filter and print out all the ones which can be knocked (evaluating 15_820_024_220 hands) takes 70 seconds - one threaded - on my 7th gen intel i3.)
A while back, I was working on a fun little video game side project which used BMI2 instructions to compute Morton codes on the critical path.
Voxels were stored in a buffer sorted in Morton order. The idea was to balance performance improvements realized by increased spatial locality against the cost of computing the Morton codes. The trade-off was only worthwhile on Intel because of the use of pdep/pext in optimized encode/decode functions.
I imagine something similar would probably apply to texture lookups in a software 3D renderer.
> I imagine something similar would probably apply to texture lookups in a software 3D renderer.
Except that for a 3D rasterizer you'd probably be better off calculating Morton code for 8 pixels at once in a SIMD register and then using vpgatherdd to fetch 8 ARGB pixel values "in parallel" (in theory, in practice AVX2 gather might not be any faster than scalar loads).
I will agree with dragon tamer that for some real-world codes PEXT/PDEP sits right in the middle of the hot path. It isn't just the clock cycle savings either, for some logic it can substantially simplify the code path. There's a lot of neat wizardry that can be done by composing sequences of those two instructions (mixed with other basic integer ops).
I don't use them often, but there are cases where I would not want to try to write code without them.
PEXT / PDEP are not instructions that would compile automatically in anything I'm aware of.
Its a new fundamental bitwise operator. Some other programmers have called it a "bitwise gather (pext) or bitwise scatter (pdep)" (EDIT: Had it backwards the first time). Its a very powerful way to think about bits that Intel just invented in those instructions.
If you have any data-structure that is bitwise, I can almost guarantee you that PEXT or PDEP will be useful in some operation. These instructions have been used to calculate bishop / rook moves in less than five operations.
And yes, remember that bishops and rooks can be blocked by other pieces. So given all the pieces on a chessboard, and the location of the bishop in question, calculate all possible locations the bishop can move (after accounting for "being blocked").
Don't omptimizers already do quite a lot of analysis to output what you mean rather than what you say? I can think of quite a lot of bitwise shift/and and shift/or patterns that should be automatically convertible to single-instruction pext/pdep's
Not a processor geek per se, but do appreciate some insights into a bit of these details. Aside, I'm really happy to see AMD being competitive across most products and even better bang for the buck at a lot of price points. Waiting on Zen 2 architecture to upgrade my desktop (will be over 5yo at that point)... depending on initial release or Threadripper version.
Finding a motherboard that supports it is relatively easy. The Asus Prime X370 Pro for example seems like a good choice for a simple home server with ECC and 8 SATAs. The problem is actually finding reasonable ECC RAM. Unregistered/Unbuffered ECC RAM is an unusual configuration that most manufacturers don't provide. It's hard to find, expensive and much slower which Zen is supposedly sensitive to.
Shouldn't we have moved to ECC RAM everywhere a long time ago? With economies of scale would it actually be any more expensive or slower? There's no place where the extra safety is a negative, is there?
ECC RAM not being in consumer PCs is largely a market segmentation pushed by Intel. It is in fact quite ridiculous if you consider that essentially every other bus, interconnect and storage in your computer has error correction except _main_ memory (and the main memory bus). If it weren't so normalized we'd go like "Dude, do you even realize how lol it is to have no parity on all of the most important data the computer is working with?!".
Also the lost productivity due to main memory errors not being detected probably easily goes into the billions. Thanks, Intel.
 There was a time when consumer systems genuinely didn't support ECC for lack of hardware support. This hasn't been the case for many, many years.
I don't believe any CPU cache has ECC, either. Which is where the memory you're actually working with lives.
Also no other interconnect in your system is as much of a bottleneck as the one to main memory is. It's worth keeping that in mind before entirely blaming this on "market segmentation". ECC RAM does actually slow down the part of a system that is already the bottleneck in most common situations.
> I don't believe any CPU cache has ECC, either. Which is where the memory you're actually working with lives.
I'm not aware of any desktop CPU that doesn't have ECC caches. CPU internal busses use ECC and external interconnects (e.g. PCIe, DMI, QPI/UPI) use it as well.
> ECC RAM does actually slow down the part of a system that is already the bottleneck in most common situations.
ECC invariably introduces some additional latency in the memory controller, but I don't see a persuasive argument why it would reduce throughput. It would surprise me if this additional latency is measurable, given that the ECC logic is already in the core and in the data path anyway, and the system configuration (AMD, Intel) / CPU fuses (Intel) only decide whether it is active.
That being said buffered ECC modules are usually not the fastest. I don't think that this is due to any technical limitation per se, but rather market demand (cost, perf per Watt).
> I'm not aware of any desktop CPU that doesn't have ECC caches. CPU internal busses use ECC and external interconnects (e.g. PCIe, DMI, QPI/UPI) use it as well.
I believe all of those are just parity checked and not ECC?
> ECC invariably introduces some additional latency in the memory controller
Latency is a non-trivial factor here, too, though.
> That being said buffered ECC modules are usually not the fastest. I don't think that this is due to any technical limitation per se, but rather market demand (cost, perf per Watt).
Poking around it looks like ECC RAM tops out at DDR4 2666 @ 1.2v. By comparison there's no shortage of DDR4 3200+ options at 1.2v. Whether or not this is purely market demand or not, it doesn't seem like there's a purely power reason for it.
But you also can't solely blame Intel for a lack of market demand. Even when the choice is there nobody seems to be making ECC memory for high-end desktop usages. Where's the DDR4 3200 for Threadripper or Xeon-W workstations, for example? They surely benefit from the improved bandwidth, or else they wouldn't have triple & quad channel memory. And they'd surely pay the price of admission, because we're talking $3,000+ entry points for builds.
Caches are definitely ECC, because the corresponding MCEs can tell the system both that an error was detected and corrected, and that an uncorrectable error occured (which by default leads to a kernel panic iirc).
I haven't bought new RAM lately, but not that long ago it was often much cheaper to buy an old server and outfit it with used ECC DDR3 than to buy equivalent consumer RAM, simply because there wasn't much demand for previous-generation ECC RAM.
Right now DDR4 is fairly new, but as old servers get rotated out I expect a good market for cheap ECC DDR4 sticks that come from used servers but are too small to get reused in new servers. (unregistered/unbuffered is still a problem though)
The main advantage of ECC is reliability so buying used RAM doesn't seem like a great option. If you're willing to buy used a good way to get a nice workstation is to just buy a used workstation machine (e.g., HP Z400/Z600/Z800 line) and just upgrade the storage and GPU. But if you're gaming or trying to upgrade a home server what you want instead is a nice Motherboard/CPU/RAM combination. Right now Ryzen would be a great option for that if there were some good ECC UDIMM options.
So you will need to pay 360$ instead of 320$. Many people would choose cheaper memory. I guess, almost everyone except some PC enthusiasts (last time I discussed it, the majority of PC enthusiasts thought that ECC on desktop is not needed, so they wouldn't want to pay for it). I agree that ECC is nice to have, but price is real.
RAM prices already vary by more than 12%, and it seems like there is no shortage of people paying premium for brand recognition, different board colors (green is "out"), useless heat sinks etc. I think there would be plenty of people willing to pay extra for ECC (some for peace of mind, some because they need it, some just to feel superior)
The market segment of "people who are willing to pay a premium for computing devices" is pretty vast, isn't it?
Most people would be fine with a $200 Chromebook, $500 bare-bones Windows notebook, or a garden-variety PC from 2007 but millions of us choose to pay more because we value the additional things that newer, more powerful computing devices give us.
Lots of professionals and enthusiasts gladly pay large premiums for higher-spec gear, even when the improvements are quite small, because those small improvements are enjoyed over many thousands of hours of lifetime use.
Perhaps more to the point, you already see gamers paying premiums for higher-specced memory to enable their overclocking and tweaking endeavors.
So I definitely think there's a market of people who'd pay more for ECC...
Just like I paid extra for a nice PSU instead of the very cheapest, I would pay more for ECC.
The cost difference for ECC amortized over the life of the hardware is negligible compared to the annoyance and time spent trying to work out what's causing those random bluescreens/reboots/corruption.
The failure risk of ECC is significantly lower than a lack of insurance of backups, though.
Drive fails and no backups? Potentially terabytes of data vanishes. House burns down and no insurance? Hundreds of thousands of dollars to repair. No ECC and a bit flips? Nothing happens, program crashes, or maybe a single file gets corrupted in an unrecoverable way.
And maybe that file happens to be an important encryption key. Or maybe a whole filesystem gets corrupted if some important metadata is. Or maybe your system develops a hardware problem over time, like an oxidized CPU pin or a contact in a memory slot and then you are getting flaky bits on a regular basis.
ECC is useless until you experience a problem of those kinds. And many people will never do. But some will. I have.
ECC logic is implemented in the memory controller (in the CPU these days).
The ECC DIMM just provides extra chips to store ECC bits.
And the motherboard, if it supports ECC, provides just extra data lanes that connect the extra DIMM chips to the appropriate pins on the CPU.
Pretty sure my next PC build is going to be Zen 2 (3xxx series) and I'm going to try and get ECC memory for it. Even if it costs a little more/a little slower I think the knowledge that my data hasn't been corrupted is worth it.
Same. I've been through two computers that have randomly been flaky. It turned out, after 2 years of debugging, that the PSU didn't really like suddenly having load on it (like when the CPU turbo-boosts 8 cores) and that that was leading to crashes. (It was a 1000W PSU, too; I thought overprovisioning would solve all my problems, but I guess not.) I suspected the memory the whole time, though, and having ECC would have at least been one less thing to worry about.
The other thing that annoys me about PC hardware is that the motherboard tries as hard as possible to make your system unstable. I don't want overclocking. I don't want to run the memory at XMP speeds. Just give me a button for "run everything at its conservative spec". (With that in mind, I'm not sure memory manufacturers test anything other than their XMP timings, leaving you to guess whether the non-XMP profile has the right voltage/latency numbers. It's infuriating!)
>This wasn't my experience either on memory.net or crucial.com
As far as I can tell crucial.com shows a total of 2 options. Both 16GB UDIMMs. One tall, one short. I think last time I checked it had none. I didn't see a single option on memory.net. Remember that you need Unbuffered/Unregistered ECC (UDIMMs with ECC) and there aren't many options of those. RDIMMs will not work.
It's also hard to verify the DIMMs will work because the motherboard's QVL doesn't list any of the ones I've found so far. Not all retailers will have these either, so that seems hard enough to find for me.
There are only a few modules of 2666Mhz DDR4 ECC RAM by any of those manufacturers. Non-ECC goes to almost 2x that speed and there is plenty of choice. In the places I've seen it's around 40-50% more expensive for 12.5% more RAM chips. These are not small differences.
As someone with a Ryzen 1xxx, X370-based motherboard, and 32GB of ECC RAM the link you give is a bit out of date in that Windows 10 better supports ECC (the same X370 with `wmic memphysical get memoryerrorcorrection` reports 6 now). Much of the rest of the article about a wide selection of memory, finding the motherboard firmware toggles, etc are still valid.
There are also posts at other fora complaining that Hardware Canucks is wrong to suggest that an uncorrectable error should result in an immediate system halt - I leave that argument to those who are interested.
I've been running a ThreadRipper 1950X in my main box for the past 15 months or so and am generally extremely pleased with the results. However, my biggest takeaway from the experience of having 32 cores has been that an embarrassingly large amount of the software I use on a daily basis for productivity runs in a single thread. My expectation was that UI blocking would be rare- it isn't, particularly with Chrome, Firefox, and Slack. Jira in the browser is terrible- even with insane resources and 1Gbps bandwidth I regularly have to wait 10-15 seconds to be able to enter text.
> Cue all the people saying how they couldn't live without their 32 threads
I can live without 24 threads, but I love not having my computer become unusable because I'm encoding video or doing some other CPU-heavy task. having more than 4 threads has opened a whole new world of thinking of how to parallelize common tasks - not ever having to wait for your computer feels like a super-power. Paradoxically, this has freed me up to use an ARM chromebook for day-to-day usage - when I need firepower, I remote into the TR workstation (smartplug + boot on power BIOS + dynamic DNS)
I'll never stop feeling a little exhilaration from typing "make -j 22"
UI lock-ups are because of thread locks, not because you don't have enough cores. Actually I could load all my 4 cores at 100% with some task and PC would be very responsive, it's really hard to notice a difference in most tasks. So yeah, 32 cores are nice when you have work for those cores, but it's not magic. Frequency is magic :)
> an embarrassingly large amount of the software I use on a daily basis for productivity runs in a single thread. My expectation was that UI blocking would be rare- it isn't, particularly with Chrome, Firefox, and Slack. Jira in the browser is terrible- even with insane resources and 1Gbps bandwidth I regularly have to wait 10-15 seconds to be able to enter text.
Why would you expect anything different when JS _does_ run in a single thread? We'll have to wait for WebAssembly to have anything like real multithreading, with good-enough performance, on the Web.
> Why would you expect anything different when JS _does_ run in a single thread?
No reason for site A rendering to block site B; no reason for either to block the main UI. No reason for an issue tracker to take 10 seconds to achieve interactivity on LAN (heck, I'd consider 0.5 seconds slow).
layout and rendering can happen asynchronously in background threads, but you have to carefully structure your JS code to not read back layout properties any time soon after modifying the dom, otherwise it will block and turn everything back into sequential execution
So an interesting thing happened to me last month. I had a gigabyte ecc pro 150 with a Xeon processor, and it died (hardware failure, it refused to POST after having it for two years).
I run Debian Stable. When I swapped out the CPU (Ryzen T 2700X), Motherboard, and RAM, and I powered on Debian, it booted up normally, and it automatically configured itself to run the new CPU, motherboard, and RAM.
Since at least windows 7 it even works on windows, although you need to do some chipset cleanup by hand after reboot before installing the new ones (if you need them). I wouldn't try that experience on previous versions on windows though.
As of Windows 10 it is reasonable to expect to be able to rip a drive out of any given computer and put it in another one and have it work. Actually did this recently to jump several years in hardware on my workstation.
For development box, is the single thread performance on the AMD system really something you'd notice? For a production system you'd ideally pick the CPU architecture that's best suited for your work load, but most of us just go with "what ever is currently under our hypervisor".
In my mind you're doing something incredibly specialized if you notice the difference between AMD and Intel, or between current generation and last generation CPUs. Video encoding is really the only "mainstream" application I can think of.
I mean theoretically my 2700X has slight worse performance per core (though not at the same price point, not fair to compare a 350 processor with a 600+) but it doesn’t matter when I have webpack running with 4 threads, type checking on a separate thread, a DB server and IntelliJ all running with not remotely a stutter.
More cores are important, but the i7-8700 uses 65W, a single thread is faster than AMD and it has six cores / 12 threads. To get close performance but with 8 cores / 16 threads I'd have to get the 2700X, which is more expensive, plus a video card, which is much more wattage and in order to not get a throwaway video card much more expensive. There are also benchmarks that show AM4 has storage performance issues between 10 - 30%, which could affect complex workflows involving builds and containers.
Still, I'm within the return period and trying to find a way to justify AMD, since my work is increasingly about containers. It's just such a huge timesink.
Only a select few games are able to use more than 6 cores, and then only in some situations. For compilation and other workstation tasks the 8-cores (and more) are king, but they're expensive, so unless you have money to blow, go for the 2700.
> Only a select few games are able to use more than 6 cores
Poorly designed older games. Modern games should be able to use all cores, because Vulkan is available. Something like dxvk for example is using as many cores as reasonable for compiling Vulkan pipelines.
And what is with the diff between 2600 and non x? I think I need an aftermarket cooler for both variants .. and the 2600 (non x) are cheaper in Power consumption and the clock losses are acceptable, or not?
And a b450 Mainboard will also be good enough? Im using a Nvidia 1060 6gb graphic Card.
isnt that a big part of the value added by apple. You dont have to care about cpus, motherboards, etc...
If you want to buy a decent machine and not spend time finding out what is decent right now, get apple. If you want control and a perfect tuning for your particular situation, definetly do not get apple.
I am still on a 4.3GHz 3570K at home (holding up pretty well after some minor mechanical/percussive maintenance revived a dead memory channel caused by a flaky CPU socket). I'm eyeing 3rd Gen Ryzen later this year but for now upgrading from 16GB DDR3 to 16GB DDR4 doesn't seem cheap ;).
Can recommend. I have a Ryzen 7 2700X and it mops the floor against all my other builds (which are admittedly all older builds.) Runs Linux great and IOMMU works very well - it seems they now officially support it, so GPU pass through has worked super well and saves me from needing to dual boot.
Another bonus: the stock CPU fan, although flashy with it's RGB LEDs, is very formidable and can probably even stand up to a bit of overclocking.
I am glad to see competition in the CPU space again. It's been too long.
I'm passing through a spare GTX 1070 using good ol' qemu kvm.
It's extremely easy to set up the PCI passthrough itself in virt manager, the system-level configuration is a bit more involved. You may also want a KVMFR like Looking Glass, since otherwise you'll need a physically separate keyboard/mouse/video setup.
I don't think most commercial VM solutions support this kind of configuration, I'd guess Vbox might but I know for a fact VMware Workstation doesn't. (and there's no VMware Workstation package for NixOS yet, so my license is collecting dust at the moment :()
It's worth noting you need a separate GPU for this right now. Intel just recently started supporting something called GVT-G that lets you split an Intel IGP into multiple VMs, not as useful for me since I want a better GPU but maybe useful to others. I have yet to try it.
Thanks for the GVT-G reference. For running a Windows VM in a laptop that seems like the only missing piece. Graphical performance is clearly lacking and if that works as described it seems like it would fix it.
Sure. I believe the motherboard is an ASUS X470 Prime Pro. I picked it up at Fry's and I'm not home to look at what it is so I could be a little off.
It is indeed a pair of NVidia cards, but that part only matters a little bit. I don't really highly recommend NVidia for the host and as far as I know you can run whatever card you want on the Linux host. Looking Glass may care about the guest GPU simply because it's still a bit experimental, but there's not really any reason I'm aware of it can't work with AMD or Intel graphics processors.
I went 2700X as well (paired with 32GB DDR4-3200 and an RTX2080), solid machine but the RTX2080 was not cheap (went EVGA since when you get into silly an extra 100 for the warranty/customer service and build quality hardly matters).
Did a 2600 build late last year... actually a little faster than my 4790K from 5 years ago, and a fraction of the cost. Waiting on Zen 2 to upgrade my own desktop later this year... hoping to see a 16-core mainline, or might wait for threadripper.