More on x86 – The Chip Letter

(thechipletter.substack.com)

65 points | by rbanffy 9 days ago

6 comments

  • Joker_vD 9 days ago
    I've recently read the original RISC paper, "The Case for the Reduced Instruction Set Computer" by D.A. Patterson and I believe that most of the arguments he presents... are just false, as in, they don't hold, at least any more. His arguments for RISC are:

    1) the memory speeds are increasing and approach those of the CPU ― that's definitely not true today;

    2) code density is not as important because memory is faster and cheaper — again, memory may be cheap, but keeping icache full more important for the performance than ever;

    3) designing and implementing CISC is difficult and takes long time — that's true... but designing and implementing a highly performant RISC is also quite difficult and takes a long time, and today the difference is mainly in the decoding stage: after it, its all RISC-like microinstructions in either case;

    4) chip area used for complex instructions can be used for something better, like caches — is it really that true today? Does microcode for those take that much space, and if we consider the operation fusion (which happens on RISCs as well), is there even an advantage?

    5) Complex instructions can be easily and more performantly simulated by combining several simpler ones — I would generally agree but... he took a very bad example for the unnecessary instruction, the VAX's INDEX instruction. Consider that we have CHERI project (which targets RISC-V as well) which is all about moving the finely-grained boundary checking back into the hardware again, this example has aged very poorly.

    What remains is that RISCs are more pleasant to work with for the programmer, they would have less errors in their implementations due to their simplicity (which again, holds only for the most simplistic implementations), and they're faster (which they marginally are? Apples-to-apples comparisons here are not quite easy). The power-effectiveness, the actual advantage of RISCs, was not even considered back then, so the my final conclusion is... "right for the wrong reasons", I guess?

    [0] https://inst.eecs.berkeley.edu/~n252/paper/RISC-patterson.pd...

    • dgacmu 9 days ago
      I think it's useful to draw a distinction between instructions that are somewhat complex versus overall design choices that make decoding (and scheduling) hard. Like, does it really matter if you have a reciprocal square root instruction? Or is it more important to have fixed length instructions with an easy decode path? And a way to avoid a lot of implicit dependencies on, e.g., status flags? RISC got many of these things right, but it wasn't really by avoiding complex instructions.

      (The fact that arm has a reciprocal square root instruction gives away my opinion a little bit...)

      Dark silicon and power limits changed the equation a lot, as you observe.

      • gpderetta 9 days ago
        > does it really matter if you have a reciprocal square root instruction

        it was important at the time of the original RISC as anything that could not be handled by the single-cycle execution stage would significantly increase the complexity of the pipeline.

        Today pipelines are orders of magnitude more complex than the original RISC, so it doesn't really matter.

        • librasteve 9 days ago
          i dont think so ... without reciprocal square root, you would need to issue a square root and a reciprocal instruction sequentially and trust your compiler to unroll, pipeline ir whatever. anyway, square root on an IEEE P754 FPU is going to use something like microcoded newton-raphson with a variable number of cycles to reach sufficient accuracy. BUT, the result of the two chained instructions will often be different to that of the unified instruction (like mac is different to multiply then add) so this MAY be a philsophical decision, or more likely someone could code LINPACK tighter
    • hajile 9 days ago
      > 3) designing and implementing CISC is difficult and takes long time — that's true... but designing and implementing a highly performant RISC is also quite difficult and takes a long time, and today the difference is mainly in the decoding stage: after it, its all RISC-like microinstructions in either case;

      You are massively understating this. Tenstorrent has ~280 people. They are working on 2-3 CPU designs, 2-3 AI accelerators, all the IO stuff, etc. A decent percentage of these people will have to be dedicated to testing, marketing, HR, finances, legal, etc. Jim Keller projects their 8-wide Ascalon to be competitive with the upcoming Zen5 core.

      I'll guarantee you that the Zen 5 core has way more than 280 people working on it. Even if Ascalon were just 70% of the performance of Zen 5, that would put it around Zen 2 performance and that core also required hundreds to thousands of people.

      This is an R&D economic disparity that cannot be sustained.

      This isn't limited to Tenstorrent either, Nuvia had around 300 people when it was acquired and the Qualcomm Oryon also seems to be competitive with x86 for a fraction of the R&D cost.

      • fpgamlirfanboy 9 days ago
        i don't have an outrageous amount of experience here but i work at a place that designs and sells x86 chips (and i work on arch adjacent things). i'll just say this: do not underestimate the effect of institutional friction/momentum/complacency on "time to ship". point being: maybe RISC is more "agile" than CISC but also maybe Tenstorrent could have chosen a CISC approach and still iterated quickly simply because they're not an institution.

        edit:

        > They are working on 2-3 CPU designs, 2-3 AI accelerators, all the IO stuff, etc

        also do we actually know how much they're designing wholecloth and not buying off the shelf?

        edit2: i think Tenstorrent is great and i'll probably buy a devkit soon (as well as maybe apply lol...)

        • The_Colonel 9 days ago
          Yeah, I don't buy this argument either. Startups like that often have the best handpicked people, won't be burdened by established processes, won't need segmentation, chips designed for specific customers / use cases, doesn't have a brand to protect abd generally can act aggresively / risky.

          It's kind of like comparing Photopea with Photoshop, the former being developed by a single person, and then claiming this huge productivity difference is caused by Photopea using Javascript (vs. C++ for Photoshop).

      • UncleOxidant 9 days ago
        > I'll guarantee you that the Zen 5 core has way more than 280 people working on it.

        And I'll postulate that however many people AMD has working on a new CPU, Intel probably has at least 2X that many. Part of this is just due to the inertia at Intel, part of it is likely due to Keller being successful at streamlining the AMD design process while he had to give up on that at Intel due to that institutional inertia.

        Would Intel be any more efficient if they were designing a RISC-V processor or would they still suffer from the same institutional inertia? I suspect the latter.

        • hajile 9 days ago
          Your argument is that this is the only difference.

          If so, then Tenstorrent should have chosen x86. They could support all the way through SSE3 without patent negotiations and maintain backward compatibility rather than risking it on a new ISA. Why didn't they?

          The same reason that software devs don't want to implement a bad API that's been continually updated for years. A new API allows you a chance to do things better even if both APIs technically do the same exact things. You can move quickly and confidently spending that mental energy and time on things like performance instead of contemplating weird edge cases and writing endless tests for each of them.

          The same applies to Tenstorrent and RISC-V. The cleaner ISA means you don't have to sit around thinking about how an instruction can be potentially executed for a weird side effect and how that will propagate through the pipeline. Instead, you can focus that mental energy on performance enhancing changes.

          In any case though, Intel sells 6-7x more chips than AMD, so even with 2x the people, their economies of scale are still better.

          That said, I believe a cleanroom RISC-V design by Intel would still progress faster at Intel than x86 does and it would probably require fewer people allowing them to put the rest on other projects and the overall reduction in team size might even further increase the design rate (as you've stated).

          • UncleOxidant 8 days ago
            No Tenstorrent wouldn't have any advantage in choosing x86. My argument is that Intel as an organization would still face the same institutional inertia if they developed a RISC-V cpu as they do with their x86 cpus.

            > that said, I believe a cleanroom RISC-V design by Intel would still progress faster at Intel than x86

            Possibly. But it would still take a lot longer than Tenstorrent or even AMD (should they decide to do this). There's a lot of tasks that Intel tends to throw bodies at that an AMD or Tenstorrent have automated.

      • klelatti 9 days ago
        > This is an R&D economic disparity that cannot be sustained.

        AMD already has enormous scale that it can spread development costs over. Tenstorrent won’t ‘win’ vs AMD solely because it needs fewer engineers and that reduces development costs (it might win for other reasons but that’s a separate argument).

        Even if it does make a cheaper product then you have to take into account the fact that you’re asking hundreds of millions of users to switch ISAs with the corresponding economic costs in return for a small reduction in the purchase price.

        • hajile 9 days ago
          AMD and Intel are high-margin rather than high-volume products.

          AMD shipped 8M chips in Q4 of 2023 and Intel shipped 50M (not what the tech press would have you believe given how they rave about Zen). Even if we assume they shipped that many every quarter (Q4 is the high water mark), that's 32M chips for AMD and another 200M for Intel.

          Apple sold 235M smartphones in 2023 and total smartphone sales were around 1.2B. Apple alone outsold Intel and AMD combined and Apple is shipping their high-end cores in every one of those devices. Put simply, AMD and Intel aren't the big boys anymore. Everything you say about AMD/Intel's economies of scale apply even more to the rest of the industry.

          Why would a company make a risky expansion into the race-to-the-bottom Wintel market when the phone and server/AI markets are easier to penetrate and offer a great combination of massive economies of scale and super-high margins?

          Qualcomm is only trying to enter the PC market because they are dominating the phone market and are looking to expand their business. RISC-V won't be entering until ARM has done the hard part of forcing more ISA flexibility onto devs.

    • retrac 9 days ago
      Re: point 1, in the 1980 paper, that should be interpreted in the context of the time. Cache memory was still a relatively new idea, and uncommon. Core memory (only supplanted by semiconductor memory c. 1975) had been much slower than digital logic. Core memory was clocked at like 0.1 - 2 MHz when processors were clocked at 1 - 50 MHz. DRAM was faster, up to about 5 MHz by then. And SRAM was very fast; it could be used to construct no-wait-state caches. That had only been used on the largest mainframes in the 60s and early 70s, because it had been prohibitively expensive to construct such cache out of individual flip-flops (the only technology available then).

      So, one might say point 1 was so fully true, that it became internalized on chip, and is why we have L1 caches.

    • librasteve 9 days ago
      Patterson (and Hennessey) seminal work http://acs.pub.ro/~cpop/SMPA/Computer%20Architecture%20A%20Q... states

      At the core is a quantitative approach to computer design and analysis that uses empirical observations of programs, experimentation, and simulation as its tools.

      So they (rightly) do not judge what is the best solution for all time, they just claim that in 1988 RISC outperformed CISC.

      As an aside, I was at a memorable CHIPS(?) conference and Dave House VP at Intel followed speakers from MIPS, SPARC, etc (RISC guys) and opened by looking at his watch and saying "it's now 11am and Intel made more CPUs this morning than all the RISC guys ever made in their entire history" ... so process / scale cannot be ignored.

    • phkahler 9 days ago
      >> 2) code density is not as important because memory is faster and cheaper — again, memory may be cheap, but keeping icache full more important for the performance than ever;

      RISC-V achieves better code density than x86, so the same C-code when compiled takes less icache. IIRC it's not a lot smaller, just a little.

      • hajile 9 days ago
        The difference is a lot when put into proper perspective.

        Average x86 instruction length is 4.25 bytes. ARM64 is 32-bits. RISC-V compressed was a little under 3 bytes around 10 years ago when introduced.

        So if RISC-V instruction length is 30% shorter than x86, why wasn't RISC-V density get a similar boost? x86 has some relatively common instructions that RISC-V didn't have even just a year or two ago. There are still some things that take 4-6 instructions in RISC-V, but just 1-2 instructions in x86 or ARM64.

        Over time, the code density of RISC-V will increase. Meanwhile, stuff like APX seems to require very long instructions, so compilers will be playing an interesting game where more registers and 3-register instructions decrease MOVs, but the length of the instructions also shoots up a lot.

        • klelatti 9 days ago
          > Over time, the code density of RISC-V will increase

          Why given the push back against adding new instructions tang would achieve this?

          • hajile 9 days ago
            There isn't a pushback against new instructions. Look at the large number of extensions being worked on at the moment let alone all the extensions already finalized and required for current profiles.

            The only issue is adding bad extensions that bleed microarchitectural details to the consumer.

      • Joker_vD 8 days ago
        > RISC-V achieves better code density than x86

        Maybe, but one of the original arguments for RISC, again, was that code density will no longer matter. Apparently, it still does.

    • gpderetta 9 days ago
      Re 4, cpus already have huge caches. L1s sizes are limited by latency constraints, not area. CPUs are power limited, not area limited (see dark silicon) and designers find creative ways to use otherwise unused space (hence the giant SIMD ALUs).
      • sudosysgen 9 days ago
        The L1 latency constraint is a size constraint - if you increase the L1 size, you simply cannot avoid increasing L1 latency.
        • gpderetta 9 days ago
          That's what I meant: for performance considerations, L1 latency must be limited to single digit cycles on a general purpose CPU, hence L1 size is capped.
    • rbanffy 9 days ago
      > and today the difference is mainly in the decoding stage

      Reordering RISC-like instructions is also lot easier than reordering complex ones. A larger set of architectural registers also makes that easier.

      • Joker_vD 8 days ago
        I'd argue an empty set of general-purpose architectural registers, a set of several base/index registers, instructions that take both their sources and destinations from indexed-memory operands, coupled with the stack engine, internal register renaming and memory remapping (to the unexposed register file with oh I don't know, 512 or more of non-architectural general-purpose registers) would work as well, if not better. You get the effect of the register windows by simply passing arguments on the stack (which is actually backed by invisible registers), and its easier to resolve dependencies when you use essentially unbounded number of stack slots instead of the fixed number of registers.

        Look at 6502: making an implementation for it with the zero page backed by register file instead of RAM is entirely possible.

    • shash 7 days ago
      An interesting statistic from the Shakti[1] project - after building a 5-stage RISC-V core with an FPU, the same team had a reason to write an x86 decoder. The _decode_ stage of the x86 alone was bigger than the entire pipeline of the RISC-V core.

      [1] https://shakti.org.in

    • sapiogram 9 days ago
      > 1) the memory speeds are increasing and approach those of the CPU ― that's definitely not true today;

      What does "speed" mean here? Latency? Bandwidth?

      • gpderetta 9 days ago
        It means keeping the instruction fetcher fed. So bandwidth primarily, but also latency when resteering it after a jump.
    • userbinator 9 days ago
      the memory speeds are increasing and approach those of the CPU ― that's definitely not true today

      There was a brief time in the 80s when that was true, but clearly hasn't been for a while. (Why else would caches be necessary?)

      chip area used for complex instructions can be used for something better, like caches — is it really that true today

      Some of Intel's CPUs can turn off sections of their caches to save power.

      Complex instructions can be easily and more performantly simulated by combining several simpler ones

      That now happens with RISCs too, e.g. there are ARM instructions which get split into multiple uops, and of course the larger amount of simpler instructions is essentially what uops are. Patterson wrote the paper at a time when the Pentium's uop-decoding approach didn't exist and "traditional" CISCs were all sequentially microcoded.

      Apples-to-apples comparisons here are not quite easy

      Pun intended? I believe there was plenty of discussion on that when they came out with the M1, and some have attributed that to process advantage instead of uarch or ISA.

  • pjdesno 9 days ago
    One significant event not mentioned was the failure of AMD’s Barcelona server architecture in 2008 or so. Opteron had been deep g Intel on its toes, and Barcelona with nested page tables (hardware memory virtualization) would have continued their run as a strong server competitor. But there was a bug in their nested page tables that would occasionally corrupt memory, and they had to sell it as the gamer Phenom, turning off the virtualization support. It took years to recover.
  • hajile 9 days ago
    I glanced over the Intel APX spec paper while reading this article.

    https://cdrdv2.intel.com/v1/dl/getContent/784266

    It looks like it uses the 4-byte EVEX prefix. It would also need 1-2 operand bytes and another register byte.

    That's 6-7 bytes for basic 3-register instructions. If you need to do something with an immediate/displacement value on the top 16 registers, that seems like it would be potentially 10-12 bytes long including 4 bytes for the value and 1 byte for the scaled index for displacement.

    I'll admit I don't have time to read the article at the moment and I'm wondering if my fast look is wrong. If not, I don't see how the extra registers are going to be worth the massive instruction length.

    • jcranmer 9 days ago
      > It would also need 1-2 operand bytes

      Only 1 operand byte. The VEX and EVEX prefixes replace the 0f/0f38/0f3a opcode prefixes with an opcode map selector, and the 66/f2/f3 prefixes are also shoved into some more bits in the prefix. The only prefixes not subsumed by the EVEX prefix are F0 (largely lock), the segment selectors, and the address size override prefix... the last of which should never be used in 64-bit code anyways, and the former are pretty rare in instructions.

      If you've got a 3-byte opcode that needs the REX prefix, you're going from 5 byte encoding to 6-byte encoding, which isn't too bad, but it is going to hurt for those instructions that previously could get away with 2 bytes.

      • hajile 9 days ago
        An extra byte to eliminate a MOV via a 3-register instruction is a no-brainer savings.

        The extra 16 registers are going to be a lot more fun though. It will also eliminate a lot of those MOVs, but every access will have a 17% penalty going from 5 to 6 bytes.

        Meanwhile, ARM/RISC-V can use 32 registers and use 3-register assignments with just 2/4 bytes.

  • phkahler 9 days ago
    x86 is already legacy. AMD64 deprecated a bunch of the old stuff, but retained it for backward compatibility. This idea of tweaking x64 with new registers and suppressing flag modifications just tells me RISC-V got those things right. But the complexity (circuitry to handle flags) will remain even with this spring cleaning.

    I'm still waiting for AMD or Intel to make a RISC-V chip that drops in one of their existing sockets. Yeah, you'll need to replace the BIOS and such but I'm looking to buy a new motherboard anyway.

    • yjftsjthsd-h 9 days ago
      > I'm still waiting for AMD or Intel to make a RISC-V chip that drops in one of their existing sockets. Yeah, you'll need to replace the BIOS and such but I'm looking to buy a new motherboard anyway.

      If you need to replace the motherboard, what's the point of reusing the existing socket? That said, I think I've seen this done, although the details are eluding me and web searches aren't helping; I think there was, once upon a time, a motherboard that could swap between x86 and Alpha processors with only minor reconfiguration (like, changing a jumper minor).

      • toast0 9 days ago
        > I think there was, once upon a time, a motherboard that could swap between x86 and Alpha processors with only minor reconfiguration (like, changing a jumper minor).

        Deep into rumors that have been circulating in my brain for two decades now... But supposedly Slot A for the (original) K7 Athlon was very similar to the Alpha CPU bus (wikipedia says "The Athlon utilizes the Alpha 21264's EV6 bus architecture with double data rate (DDR) technology.[citation needed]"). The K7 team had some people who had worked on the Alpha at DEC, so there were lots of comparisons drawn.

        • whaleofatw2022 9 days ago
          Iirc there was at least a couple samsung alpha boards that used AMDs 750/760. UP1100 and UP1500
      • phkahler 9 days ago
        >> If you need to replace the motherboard, what's the point of reusing the existing socket?

        Zero hardware changes for the board maker. I'm not sure it would be possible to reflash from x86 BIOS to RISC-V bios and reboot. That would be cool though. Or maybe they have a jumper to select which is expected (to switch between 2 BIOS images) and everything else is identical.

    • rbanffy 9 days ago
      > I'm still waiting for AMD or Intel to make a RISC-V chip that drops in one of their existing sockets

      There was once a MIPS CPU that was a drop-in replacement for the 80486. It required, of course, a different BIOS.

      This is one good thing OpenFirmware had going for it.

    • shash 7 days ago
      IIRC, Zen at one point had an ARM decode stage. I don't know if it was intended commercially, or they just did it to try out...
  • snvzz 9 days ago
    The first batch of very high performance RISC-V chips are dropping soon, and there will be servers, as well as Android devices.

    These microarchitectures come from many vendors at once, and will make a significant impact.

    Not long after, Windows for RISC-V will appear; Microsoft has been working on it for years, while steering the RISC-V committee to prioritize its requirements.

    Thanks to emulation, x86 software lock in is no more, and the moat loses its significance.

    AMD and/or Intel will offer RISC-V CPUs with x86 acceleration as distinguishing feature, thus retaining a decent slice of the market through this transition, which will not take anywhere as much as x86 32 to 64 did.

    • yzmtf2008 9 days ago
      > The first batch of very high performance RISC-V chips are dropping soon, and there will be servers, as well as Android devices.

      Is this the new "2024 is the year of linux desktops"?

    • The_Colonel 9 days ago
      > Thanks to emulation, x86 software lock in is no more, and the moat loses its significance.

      With reduced performance and efficiency, so one might ask what's the point for a consumer to buy such a device. PC laptop vendors don't have anywhere close to the market power to force such change. Meanwhile, AMD/Intel can simply lower their prices for a while, when RISC-V PCs/laptops will appear to become a potential threat.

    • dmitrygr 9 days ago
      > Thanks to emulation, x86 software lock in is no more, and the moat loses its significance.

      Tell me you've never attempted to performantly emulate x86 without saying those words. As a small example, go look into the difference between x86 memory model and the weak memory models used by modern architectures, when it comes to ordering of accesses as seen by other cores.

      Now imagine emulating this correctly on a weak memory model system