Tiny GPU: A minimal GPU implementation in Verilog

(github.com)

314 points | by fgblanch 10 days ago

9 comments

userbinator 9 days ago
Because the GPU market is so competitive, low-level technical details for all modern architectures remain proprietary.
Except for Intel, which publishes lots of technical documentation on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/
You can also find the i810/815 manuals elsewhere online, but except for an odd gap between that and the 965 (i.e. missing the 855/910/915/945) for some reason, they've been pretty consistent with the documentation.
[-]
- kimixa 9 days ago
  AMD also publish a fair bit of documentation - https://www.amd.com/en/developer/browse-by-resource-type/doc...
  Includes full ISA documentation of their current and past offerings, though look like they tend to be aimed at implementors rather than "high level" description for interested enthusiasts.
  [-]
  - adrian_b 7 days ago
    The AMD documentation consists mostly of very laconic descriptions of the (hundreds or thousands of) registers and of their bit fields.
    There is almost no explanation about how they are intended to be used and about the detailed microarchitecture of their GPUs. For that, the best remains to read the source code of their Linux drivers, though even that is not always as informative as it could be, as some of the C code may have been automatically generated from some other form used internally by AMD.
    The Intel documentation is much more complete.
    Nevertheless, AMD has promised recently in several occasions that they will publish in the near future additional GPU documentation and additional open-source parts of their GPU drivers, so hopefully there will be a time when the quality of their documentation will match again that of Intel, like it did until around 2000.
- matheusmoreira 9 days ago
  The Linux drivers are also high quality and mainlined. Wish every company followed their lead.
  [-]
  - EasyMark 9 days ago
    My cheap little dell laptop is the most solid machine I have in my house and I haven't seen it crash yet and I half suspect it's because it's the only one with Intel gpu only in it :) . My tower machine can go a few days without crashing but inevitably it will whether it's 3 days or 1 week. Nvidia card. It's not often enough to really worry about but I've been thinking about switching to an AMD card, if I can find a second hand one reasonably priced.
- xeonmc 9 days ago
  Somewhat relevant, from 2018:
  [The Thirty Million Line Problem - Casey Muratori](https://www.youtube.com/watch?v=kZRE7HIO3vk)
jgarzik 9 days ago
Nice! I warmly encourage open-core GPU work.
Here's another: https://github.com/jbush001/NyuziProcessor
[-]
- joe_the_user 9 days ago
  What would be nice would be a bare-bones CUDA implementation for one these open-core processors.
  What size run would be needed for TSMC or some other fab to produce such a processor economically?
ginko 9 days ago
Really cool project I love seeing HW projects like this in the open. But I'd argue that this is a SIMD coprocessor. For something to be a GPU it should at least have some sort of display output.
I know the terminology has gotten quite loose in recent years with Nvidia & Co. selling server-only variants of their graphics architectures as GPUs, but the "graphics" part of GPU designs make up a significant part of the complexity, to this day.
[-]
- jdiff 9 days ago
  If it processes graphics, I think it counts, even if it has no output. There's still use for GPUs even if they're not outputting anything. My place of work has around 75 workstations with mid-tier Quadros, but they only have mini-DisplayPort and my employer only springs for HDMI cables, so they're all hooked into the onboard graphics. The cards still accelerate our software, they still process graphics, they just don't output them.
  [-]
  - Narishma 9 days ago
    > If it processes graphics, I think it counts, even if it has no output.
    That's not a good definition, since a CPU or a DSP would count as a GPU. Both have been used for such purpose in the past.
    > There's still use for GPUs even if they're not outputting anything.
    The issue is not their existence, it about calling them GPUs when they have no graphics functionality.
    [-]
    - vineyardlabs 9 days ago
      Graphics functionality != display output What about laptop GPUs, which don't necessarily output to the screen at all times. Sometimes they don't even have a capability to do so. If it's coprocessor working alongside the general processor for the primary purpose of accelerating graphics computing workloads, it seems appropriate to call it a GPU.
      Edit: perhaps your point is that it doesn't make sense to call a device designed primarily to accelerate ML workloads or just general purpose vector calculations. In that case I'd agree that GPU isn't the right name.
      [-]
      - omikun 9 days ago
        >> Graphics functionality != display output Exactly. Graphics functionality also includes graphics specific hardware like vertex and fragment processing, which this does not have. It has no graphics specific hardware, ergo not a GPU.
    - jdiff 9 days ago
      If it looks like a duck and it walks like a duck, why is it not a duck? If you are using a DSP to process graphics, then at least in the context of your system it has become your graphics processor.
      Plenty of GPUs don't have (or aren't used for their) display output. It's a GPU because of what it does: graphics processing. Not because of what connectivity it has.
      [-]
      - Narishma 9 days ago
        But it doesn't do graphics, so it shouldn't be called GPU. That's the whole point of this thread.
        [-]
        samus 9 days ago
        But it does - it just needs an application to retrieve the buffer and do something with it. For example pushing it to storage.
        jdiff 9 days ago
        It does do graphics. Calculating graphics is different from handling display output. You can separate the two.
        Like someone else mentioned, laptops often have discrete graphics cards that are not wired to display hardware at all, needing to shuffle framebuffers through the onboard graphics when something needs to make its way to a screen.
        [-]
        Narishma 9 days ago
        > Like someone else mentioned, laptops often have discrete graphics cards that are not wired to display hardware at all, needing to shuffle framebuffers through the onboard graphics when something needs to make its way to a screen.
        Those are GPUs even if they aren't connected to a display because they still have graphics components like ROPs, TMUs and whatnot.
        [-]
        jdiff 9 days ago
        You're free to define it that way, but that's substantially different from GP's "if it's not a display adapter, it's not a GPU" that I was pushing against. It does seem pretty fragile to define a GPU in terms of the particular architecture of the day, though. There's plenty of things called GPUs that don't/didn't have TMUs, for example.
    - samus 9 days ago
      CPUs and DSPs are not primarily designed for graphics work, therefore they don't count as GPUs. CPU are general-purpose, DSPs might be abused for graphics work.
      The "G" in GPU doesn't imply that they have to render directly to a screen. In fact, professional graphics cards are commonly used for bulk rendering for animating videos.
      Datacenter GPUs are mostly used for AI these days, but they can nevertheless do graphics work very well, and if they are used for generative AI or if their built-in super sampling capability is used, the distinction becomes rather blurry.
      [-]
      - Narishma 8 days ago
        But this particular one isn't designed for graphics work either, so it shouldn't be called GPU.
        [-]
        samus 8 days ago
        It's in the very name: "Tiny-GPU". Since it's a demonstration project by a hobbyist, the author probably didn't want to implement the whole optimized rendering stack yet.
        On the other hand, they also left out some features that you'd expect to find on a general-purpose compute accelerator.
        For example, they focus on tensor math. No support for bit wrangling and other integer math. No exotic floating point formats. Minimal branching capabilities.
        [-]
        Narishma 7 days ago
        The name is what I'm contesting. It's called Tiny-GPU but there's no mention of graphics functionality anywhere in the project.
        [-]
        samus 7 days ago
        Graphics pretty much boils down to matrix multiplication, and that's exactly what this thing accelerates. If it were a generalized accellerator, it would have to support other kinds of arithmetic as well.
        [-]
        Narishma 7 days ago
        Agree to disagree. I'll stop here because we're just wasting time running in circles.
  - omikun 9 days ago
    It's the shader core of a GPU. There are no graphics specific pipelines, eg: vertex processing, culling, rasterizer, color buffer, depth buffer, etc. That's like saying a CPU is also a GPU if it runs graphics in software.
piotrrojek 9 days ago
Really awesome project. I want to get into FPGAs, but honestly it's even hard to grasp where to start and the whole field feels very intimidating. My eventual goal would be to create acceleration card for LLMs (completely arbitrary), so a lot of same bits and pieces as in this project, probably except for memory offloading part to load bigger models.
[-]
- Aromasin 9 days ago
  Reframe it in your mind. "Getting into FPGAs" needs to be broken down. There are so many subsets of skills within the field that you need to level expectations. No one expects a software engineer to jump into things by building a full computer from first principles, writing an instruction set architecture, understanding machine code, converting that to assembly, and then developing a programming language so that they can write a bit of Python code to build an application. You start from the top and work your way down the stack.
  If you abstract away the complexities and focus on building a system using some pre-built IP, FPGA design is pretty easy. I always point people to something like MATLAB, so they can create some initial applications using HDL Coder on a DevKit with a Reference design. Otherwise, there's the massive overhead of learning digital computing architecture, Verilog, timing, transceivers/IO, pin planning, Quartus/Vivado, simulation/verification, embedded systems, etc.
  In short, start with some system-level design. Take some plug-and-play IP, learn how to hook together at the top level, and insert that module into a prebuilt reference design. Eventually, peel back the layers to reveal the complexity underneath.
  [-]
- checker659 9 days ago
  I'm in the same boat. Here's my plan.
  1. Read Harris, Harris → Digital Design and Computer Architecture. (2022). Elsevier. https://doi.org/10.1016/c2019-0-00213-0
  2. Follow the author's RVFpga course to build an actual RISC-V CPU on an FPGA → https://www.youtube.com/watch?v=ePv3xD3ZmnY
  [-]
  - dailykoder 9 days ago
    Love the Harris and Harris book!
    I might add these:
    - Computer Architecture, Fifth Edition: A Quantitative Approach - https://dl.acm.org/doi/book/10.5555/1999263
    - Computer Organization and Design RISC-V Edition: The Hardware Software Interface - https://dl.acm.org/doi/10.5555/3153875
    both by Patterson and Hennessy
    Edit: And if you want to get into CPU design and can get a grip on "Advanced Computer Architecture: Parallelism, Scalability, Programmability" by Kai Hwang, then i'd recommend that too. It's super old and probably some things are made differently in newer CPUs, but it's exceptionally good to learn the fundamentals. Very well written. But I think it's hard to find a good (physical) copy.
- ruslan 8 days ago
  I would suggest the following route:
  1. Clone this educational repo https://github.com/yuri-panchul/basics-graphics-music - a set of simple labs for those learning Verilog from the scratch. It's written by Yuri Panchul who worked at Imagination developing GPUs, by the way. :) 2. Obtain one of the dozens supported FPGA boards and some accessories (keys, LEDs, etc). 3. Install Yosys and friends. 4. Perform as many labs from the repo as you can, starting from lab01 - DeMorgan.
  You can exercise labs while reading Harris&Harris. Once done with the labs and with the book, it's time to start your own project. :)
  PS: They have a weekly meetup at HackerMojo, you can participate by Zoom if you are not in the Valley.
- samvher 9 days ago
  I don't know where you are in your journey, but I found these resources very helpful to better understand digital logic and CPU/GPU architecture:
  1. https://learn.saylor.org/course/CS301
  2. https://www.coursera.org/learn/comparch
  3. https://hdlbits.01xz.net/wiki/Main_Page
- imtringued 9 days ago
  If you want to accelerate LLMs, you will need to know the architecture first. Start from that. The hardware is actually both the easy (design) and the hard part (manufacturing).
- lusus_naturae 8 days ago
  A simple project is implementing a FIR filter using a HDL like Verilog. The Altera university FPGAs are cheap enough.
- IshKebab 9 days ago
  You might want to pick a more realistic goal! An FPGA capable of accelerating LLMs is going to cost at least tens of thousands, probably hundreds.
  [-]
  - JoachimS 9 days ago
    Depends heavily on what system it is supposed to provide acceleration for.
    If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or RISC-V RV3I, then you could use an iCE40 or similar FPGA to provide a big acceleration by just using the DSPs and the big SPRAM.
    Basically add the custom compute operations and space that doesn't exist in the MCU, operations that would take several, many instructions to do in SW. Also, just by offloading to the FPGA AI 'co-processor' frees up the MCU to do other things.
    The kernel operations in the Tiny GPU project is actually really good examples of things you could efficiently implement in an iCE40UP FPGA device, resulting in substantial acceleration. And using EBRs (block RAM) and/or the SPRAM for block queues would make a nice interface to the MCU.
    One could also implement a RISC-V core in the FPGA, thus having a single chip with a low latency interface to the AI accelerator. You could even implement the AI acceleator as a set of custom instructions. There are so many possible solutions!
    An ice40UP-5K FPGA will set you back 9 EUR in single quantity.
    This concept of course scales up to performance and cost levels you talk about. With many possible steps in between.
    [-]
    - rjsw 9 days ago
      Or use one of the combined CPU+FPGA chips like the AMD/Xilinx Zynq, there are plenty of low cost dev boards for them.
      [-]
      - JoachimS 9 days ago
        Sure, a good example of a step between a really tiny system and 100k+ systems.
  - imtringued 9 days ago
    Something that appears to be hardly known is that the transformer architecture needs to become more compute bound. Inventing a machine learning architecture which is FLOPs heavy instead of bandwidth heavy would be a good start.
    It could be as simple as using a CNN instead of a V matrix. Yes, this makes the architecture less efficient, but it also makes it easier for an accelerator to speed it up, since CNNs tend to be compute bound.
vineyardlabs 9 days ago
Is there a reason they're mixing non-blocking and blocking assignment operators in sequential always blocks here?
[-]
- urmish 9 days ago
  looks like those are local variables
- CamperBob2 9 days ago
  You can feel free to do that, if you're not too hung up on simulation-synthesis matching.
  [-]
novaRom 9 days ago
I did something similar many years ago in VHDL. There was a site called opencores for different open source HDL projects. I wonder if is there any good HPC level large scale distributed HDL simulator exists today? It makes sense to utilize modern GPUs for making RTL level simulations.
[-]
- Someone 9 days ago
  > There was a site called opencores for different open source HDL projects
  Was? https://opencores.org/projects?language=VHDL. Or is that not the same but similar?
mk_stjames 9 days ago
Uh, the ALU implements a DIV instruction straight up at the hardware level? Is this normal to have as a real instruction in something like a modern CUDA core or is DIV usually a software emulation instead? Because actual hardware divide circuits take up a ton a space and I wouldn't have expected them in a GPU ALU.
It's so easy to write "DIV: begin alu_out_reg <= rs / rt; end" in your verilog but that one line takes a lotta silicon. But the person simulating this might not never see that if all they do is simulate the verilog.
[-]
- daghamm 9 days ago
  This is just someone learning Verilog.
  The project stops at simulation, making real hardware out of this requires much more work.
Narishma 9 days ago
Yet another "GPU" providing no graphics functionality. IMO theses should be called something else.
[-]
- Lichtso 9 days ago
  The first question is why is there a divide between CPUs and GPUs in the first place. Yes, the gap is closing and both categories are adding features of one another but there still is a significant divide. IMO it has to do with Amdahl's law [0]. In that sense CPUs should be called Latency-Optimizing-Processors (LOPs) and GPUs should be called Throughput-Optimizing-Processors (TOPs).
  More specifically [1] we could also call CPUs long / deep data dependency processors (LDDPs) and GPUs wide / flat data dependency processors (WDDPs).
  [0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]: https://en.wikipedia.org/wiki/Data_dependency
  [-]
  - gpderetta 9 days ago
    The observation that graphic hardware and general purpose CPUs converge and diverge is not new: http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland... .
    But as you observe, we are stuck in a local optimum where GPUs are optimized for throughput and CPUs for latency sensitive work.
- 127 9 days ago
  TPU, a Tensor Processing Unit
  Tensors are just n-dimensional arrays
  Then you can run software (firmware) on top of the TPU to make it behave like a GPU.
- deivid 9 days ago
  I've been thinking about starting a project to build a 'display adapter', but I've gotten stuck before starting as I wasn't able to figure out what is the communication protocol between UEFI's GOP driver and the display adapter. I've been trying to piece it together from EDK2's source, but it's unclear how much of this is QEMU-specific
- tossandthrow 9 days ago
  I think the establishing term is AIA, AI Accelerator.
  [-]
  - fancyfredbot 9 days ago
    I have seen the term NPU used in reference to neural network accelerators a lot. I think AMD, Intel and Qualcomm all use this term for their AI accelerators. I think Apple call their AI accelerators neural engines, but I've definitely heard others refer to these as NPUs even though that's not their official name.
    I'll be honest I've never heard the AIA acronym used in this way. It seems all acronyms for all processors need to end in PU, for better or for worse.
  - n4r9 9 days ago
    That would ignore applications like crypto mining, which I'm guessing is still a biggie.
    What is it exactly that sets these units apart from CPUs? Something to do with the parallel nature of the hardware?
    [-]
    - pjc50 9 days ago
      The distinction that seems to be important is the warp-thread architecture: multiple compute units sharing a single program counter, but instead of the SIMD abstraction they are presented as conceptually separate threads.
      Also they tend to lack interrupt mechanisms and virtualization, at least at the programmer API level (usually NVIDIA systems have these but managed by the proprietary top level scheduler).
    - Narishma 9 days ago
      CPUs are also pretty parallel. They have multiple cores, each of which can execute multiple instructions working on multiple data items all in a single clock cycle.
    - tossandthrow 9 days ago
      the fact that they are not central. they work as a coprocessor.
      However, a CPU could easily embed an AIA, and certainly, they do.
- checker659 9 days ago
  GPGPU
- andersa 9 days ago
  Easy, it's now a General Processing Unit. Or perhaps a Great Processing Unit?
  [-]
  - barkingcat 9 days ago
    Is that pronounced gee pee you, or Gip Pee You?
  - Narishma 9 days ago
    But how is that different from a CPU?
    [-]
    - ginko 9 days ago
      It's not the central processor.
      [-]
      - trollied 9 days ago
        It's a coprocessor. They have existed for a very long time.
        [-]
        shrubbery 9 days ago
        It's a brocessor.
    - andersa 9 days ago
      It starts with a G.
- cbm-vic-20 9 days ago
  MPU- Matrix Processing Unit.
- djmips 9 days ago
  Haha I love this project but it's just PU
- how2dothis 9 days ago
  ...Won't sound offending... But, but ...a Graphics-card; has "Ports (to attach a Keyboard on to)", RAM (verry fast), CPUs (many) and may be used like a full Computer, even without been driven by someone else like ...You -I suspect, not ?
  ...I for my part want to say thanks for the findings! :-)
  [Setting:Weekendmodus]
Jasper_ 9 days ago
> Since threads are processed in parallel, tiny-gpu assumes that all threads "converge" to the same program counter after each instruction - which is a naive assumption for the sake of simplicity.
> In real GPUs, individual threads can branch to different PCs, causing branch divergence where a group of threads threads initially being processed together has to split out into separate execution.
Whoops. Maybe this person should try programming for a GPU before attempting to build one out of silicon.
Not to mention the whole SIMD that... isn't.
(This is the same person who stapled together other people's circuits to blink an LED and claimed to have built a CPU)
[-]
- bootsmann 9 days ago
  Isn't the first just equivalent to calling __syncthreads() on every launch?
  [-]
  - Jasper_ 9 days ago
    No, that effectively syncs all warps in a thread group. This implementation isn't doing any synchronization, it's independently doing PC/decode for different instructions, and just assuming they won't diverge. That's... a baffling combination of decisions; why do independent PC/decode if they're not to diverge? It reads as a very basic lack of ability to understand the core fundamental value of a GPU. And this isn't a secret GPU architecture thing. Here's a slide deck from 2009 going over the actual high-level architecture of a GPU. Notice how fetch/decode are shared between threads.
    https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p...
  - stanleykm 9 days ago
    syncthreads synchronizes threads within a threadgroup and not across all threads.
  - hyperbovine 9 days ago
    Which experienced CUDA programmers do anyways!