An Introduction to GPU Programming in Julia

(nextjournal.com)

290 points | by simondanisch 2010 days ago

8 comments

maxbrunsfeld 2010 days ago
> GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU. This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia
Every time I read about Julia, I’m amazed. What a game changing tool.
daenz 2010 days ago
GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back https://github.com/amoffat/gpgpu.js
The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:
```
  CPU: 6851.25ms
  GPU Total: 1449.29ms
  GPU Execution: 30.64ms
  GPU IO: 1418.65ms
  Theoretical Speedup: 223.59x
  Actual Speedup: 4.73x
```
The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.
[-]
- johndough 2010 days ago
  Note that `gl.finish` is often implemented as non-blocking in WebGL, so measuring query submission time on the host (cpu) does not necessarily reflect execution time on the device (gpu). It is more accurate to use timer queries. For more info see: https://stackoverflow.com/questions/20798294/is-it-possible-...
- pjmlp 2010 days ago
  WebGL compute shaders are being discussed, it might be a possible improvement to your library when they finally get adopted.
  [-]
  - johndough 2010 days ago
    Are there any news on WebGL compute shaders?
    I remember OpenCL for Firefox being discussed and discarded in favor of compute shaders about 7 years ago, and then when WebGL 2.0 was finally released 5 years later, compute shaders were not part of it.
    Additionaly, SIMD.js has been developed and then killed again and support for SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to implement efficient numerical methods in the browser any time soon.
    [-]
    - pjmlp 2010 days ago
      Here is the proposal.
      https://docs.google.com/document/d/1EwhDJO_lBH1mGMMwheQUXGhh...
    - fulafel 2010 days ago
      Note that you can do GPGPU with the current WebGL, compute shaders don't have that many additional capabilities unless you are looking do computation without interference from from onscreen WebGL graphics.
      For a future GPGPU web api, the vendors seem to have settled behind https://github.com/gpuweb/gpuweb/
  - obl 2010 days ago
    I wonder how they are going to pull that off without security implications.
    If you've played with compute shaders (or any of the modern "general purpose" shader stuff, ie arbitrary loads & stores etc) you probably know that it's quite easy to crash drivers. Of course you do so generally by provoking some form of UB (although not always, their runtime/compilers are far from bug-free).
    But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.
    Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.
    [-]
    - fulafel 2010 days ago
      Compute shaders are no different in this regard from other types of shaders that are currently supported by WebGL. They have the same capabilities and data types as other shaders. The difference is when they are invoked and the lack of implicit inputs/outputs. This is needed to perform computation out of lockstep with the graphics pipeline when you are also rendering graphics.
      The existing WebGL 1 & WebGL 2 shader compilation process does involve "a ton of checks", shaders are recompiled to have known safe memory accesses, eliminate uninitialized memory/variables, etc.
      Usually in compiler technology bounds checking does not involve a big cost as static sizes are zero cost and most dynamic checks can be hoisted out of loops.
Athas 2010 days ago
I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than (sequential?) CPU. Does Julia just generate really fast (multicore/vectorized?) CPU code? Does it also count communication costs? Fractal computations like that are extremely GPU friendly because they involve no memory accesses at all, except for writing the final result. I would expect at least two orders of magnitude improvement over a straightforwardly written C implementation.
[-]
- Athas 2010 days ago
  I had a suspicion that the problem was that maxiters was set fairly low (16). For the Mandelbrot sets I'm used to, this would result in relatively little computation (over a hundred iterations is more common). To investigate this hunch, I wrote a Futhark program as a proxy for a hand-written GPU implementation (didn't quite have the patience for that this evening):
```
    import "lib/github.com/diku-dk/complex/complex"

    module c32 = mk_complex f32
    type c32 = c32.complex

    let juliaset (maxiter: i32) (z0: c32) =
      let c = c32.mk (-0.5) 0.75
      let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
                   (c32.(z * z + c), i+1)
      in i

    let main (n: i32) (maxiter: i32) =
      let N = 2**n
      let (w, h) = (N, N)
      let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
                                       let j = -1.5 + r32 j*(3/r32 h)
                                       in c32.mk i j)
      in map (map (juliaset maxiter)) q
```
  I don't have the tools or skills to easily generate nice graphs, but I get about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64 GPU, versus when it is compiled to C. The C code runs in 272ms, which is very close to the number reported for Julia here[0] (I assume that the page has a typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image would take up dozens of TiB in GPU memory. Also, the benchmark code linked only goes up to N=2¹².)
  Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.
  Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new theory is that the Julia code also counts the cost of moving the resulting array back to the host (which can easily be dominant for a kernel that runs for just a millisecond or two).
  [0]: https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...
- ummonk 2010 days ago
  It does autovectorize quite often.
currymj 2010 days ago
While having a Torch-esque GPU ndarray is great, the ability to easily write your own kernels without having to compile gnarly C++ code is what sets Julia apart from competitors IMO. Not sure if there's any other dynamic language offering anything like this.
[-]
- schmudde 2010 days ago
  ClojureCL has already been mentioned, but I found the author's interview on the Defn podcast particularly enlightening: https://soundcloud.com/defn-771544745/defn-24-v3
- dragandj 2010 days ago
  ClojureCL and ClojureCUDA have been doing it for 4 years now.
  https://clojurecl.uncomplicate.org
  https://clojurecuda.uncomplicate.org
  [-]
  - KenoFischer 2010 days ago
    Am I missing something? I was excited about other languages doing the same as Julia, but the first example for ClojureCUDA has
```
    (def kernel-source
          "extern \"C\"
             __global__ void increment (int n, float *a) {
                   int i = blockIdx.x * blockDim.x + threadIdx.x;
               if (i < n) {
                  a[i] = a[i] + 1.0f;
                }
           };")
```
    Is there an example where the kernel is written in Clojure?
    [-]
    - dragandj 2010 days ago
      You can of course create Clojure/lisp code that would generate kernel code, but I am strongly against it as the default option, since I find plain C much better suited to small kernels close to hardware (which are only a small proportion of the overall code) than either lisp or Julia.
      Please note that there is no need for out-of process compilation step. You compile this by calling a Clojure function, interactively from the REPL.
      [-]
      - KenoFischer 2010 days ago
        Sure, but that's a completely separate thing from what Julia is doing here. The whole point of the GPU backend is that you get almost the full flexibility of the Julia language at your disposal, without having to fall back to C code. As a result, not only is your code more familiar to Julia programmers, but you also get to share large amounts of code between CPU/GPU/TPU. If all you're doing is super small kernels that may not matter much, but GPUs are so powerful these days that it's perfectly sensible to offload rather large programs that you may not want to write in C anymore.
        [-]
        3rdAccount 2010 days ago
        It's not just easier to read for Julia programmers. Any pythonista can mostly read Julia without significant issue. They're both high level and syntactically similar enough. I know Python is a lot more OO, but when you're just writing a short script it looks a lot like Julia if Julia didn't need you to specify types.
        byt143 2010 days ago
        Also that exact C example can be written almost verbatim in Julia, so C isn't "better suited" for even these small kernels, because one can drop to low level with even clearer syntax in julia
- pjmlp 2010 days ago
  "Parallel computations on the GPU with CUDA in Clojure"
  https://clojurecuda.uncomplicate.org/
  [-]
  - imtringued 2010 days ago
    The article is about compiling a restricted subset of Julia code to run on a GPU and accelerate your project without leaving the language.
    The library you linked doesn't compile clojure code to run on the GPU. It's basically just a FFI wrapper to pass C/C++ kernels to CUDA.
    [-]
    - pjmlp 2009 days ago
      Yeah, I missed that part.
pjmlp 2010 days ago
Love it! So much more fun than being stuck with C derived languages for GPGPU programming.
[-]
- tanderson92 2010 days ago
  You can do GPGPU programming with Fortran, which is definitely not C-derived, predating C by at least 20 years.
  [-]
  - 3rdAccount 2010 days ago
    I find Fortran pretty pleasant for scientific work (what it was designed for) and take pleasure in knowing it hasn't changed a whole lot over the years F70, F90...etc.
  - pjmlp 2010 days ago
    Sure, and it is a good alternative option that I forgot about, given that it is anyway its strong point, numerical computing.
    However it doesn't win many hearts outside HPC nowadays even with the latest standard revisions, and I was thinking more about mainstream adoption.
    There are also Accelerate, CUDA4J, ClojureCUDA, Hybridizer and Alea, but I am not sure about their adoption at large.
eigenspace 2010 days ago
It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.
Really hoping we see some movement on CLArrays in the near future.
[-]
- ChrisRackauckas 2010 days ago
  I want to see it updated too. It was really nice to write/test/debug GPU codes on my cheap laptop without an NVIDIA GPU, and then switch over to my desktop and CUDA with single line changes. Even though CuArrays tends to be quite a bit faster, this convenience cannot be overstated. I didn't realize how much I'd miss it until I was at a conference and couldn't just fiddle around with some GPU integration I had to debug.
  I think the reason CLArrays.jl only claims experimental support on v0.6 is because it uses Transpiler.jl which is quite limited and a direct translation. CUDANative.jl is much more advanced and uses LLVM's .ptx backend. I think Simon mentioned he wanted to do a more direct compilation route, and that would probably be the change to start calling it experimental? I don't know, I'd like to hear about the future of CLArrays as well. If we have both working strong, boy oh boy I'll be happy.
- keldaris 2010 days ago
  It is clearly mentioned in a separate paragraph which reads:
  > For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.
  [-]
  - eigenspace 2010 days ago
    My bad, I should really finish reading things before I comment.
- boromi 2010 days ago
  sadly OpenCL seems to always get ignored
  [-]
  - pjmlp 2010 days ago
    They did that to themselves by sticking originally with C, and forcing the whole read text, compile, link process at runtime, instead of the multiple language support from CUDA.
    OpenCL has been improved since then, but now it is too late.
eghad 2010 days ago
If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you might run into ram allocation issues if you're not one of the lucky users who get to use the full amount) here's a quick guide to get a Julia kernel up and running: https://discourse.julialang.org/t/julia-on-google-colab-free...
[-]
- simondanisch 2009 days ago
  Or just directly edit the article ;) That will also give you a K80 and you can directly run the code inside the article
IshKebab 2010 days ago
It doesn't really describe the fundamental difference between a GPU and a 4000-core CPU, which is that the GPU has a shared program counter. All the cores must execute the same instruction at each cycle.
[-]
- twtw 2010 days ago
  The conceptually independent threads are executed on 32 wide SIMD cores. "All the cores" within a warp/wavefront must execute the same instruction each cycle.
  GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.
- why_only_15 2010 days ago
  Are you sure? I'm pretty sure I can branch on core-specific measures like (in CUDA) TheadIdx and BlockIdx
  [-]
  - gpuhacker 2010 days ago
    It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.
    [-]
    - dialecticDolt 2010 days ago
      Do you have any idea what the source of the slide deck could have been? It sounds very interesting and I'd love to see it.
      [-]
      - gpuhacker 2010 days ago
        Sure! I meant this one: https://developer.download.nvidia.com/assets/cuda/files/redu...
  - 0-_-0 2010 days ago
    Yes but can you do that from Julia without writing CUDA/OpenCL code?
    I'm happy that Julia supports GPU programming for simple code, but I don't see how you can run algorithms with inter-thread communication.
    [-]
    - simondanisch 2009 days ago
      That's how you do it:
      https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/lin...
      Or directly with cuda shfl intrinsics: https://github.com/JuliaGPU/CUDAnative.jl/blob/b249dfd145501...
      [-]
      - 0-_-0 2009 days ago
        Oh cool! Thanks!
    - byt143 2009 days ago
      Yes you can.
- llukas 2010 days ago
  On NVIDIA Volta architecture threads have individual PC.