My history: I’m a devops guy with about four years of experience in IT and about a year of experience writing Python at a professional level. My degree is in general mathematics, though I did best in the prob/stat courses (and enjoyed them more than the others).
Side note: I wonder if I “33 bits”’d myself above...
There are two recently updated great books I recommend:
- Computer Architecture: A Quantitative Approach (2017) by Hennessy & Patterson
- Computer Organization and Design - RISC-V Edition (2017) by Hennessy & Patterson (I have the older MIPS edition)
You also need a book and documentation for the specific architecture (x86 or ARM), but the two books above teach generic stuff that is useful everywhere.
If you do numerical HPC programming, you usually write very little assembly. You might add some some inline Assembler tweaks inside (C/C++/Fortran) functions when needed. You must know to program in C, C++ or Fortran depending on what the code base you are working on uses and how to embed assembly code inside them.
EDIT: CUDA programming might be important to learn if you want to do low level numerical programming.
Here's my general workflow for optimizing functions in HFT:
Write a function in C++ and compile it. Look at the annotated disassembly and try to improve it using intrinsics, particularly vector intrinsics, and rdtsc times.
Then compare your output to the compiler's "-ftree-vectorize -march=native" and compare what it did to what you did. Lookup the instructions it used and compare them with what you did, check for redundancies, bad ordering, register misuse/underuse in the compiler output.
Then see if you can improve that.
But all that being said, note that in general this kind of cycle-counting micro-optimization is often overshadowed by instruction/data cacheline loads. It's rare that you have a few kilobytes of data that you will constantly iterate over with the same function. Most learning resources and optimizing compilers seem to ignore this fact.
You can use the start and stop macros in valgrind.h to show cache behaviour of a specific chain of function calls, like when a network event happens, then in the view menu of kcachegrind select IL Fetch Misses, and show the hierarchical function view.
It doesn't mimic the exact branch prediction or whatever of your architecture but when you compare it to actual timings it's damn close.
2) Because after you do this enough times, you will learn when to write your own, when not to, and when to spot inefficiencies in the compiler output. The point is to learn, both about how the instructions work and how the compiler works.
3) The C/C++ implementation serves as documentation of intent and is portable across architectures (including future x86-64 architectures). It's fucking atrocious when devs write pure assembly without a C/C++ reference that can replace it. To me, finding random assembly without a code implementation in the project is the ultimate indictment of a hot rod programmer not thinking about the future or future maintainers.
I do, but very rarely code in assembly. Usually just read output of C++ compiler.
It’s hard to beat well-written manually vectorized C++ code. Even when the compiler obviously doing something wrong: https://developercommunity.visualstudio.com/content/problem/... The compiler appears to know about typical latency figures for these instructions, i.e. it reorders them when It can if it gonna help.
> how is software keeping up with hardware advances?
Barely. The only way to approach advertised compute performance of CPUs is SIMD instructions. Despite available in mainstream CPUs for couple decades now (Pentium 3 launched in 1999), modern compilers can only auto-vectorize very simple code.
Fortunately, these instructions are available as C and C++ compiler intrinsics. Supported by all modern compilers and quite portable in practice (across different compilers building for the same architecture).
> Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?
Both are worse than C++.
About what to start with… I would start with picking either CPU or GPGPU. GPUs have way more raw computing power. Programming models are entirely different between them.
If you’ll pick GPU, I’d recommend “CUDA by example” book, helped me when I started GPGPU programming. BTW, for GPUs, assemblies and instruction sets are proprietary i.e. no one works with assembly. There’re low-level assembly-like things, Nvidia PTX, MS shader assembly, but these instructions are not executed by hardware, GPU drives compiles them once again into proprietary stuff.
Don’t know good books for CPU SIMD, I started organically with some random articles and Intel reference, but I had many years of C and C++ programming when I did. Not sure it’ll work for you.
Lately, working on CAD/CAM/CAE Windows desktop software, also some Linux embedded. Before that worked in game development, HPC (have not coded for supercomputers, just commodity servers i.e. Xeon + nVidia), realtime multimedia (video processing, encoding, broadcasting).
How so?
Intel makes CPUs and invented many of these instructions sets, they also make a C++ compiler, and their implementation of intrinsics is what's adopted by other compilers. They also provide decent documentation. Intel doesn't support any golang or rust packages or language extensions.
They’re the exact same thing as if you used clang.
(It’s only relevant because you seem to imply that Intel doesn’t care about Rust at all. That’s not true.)
Are you sure functions calls that crate adds on top of every single instruction, transmute(), as_i32x4(), etc., compile into nothing? And do so reliably i.e. every single time regardless on the surrounding code?
BTW, functions built on top of intrinsics aren’t reliable in clang. I sometimes have to use compiler-specific trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.
https://github.com/AdamNiederer/faster
In fact, in Rust, they are easier to use.
Right, I know there’s some support.
> You can use intrinsics from a lot of languages.
Yes. However, Intel (the guys making CPUs actually implementing these instructions) only supports them for C/C++. Just because you can use them from other languages (e.g. modern .NET has them as well, System.Numerics.Vectors) doesn’t necessarily mean it’s a good idea to do so.
> In fact, in Rust, they are easier to use.
That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
When I code at that level of abstraction, I want to get whatever instructions are implemented by CPU. No more, no less.
I’ve looked at example on the front page. There’re two methods to compute rsqrt in SSE/AVX, fast approximate one (rsqrtps), and precise one (sqrtps, divps). There’re several methods to compute ceil/floor, again with different tradeoffs. Do you know which instructions their example compile into? Neither do I.
Also, one tricky part of CPU SIMD is cross-lane operations (shuffle, movelh/movehl, unpack, etc). Another one is integers: the instruction set is not comparable to any programming language, saturated versions of + and -, relatively high-level operations like psadbw, pmaddubsw, palignr, lack of something simple (e.g. can’t compare unsigned bytes for greater/less, only signed ones).
For trivially simple algorithms that compute same math on wide vectors of float values, you better use OpenCL and run on GPU. Will likely be faster.
> That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
When I grow up I want to be as smart as you.
And re: the faster project, you can use the exact same instructions. This is nothing about rust or not.
Also, Rust is simply a better language for low level things even ignoring intrinsics.
Intrinsics are not library functions. You don’t link them anywhere. They’re processed by compiler not linker, and for SIMD math, each one usually becomes a single instruction. Linked functions are too slow for that.
> This is nothing about rust or not.
When I code C and write y=_mm_rsqrt_ps(x) I know I’ll get my rsqrtps instruction. When I write y=_mm_div_ps(_mm_set1_ps(1), _mm_sqrt_ps(x)) I know I’ll get slower more precise version. I don’t want compiler to choose one for me while converting a formula into machine code.
You can do the same in rust. See the explicit section of the faster project.
Sorry to disappoint but Rust can’t include C++ headers. Even if it could, they wouldn’t work, because intrinsics are not library functions.
> See the explicit section of the faster project.
These aren’t C intrinsics, they are library functions exported from stdsimd crate. Which in turn forwards them to LLVM. Requires Rust nighty. Also I’m not sure that many levels of indirection are good for performance. You usually want these m128/m256 values to stay in registers. In C++, I sometimes have to write __forceinline to achieve that, or the compiler breaks performance by making function calls, or referencing RAM.
You are pedantic.
https://doc.rust-lang.org/1.29.0/std/arch/#static-cpu-featur...
Looks like significant overhead over C intrinsics. Two calls to transmute() for every instruction. And other calls for every instruction, stuff like as_i32x4.
It’s technically possible every last one of them compile into nothing at all, and emits just a single desired instruction. I don’t believe these optimizations are 100% reliable, however. They aren’t reliable in clang or vc++, I sometimes have to use trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.
That crate calls transmute twice, for every single instruction.
https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...
https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...
They correspond to the machine instructions, that’s their entire purpose. That’s also why they’re intrinsics.
One notable use of raw assembly is that Intel themselves contribute optimized strcpy etc. implementations to glibc. You might find it interesting to go see how the most recent ones work.
The details will vary significantly with the type of high-performance code you are writing e.g. is it floating point numerics, integer domain, memory-hard, trivially parallelizable, etc. Different architectures are optimized for different kinds of codes, but a modern CPU with strong vector support likely gives the best performance across the broadest range of code types. Contrary to popular impression, most HPC is not doing linear algebra and quite a bit of it is integer code.
Becoming intimately familiar with the details of microarchitectures is hugely important to understanding how to optimally structure codes for them. Agner Fog's resources are a good starting point for understanding some of these issues.
Software has probably been free-riding on hardware, but hardware has also not been keeping up with hardware. The big difficulty is the latency between memory and the processor.
Rust is closer than Go. Go isn't designed to make optimal code. It's there to make good enough code. And it does so very quickly.
In terms of resources, I really like Crafting Interpreters. Would love for you to take a glance at my book (Rust in Action, Manning), if you want to learn about other systems topics outside of compiler design.
> LockOSThread wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread. [...] A goroutine should call LockOSThread before calling OS services or non-Go library functions that depend on per-thread state.
Source: https://golang.org/pkg/runtime/#LockOSThread
Assembly is still used where needed. Kernels of media encoders (for performance), some interrupt handler (for control), other things that shouldn't comply with ABIs for whatever reason, and for things that can't be accessed from higher level languages (architecture specific registers).
By programming for the Game Boy Advanced ofcourse
For x86 you may find the following useful;
1) Computer systems: A programmer's perspective by Bryant and O'Hallaron.
2) Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture by Igor Zhirkov.
3) Modern X86 Assembly Language Programming by Daniel Kusswurm.
4) Agner Fog's writings.
5) Ciro Santilli's x86 examples - https://github.com/cirosantilli/x86-bare-metal-examples
It starts off with booting a simple OS written in assembly from SD card, blinking LED and then continues to more advanced topics such as controlling the GPU/screen.
Sometimes on simple 8-bit architectures for some interrupt-routines where it does perhaps under certain conditions increase performance on I/O operations. And that is to 99% read -> write.
On more complex architectures like x86? Not really. I wouldn't even come close to modern compilers. Maybe for fun.
If you want to create a simple system I would recommend to just ignore optimization and focus on system design. That is probably the only thing agnostic about assembler.
You could also use a µC with external programmer to get a basic system running. mnemonics are often very similar to x86 and while optimizations are hard and in-depth knowledge about architecture is required, the first step to take is probably the system design itself and the efficiency of the assembly is secondary. All that should come later.
For the very first steps, I would recommend linux or windows assembler and learning how to interface the underlying OS. That could help hardening the requirements for the design.
I will suggest studying some general Assembly language concepts, memory locations, how to set-up RAM timings and bring offchip RAM into an address space, etc.
This stuff is used daily in the world of microcontrollers, I'm a huge fan of the Parallax propeller, in which the Spin interpreter in ROM launches Assembly routines on individual cores in about the simplest fashion possible...
Some things you may find interesting: https://github.com/dwelch67/raspberrypi
https://github.com/rsta2/circle
https://ultibo.org/
Think of the latter 2 as "HAL plus some primitives" rather than RTOS...
C, C++, Rust, possibly Ocaml, FORTH, FreePascal, etc... or else you are actually on about a virtual machine, not bare metal.
Learning how to create your own heap. Make your own malloc/free on a baremetal device and that will be a huge boost for you, confidence wise...
Others beg to differ, selling GC enabled development environments from small PICs to grown up ARM deployments.
http://www.astrobe.com/default.htm
http://www.microej.com/
https://www.aicas.com/cms/
https://www.ptc.com/en/products/developer-tools/perc
https://www.microdoc.com/ibm-websphere-everyplace-custom-env...
Some solutions even considered suitable for bare metal work by the US and French military.
And speaking of Go on bare metal, https://tinygo.org/
http://www.projectoberon.com/
https://inf.ethz.ch/personal/wirth/ProjectOberon/index.html
Sadly the ready made OberonStation boards are no longer on sale.
https://www.pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembl...
https://github.com/cirosantilli/x86-bare-metal-examples
He has a lot of them, and you can also check out his stack overflow answers.
>Does anyone even work with assembly anymore?
in infosec, definitely
As a little bit of a different fun one:
I have a MicroPython project I'm building, a limited run toy, which live-assembles some assembler that it deploys to a co-processor which handles low power mode.
A real mix of very high level and very low level code.
I do so not for performance but for reducing dependencies, for building a more parsimonious stack at all levels: https://github.com/akkartik/mu/blob/master/subx/Readme.md. It's surprisingly pleasant. Lately I find myself thinking heretical thoughts, like whether high-level languages are worth all the trouble.
Otherwise the only things I've done recently have been writing a simple "compiler" for maths, converting "3 4 + 5 *" into floating-point assembly, or generating code for a toy-language.