Ask HN: Resources for getting started with bare-metal coding?

Several years ago I bought a used copy of the original x86 manual and wrote a proof-of-concept OS. I’m interested in getting back into it, but with more of a focus on HPC and utilizing the more advanced features of modern architectures. Where should I start? Even back when I wrote my toy OS, the contemporary Intel manual was 10x the size of the original that I worked with. Does anyone even work with assembly anymore? (If not, how is software keeping up with hardware advances? Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?)

My history: I’m a devops guy with about four years of experience in IT and about a year of experience writing Python at a professional level. My degree is in general mathematics, though I did best in the prob/stat courses (and enjoyed them more than the others).

Side note: I wonder if I “33 bits”’d myself above...

179 points | by DATACOMMANDER 1761 days ago

23 comments

nabla9 1761 days ago
If you want to be fluent in bare-metal, you must know the bare metal. Writing functions in bare metal provides performance boost over compilers only if know how to match the computation and data to the underlying architecture better than the compiler. Assembly is just a way to write it down.
There are two recently updated great books I recommend:
- Computer Architecture: A Quantitative Approach (2017) by Hennessy & Patterson
- Computer Organization and Design - RISC-V Edition (2017) by Hennessy & Patterson (I have the older MIPS edition)
You also need a book and documentation for the specific architecture (x86 or ARM), but the two books above teach generic stuff that is useful everywhere.
If you do numerical HPC programming, you usually write very little assembly. You might add some some inline Assembler tweaks inside (C/C++/Fortran) functions when needed. You must know to program in C, C++ or Fortran depending on what the code base you are working on uses and how to embed assembly code inside them.
EDIT: CUDA programming might be important to learn if you want to do low level numerical programming.
[-]
- na85 1760 days ago
  In the case of x86 where the microcode is fluid and can be patched, and frankly does not really represent what the processor is really doing, can assembly really be considered "bare metal" these days?
  [-]
  - p1esk 1760 days ago
    These days even programming mainstream FPGAs with Verilog is not really “bare metal”. Lots of complexity and control is hidden.
CoolGuySteve 1760 days ago
The problem with x86 in particular is that there is tons of cruft. You can get lost for days reading about obsolete functionality.
Here's my general workflow for optimizing functions in HFT:
Write a function in C++ and compile it. Look at the annotated disassembly and try to improve it using intrinsics, particularly vector intrinsics, and rdtsc times.
Then compare your output to the compiler's "-ftree-vectorize -march=native" and compare what it did to what you did. Lookup the instructions it used and compare them with what you did, check for redundancies, bad ordering, register misuse/underuse in the compiler output.
Then see if you can improve that.
But all that being said, note that in general this kind of cycle-counting micro-optimization is often overshadowed by instruction/data cacheline loads. It's rare that you have a few kilobytes of data that you will constantly iterate over with the same function. Most learning resources and optimizing compilers seem to ignore this fact.
[-]
- elcritch 1760 days ago
  I’ve wondered why there aren’t more tools for predicting how a program fits into cache lines and data caching effects. For given cpu parameters it seems a reasonable task to estimate cache lines based on a sample dataset. Am I just missing what tools are used out there?
  [-]
  - CoolGuySteve 1760 days ago
    The best tool for this in my experience is callgrind with assembly notation. You can configure it to more or less mimic the cache layout of whatever particular chip you're running and then execute your code on it.
    You can use the start and stop macros in valgrind.h to show cache behaviour of a specific chain of function calls, like when a network event happens, then in the view menu of kcachegrind select IL Fetch Misses, and show the hierarchical function view.
    It doesn't mimic the exact branch prediction or whatever of your architecture but when you compare it to actual timings it's damn close.
    [-]
    - elcritch 1759 days ago
      Wow, that's cool!
- MuffinFlavored 1760 days ago
  Why not just write the function in ASM in the first place?
  [-]
  - CoolGuySteve 1760 days ago
    1) Because the compiler gives you a clear reference implementation to test against for correctness and performance.
    2) Because after you do this enough times, you will learn when to write your own, when not to, and when to spot inefficiencies in the compiler output. The point is to learn, both about how the instructions work and how the compiler works.
    3) The C/C++ implementation serves as documentation of intent and is portable across architectures (including future x86-64 architectures). It's fucking atrocious when devs write pure assembly without a C/C++ reference that can replace it. To me, finding random assembly without a code implementation in the project is the ultimate indictment of a hot rod programmer not thinking about the future or future maintainers.
- bitcoinmoney 1760 days ago
  Can you talk about your day job?
Const-me 1761 days ago
> Does anyone even work with assembly anymore?
I do, but very rarely code in assembly. Usually just read output of C++ compiler.
It’s hard to beat well-written manually vectorized C++ code. Even when the compiler obviously doing something wrong: https://developercommunity.visualstudio.com/content/problem/... The compiler appears to know about typical latency figures for these instructions, i.e. it reorders them when It can if it gonna help.
> how is software keeping up with hardware advances?
Barely. The only way to approach advertised compute performance of CPUs is SIMD instructions. Despite available in mainstream CPUs for couple decades now (Pentium 3 launched in 1999), modern compilers can only auto-vectorize very simple code.
Fortunately, these instructions are available as C and C++ compiler intrinsics. Supported by all modern compilers and quite portable in practice (across different compilers building for the same architecture).
> Do newer low-level languages like Rust and Go really utilize the massive advances that have taken place?
Both are worse than C++.
About what to start with… I would start with picking either CPU or GPGPU. GPUs have way more raw computing power. Programming models are entirely different between them.
If you’ll pick GPU, I’d recommend “CUDA by example” book, helped me when I started GPGPU programming. BTW, for GPUs, assemblies and instruction sets are proprietary i.e. no one works with assembly. There’re low-level assembly-like things, Nvidia PTX, MS shader assembly, but these instructions are not executed by hardware, GPU drives compiles them once again into proprietary stuff.
Don’t know good books for CPU SIMD, I started organically with some random articles and Intel reference, but I had many years of C and C++ programming when I did. Not sure it’ll work for you.
[-]
- bitcoinmoney 1760 days ago
  Can you share what you do for a living?
  [-]
  - Const-me 1760 days ago
    I've been developing software for living since 2000.
    Lately, working on CAD/CAM/CAE Windows desktop software, also some Linux embedded. Before that worked in game development, HPC (have not coded for supercomputers, just commodity servers i.e. Xeon + nVidia), realtime multimedia (video processing, encoding, broadcasting).
- mirceal 1760 days ago
  of course someone works with it. someone has to write and maintain the tooling that eventually ends up as x86, right?
- adamnemecek 1760 days ago
  > Both are worse than C++.
  How so?
  [-]
  - Const-me 1760 days ago
    If you need performance, you must embrace vector nature of the hardware.
    Intel makes CPUs and invented many of these instructions sets, they also make a C++ compiler, and their implementation of intrinsics is what's adopted by other compilers. They also provide decent documentation. Intel doesn't support any golang or rust packages or language extensions.
    [-]
    - steveklabnik 1760 days ago
      My sibling already talked about the intrinsics, but I’d also like to point out that Intel does use Rust. https://github.com/intel/cloud-hypervisor for example. They also sponsor Rust conferences.
      [-]
      - Const-me 1760 days ago
        Does Intel support Rust versions of their intrinsics? If no, how’s that relevant?
        [-]
        steveklabnik 1760 days ago
        I’m not even sure what that would mean. A compiler is a compiler, regardless if it’s Intel making it or not.
        They’re the exact same thing as if you used clang.
        (It’s only relevant because you seem to imply that Intel doesn’t care about Rust at all. That’s not true.)
        [-]
        Const-me 1760 days ago
        > They’re the exact same thing as if you used clang.
        Are you sure functions calls that crate adds on top of every single instruction, transmute(), as_i32x4(), etc., compile into nothing? And do so reliably i.e. every single time regardless on the surrounding code?
        BTW, functions built on top of intrinsics aren’t reliable in clang. I sometimes have to use compiler-specific trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.
        [-]
        steveklabnik 1760 days ago
        In my understanding, if it does not, that’s a bug. Compiler bugs do happen. Sounds like you’ve hit a few with clang.
        adamnemecek 1760 days ago
        They don't need to support it. It's the exact same linking process as from a c binary.
    - adamnemecek 1760 days ago
      You can use intrinsics from rust. You can use intrinsics from a lot of languages.
      https://github.com/AdamNiederer/faster
      In fact, in Rust, they are easier to use.
      [-]
      - Const-me 1760 days ago
        > You can use intrinsics from rust
        Right, I know there’s some support.
        > You can use intrinsics from a lot of languages.
        Yes. However, Intel (the guys making CPUs actually implementing these instructions) only supports them for C/C++. Just because you can use them from other languages (e.g. modern .NET has them as well, System.Numerics.Vectors) doesn’t necessarily mean it’s a good idea to do so.
        > In fact, in Rust, they are easier to use.
        That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
        When I code at that level of abstraction, I want to get whatever instructions are implemented by CPU. No more, no less.
        I’ve looked at example on the front page. There’re two methods to compute rsqrt in SSE/AVX, fast approximate one (rsqrtps), and precise one (sqrtps, divps). There’re several methods to compute ceil/floor, again with different tradeoffs. Do you know which instructions their example compile into? Neither do I.
        Also, one tricky part of CPU SIMD is cross-lane operations (shuffle, movelh/movehl, unpack, etc). Another one is integers: the instruction set is not comparable to any programming language, saturated versions of + and -, relatively high-level operations like psadbw, pmaddubsw, palignr, lack of something simple (e.g. can’t compare unsigned bytes for greater/less, only signed ones).
        For trivially simple algorithms that compute same math on wide vectors of float values, you better use OpenCL and run on GPU. Will likely be faster.
        [-]
        adamnemecek 1760 days ago
        You can link the exact same intrinsics in a rust binary to get the same intrinsics.
        > That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
        When I grow up I want to be as smart as you.
        And re: the faster project, you can use the exact same instructions. This is nothing about rust or not.
        Also, Rust is simply a better language for low level things even ignoring intrinsics.
        [-]
        Const-me 1760 days ago
        > You can link the exact same intrinsics in a rust binary to get the same intrinsics.
        Intrinsics are not library functions. You don’t link them anywhere. They’re processed by compiler not linker, and for SIMD math, each one usually becomes a single instruction. Linked functions are too slow for that.
        > This is nothing about rust or not.
        When I code C and write y=_mm_rsqrt_ps(x) I know I’ll get my rsqrtps instruction. When I write y=_mm_div_ps(_mm_set1_ps(1), _mm_sqrt_ps(x)) I know I’ll get slower more precise version. I don’t want compiler to choose one for me while converting a formula into machine code.
        [-]
        adamnemecek 1760 days ago
        Ok not link but include header.
        You can do the same in rust. See the explicit section of the faster project.
        [-]
        Const-me 1760 days ago
        > Ok not link but include header.
        Sorry to disappoint but Rust can’t include C++ headers. Even if it could, they wouldn’t work, because intrinsics are not library functions.
        > See the explicit section of the faster project.
        These aren’t C intrinsics, they are library functions exported from stdsimd crate. Which in turn forwards them to LLVM. Requires Rust nighty. Also I’m not sure that many levels of indirection are good for performance. You usually want these m128/m256 values to stay in registers. In C++, I sometimes have to write __forceinline to achieve that, or the compiler breaks performance by making function calls, or referencing RAM.
        [-]
        adamnemecek 1760 days ago
        > Sorry to disappoint but Rust can’t include C++ headers.
        You are pedantic.
        https://doc.rust-lang.org/1.29.0/std/arch/#static-cpu-featur...
        [-]
        Const-me 1760 days ago
        That’s library functions from stdsimd crate.
        Looks like significant overhead over C intrinsics. Two calls to transmute() for every instruction. And other calls for every instruction, stuff like as_i32x4.
        It’s technically possible every last one of them compile into nothing at all, and emits just a single desired instruction. I don’t believe these optimizations are 100% reliable, however. They aren’t reliable in clang or vc++, I sometimes have to use trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.
        [-]
        adamnemecek 1760 days ago
        There's no transmute. Look I don't have time for this.
        [-]
        Const-me 1760 days ago
        > There's no transmute.
        That crate calls transmute twice, for every single instruction.
        https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...
        https://github.com/rust-lang-nursery/stdsimd/blob/master/cra...
        [-]
        steveklabnik 1760 days ago
        What he’s trying to say is that transmute is also an intrinsic.
        They correspond to the machine instructions, that’s their entire purpose. That’s also why they’re intrinsics.
        adamnemecek 1760 days ago
        Check the actual compiler output.
        steveklabnik 1760 days ago
        (This is exactly how Rust’s intrinsics work.)
geofft 1761 days ago
Agner Fog's page of optimization resources might have some things that interest you: https://www.agner.org/optimize/
One notable use of raw assembly is that Intel themselves contribute optimized strcpy etc. implementations to glibc. You might find it interesting to go see how the most recent ones work.
jandrewrogers 1761 days ago
For high-performance bare metal computing in 2019 on typical hardware, your elementary toolset will be an up-to-date C++ compiler and programming with vector intrinsics. Note: most performance these days comes from highly optimizing memory locality (i.e. data structures), clever instruction sequences will gain little if locality is poor. While you need to be adequately fluent in reading the assembly code generated by your compiler, you virtually never need to write it outside of some very rare cases if you are programming in C++ because almost everything is exposed either as intrinsics or implicitly as part of the language/library. Writing code that uses vector instructions is still a manual process via intrinsics, compilers are still poor at doing that automatically. The set of vector instructions provided seem to have many weird gaps in them (e.g. only '>' and '==' operators for comparison in SSE) because you are expected to figure out how to logically compose most of the operators programmers are accustomed to in the vector domain.
The details will vary significantly with the type of high-performance code you are writing e.g. is it floating point numerics, integer domain, memory-hard, trivially parallelizable, etc. Different architectures are optimized for different kinds of codes, but a modern CPU with strong vector support likely gives the best performance across the broadest range of code types. Contrary to popular impression, most HPC is not doing linear algebra and quite a bit of it is integer code.
Becoming intimately familiar with the details of microarchitectures is hugely important to understanding how to optimally structure codes for them. Agner Fog's resources are a good starting point for understanding some of these issues.
timClicks 1761 days ago
Compiler engineers need to read assembly. I don't think too many people write lots of it by hand, unless they're extending packages like LAPACK of BLAS.
Software has probably been free-riding on hardware, but hardware has also not been keeping up with hardware. The big difficulty is the latency between memory and the processor.
Rust is closer than Go. Go isn't designed to make optimal code. It's there to make good enough code. And it does so very quickly.
In terms of resources, I really like Crafting Interpreters. Would love for you to take a glance at my book (Rust in Action, Manning), if you want to learn about other systems topics outside of compiler design.
[-]
- viraptor 1761 days ago
  Go also schedules N:M, which means it's not trivial to ensure that your code keeps running on the same CPU... which may be important for hardware.
  [-]
  - majewsky 1761 days ago
    Why do people keep repeating this myth, when thread pinning is literally a one-liner?
    > LockOSThread wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread. [...] A goroutine should call LockOSThread before calling OS services or non-Go library functions that depend on per-thread state.
    Source: https://golang.org/pkg/runtime/#LockOSThread
    [-]
    - viraptor 1761 days ago
      It's usually enough, but not always. For example namespaces and new threads don't handle this well, even if you do LockOSThread https://www.weave.works/blog/linux-namespaces-and-go-don-t-m...
      [-]
      - rantanplan 1761 days ago
        That has been fixed quite a while ago, in Go 1.10
viraptor 1761 days ago
If you haven't seen it yet, have a look at https://wiki.osdev.org/Expanded_Main_Page - it's probably the best first destination when looking for OS implementation info. It may not go too deep in many areas, but it has reference to other important places.
Assembly is still used where needed. Kernels of media encoders (for performance), some interrupt handler (for control), other things that shouldn't comply with ABIs for whatever reason, and for things that can't be accessed from higher level languages (architecture specific registers).
orbifold 1761 days ago
You should check out https://software.intel.com/sites/landingpage/IntrinsicsGuide.... Besides that it probably is a good idea to figure out a concrete thing to implement, do an implementation in C/C++ and figure out using something like godbold how to improve the generated assembler code. HPC is not about assembler implementations at all, more about Problem specific algorithms and MPI.
z3phyr 1761 days ago
https://www.coranac.com/tonc/text/toc.htm http://ianfinlayson.net/class/cpsc305/notes/01-intro
By programming for the Game Boy Advanced ofcourse
matthewmacleod 1761 days ago
It’s worth reading through Phillip Opperman’s blog series (https://os.phil-opp.com/) where he writes about creating an OS in Rust. Not necessarily for the Rust part, but it has lots of nice info on how a modern system is bootstrapped.
rramadass 1760 days ago
For bare-metal coding it is best to get started with Embedded Programming using MCUs (ARM, AVR, 8051 etc.).
For x86 you may find the following useful;
1) Computer systems: A programmer's perspective by Bryant and O'Hallaron.
2) Low-Level Programming: C, Assembly, and Program Execution on Intel® 64 Architecture by Igor Zhirkov.
3) Modern X86 Assembly Language Programming by Daniel Kusswurm.
4) Agner Fog's writings.
5) Ciro Santilli's x86 examples - https://github.com/cirosantilli/x86-bare-metal-examples
janci5243 1760 days ago
I find this course at Cambridge to be a wonderful resource for bare metal programming on ARM (Raspberry PI): https://www.cl.cam.ac.uk/projects/raspberrypi/tutorials/os/i...
It starts off with booting a simple OS written in assembly from SD card, blinking LED and then continues to more advanced topics such as controlling the GPU/screen.
raxxorrax 1761 days ago
> Does anyone even work with assembly anymore?
Sometimes on simple 8-bit architectures for some interrupt-routines where it does perhaps under certain conditions increase performance on I/O operations. And that is to 99% read -> write.
On more complex architectures like x86? Not really. I wouldn't even come close to modern compilers. Maybe for fun.
If you want to create a simple system I would recommend to just ignore optimization and focus on system design. That is probably the only thing agnostic about assembler.
You could also use a µC with external programmer to get a basic system running. mnemonics are often very similar to x86 and while optimizations are hard and in-depth knowledge about architecture is required, the first step to take is probably the system design itself and the efficiency of the assembly is secondary. All that should come later.
For the very first steps, I would recommend linux or windows assembler and learning how to interface the underlying OS. That could help hardening the requirements for the design.
justaaron 1761 days ago
This is going to end up somewhat specific, as most low-level initialization routines are. ARM vs x86 vs RISC-V vs MIPS vs a BASIC stamp or propeller etc...
I will suggest studying some general Assembly language concepts, memory locations, how to set-up RAM timings and bring offchip RAM into an address space, etc.
This stuff is used daily in the world of microcontrollers, I'm a huge fan of the Parallax propeller, in which the Spin interpreter in ROM launches Assembly routines on individual cores in about the simplest fashion possible...
Some things you may find interesting: https://github.com/dwelch67/raspberrypi
https://github.com/rsta2/circle
https://ultibo.org/
Think of the latter 2 as "HAL plus some primitives" rather than RTOS...
[-]
- justaaron 1761 days ago
  Go is garbage collected, and thus not remotely suitable for bare-metal work.
  C, C++, Rust, possibly Ocaml, FORTH, FreePascal, etc... or else you are actually on about a virtual machine, not bare metal.
  Learning how to create your own heap. Make your own malloc/free on a baremetal device and that will be a huge boost for you, confidence wise...
  [-]
  - pjmlp 1761 days ago
    > Go is garbage collected, and thus not remotely suitable for bare-metal work.
    Others beg to differ, selling GC enabled development environments from small PICs to grown up ARM deployments.
    http://www.astrobe.com/default.htm
    http://www.microej.com/
    https://www.aicas.com/cms/
    https://www.ptc.com/en/products/developer-tools/perc
    https://www.microdoc.com/ibm-websphere-everyplace-custom-env...
    Some solutions even considered suitable for bare metal work by the US and French military.
    And speaking of Go on bare metal, https://tinygo.org/
    [-]
    - justaaron 1760 days ago
      Well, is any of this actual bare-metal environments for coding? It's more of a HAL meets RTOS, and it's certainly not self-bootstrapping in Go, which is impossible on real hardware, or?
      [-]
      - pjmlp 1760 days ago
        Astrobe is certainly bare metal, just like some of the Java deployment options, with AOT compilation to target boards, like Aicas and PTC are capable of.
        [-]
        justaaron 1760 days ago
        Thanks, that's interesting and very new to me. I appreciate the links
        [-]
        pjmlp 1760 days ago
        If you want an example of GC in bare bones all the way from building your own FPGA up to the graphics display, check the 2013 update from Project Oberon.
        http://www.projectoberon.com/
        https://inf.ethz.ch/personal/wirth/ProjectOberon/index.html
        Sadly the ready made OberonStation boards are no longer on sale.
  - snaky 1760 days ago
    "Steel Bank Common Lisp: because sometimes C abstracts away too much"
    https://www.pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembl...
  - sansnomme 1761 days ago
    To be fair, you will learn a lot more getting a garbage collected language to run bare metal runtime and all.
    [-]
    - justaaron 1760 days ago
      implementing in a non-garbage collected lower-level language, of course. Even FORTH's stack requires some ASM words to set it up... or am I missing something here?
l00sed 1761 days ago
I have a friend at NIU who's in the CS program and they're still teaching assembly there. I think with web dev and all the new flashy stuff that's readily available on top of all these foundational languages people are much more drawn to the immediacy of JavaScript and Python and other high-level languages.
yitchelle 1761 days ago
Many paths to follow here. However, one thing that may help is to decide whether which platform you want to start with. Something like the ever popular Pi and friends, or a lower powered platform such as Ardurino. Each platform will need a different approach due to its available resources.
readme 1760 days ago
This:
https://github.com/cirosantilli/x86-bare-metal-examples
He has a lot of them, and you can also check out his stack overflow answers.
>Does anyone even work with assembly anymore?
in infosec, definitely
shakna 1761 days ago
> Does anyone even work with assembly anymore?
As a little bit of a different fun one:
I have a MicroPython project I'm building, a limited run toy, which live-assembles some assembler that it deploys to a co-processor which handles low power mode.
A real mix of very high level and very low level code.
akkartik 1761 days ago
I don't have any answers to your questions (you probably know more than me), but this thread seems like a good beginning of a support group for people programming on the bare metal.
I do so not for performance but for reducing dependencies, for building a more parsimonious stack at all levels: https://github.com/akkartik/mu/blob/master/subx/Readme.md. It's surprisingly pleasant. Lately I find myself thinking heretical thoughts, like whether high-level languages are worth all the trouble.
Taniwha 1761 days ago
Probably the best resource you'll find these days is a virtualised environment - do most of your development in a VM or an emulator, get it all up and running, then port it to real metal
[-]
- lallysingh 1761 days ago
  Yes. Start with this, osdev.org, and gcc -S.
stevekemp 1761 days ago
I mostly dabble in assembly language for fun, but it can be very useful to know how it works when you're patching binaries, or reverse engineering protection/encryption systems. Some of the binary reverser-challenges are great for that.
Otherwise the only things I've done recently have been writing a simple "compiler" for maths, converting "3 4 + 5 *" into floating-point assembly, or generating code for a toy-language.
nohope 1761 days ago
The best book on x86 Assembly for beginners ever: https://savannah.nongnu.org/projects/pgubook/