Green threads explained in 200 lines of Rust

(cfsamson.gitbook.io)

374 points | by taheris 99 days ago

12 comments

  • FullyFunctional 99 days ago

    Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.

    Anecdote: almost a decade ago, I was responsible for an NVMe-like implementation (hard- and software). The 3rd version of the firmware recognized the various components as threads, but there was no need for preemption (which would require expensive locking). Traditional scheduling would work, but you actually know exactly which thread should execute next (hardware will signal done), so an explicit yield_to() was far cheaper and only slightly more expensive than a function call.

    • wahern 98 days ago

      > Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.

      That's almost the very definition of a coroutine--explicit transfer of control. In symmetric coroutines you must specify a coroutine for both yield and resume; in asymmetric coroutines you specify what to resume to but yield implicitly returns to whatever resumed the current coroutine. In either case the actual control flow transfer is explicitly invoked.

      The term thread is more ambiguous, but it almost always implies control transfers--both the timing and target of control transfer--are implicit and not directly exposed to application logic. (Automagic control transfer might be hidden within commonly used functions (e.g. read and write), injected by the compiler (Go does this), or triggered by hardware.)

      You can synthesize a threading framework with both asymmetric and symmetric stackful[1] coroutines by simply overloading the resume and yield operations to transfer control to a scheduler, and then hiding implicit resume/yield points within commonly used functions or by machine translation of the code. In languages where "yield" and "resume" are exposed as regular functions this is especially trivial. Stackful coroutines (as opposed to stackless, which are the most commonly provided type of coroutine) are a powerful enough primitive that building threads is relatively trivial, which is why the concepts are easy to conflate, but they shouldn't be confused.

      LISP-y languages blur some of these distinctions as libraries can easily rewrite code; they can inject implicit control transfer and stack management in unobtrusive ways.[2] This isn't possible to the same extent in languages like C, C++, or Rust; lacking a proper control flow primitive (i.e. stackful coroutine) their "threading" frameworks[3] are both syntactically and semantically leaky.

      [1] By definition a thread preserves stack state--recursive function state--and this usually implies that stack management occurs at a very low-level in the execution environment, but in any case largely hidden from the logical application code.

      [2] OTOH, this is usually inefficient--stack management is a very performance critical aspect of the runtime. For example, Guile, a Scheme implementation, now provides a stackful coroutine primitive. For a good discussion of some of these issues, see https://wingolog.org/archives/2017/06/27/growing-fibers

      [3] Specifically the frameworks that attempt to make asynchronous I/O network programming simple and efficient. So-called native threads are a different matter as both stack management and control transfer are largely implemented outside the purview of those languages, very much like how native processes are implemented. If you go back far enough in the literature, especially before virtual memory, the distinctions between process and thread fall away. Nowadays threads are differentiated from processes by sharing the same memory/object space.

      • FullyFunctional 98 days ago

        Yes, I agree, but it's not what the OP does, also, mostly often, co-routines gets conflated with cooperative threads/scheduling.

        For some color to your other points: the previous version was actually using continuation passing style which worked and was very fast (faster than coroutines), but challenging to understand without a good background in FP and FP implementations.

      • PopeDotNinja 99 days ago

        If you have any knowledge of it, what are your thoughts about the Erlang scheduler & Erlang's concurrency model?

        • FullyFunctional 99 days ago

          Funny enough we actually started prototyping with Erlang for subsequent project (which was cancelled before it went far). Unfortunately I don't know enough to know what's special about the Erlang scheduler (if anything), but as I understand the Erlang concurrency model, it's mostly about not sharing memory (forces explicit communication). That's obviously going to eliminate a host of bugs, but it would have been way too expensive for the mentioned firmware.

          Once we got started with Erlang I was pretty turned off. The pretty examples you see in tutorials aren't what you'll be using. Instead it's framework upon framework, far from elegance IMO. I was happy to not have to deal with that again. Today I'd probably choose Rust for the same task (static types FTW).

          (EDIT: typos)

          • PopeDotNinja 99 days ago

            Given that Erlang is sufficiently different from anything else I've seen, it doesn't surprise me that trying to be productive in Erlang before you knew it well enough was suboptimal experience. I like it, and more specifically Elixir, quite a bit, but the learning curve was steep.

            • verttii 98 days ago

              It's fair to say Erlang's strength or appeal is not the language itself but the platform you instruct with it. That's also where the learning curve is.

            • qohen 98 days ago

              Unfortunately I don't know enough to know what's special about the Erlang scheduler (if anything)

              Bon appetit: :-)

              Appetizer (blog post):

              http://jlouisramblings.blogspot.com/2013/01/how-erlang-does-...

              Entree (book chapter):

              https://blog.stenmans.org/theBeamBook/#CH-Scheduling

            • davidw 98 days ago

              As he replies, Erlang is operating at a significantly higher level than that kind of project.

              It's well worth learning about how it works, as it's a great fit for some problems.

            • shanemhansen 98 days ago

              There was some work to give linux this functionality via a switchto* family of syscalls. https://youtu.be/KXuZi9aeGTw

              • kazinator 98 days ago

                Anyone who thinks that a kernel system call is the way to do this basically doesn't get the concept.

            • identity0 98 days ago

              I feel like "200 lines of Rust" is a little misleading considering how much explaination there is. It's like saying you can learn all of string theory in 0 lines of Python.

              • mlevental 98 days ago

                it seems like in every post there's a naysayer about something. you're complaining about an article that explains a difficult topic extensively and generously - in my day you had to pay really really good money for that. who cares if the title focuses one aspect of the explanation? sure it's a little deceptive but it's such a minor cost to you that you literally spent more time complaining than they deceived you out (click link with false notion, see length, close page). LPT: being critical in and of itself does not make you look smart.

              • amluto 99 days ago

                Part of the code is:

                    unsafe {
                        std::ptr::write(stack_ptr.offset(SSIZE - 16) as *mut u64, hello as u64);
                        ctx.rsp = stack_ptr.offset(SSIZE - 16) as u64;
                        gt_switch(&mut ctx, &mut n);
                    }
                
                Only the last line should be unsafe — the first line appears to be writing in bounds to a Vec, which is easy (and much more readable) in safe Rust.

                Linux used to do its actual context switch in C with inline asm, and it changed to being in straight asm a while back. This was absolutely a win. Rather than trying to make everything work out behind the compiler’s back, it’s much more straightforward to write a function in asm that does the stack switch.

                • oconnor663 99 days ago

                  The ptr::write is needed because the stack vector contains bytes and the code wants to do a pointer cast and write a u64. It could be done in safe Rust with u64::to_ne_bytes and a copy_from_slice or something like that, but since all of this code is extravagantly unsafe anyway, I think it's reasonably clear to Just Do It :)

                  • gravitas 99 days ago

                    Fantastic (IMHO) read on the last paragraph's history and code movement: https://www.maizure.org/projects/evolution_x86_context_switc...

                    • paines 97 days ago

                      >Linux used to do its actual context switch in C with inline asm, and it changed to being in straight asm a while back.

                      Could you be a bit more specific, maybe a link to the diff e.g. THANKS!

                    • amelius 99 days ago

                      Isn't one downside that the inline assembly will not be peephole-optimized by the code generator (LLVM)? E.g., you'd be saving and restoring registers that are not even used.

                      • stephanimal 99 days ago

                        I don't know how practical that optimization would be in this particular use case, wouldn't that mean the context switch code would have to be inlined at every call site? Then when you switch back to that context how would you generate that code correctly? You would need to know which registers were not saved by green thread A, and not touched by green thread B. So the compiler would basically need to know the runtime behavior of your program to optimize pushing and popping context records, unless I misunderstood your point?

                        If you set a breakpoint in say a win32 fiber switch, and look at the disassembly, it jumps to an internal function that just saves all the registers (and flags) to the active context and restores all the registers from the resumed context every time. Don't know how more optimal that can be for the general case.

                        • chc4 98 days ago

                          Not really.

                          You potentially don't know who is "resuming", and so don't know what registers they will clobber. It would only be a "downside" if 1) your code uses a register 2) no other possible green threads do, and that isn't an invariant any compiler I know will promise, especially in the face of FFI calls.

                          If you're at the point where you want to do register allocation and spilling optimization across multitasking points, you're probably better off writing your own compiler instead of expecting a thread runtime to do it for you.

                          For comparison, normal threads have the same downside: the kernel saves and restores all registers (even more than what the example does), so you're not doing any worse than that.

                        • snissn 99 days ago

                          > a simple but working example

                          • amelius 99 days ago

                            Yeah, of course, but how would you work this type of code into something that can be used in production?

                            The goal should be to have a green thread library that you can just use without thinking about registers or assembly.

                            • arnarbi 99 days ago

                              > how would you work this type of code into something that can be used in production?

                              You wouldn't. There are production-ready solutions to this in many languages.

                              > The goal should be to have a green thread library that you can just use without thinking about registers or assembly.

                              The goal here is very explicitly to explain a concept by example, which is largely incompatible with what you say the goal should be. I think the author did a very good job.

                              • amelius 99 days ago

                                > I think the author did a very good job.

                                I wasn't questioning that!

                              • masklinn 99 days ago

                                > Yeah, of course, but how would you work this type of code into something that can be used in production?

                                Why would you?

                                > The goal should be to have a green thread library that you can just use without thinking about registers or assembly.

                                No it should not. That's a completely separate (and mostly incompatible) goal than TFA's, which is educational.

                                • mamcx 98 days ago

                                  Going for "toy" code to production ready is hard.

                                  Also, rust already have several options for concurrency/parallelism that are more idiomatic.

                                  The only reason I see is to have a coroutine library for build a language, yet I have wonder if I put a facade of Actix actors or similar for this case...

                                  • nwallin 98 days ago

                                    > Going for "toy" code to production ready is hard.

                                    Too true. I recently implemented a feature, and I had a standalone working version 90% complete in a day or two. Getting the thing solid, tested, and integrated took the better part of a month.

                                    Part of that, I think, is our education model. Every project I ever had in school was started from scratch, I worked on it for a short period of time, turned it in, and never looked at it again. I got really good at that. I can write a toy version of a hard problem in a very short period of time. But I'm terrible at integrating those things into a larger whole, and I know for a fact I'm not alone in this.

                                    I think it's a huge problem that corporations think that a CS degree is job training and won't hire you if you haven't done it, but colleges think it's about abstract concepts and research and don't teach engineering principles.

                                  • heyzooos 99 days ago

                                    > I’m not trying to make a perfect implementation here. I’m cutting corners to get down to the essence and fit it into what was originally intended to be an article but expanded into a small book.

                                    lol

                              • swolchok 98 days ago

                                FWIW, there is POSIX-specified functionality (swapcontext and friends) for changing the user thread context: http://pubs.opengroup.org/onlinepubs/009695399/functions/swa...

                                • asdfasgasdgasdg 98 days ago

                                  It's very slow though, at least if you believe the Boost docs. They have various context-controlling types. One of them is called ucontext_t, and falls back to ucontext. Then there's fcontext_t, which is basically a more advanced version of this article. Boost's docs claim it is much faster than ucontext.

                                • mhh__ 98 days ago

                                  https://github.com/dlang/druntime/blob/v2.086.0/src/core/thr...

                                  D's implementation ^ (Linked partly because the inline assembly is very clear)

                                  • kccqzy 98 days ago

                                    Great explanation! Although I have to wonder, why did the author choose to use Rust for this project? As far as I can see, it didn't really use any of Rust's advanced features not available in C or C++, like tagged unions, pattern matching, advanced type system, traits, borrow checking…

                                    For what it's worth, I translated the whole code into C++. And it's essentially the same: https://gist.github.com/kccqzy/c404b8614f39f854f137dcc9284b0... And it runs correctly on macOS and Linux when compiled with clang.

                                    • kasane 98 days ago

                                      Why would one pick C or C++ over Rust for this project?

                                      • sim_card_map 98 days ago

                                        Much easier to build, for me at least.

                                        I tried installing Rust and building the example, but then it complained about nightly features, and I didn't have time to figure that out.

                                        C/C++ are mainstream languages, and building is as simple as g++ test.cpp.

                                        • eridius 98 days ago

                                          Using nightly features is as simple as inserting +nightly as the first argument to rustc or cargo (assuming rust was installed with rustup, which is the default way to install it).

                                          So in this case, it's just `rustc +nightly green_threads.rs`

                                        • kccqzy 98 days ago

                                          If the point is to explain how green threads work, it is better to use a common language so people can simply try to understand the concept and not try to understand both the concept and the language at the same time.

                                        • sim_card_map 98 days ago

                                          Thanks a lot for this!

                                        • bluejekyll 99 days ago

                                          The explanation of the asm! macro is great, I’ve always been a little confused by it.

                                          It does make me thinking about how this could be useful in the context of Futures and async/await.

                                          • steveklabnik 99 days ago

                                            > It does make me thinking about how this could be useful in the context of Futures and async/await.

                                            At its core, when you pass a Future (really, a chain of futures, but in the end that's still a Future) to an executor, the executor creates a stack for it. An executor with multiple futures will then switch between them, sorta similar to the code seen here. The details depend on the exact executor, of course, but the fundamental idea is the same.

                                            One of the nice things about the futures design is that you can know exactly how big the stack needs to be, so the resizing stuff isn't an issue.

                                            • cfsamson 98 days ago

                                              The long term plan for this was/is to actually connect them to futures and the async story in rust, adapting this to be an executor and implementing our own futures (though I probably have to implement a simple reactor and fake some IO operation in e seperate thread as well). However, it's quite a bit of rework and better to seperate this in two parts, but I feel it could be a good way to build brick by brick towards a pretty good understanding for those interested.

                                          • joaomoreno 99 days ago

                                            Beautiful docs

                                            • iamcreasy 98 days ago

                                              Now that there is a lot more interest in green threads, I was wondering if anybody knows why back in the days Java dropped the support of green thread in favor of native thread?

                                              • fc_barnes 98 days ago

                                                Daily reminder that goroutines don't require a context switch.

                                                • steveklabnik 98 days ago

                                                  A kernel context switch. Neither do the ones described here.