Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.
Anecdote: almost a decade ago, I was responsible for an NVMe-like implementation (hard- and software). The 3rd version of the firmware recognized the various components as threads, but there was no need for preemption (which would require expensive locking). Traditional scheduling would work, but you actually know exactly which thread should execute next (hardware will signal done), so an explicit yield_to() was far cheaper and only slightly more expensive than a function call.
> Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.
That's almost the very definition of a coroutine--explicit transfer of control. In symmetric coroutines you must specify a coroutine for both yield and resume; in asymmetric coroutines you specify what to resume to but yield implicitly returns to whatever resumed the current coroutine. In either case the actual control flow transfer is explicitly invoked.
The term thread is more ambiguous, but it almost always implies control transfers--both the timing and target of control transfer--are implicit and not directly exposed to application logic. (Automagic control transfer might be hidden within commonly used functions (e.g. read and write), injected by the compiler (Go does this), or triggered by hardware.)
You can synthesize a threading framework with both asymmetric and symmetric stackful coroutines by simply overloading the resume and yield operations to transfer control to a scheduler, and then hiding implicit resume/yield points within commonly used functions or by machine translation of the code. In languages where "yield" and "resume" are exposed as regular functions this is especially trivial. Stackful coroutines (as opposed to stackless, which are the most commonly provided type of coroutine) are a powerful enough primitive that building threads is relatively trivial, which is why the concepts are easy to conflate, but they shouldn't be confused.
LISP-y languages blur some of these distinctions as libraries can easily rewrite code; they can inject implicit control transfer and stack management in unobtrusive ways. This isn't possible to the same extent in languages like C, C++, or Rust; lacking a proper control flow primitive (i.e. stackful coroutine) their "threading" frameworks are both syntactically and semantically leaky.
 By definition a thread preserves stack state--recursive function state--and this usually implies that stack management occurs at a very low-level in the execution environment, but in any case largely hidden from the logical application code.
 OTOH, this is usually inefficient--stack management is a very performance critical aspect of the runtime. For example, Guile, a Scheme implementation, now provides a stackful coroutine primitive. For a good discussion of some of these issues, see https://wingolog.org/archives/2017/06/27/growing-fibers
 Specifically the frameworks that attempt to make asynchronous I/O network programming simple and efficient. So-called native threads are a different matter as both stack management and control transfer are largely implemented outside the purview of those languages, very much like how native processes are implemented. If you go back far enough in the literature, especially before virtual memory, the distinctions between process and thread fall away. Nowadays threads are differentiated from processes by sharing the same memory/object space.
Yes, I agree, but it's not what the OP does, also, mostly often, co-routines gets conflated with cooperative threads/scheduling.
For some color to your other points: the previous version was actually using continuation passing style which worked and was very fast (faster than coroutines), but challenging to understand without a good background in FP and FP implementations.
Funny enough we actually started prototyping with Erlang for subsequent project (which was cancelled before it went far). Unfortunately I don't know enough to know what's special about the Erlang scheduler (if anything), but as I understand the Erlang concurrency model, it's mostly about not sharing memory (forces explicit communication). That's obviously going to eliminate a host of bugs, but it would have been way too expensive for the mentioned firmware.
Once we got started with Erlang I was pretty turned off. The pretty examples you see in tutorials aren't what you'll be using. Instead it's framework upon framework, far from elegance IMO. I was happy to not have to deal with that again. Today I'd probably choose Rust for the same task (static types FTW).
Given that Erlang is sufficiently different from anything else I've seen, it doesn't surprise me that trying to be productive in Erlang before you knew it well enough was suboptimal experience. I like it, and more specifically Elixir, quite a bit, but the learning curve was steep.
it seems like in every post there's a naysayer about something. you're complaining about an article that explains a difficult topic extensively and generously - in my day you had to pay really really good money for that. who cares if the title focuses one aspect of the explanation? sure it's a little deceptive but it's such a minor cost to you that you literally spent more time complaining than they deceived you out (click link with false notion, see length, close page). LPT: being critical in and of itself does not make you look smart.
std::ptr::write(stack_ptr.offset(SSIZE - 16) as *mut u64, hello as u64);
ctx.rsp = stack_ptr.offset(SSIZE - 16) as u64;
gt_switch(&mut ctx, &mut n);
Only the last line should be unsafe — the first line appears to be writing in bounds to a Vec, which is easy (and much more readable) in safe Rust.
Linux used to do its actual context switch in C with inline asm, and it changed to being in straight asm a while back. This was absolutely a win. Rather than trying to make everything work out behind the compiler’s back, it’s much more straightforward to write a function in asm that does the stack switch.
The ptr::write is needed because the stack vector contains bytes and the code wants to do a pointer cast and write a u64. It could be done in safe Rust with u64::to_ne_bytes and a copy_from_slice or something like that, but since all of this code is extravagantly unsafe anyway, I think it's reasonably clear to Just Do It :)
I don't know how practical that optimization would be in this particular use case, wouldn't that mean the context switch code would have to be inlined at every call site? Then when you switch back to that context how would you generate that code correctly? You would need to know which registers were not saved by green thread A, and not touched by green thread B. So the compiler would basically need to know the runtime behavior of your program to optimize pushing and popping context records, unless I misunderstood your point?
If you set a breakpoint in say a win32 fiber switch, and look at the disassembly, it jumps to an internal function that just saves all the registers (and flags) to the active context and restores all the registers from the resumed context every time. Don't know how more optimal that can be for the general case.
You potentially don't know who is "resuming", and so don't know what registers they will clobber. It would only be a "downside" if 1) your code uses a register 2) no other possible green threads do, and that isn't an invariant any compiler I know will promise, especially in the face of FFI calls.
If you're at the point where you want to do register allocation and spilling optimization across multitasking points, you're probably better off writing your own compiler instead of expecting a thread runtime to do it for you.
For comparison, normal threads have the same downside: the kernel saves and restores all registers (even more than what the example does), so you're not doing any worse than that.
> Going for "toy" code to production ready is hard.
Too true. I recently implemented a feature, and I had a standalone working version 90% complete in a day or two. Getting the thing solid, tested, and integrated took the better part of a month.
Part of that, I think, is our education model. Every project I ever had in school was started from scratch, I worked on it for a short period of time, turned it in, and never looked at it again. I got really good at that. I can write a toy version of a hard problem in a very short period of time. But I'm terrible at integrating those things into a larger whole, and I know for a fact I'm not alone in this.
I think it's a huge problem that corporations think that a CS degree is job training and won't hire you if you haven't done it, but colleges think it's about abstract concepts and research and don't teach engineering principles.
It's very slow though, at least if you believe the Boost docs. They have various context-controlling types. One of them is called ucontext_t, and falls back to ucontext. Then there's fcontext_t, which is basically a more advanced version of this article. Boost's docs claim it is much faster than ucontext.
Great explanation! Although I have to wonder, why did the author choose to use Rust for this project? As far as I can see, it didn't really use any of Rust's advanced features not available in C or C++, like tagged unions, pattern matching, advanced type system, traits, borrow checking…
If the point is to explain how green threads work, it is better to use a common language so people can simply try to understand the concept and not try to understand both the concept and the language at the same time.
> It does make me thinking about how this could be useful in the context of Futures and async/await.
At its core, when you pass a Future (really, a chain of futures, but in the end that's still a Future) to an executor, the executor creates a stack for it. An executor with multiple futures will then switch between them, sorta similar to the code seen here. The details depend on the exact executor, of course, but the fundamental idea is the same.
One of the nice things about the futures design is that you can know exactly how big the stack needs to be, so the resizing stuff isn't an issue.
The long term plan for this was/is to actually connect them to futures and the async story in rust, adapting this to be an executor and implementing our own futures (though I probably have to implement a simple reactor and fake some IO operation in e seperate thread as well). However, it's quite a bit of rework and better to seperate this in two parts, but I feel it could be a good way to build brick by brick towards a pretty good understanding for those interested.