The Rust calling convention we deserve

(mcyoung.xyz)

291 points | by matt_d 13 days ago

17 comments

  • pizlonator 13 days ago
    The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.

    Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.

    One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.

    Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.

    Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.

    Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.

    (Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)

    • weinzierl 13 days ago
      I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.

      That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.

      You said it yourself: "Another reason is that the CPUs of today are optimized [..]"

      The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.

      Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.

      Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.

      [1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.

      • workingjubilee 13 days ago
        The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.
        • weinzierl 13 days ago
          I know, but from what I understand there are initiatives to stabilize the ABI, which would also mean stabilizing calling conventions. I read the article in that broader context, even if it does not talk about that directly.
          • JoshTriplett 13 days ago
            There's no proposal to stabilize the Rust ABI. There are proposals to define a separate stable ABI, which would not be the default ABI. (Such a separate stable ABI would want to plan for long-term performance, but the default ABI could continue to improve.)
            • zozbot234 12 days ago
              There is already a separate stable ABI, it's just the C ABI. There are also multiple crates that address the issue of stable ABIs for Rust code. It's not very clear why compiler involvement would be required for this.
              • tialaramex 11 days ago
                Surely it would be nice to be able to specify using the repr mechanism that you explicitly want so-and-such ABI, in the same way that you can for the C ABI.

                I haven't looked at the crates you're describing, but presumably they're providing a proc macro or something instead, which is not really the right layer to do this stuff.

            • weinzierl 12 days ago
              Thanks, I never had considered that a possibility when hearing about "Rust stable ABI", but it makes a lot if sense.
        • dathinab 13 days ago
          If I remember correctly there is a bit of difference between explicit `extern "rust"` and no explicit calling convention but I'm not so sure.

          Anyway at least when not using explicit rust representation Rust doesn't even guarantee that the layout of a struct is the same for two repeated build _with the same compiler and code_. That is very intentionally and I think there is no intend to change that "in general" (but various subsets might be standarized, like `Option<&T> where T: Sized` mapping `None` to a null pointer allowing you to use it in C-FFI is already a de facto standard). Which as far as I remember is where explicit extern rust comes in to make sure that we can have a prebuild libstd, it still can change with _any_ compiler version including patch versions. E.g. a hypothetical 1.100 and 1.100.1 might not have the same unstable rust calling convention.

      • Ygg2 12 days ago
        > means that it is beneficial to not deviate too much from what C++ does

        Or just C.

        Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .

        • mananaysiempre 12 days ago
          Huh, I thought they fixed that (the PCMPISTR? string instructions from SSE4.2 being significantly faster than PCMPESTR?), but looks like the explicit-length version still takes twice as many uops on recent Intel and AMD CPUs. They don’t seem to get much use nowadays anyway, though, what with being stuck in the 128-bit realm (there’s a VEX encoding but that’s basically it).
      • flohofwoe 13 days ago
        > and a calling convention should be designed for the long term

        ...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).

        • weinzierl 13 days ago
          You are right, as Josh Triplett also pointed out above. I was mistaken about the plans to stabilize the ABI.
    • leni536 13 days ago
      Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.

      It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.

      • __s 12 days ago
        Dynamic calling conventions also won't work with dynamic linking
        • kevincox 11 days ago
          Even when dynamic linking a lot of calls will be internal to each library. So you can either:

          1. Use a stable calling convention for external interfaces.

          2. Use a stable calling convention for everything, but generate trampolines for external calls.

          Swift is actually pretty cool here. It basically does 2. But you can also specify which dependencies are "pinned" so that even if they are dynamically linked they can't be updated without a recompile. Then you can use the unstable calling convention when calling those dependencies.

    • workingjubilee 13 days ago
      Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.
      • pizlonator 12 days ago
        LLVM inlines even more than my JIT does.

        The JIT has both more and less information.

        It has more information about the program globally. There is no linking or “extern” boundary.

        But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.

        One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.

        (I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)

        • kevincox 11 days ago
          > the JIT only knows about those that happened in the past.

          This is typically handled by assuming that all future calls will be representative of past calls. Then you add a (cheap) check for that assumption and fall back to interpreter or an earlier JIT that didn't make that assumption.

          This can actually be better than AOT because you may have some incredibly rare error path that creates a second call to the function. But you are better off compiling that function assuming that the error never occurs. In the unlikely case it does occur you can fall back to the slower path and end up faster overall. Unless the AOT compiler wants to emit two specializations of the function the generated code needs to handle all possible cases, no matter how unlikely.

          Of course in practice AOT wins. But there are many interesting edge cases where a JIT can pull off an optimization that an AOT compiler can't do.

      • saagarjha 13 days ago
        The people who write JITs also write a bunch of C++ that gets statically compiled.
    • mkj 13 days ago
      And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.
      • fleventynine 13 days ago
        The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.
        • planede 13 days ago
          There were proposals for optimizing this kind of stuff for C++ in particular for error handling, like:

          https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p07...

          > Throwing such values behaves as-if the function returned union{R;E;}+bool where on success the function returns the normal return value R and on error the function returns the error value type E, both in the same return channel including using the same registers. The discriminant can use an unused CPU flag or a register

          • mananaysiempre 12 days ago
            IBM BIOS and MS DOS calls used the carry flag as a boolean return or error indicator (the 8086 has instructions for manually setting and resetting it). I don’t think people do that these days except in manually-coded assembly, unfortunately (which the relevant parts of both BIOS and DOS also were of course).
      • pizlonator 12 days ago
        Also a thing you gotta measure.

        Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.

    • CalChris 12 days ago
      "If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."

      The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.

      Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?

      • Maken 12 days ago
        ARM follows its own calling convention, which by default uses registers for both argument and return value passing [1], so these lessons likely do not apply.

        [1] https://developer.arm.com/documentation/dui0041/c/ARM-Proced...

      • pizlonator 12 days ago
        Yes.

        If you flatten big structs into registers to pass them you have a bad time on armv8.

        I tried. That was an llvm experiment. Ahead of time compiler for a modified version of C.

    • amelius 12 days ago
      If you want fast, then you probably need to have a different calling convention per call.
  • JonChesterfield 13 days ago
    Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

    It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

    Changing ABI with optimization setting interacts really badly with separate compilation.

    Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.

    The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.

    Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.

    • LegionMammal978 13 days ago
      > This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

      Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.

      Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.

      • JonChesterfield 13 days ago
        You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.
        • jcranmer 13 days ago
          The LLVM calling conventions for x86 only allow returning 3 integer registers, 4 vector registers, and 2 x87 floating point registers (er, stack slots technically because x87 is weird).
          • JonChesterfield 13 days ago
            Sure. That would be an instance of the "usual error". The argument registers are usually caller save, where any unused ones get treated as scratch in the callee, in which case making them all available for returning data as well is zero cost.

            There's no reason not to, other than C makes returning multiple things awkward and splitting a struct across multiple registers is slightly annoying for the compiler.

          • Denvercoder9 13 days ago
            Limiting a newly designed Rust ABI to whatever LLVM happens to support at the moment seems unnecessarily limiting. Yeah, you'd need to write some C++ to implement it, but that's not the end of the world, especially compared to getting stuck with arbitrary limits in your ABI for the next decade or two.
            • anonymoushn 13 days ago
              This sort of thing is why integer division by 0 is UB in rust on targets where it's not UB, because it's UB in LLVM :)
              • tialaramex 12 days ago
                I stared at this really hard, and I eventually couldn't figure out what you mean here.

                Obviously naively just dividing integers by zero in Rust will panic, because that's what is defined to happen.

                So you have to be thinking about a specific case where it's defined not to panic. But, what case? There isn't an unchecked_div defined on the integers. The wrapping and saturating variants panic for division by zero, as do the various edge cases like div_floor

                What case are you thinking of where "integer division by 0 is UB in Rust" ?

                • Marzepansion 12 days ago
                  The poster is both correct and incorrect. It definitely is true that LLVM only has two instructions to deal with division, udiv and sdiv specifically, and it used to be the case that Rust as a consequence had UB when encountering division by zero as a result as those two instructions consider that operation UB.

                  But Rust has solved this problem by inserting a check before every division that reasonably could get a division by zero (might even be all operations, I don't know the specifics), which checks for zero and defines the consequences.

                  So as a result divisions aren't just divisions in Rust, they come with an additional check as overhead, but they aren't UB either.

                  • tialaramex 12 days ago
                    Oh, I see, yes obviously if you know your value isn't zero, that's what the NonZero types are for, and these of course don't emit a check because it's unnecessary.
                    • anonymoushn 12 days ago
                      Sure, and if you actually want a branchless integer division for an arbitrary input, which is defined for the entire input domain on x64, then to get it you'll have to pull some trick like reinterpreting a zeroable type as a nonzero one, heading straight through LLVM IR UB on your way to the defined behavior on x64.
                      • tialaramex 11 days ago
                        By the way: Don't actually do this. The LLVM IR is not defined to do what you wanted, and even if it works today, and it worked yesterday it might just stop working tomorrow, or on a different CPU model or with different optimisation settings.

                        If what you want is "Whatever happens when I execute this CPU instruction" you can literally write that in Rust today and that will do what you wanted. Invoking UB because you're sure you know better is how you end up with mysterious bugs.

                        This reminds me of people writing very crazy unsafe Rust to try to reproduce the "Quake fast inverse square root" even though um, you can just write that exact routine in safe Rust and it's guaranteed to do exactly what you meant with the IEEE re-interpretation as integer etc., safely and emitting essentially the same machine code on x86 - not even mentioning that's not how to calculate an inverse square root quickly today because Quake was a long time ago and your CPU is much better today than the ones Carmack wrote that code for.

                        • anonymoushn 11 days ago
                          I agree, one should put asm inside of the unsafe block instead. Because of the unfortunate fact about LLVM IR mentioned many comments above, the raw div in LLVM IR is allowed to be treated the same as if it was preceded with `if (divisor == 0) unreachable;`, which is a path to disaster.
    • workingjubilee 13 days ago
      Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?

      And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.

      In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.

      • JonChesterfield 12 days ago
        If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).

        I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.

        Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.

        I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.

        I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.

        Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.

    • khuey 13 days ago
      > It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

      DWARF doesn't encode bespoke calling conventions at all today.

    • t0b1 13 days ago
      The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.

      And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.

      • dzaima 13 days ago
        Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).

        Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.

        Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.

        Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.

        • saghm 13 days ago
          > Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2

          I don't have a strong sense of how much more common owned `Option` types are than references, but it's worth noting that if `T` is a reference, `Option<T>` will just use a pointer and treat the null value as `None` under the hood to avoid needing any tag. There are probably other types where this is done as well (maybe `NonZero` integer types?)

          • tialaramex 11 days ago
            Rust has a thing called the Guaranteed Niche Optimisation, which says if you make a Sum type, and the Sum type has exactly one variant which is just itself, plus exactly one variant which has a niche (a bit pattern which isn't used by any valid representation of that type) then it promises that your type is the same size as the type with the niche in it.

            That is, if you made your own Maybe type which works like Option, it's also guaranteed to get this optimisation, and the optimisation works for any type which the compiler knows has a "niche", not just obvious things like references or small enumerations, the NonZero type, but also e.g. OwnedFd, a type which is a Unix file descriptor - Unix file descriptors cannot be -1, and so logically the bit pattern for -1 serves as a niche for this type.

            I really like this feature, and I want to use it more. There's good news and bad news. The good news is that although the Guaranteed Niche Optimisation is the only such guarantee, in practice the Rust compiler will do much more with a niche.

            The bad news is that we're not allowed to make new types with their own niches (other than enumerations which automatically get an appropriately sized niche) in stable Rust today. In fact the ability to mark a niche is not only permanently unstable (thus usable in practice only from the Rust stdlib) but it's a compiler internal feature, they're pretty much telling you not to touch this, it can't and won't get stabilized in this form)

            But, we do have a good number of useful niches in the standard library, all references, the NonNull pointers (if you use pointers for something), the NonZero types, the booleans, small C-style enumerations, OwnedFd, that's quite a lot of possibilities.

            The main thing I want, and the reason I tried to make more movement happen (but I have done very little for about a year) is BalancedIx a suite of types like NonZero, but missing the most negative values of the signed integers. You very rarely need -128 on an 8-bit signed integer, and it's kind of a nuisance, so BalancedI8 would be the same size, it loses -128 and in exchange Option<BalancedI8> is the same size and now abs does what you expected, two for the price!

          • ratmice 13 days ago
            Yeah, `NonZero*` but also a type like `#[repr(u8)] enum Foo{ X }`, according to `assert_eq!(std::mem::size_of::<Option<Foo>(), std::mem::size_of::<Foo>())` you need an enum which fully saturates the repr, e.g. `#[repr(u8)]Bar { X0, ... X255}` (pseudo code) before niche optimization fails to kick in.
            • saghm 12 days ago
              Oh, good to know!
    • rayiner 13 days ago
      Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?
      • kevingadd 13 days ago
        Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.

        (Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.

        • rayiner 13 days ago
          It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.
          • FullyFunctional 12 days ago
            Due to speculative out-of-order execution, it's not just "a few cycles". The LSU has a hard, small, limit on the number of outstanding loads and stores (usually separate limits, on the order of 8-32) and once you fill that, you have to stop issuing until commit has drained them.

            This discussion is yet another instance of the fallacy of "Intel has optimized for the current code so let's not improve it!". Other examples include branch prediction (correctly predicted branch as a small but not zero cost) and indirect jump prediction. And this doesn't even begin to address implementations that might be less aggressive about making up for bad code (like most RISCs and RISC-likes).

      • dwattttt 13 days ago
        More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo
        • eru 13 days ago
          Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.
      • pcwalton 12 days ago
        Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.

        Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.

  • quotemstr 13 days ago
    The C calling convention kind of sucks. True, can't change the C calling convention, but that doesn't make it any less unfortunate.

    We should use every available caller-saved register for arguments and return values, but in the traditional SysV ABI, we use only one register (sometimes two) for return values. If you return a struct Point3D { long x, y, z }, you spill the stack even though we could damned well put Point3D in rax, rdi, and rsi.

    There are other tricks other systems use. For example, if I recall correctly, in SBCL, functions set the carry flag on exit if they're returning multiple values. Wouldn't it be nice if we used the carry flag in indicate, e.g. whether a Result contains an error.

    • fch42 13 days ago
      "sucks" is a strong word but with respect to return values, you're right. The C calling conventions, everywhere really, support what C supports - returning one argument. Well, not even that (struct returns ... nope). Kind of "who'd have thought" in C I guess. And then there's the C++ argument "just make it inline then".

      On the other hand, memory spills happen. For SPARC, for example, the gracious register space (windows) ended up with lots of unused regs in simple functions and a cache-busting huge stack size footprint, definitely so if you ever spilled the register ring. Even with all the mov in x86 (and there is always lots of it, at least in compiled C code) to rearrange data to "where it needed to be", it often ended up faster.

      When you only look at the callee code (code generated for a given function signature), it's tempting to say "oh it'll definitely be fastest if this arg is here and that return there". You don't know the callers though. There's no guarantee the argument marshalling will end up "pass through" or the returns are "hot" consumed. Say, a struct Point { x: i32, y: i32, z: i32 } as arg/return; if the caller does something like mystruct.deepinside.point[i] = func(mystruct.deepinside.point[i]) in a loop then moving it in/out of regs may be overhead or even prevent vectorisation. But the callee cannot know. Unless... the compiler can see both and inline (back to the C++ excuse). Yes, for function call chaining javascript/rust style it might be nice/useful "in principle". But in practice only if the compiler has enough caller/callee insight to keep the hot working set "passthrough" (no spills).

      The lowest hanging fruit on calling is probably to remove the "functions return one primitive thing" that's ingrained in the C ABIs almost everywhere. For the rest ? A lot of benchmarking and code generation statistics. I'd love to see more of that. Even if it's dry stuff.

      • flohofwoe 13 days ago
        > Well, not even that (struct returns ... nope).

        C compilers actually pack small struct return values into registers:

        https://godbolt.org/z/51q5se86s

        It's just limited that on x86-64, GCC and Clang use up to two registers while MSVC only uses one.

        Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions that are defined by the various runtime environments (usually the combination of CPU architecture and operating system). C compilers just must adhere to those CPU+OS calling conventions like any other language that wants to interact directly with the operating system.

        IMHO the whole performance angle is a bit overblown though, for 'high frequency functions' the compiler should inline the function body anyway. And for situations where that's not possible (e.g. calling into DLLs), the DLL should expose an API that doesn't require such 'high frequency functions' in the first place.

        • fch42 13 days ago
          > Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions [ ... ]

          I did not say that. I said "C calling conventions" (plural). Rather aware of the fact that the devil is in the detail here ... heck, if you want it all, back in the bad old days, even the same compiler supported/used multiple ("fastcall" & Co, or on Win 3.x "pascal" for system interfaces, or the various ARM ABIs, ...).

          • dzaima 12 days ago
            Clang still has some alternative calling conventions via __attribute__((X)) for individual functions with a bunch of options[0], though none just extend the set of arguments passed via GPRs (closest seems to be preserve_none with 12 arguments passed by register, but it also unconditionally gets rid of all callee-saved registers; preserve_most is nice for rarely-taken paths, though until clang-17 it was broken on functions which returned things).

            [0]: https://clang.llvm.org/docs/AttributeReference.html#calling-...

  • Arnavion 12 days ago
    Tangentially related, there's another "unfortunate" detail of Rust that makes some structs bigger than you want them to be. Imagine a struct Foo that contains eight `Option<u8>` fields, ie each field is either `None` or `Some(u8)`. In C, you could represent this as a struct with eight 1-bit `bool`s and eight `uint8_t`s, for a total size of 9 bytes. In Rust however, the struct will be 16 bytes, ie eight sequences of 1-byte discriminant followed by a `uint8_t`.

    Why? The reason is that structs must be able to present borrows of their fields, so given a `&Foo` the compiler must allow the construction of a `&Foo::some_field`, which in this case is an `&Option<u8>`. This `&Option<u8>` must obviously look identical to any other `&Option<u8>` in the program. Thus the underlying `Option<u8>` is forced to have the same layout as any other `Option<u8>` in the program, ie its own personal discriminant bit rounded up to a byte followed by its `u8`. The struct pays this price even if the program never actually constructs a `&Foo::some_field`.

    This becomes even worse if you consider Options of larger types, like a struct with eight `Option<u16>` fields. Then each personal discriminant will be rounded up to two bytes, for a total size of 32 bytes with a quarter (or almost half, if you include the unused bits of the discriminants) being wasted interstitial padding. The C equivalent would only be 18 bytes. With `Option<u64>`, the Rust struct would be 128 bytes while the C struct would be 72 bytes.

    You *can* implement the C equivalent manually of course, with a `u8` for the packed discriminants and eight `MaybeUninit<T>`s, and functions that map from `&Foo` to `Option<&T>`, `&mut Foo` to `Option<&mut T>`, etc, but not to `&Option<T>` or `&mut Option<T>`.

    https://play.rust-lang.org/?version=stable&mode=debug&editio...

    • Arcuru 12 days ago
      You have to implement the C version manually, so it's not that odd you'd need to do the same for Rust?

      You've described, basically, a custom type that is 8 Options<u8>s. If you start caring about performance you'll need to roll your own internal Option handling.

      • Arnavion 12 days ago
        >You have to implement the C version manually

        There's no "manually" about it. There's only one way to implement it in C, ie eight booleans and eight uint8_ts as I described. Going from there to the further optimization of adding a `:1` to every `bool` field is a simple optimization. Reimplementing `Option` and the bitpacking of the discriminants is much more effort compared to the baseline implementation of using `Option`.

        • IshKebab 12 days ago
          The alternative is `std::optional` which works exactly the same as Rust's `Option` (without the niche optimisation).

          I'm not a C programmer but I imagine you could make something like `std::optional` in C using structs and macros and whatnot.

        • surajrmal 11 days ago
          But it's not any more work than it would take in C. What does it matter how much work it is relative to rust's happy path?
    • Aurornis 12 days ago
      > You can implement the C equivalent manually of course

      But you have to implement the C version manually as well.

      It's not really a downside to Rust if it provides a convenient feature that you can choose to use if it fits your goals.

      The use case you're describing is relatively rare. If it's an actual performance bottleneck then spending a little extra time to implement it in Rust doesn't seem like a big deal. I have a hard time considering this an "unfortunate detail" to Rust when the presence of the Option<_> type provides so much benefit in typical use cases.

      • Arnavion 12 days ago
        I answered this in the other subthread already.
  • dwattttt 13 days ago
    > If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.

    If the legacy shim tail calls the Rust-calling-convention function, won't that prevent it from fixing any return value differences in the calling convention?

    • JonChesterfield 13 days ago
      Yes. People tend to forget about the return half of the calling convention though so it's an understandable typographical error.
  • yogorenapan 13 days ago
    Tangentially related: Is it currently possible to have interop between Go and Rust? I remember seeing someone achieving it with Zig in the middle but can’t for the sake of me find it. Have some legacy Rust code (what??) that I’m hoping to slowly port to Go piece by piece
    • 100k 13 days ago
      Yes, you can use CGO to call Rust functions using extern "C" FFI. I gave a talk about how we use it for GitHub code search at RustConf 2023 (https://www.youtube.com/watch?v=KYdlqhb267c) and afterwards I talked to some other folks (like 1Password) who are doing similar things.

      It's not a lot of fun because moving types across the C interop boundary is tedious, but it is possible and allows code reuse.

    • apendleton 13 days ago
      If you want to call from Go into Rust, you can declare any Rust function as `extern "C"` and then call it the same way you would call C from Go. Not sure about going the other way.
    • duped 13 days ago
      It's usually unwise to mix managed and unmanaged memory since the managed code needs to be able to own the memory its freeing and moving whereas the unmanaged code needs to reason about when memory is freed or moved. cgo (and other variants) let you mix FFI calls into unmanaged memory from managed code in Go, but you pay a penalty for it.

      In language implementations where GC isn't shared by the different languages calling each other you're always going to have this problem. Mixing managed/unmanaged code is both an old idea and actively researched.

      It's almost always a terrible idea to call into managed code from unmanaged code unless you're working directly with an embedded runtime that's been designed for it. And when you do, there's usually a serialization layer in between.

      • zozbot234 12 days ago
        > It's usually unwise to mix managed and unmanaged memory

        Broadly stated, you can achieve this by marking a managed object as a GC root whenever it's to be referenced by unmanaged code (so that it won't be freed or moved in that case) and adding finalizers whenever managed objects own or hold refcounted references over unmanaged memory (so that the unmanaged code can reason about these objects being freed). But yes, it's a bit fiddly.

      • neonsunset 13 days ago
        Mixing managed and unmanaged code being an issue is simply not true in programming in general.

        It may be an issue in Go or Java, but it just isn't in C# or Swift.

        Calling `write` in C# on Unix is as easy as the following snippet and has almost no overhead:

            var text = "Hello, World!\n"u8;
            Interop.Write(1, text, text.Length);
        
            static unsafe partial class Interop
            {
                [LibraryImport("libc", EntryPoint = "write")]
                public static partial void Write(
                    nint fd, ReadOnlySpan<byte> buffer, nint length);
            }
        
        In addition, unmanaged->managed calls are also rarely an issue, both via function pointers and plain C exports if you build a binary with NativeAOT:

            public static class Exports
            {
                [UnmanagedCallersOnly(EntryPoint = "sum")]
                public static nint Sum(nint a, nint b) => a + b;
            }
        
        It is indeed true that more complex scenarios may require some form of bespoke embedding/hosting of the runtime, but that is more of a peculiarity of Go and Java, not an actual technical limitation.
        • jcranmer 13 days ago
          That's not the direction being talked about here. Try calling the C# method from C or C++ or Rust.

          (I somewhat recently did try setting up mono to be able to do this... it wasn't fun.)

          • int_19h 11 days ago
            It is very easy to call a C# method from C++, since .NET has a COM interop layer. From C++ this will just look as a class with no fields but a bunch of virtual methods. Alternatively, you can easily convert a static method to a native function pointer and then invoke that - this way it's also easy to do from C, Rust, and just about anything else that speaks the C ABI.

            If your C# method doesn't take any arguments like managed strings or arrays that require marshaling, it's also very cheap (and there's unsafe pointers, structs, and fixed arrays that can be used at interop boundary to avoid marshaling even for fairly complicated data structures).

            .NET was very much designed around these kinds of things. It's not a coincidence that its full type system covers everything that you can find in C.

          • neonsunset 13 days ago
            What you may have been looking for is these:

            - https://learn.microsoft.com/en-us/dotnet/core/deploying/nati...

            - https://github.com/dotnet/samples/blob/main/core/nativeaot/N...

            With that said, Mono has been a staple choice for embedding in game-script style scenarios, in particular, because of the ability to directly call its methods inside (provided the caller honors the calling convention correctly), but it has been slowly becoming more of a liability as you are missing out on a lot of performance by not hosting CoreCLR instead.

            For .dll/.so/.dylib's, it is easier and often better to just build a native library with naot instead (the links above, you can also produce statically linkable binaries but it might have issues on e.g. macOS which has...not the most reliable linker that likes to take breaking changes).

            This type of library works in almost every scenario a library implemented in C/C++/Rust with C exports does. For example, here someone implemented a hello-world demonstration of using C# to write an OBS plugin: https://sharovarskyi.com/blog/posts/dotnet-obs-plugin-with-n...

            Using the exports boils down to just this https://github.com/kostya9/DotnetObsPluginWithNativeAOT/blob... and specifying correct build flags.

            • duped 13 days ago
              I haven't been looking for those because I don't work with .NET. Regardless, what you're linking still needs callers and callees to agree on calling convention and special binding annotations across FFI boundaries which isn't particularly interesting from the perspective of language implementation like the promises of Graal or WASM + GC + component model.
              • neonsunset 13 days ago
                There is no free lunch. WASM just means another lowest common denominator abstraction for FFI. I'm also looking forward to WASM getting actually good so .NET could properly target it (because shipping WASM-compiled GC is really, really painful, it works acceptably today, but could be better). Current WasmGC spec is pretty much unusable by any language that has non-primitive GC implementation.

                Just please don't run WASM on the server, we're already getting diminishing generational performance gains in hardware, no need to reduce them further.

                The exports in the examples follow C ABI with respective OS/ISA-specific calling convention.

        • duped 13 days ago
          There are more managed langauges than Go, Java, and C#. Swift (and Objective C with ARC) are a bit different in that they don't use mark and sweep/generational GCs for automatic memory management so it's significantly less of an issue. Compare with Lua, Python, JS, etc where there's a serialization boundary between the two.

          But I stand by what I said. It's generally unwise to mix the two, particularly calling unmanaged code from managed code.

          I wouldn't say it's "not a problem" because there are very few environments where you don't pay some cost for mixing and matching between managed/unmanaged code, and the environments designed around it are built from first principles to support it, like .NET. More interesting to me are Graal and WASM (once GC support lands) which should make it much easier to deal with.

        • pjmlp 13 days ago
          Except that is only true since those attributes were introduced in recent .NET versions, and it doesn't account for COM marshaling issues.

          Plenty of .NET code still using the old ways that isn't going to be rewritten, either for these attributes, or the new Cs/WinRT, or the new Core COM interop, which doesn't support all COM use cases anyway.

          • neonsunset 12 days ago
            Code written for .NET Framework is completely irrelevant to conversation since it does not evaluate it.

            You should treat it as dead and move on because it does not impact what .NET can or can’t do.

            There is no point to bring up “No, but 10 years ago it was different”. So what? It’s not 2014 anymore.

            • pjmlp 12 days ago
              My remarks also apply to modern .NET, as those improvements were introduced in .NET 6 and .NET 8, and require a code rewrite to adopt them, instead of the old ways which are also available, in your blind advocacy you happened to miss out.

              Very few code gets written from scratch unless we are talking about startups.

        • meindnoch 13 days ago
          Swift is not a "managed" (i.e. GC) language.
          • pjmlp 13 days ago
            Reference counting is a GC algorithm in any decent CS book.

            "A Unified Theory of Garbage Collection"

            https://courses.cs.washington.edu/courses/cse590p/05au/p50-b...

            • meindnoch 12 days ago
              I was expecting this pedantic comment... If refcounting makes a language "managed", then C++ with shared_ptr is also "managed".

              _______

              The charitable interpretation is that OP was likely referring to the issues when calling into a language with a relocating GC (because you need to tell the GC not to move objects while you're working with them), which Swift is not.

              • neonsunset 12 days ago
                Swift has just as much concerns for its structs and classes passing across FFI in terms of marshalling/unmarshalling and ensuring the ARC-unaware code performs either manual retain/release calls or adapts them to whatever other mechanism of memory management of the callee.

                One of the comments here mentions that Swift has its own stable ABI, which exposes richer type system, so it does stand out in terms of interop (.NET 9 will add support for it natively (library evolution ABI) without having to go through C calls or C "glue" code on iOS and macOS, maybe the first non-Swift/ObjC platform to do so?).

                Object pinning in .NET is only a part of equation and at this point far from the biggest concern (it just works, like it did 15 years ago, maybe it's a matter of fuss elsewhere?).

              • pjmlp 12 days ago
                Nope, because that is a library class without any language support.

                The pedantic comment is a synonymous with proper education instead of street urban myths.

                • meindnoch 12 days ago
                  It is a library class, because C++ is a rich enough language to implement automatic refcounting as a library class, by hooking into the appropriate lifecycle methods (copy ctor, dtor).
    • mrits 13 days ago
      I have to use Rust and Swift quite a bit. I basically just landed on sending a byte array of serialized protobufs back and forth with cookie cutter function calls. If this is your full time job I can see how you might think that is lame, but I really got tired of coming back to the code every few weeks and not remembering how to do anything.
    • Voultapher 12 days ago
      If you want a particularity cursed example, I've recently called Go code from Rust via C in the middle, including passing a Rust closure with state into the Go code as callback into a Go stdlib function, including panic unwinding from inside the Rust closure https://github.com/Voultapher/sort-research-rs/commit/df6c91....
    • neonsunset 13 days ago
      You have to go through C bindings, but FFI is very far from being Go's strongest suit (if we don't count Cgo), so if that's what interests you, it might be better to explore a different language.
  • dhosek 13 days ago
    I just spent a bunch of time on inspect element trying to figure out how the section headings are set at an angle and (at least with Safari tools), I’m stumped. So how did he do this?
    • caperfee 13 days ago
      The style is on the `.post-title` element: `transform: skewY(-2deg) translate(-1rem, -0.4rem);`
    • skgough 13 days ago
      related, I thought the minimap was using the CSS element() function [0], but it turns out it's actually just a copy of the article shrunk down real small.

      [0] https://developer.mozilla.org/en-US/docs/Web/CSS/element

    • aaron_seattle2 13 days ago
      h1, h2, h3, h4, h5, h6 { transform:skewY(-2deg) translate(-1rem,0rem); transform-origin:top; font-style:italic; text-decoration-line:underline; text-decoration-color:goldenrod; text-underline-offset:4%; text-decoration-thickness:.25ex }
  • AceJohnny2 13 days ago
    In contrast: "How Swift Achieved Dynamic Linking Where Rust Couldn't " (2019) [1]

    On the one hand I'm disappointed that Rust still doesn't have a calling convention for Rust-level semantics. On the other hand the above article demonstrates the tremendous amount of work that's required to get there. Apple was deeply motivated to build this as a requirement to make Swift a viable system language that applications could rely on, but Rust does not have that kind of backing.

    [1] https://faultlore.com/blah/swift-abi/

    HN discussion: https://news.ycombinator.com/item?id=21488415

    • fl0ki 13 days ago
      It's only fair to point out that Swift's approach has runtime costs. It would be good to have more supported options for this tradeoff in Rust, including but not limited to https://github.com/rust-lang/rfcs/pull/3470
      • ninkendo 13 days ago
        Notably these runtime costs only occur if you’re calling into another library. For calls within a given swift library, you don’t incur the runtime costs: size checks are elided (since size is known), calls can be inlined, generics are monomorphized… the costs only happen when you’re calling into code that the compiler can’t see.
  • Animats 13 days ago
    Given that the current Rust compiler does aggressive inlining and then optimizes, is this worth the trouble? If the function being called is tiny, it should be inlined. If it's big, you're probably going to spend some time in it and the call overhead is minor.
    • celeritascelery 13 days ago
      Runtime functions (eg dyn Trait) can’t be inlined for one, so this would help there. But also if you can make calls cheaper then you don’t have to be so aggressive with inlining, which can help with code size and compile times.
    • jonstewart 13 days ago
      Probably? A complex function that’s not a good fit for inlining will probably access memory a few times and those accesses are likely to be the bottlenecks for the function. Passing on the stack squeezes that bottleneck tighter — more cache pressure, load/stores, etc. If Rust can pass arguments optimally in a decent ratio of function calls, not only is it avoiding the several clocks of L1 access, it’s hopefully letting the CPU get to those essential memory bottlenecks faster. There are probably several percentage points of win here…? But I am drinking wine and not doing the math, so…
  • repelsteeltje 13 days ago
    Can someone explain the “Diana’s silk dress cost $89” mnemonic on x86 reference?
  • vrotaru 13 days ago
    There was an interesting aproach to this, in an experimental language some time ago

       fn f1 (x, y) #-> // Use C calling conventions
    
       fn f2 (x, y) -> // use fast calling conventions
    
    The first one was mostly for interacting with C code, and the compiler knew how to call each function.
    • magicalhippo 13 days ago
      Delphi, and I'm sure others, have had[1] this for ages:

      When you declare a procedure or function, you can specify a calling convention using one of the directives register, pascal, cdecl, stdcall, safecall, and winapi.

      As in your example, cdecl is for calling C code, while stdcall/winapi on Windows for calling Windows APIs.

      [1]: https://docwiki.embarcadero.com/RADStudio/Sydney/en/Procedur...

      • pjmlp 13 days ago
        Since the Turbo Pascal days actually.
        • magicalhippo 12 days ago
          I was pretty sure it had it, I just couldn't find an online reference.
          • pjmlp 12 days ago
            Turbo Pascal for Windows v1.5, on Windows 3.1, the transition step before Delphi came to be.
      • JonChesterfield 12 days ago
        C for example does this, albeit in compiler extensions, and with a longer tag than #.
    • IshKebab 13 days ago
      Terrible taste. Why would you hide such an infrequently used feature behind a single character? In this case you should absolutely use a keyword.
    • dgellow 13 days ago
      Is it similar to Zig’s callconv keyword?
      • vrotaru 13 days ago
        Guess so. Unfamiliar with Zig. The point is that not a "all or nothing" strategy for a compilation unit.

        Debugger writers may not be happy, but maybe lldb supports all conventions supported by llvm.

        • int_19h 11 days ago
          Debugger writers have dealt with different calling conventions for decades. The notion predates C even. They can handle it just fine.
  • zamalek 13 days ago
    > Debuggers

    Simply throw it in as a Cargo.toml flag and sidestep the worry. Yes, you do sometimes have to debug release code - but there you can use the not-quite-perfect debugging that the author mentions.

    Also, why aren't we size-sorting fields already? That seems like an easy optimization, and can be turned off with a repr.

    • fl0ki 12 days ago
      > Also, why aren't we size-sorting fields already?

      We are for struct/enum fields. https://camlorn.net/posts/April%202017/rust-struct-field-reo...

      There's even an unstable flag to help catch incorrect assumptions about struct layout. https://github.com/rust-lang/compiler-team/issues/457

      • zamalek 12 days ago
        Oh wow, awesome! I was somewhat considering this being my first contribution - glad someone already tackled it!
        • fl0ki 12 days ago
          If I can suggest, the next big breakthrough in this space would be generalizing niche filling optimization. Every thread about this seems to fizzle out, to the point that I couldn't even find which one is the latest any more.

          Today most data-carrying enums decay into the lowest common denominator of a 1-byte discriminant padded by 7 more bytes before any variant's data payload can begin. This really adds up when enums are nested, not just blowing out the calling register set but also making each data structure a swiss cheese of padding bytes.

          Even a few more improvements in that space would have enormous impact, compounding with other optimizations like better calling conventions.

    • JonChesterfield 12 days ago
      Did you want alignment sorting? In general the problem with things like that is the ideal layout is usually architecture and application specific - if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.
      • zamalek 12 days ago
        > Did you want alignment sorting?

        Yep. It will probably improve (to be measured) the 80%. Less memory means less bandwidth usage etc.

        > if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.

        I did suggest having a repr for situations like yours. Something like #[repr(yeet)]. Optimizing for false sharing etc. is probably well within 5% of code that exists today, and is usually wrapped up in a library that presents a specific data structure.

  • sheepscreek 13 days ago
    Very interesting but pretty quickly went over my head. I have a question that is slightly related to SIMD and LLVM.

    Can someone explain simply where does MLIR fit into all of this? Does it standardize more advanced operations across programming languages - such as linear algebra and convolutions?

    Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

    • fpgamlirfanboy 13 days ago
      > Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).

      Are people getting paid to repeat this ad nauseum?

    • jadodev 13 days ago
      MLIR includes a "linalg" dialect that contains common operations. You can see those here: https://mlir.llvm.org/docs/Dialects/Linalg/

      This post is rather unrelated. The linalg dialect can be lowered to LLVM IR, SPIR-V, or you could write your own pass to lower it to e.g. your custom chip.

    • jcranmer 13 days ago
      > Can someone explain simply where does MLIR fit into all of this?

      It doesn't.

      MLIR is a design for a family of intermediate languages (called 'dialects') that allow you to progressively lower high-level languages into low-level code.

      • fl0ki 12 days ago
        The ML media cycle is so unhinged that I've seen people simply assume out of hand that MLIR stands for Machine Learning Intermediate Representation.
  • m463 13 days ago
    interesting website - the title text is slanted.

    Sometimes people who dig deep into the technical details end up being creative with those details.

    • eviks 13 days ago
      True, creative, but usually in a quality degrading way like here (slanted text is harder to read, also due to the underline being too thick, and takes more space) or like with those poorly legible bg/fg color combinations
  • tempodox 13 days ago
  • jayachandranpm 12 days ago
    good
  • retox 13 days ago
    Meta: the minimap is quite interesting, it's 'just' a smaller copy of all the content.
    • edflsafoiewq 13 days ago
      Clever! Should probably have aria-hidden though.