The Performance Impact of C++'s `final` Keyword

(16bpp.net)

247 points | by hasheddan 11 days ago

151 comments

  • mgaunard 11 days ago
    What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.

    Inlining has other requirements as well -- LTO pretty much covers it.

    The article doesn't have sufficient data to tell whether the testcase is built in such a way that any of these optimizations can happen or is beneficial.

    • chipdart 10 days ago
      > What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.

      I think that enabling inlining is just one of the indirect consequences of devirtualization, and perhaps one that is largely irrelevant for performance improvements.

      The whole point of devirtualization is eliminating the need to resort to pointer dereferencing when calling virtual members. The main trait of a virtual class is it's use of a vtable that requires dereferencing virtual members to access each and every one of them.

      In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.

      Once a class is final, none of that is required anymore. When a member is referred, no dereferencing takes place.

      Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.

      The whole purpose of the final keyword is that compilers can easily eliminate all pointer dereferencing used by virtual members. What stops them from applying this optimization is that they have no information on whether that class will be inherited and one of its members will either override any of its members or invoke any member function implemented by one of its parent classes.

      With the introduction of the final keyword, you are now able to tell the compiler "from thereon, this is exactly what you get" and the compiler can trim out anything loose.

      • simonask 10 days ago
        An extra indirection (indirect call versus direct call) is practically nothing on modern hardware. Branch predictors are insanely good, and this isn't something you generally have to worry about.

        Inlining is by far the most impactful optimization here, because it can eliminate the call altogether, and thus specialize the called function to the callsite, lifting constants, hoisting loop variables, etc.

        • ot1138 10 days ago
          I had a section of code which incurred ~20 clock cycles to make a function call to a virtual function in a critical loop. That's over and above potential delays resulting from cache misses and the need to place multiple parameters on the stack.

          I was going to eliminate polymorphism altogether for this object but later figured out how to refactor so that this particular call could be called once a millisecond. Then if more work was needed, it would dispatch a task to a dedicated CPU.

          This was an incredibly performant improvement which made a significant difference to my P&L.

          • mgaunard 10 days ago
            Could just be inefficient spilling caused by ABI requirements due to the inability to inline.

            In general if you're manipulating values that fit into registers and work on a platform with a shitty ABI,you need to be very careful of what your function call boundaries look like.

            The most obvious example is SIMD programming on Windows x86 32-bit.

        • silvestrov 10 days ago
          "is practically nothing on modern hardware" if the data is already present in the L2 cache. Random RAM access that stalls execution is expensive.

          My guess is this is why he didn't see any speedup: all the code could fit inside the L2 cache, so he did not have to pay for RAM access for the deference.

          The number of different classes is important, not the number of objects as they have the same small number of vtable pointers.

          It might be different for large codebases like Chrome and Firefox.

          • simonask 9 days ago
            Both the number of objects (dcache) and the number of classes (icache) are significant, as well as the size of both, but yeah. It's pretty rare to have extremely wide class hierarchies, though. You really have to go out of your way to run into significant icache misses.
          • dblohm7 10 days ago
            Firefox has done a lot of work on devirtualization over the years. There is a cost.
        • rurban 9 days ago
          C++ vtables need 2 levels of indirection. See the asm or decompile it with ghidra. First the vtable field, and then the method field.

          Of course you have to worry about pointer chasing, when you can easily avoid it. Either via a switch to a single indirection (by passing method pointers around) or inlining with final. Or other compile-time specialization.

          • phire 9 days ago
            Though the branch predictor can chew though both layers of indirection. It can actually start fetching code from the function (and even executing it) before it even reads the function pointer from the vtable.

            Though, that assumes a correct prediction. But modern branch predictors are really good, they can track and correctly predict hundreds (if not thousands) of indirect calls, taking into account the history of the last few branches (so it can even get an idea of what class is currently being executed, and make branch predictions based on that). Modern branch predictors do a really good job at chewing up indirect branches in hot sequences of code.

            Virtual functions are probably the most harmful for warm code. We are talking about code that's executed too often to be considered cold code, but not often enough to stick around in the branch predictors' cache, executed only a few hundred times a second. It's a death by a thousand cuts type thing. And that's where devirtualisation will help the most...

            As long as you don't go too far with the inlining and start causeing icache misses with code bloat. In an ideal would the compiler would inline enough to devirtualise the class, but not necessarily inline the actual function (unless they are small, or only called from one place)

          • simonask 9 days ago
            Show me the receipts. :-)

            In general it takes a significant amount of nondeterministic pointer chasing to fool modern branch predictors. Decades of research have been put into optimizing the hardware for languages like C++ and Java, both of which exhibit a lot of pointer chasing.

        • pixelpoet 10 days ago
          Vfuncs are only fast when they can be predicted: https://forwardscattering.org/post/28
          • mgaunard 10 days ago
            Same as any other branch. They're fast if predicted correctly and slow if not.

            If they cannot be predicted, write your code accordingly.

      • oasisaimlessly 10 days ago
        > In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.

        This is not a thing in C++; vtables are flat, not nested. Function pointers are always 1 dereference away.

      • account42 10 days ago
        > Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.

        virtual inheritance. Regular old inheritance does not need or benefit from devirtualization. This is why the CRTP exists.

        • scaredginger 10 days ago
          Maybe a nitpick, but virtual inheritance is a term used for something else entirely.

          What you're talking about is dynamic dispatch

        • chipdart 10 days ago
          > This is why the CRTP exists.

          CRTP does not exist for that. CRTP was one of the many happy accidents in template metaprogramming that happened to be discovered when doing recursive templates.

          Also, you've missed the whole point. CRTP is a way to rearchitect your code to avoid dereferencing pointers to virtual members in inheritance. The whole point is that with final you do not need to pull tricks: just tell the compiler that you don't want the class to be inherited, and the compiler picks up from there and does everything for you.

          • account42 10 days ago
            If that's your point then it is simply wrong. Final does not allow the compiler to devirtualize calls through a base pointer, it only eliminates the virtualness for calls through pointers to the (final) derived type. The compiler can devirtualize calls through base pointers in others ways (by deducing the possible derived types via whole program optimization or PGO) but final does not help with that.
            • chipdart 10 days ago
              > If that's your point then it is simply wrong. Final does not allow the compiler to devirtualize calls through a base pointer, it only eliminates the virtualness for calls through pointers to the (final) derived type.

              Please read my post. That's not my claim. I think I was very clear.

    • Negitivefrags 11 days ago
      See this is why I find this odd.

      Is there a theory as to how devirtualisation could hurt performance?

      • phire 11 days ago
        Jumps/calls are actually be pretty cheap with modern branch predictors. Even indirect calls through vtables, which is the opposite of most programmers intuition.

        And if the devirtualisation leads to inlining, that results in code bloat which can lower performance though more instruction cache misses, which are not cheap.

        Inlining is actually pretty evil. It almost always speeds things up for microbenchmarks, as such benchmarks easily fit in icache. So programmers and modern compilers often go out of their way to do more inlining. But when you apply too much inlining to a whole program, things start to slow down.

        But it's not like inlining is universally bad in larger program, inlining can enable further optimisations, mostly because it allows constant propagation to travel across function boundaries.

        Basically, compilers need better heuristics about when they should be inlining. If it's just saving the overhead of a lightweight call, then they shouldn't be inlining.

        • qsdf38100 11 days ago
          "Inlining is actually pretty evil".

          No it's not. Except if you __force_inline__ everything, of course.

          Inlining reduces the number of instructions in a lot of cases. Especially when things are abstracted and factored with lot of indirections into small functions that calls other small functions and so on. Consider a 'isEmpty' function, which dissolves to 1 cpu instruction once inlined, compared with a call/save reg/compare/return. Highly dynamic code (with most functions being virtual) tend to result in a fest of chained calls, jumping into functions doing very little work. Yes the stack is usually hot and fast, but spending 80% of the instructions doing stack management is still a big waste.

          Compilers already have good heuristics about when they should be inlining, chances are they are a lot better at it than you. They don't always inline, and that's not possible anyway.

          My experience is that compiler do marvels with inlining decisions when there are lots of small functions they _can_ inline if they want to. It gives the compiler a lot of freedom. Lambdas are great for that as well.

          Make sure you make the most possible compile-time information available to the compiler, factor your code, don't have huge functions, and let the compiler do its magic. As a plus, you can have high level abstractions, deep hierarchies, and still get excellent performances.

          • grdbjydcv 11 days ago
            The “evilness” is just that sometimes if you inline aggressively in a microbenchmark things get faster but in real programs things get slower.

            As you say: “chances are they are a lot better at it than you”. Infrequently they are not.

          • EasyMark 10 days ago
            doesn't the compiler usually do well enough that you really only need to worry about time critical sections of code? Even then you could go in and look at the assembler and see if it's being inlined, no?
            • somenameforme 10 days ago
              I find the Unreal Engine source to be a reasonable reference for C++ discussions, because it runs just unbelievably well for what it does, and on a huge array of hardware (and software). And it's explicit with inlining, other hints, and even a million things that could be easily called micro-optimizations, to a somewhat absurd degree. So I'd take away two conclusions from this.

              The first is that when building a code base you don't necessarily know what it's being compiled with. And so even if there were a super-amazing compiler, there's no guarantee that's what will be compiling your code. Making it explicit, so long as you have a reasonably good idea of what you're doing, is generally just a good idea. It also conveys intent to some degree, especially things like final.

              The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil. Because that mindset has gradually transitioned to being against optimization in general outside of the most primitive things like not running critical sections in O(N^2) when they could be O(N). And I think it's this mindset that has gradually brought us to where we are today where need what what would have been a literal supercomputer not that long ago, to run a word processor. It's like death by a thousand cuts, and quite ridiculous.

              • moring 9 days ago
                > The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil.

                The greater evil is putting a one-sentence quote out of context:

                """ There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

                Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off. """

                • somenameforme 9 days ago
                  Indeed, but I think even that advice, with context, is pretty debatable. Obviously one should prioritize critical sections, but completely ignoring those "small efficiencies" is certainly a big part of how we got to where we are today in software performance. A 10% jump in performance is huge; whether that comes from a single 10% jump, or a hundred 0.1% jumps - it's exactly the same!

                  So referencing something in particular from Unreal Engine, they actually created a caching system for converting between a quaternion and a rotator (euler rotation)! Obviously that sort of conversion isn't going to, in a million years, be even close to a bottleneck. That conversion is quite cheap on modern hardware, and so that caching system probably only gives the engine one of those 0.1% boosts in performance. But there are literally thousands of these "small efficiencies" spread all throughout the code. And it yields a final product that runs dramatically better than comparable engines.

                  • mgaunard 8 days ago
                    That's not how compounding percentages works.
            • usefulcat 10 days ago
              I find that gcc and clang are so aggressive about inlining that it's usually more effective to tell them what not to inline.

              In a moderately-sized codebase I regularly work on, I use __attribute__((noinline)) nearly ten times as often as __attribute__((always_inline)). And I use __attribute__((cold)) even more than noinline.

              So yeah, I can kind of see why someone would say inlining is 'evil', though I think it's more accurate to say that it's just not possible for compilers to figure out these kinds of details without copious hints (like PGO).

              • jandrewrogers 10 days ago
                +1 on the __attribute__((cold)). Compilers so aggressively optimize based on their heuristics that you spend more time telling them that an apparent optimization opportunity is not actually an optimization.

                When writing ultra-robust code that has to survive every vaguely plausible contingency in a graceful way, the code is littered with code paths that only exist for astronomically improbable situations. The branch predictor can figure this out but the compiler frequently cannot without explicit instructions to not pollute the i-cache.

        • a_e_k 10 days ago
          Another for the pro side: inlining can allow for better branch prediction if the different call sites would tend to drive different code paths in the function.
          • phire 10 days ago
            This was true 15 years ago, but not so much today.

            The branch predictors actually hash the history of the last few branches taken into the branch prediction query. So the exact same branch within a child function will map different branch predictors entries depending on which parent function it was called from, and there is no benifit to inlining.

            It also means that branch predictor can also learn correlations between branches within a function. Like when a branches at the top and bottom of functions share conditions, or have inverted conditions.

      • variadix 11 days ago
        It basically never should unless the inliner made a terrible judgement. Devirtualizing in C++ can remove 3 levels of pointer chasing, all of which could be cache misses. Many optimizations in modern compilers require the context of the function to be inlined to make major optimizations, which requires devirtualization. The only downside is I$ pressure, but this is generally not a problem because hot loops are usually tight.
      • hansvm 11 days ago
        There's a cost to loading more instructions, especially if you have more types of instructions.

        The main advantages to inlining are (1) avoiding a jump and other function call overhead, (2) the ability to push down optimizations.

        If you execute the "same" code (same instructions, different location) in many places that can cause cache evictions and other slowdowns. It's worse if some minor optimizations were applied by the inlining, so you have more types of instructions to unpack.

        The question, roughly, is whether the gains exceed the costs. This can be a bit hard to determine because it can depend on the size of the whole program and other non-local parameters, leading to performance cliffs at various stages of complexity. Microbenchmarks will tend to suggest inlining is better in more cases that it actually is.

        Over time you get a feel for which functions should be inlined. E.g., very often you'll have guard clauses or whatnot around a trivial amount of work when the caller is expected to be able to prove the guarded information at compile-time. A function call takes space in the generated assembly too, and if you're only guarding a few instructions it's usually worth forcing an inline (even in places where the compiler's heuristics would choose not to because the guard clauses take up too much space), regardless of the potential cache costs.

      • cogman10 11 days ago
        Through inlining.

        If you have something like a `while` loop and that while loop's instructions fit neatly on the cache line, then executing that loop can be quiet fast even if you have to jump to different code locations to do the internals. However, if you pump in more instructions in that loop you can exceed the length of the cache line which causes you to need more memory loads to do the same work.

        It can also create more code. A method that took a `foo(NotFinal& bar)` could be duplicated by the compiler for the specialized cases which would be bad if there's a lot of implementations of `NotFinal` that end up being marshalled into foo. You could end up loading multiple implementations of the same function which may be slower than just keeping the virtual dispatch tables warm.

      • neonsunset 11 days ago
        Practically - it never does. It is always cheaper to perform a direct, possibly inlined, call (devirtualization != inlining) than a virtual one.

        Guarded devirtualization is also cheaper than virtual calls, even when it has to do

            if (instance is SpecificType st) { st.Call() }
            else { instance.Call() } 
        
        or even chain multiple checks at once (with either regular ifs or emitting a jump table)

        This technique is heavily used in various forms by .NET, JVM and JavaScript JIT implementations (other platforms also do that, but these are the major ones)

        The first two devirtualize virtual and interface calls (important in Java because all calls default to virtual, important in C# because people like to abuse interfaces and occasionally inheritance, C# delegates are also devirtualized/inlined now). The JS JIT (like V8) performs "inline caching" which is similar where for known object shapes property access is shape type identifier comparison and direct property read instead of keyed lookup which is way more expensive.

        • ynik 10 days ago
          Caution! If you compare across languages like that, not all virtual calls are implemented equally. A C++ virtual call is just a load from a fixed offset in the vtbl followed by an indirect call. This is fairly cheap, on modern CPUs pretty much the same as a non-virtual non-inlined call. A Java/C# interface call involves a lot more stuff, because there's no single fixed vtbl offset that's valid for all classes implementing the interface.
          • neonsunset 10 days ago
            Yes, it is true that there is difference. I'm not sure about JVM implementation details but the reason the comment says "virtual and interface" calls is to outline it. Virtual calls in .NET are sufficiently close[0] to virtual calls in C++. Interface calls, however, are coded differently[1].

            Also you are correct - virtual calls are not terribly expensive, but they encroach on ever limited* CPU resources like indirect jump and load predictors and, as noted in parent comments, block inlining, which is highly undesirable.

            [0] https://github.com/dotnet/runtime/blob/5111fdc0dc464f01647d6...

            [1] https://github.com/dotnet/runtime/blob/main/docs/design/core... (mind you, the text was initially written 18 years ago, wow)

            * through great effort of our industry to take back whatever performance wins each generation brings with even more abstractions that fail to improve our productivity

      • samus 11 days ago
        Devirtualization maybe not necessarily, but inlining might make code fail to fit into instruction caches.
      • bandrami 10 days ago
        If it's done badly, the same code that runs N times also gets cached N times because it's in N different locations in memory rather than one location that gets jumped to. Modern compilers and schedulers will eliminate a lot of that (but probably not for anything much smaller than a page), but in general there's always a tradeoff.
      • masklinn 11 days ago
        Code bloat causing icache evictions?
    • i80and 11 days ago
      If you already have LTO, can't the compiler determine this information for devirtualization purposes on its own?
      • ot 11 days ago
        In general the compiler/linker cannot assume that derived classes won't arrive later through a shared object.

        You can tell it "I won't do that" though with additional flags, like Clang's -fwhole-program-vtables, and even then it's not that simple. There was an effort in Clang to better support whole program devirtualization, but I haven't been following what kind of progress has been made: https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1

        • Slix 10 days ago
          This optimization option isn't on by default? That sounds like a lot of missed optimization. Most programs aren't going to be loading from shared libraries.

          Maybe I can set this option at work. Though it's scary because I'd have to be certain.

          • Thiez 10 days ago
            The JVM can actually perform this optimization optimistically and can undo it if the assumption is violated at runtime. So Java's 'everything is virtual by default' approach doesn't hurt. Of course relying an a sufficiently smart JIT comes with its own trade-offs.
          • JonChesterfield 10 days ago
            Optimization means "make it faster without changing behaviour in ways I don't like". Clang can't generally default that one to on because it doesn't know whether you're going to splice in more code it can't see at runtime.

            Lots of code gets slower if it might need to be called from something not currently in the compiler's scope. That's essentially what ABI overhead is. If there isn't already, there should be a compiler flag that says "this is the whole program, have at it" which implies the vtables option.

          • soooosn 10 days ago
            I think you have answered your own question: If turning on the setting is scary for you in a very localized project at your company, imagine how scary it would be to turn on by default for everybody :-P
      • wiml 11 days ago
        If your runtime environment has dynamic linking, then the LTO pass can't always be sure that a subclass won't be introduced later that overrides the method.
        • gpderetta 11 days ago
          You can tell the compiler it is indeed compiling the whole program.
        • i80and 11 days ago
          Aha! That makes sense. I wasn't thinking of that case. Thanks!
      • samus 11 days ago
        This is one of the cases where JIT compiling can shine. You can use a bazillion interfaces to decouple application code, and the JIT will optimize the calls after it found out which implementation is used. This works as long as there is only one or two of them actually active at runtime.
        • account42 10 days ago
          You don't need a JIT to do whole program optimization.
          • samus 10 days ago
            AOT whole program optimization has two limits:

            * It is possible with `dlopen()` to load code objects that violate the assumptions made during compilation.

            * The presence of runtime configuration mechanisms and application input can make it impossible to anticipate things like the choice of implementations of an interface.

            One can always strive to reduce such situations, but it might simply not be necessary if a JIT is present.

      • nickwanninger 11 days ago
        At the level that LLVM's LTO operates, no information about classes or objects is left, so LLVM itself can't really devirtualize C++ methods in most cases
        • nwallin 11 days ago
          You appear to be correct. Clang does not devirtualize in LTO, but GCC does. Personally I consider this very strange.

               $ cat animal.h cat.cpp main.cpp
              // animal.h
              
              #pragma once
              
              class animal {
               public:
                virtual ~animal() {}
                virtual void speak() = 0;
              };
              
              animal& get_mystery_animal();
              // cat.cpp
              
              #include "animal.h"
              #include <cstdio>
              
              class cat final : public animal {
              public:
                ~cat() override{}
                void speak() override{
                  puts("meow");
                }
              };
              
              static cat garfield{};
              
              animal& get_mystery_animal() {
                return garfield;
              }
              // main.cpp
              
              #include "animal.h"
              
              int main() {
                animal& a = get_mystery_animal();
                a.speak();
              }
               $ make clean && CXX=clang++ make -j && objdump --disassemble=main -C lto_test
              rm -f *.o lto_test
              clang++ -c -flto -O3 -g cat.cpp -o cat.o
              clang++ -c -flto -O3 -g main.cpp -o main.o
              clang++ -flto -O3 -g cat.o main.o -o lto_test
              
              lto_test:     file format elf64-x86-64
              
              
              Disassembly of section .init:
              
              Disassembly of section .plt:
              
              Disassembly of section .plt.got:
              
              Disassembly of section .text:
              
              00000000000011b0 <main>:
                  11b0: 50                    push   %rax
                  11b1: 48 8b 05 58 2e 00 00  mov    0x2e58(%rip),%rax        # 4010 <garfield>
                  11b8: 48 8d 3d 51 2e 00 00  lea    0x2e51(%rip),%rdi        # 4010 <garfield>
                  11bf: ff 50 10              call   *0x10(%rax)
                  11c2: 31 c0                 xor    %eax,%eax
                  11c4: 59                    pop    %rcx
                  11c5: c3                    ret
              
              Disassembly of section .fini:
               $ make clean && CXX=g++ make -j && objdump --disassemble=main -C lto_test|sed -e 's,^,    ,'
              rm -f *.o lto_test
              g++ -c -flto -O3 -g cat.cpp -o cat.o
              g++ -c -flto -O3 -g main.cpp -o main.o
              g++ -flto -O3 -g cat.o main.o -o lto_test
              
              lto_test:     file format elf64-x86-64
              
              
              Disassembly of section .init:
              
              Disassembly of section .plt:
              
              Disassembly of section .plt.got:
              
              Disassembly of section .text:
              
              0000000000001090 <main>:
                  1090: 48 83 ec 08           sub    $0x8,%rsp
                  1094: 48 8d 3d 75 2f 00 00  lea    0x2f75(%rip),%rdi        # 4010 <garfield>
                  109b: e8 50 01 00 00        call   11f0 <cat::speak()>
                  10a0: 31 c0                 xor    %eax,%eax
                  10a2: 48 83 c4 08           add    $0x8,%rsp
                  10a6: c3                    ret
              
              Disassembly of section .fini:
          • JonChesterfield 10 days ago
            I think this is a bug. There's dedicated metadata that's supposed to end up on the indirect call to list the possible targets and when that list of possible targets is this short it should be turning into a switch over concrete targets. Don't have time to dig into the IR now but it might be worth posting to the github llvm issues.
          • ranger_danger 10 days ago
            What if you add -fwhole-program-vtables on clang?
      • adzm 11 days ago
        MSVC with LTO and PGO will inline virtual calls in some situations along with a check for the expected vtable, bypassing the inlined code and calling the virtual function normally if it is an unexpected value.
      • bluGill 11 days ago
        not if there is a shared libray or other plugin. Then you coannot determine until runtime if there is an override.
    • bdjsiqoocwk 10 days ago
      Whats devirtualization in C++?

      Funny how things work. From working with Julia I've built a good intuition for guessing when functions would be inlined. And yet, I've never heard the word devirtualization until now.

      • saagarjha 10 days ago
        In C++ virtual functions are polymorphic and indirected, with the target not known to the compiler. Devirtualization gives the compiler this information (in this case a final method cannot be overridden and branch to something else).
  • tombert 11 days ago
    I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.

    Quick example, I got in an argument with someone a few years ago that claimed in C# that a `switch` was better than an `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR. I mentioned that that doesn't appear to be true, we went back and forth until I did a compile-then-decompile of a minimal test with equality-based-ifs, and showed that the compiler actually converts equality-based-ifs to `switch` behind the scenes. The guy accepted my PR after that.

    But there's tons of this stuff like this in CS, and I kind of blame professors for a lot of it [1]. A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college. Most of what they said was fine, but you can't assume that; what they tell you could be out of date, or simply never correct to begin with, and as far as I can tell you have to always test these things.

    It doesn't help that a lot of these "it's faster" arguments are often reductive because they only are faster in extremely minimal tests. Sometimes a microbenchmark will show that something is faster, and there's value in that, but I think it's important that that can also be a small percentage of the total program; compilers are obscenely good at optimizing nowadays, it can be difficult to determine when something will be optimized, and your assertion that something is "faster" might not actually be true in a non-trivial program.

    This is why I don't really like doing any kind of major optimizations before the program actually works. I try to keep the program in a reasonable Big-O and I try and minimize network calls cuz of latency, but I don't bother with any kind of micro-optimizations in the first draft. I don't mess with bitwise, I don't concern myself on which version of a particular data structure is a millisecond faster, I don't focus too much on whether I can get away with a smaller sized float, etc. Once I know that the program is correct, then I benchmark to see if any kind of micro-optimizations will actually matter, and often they really don't.

    [1] That includes me up to about a year ago.

    [2] At least I like to pretend I am.

    • jandrewrogers 11 days ago
      A significant part of it is that what engineers believe was effectively true at one time. They simply haven't revisited those beliefs or verified their relevance in a long time. It isn't a terrible heuristic for life in general to assume that what worked ten years ago will work today. The rate at which the equilibriums shift due to changes in hardware and software environments when designing for system performance is so rapid that you need to make a continuous habit of checking that your understanding of how the world works maps to reality.

      I've solved a lot of arguments with godbolt and simple performance tests. Some topics are recurring themes among software engineers e.g.:

      - compilers are almost always better at micro-optimizations than you are

      - disk I/O is almost never a bottleneck in competent designs

      - brute-force sequential scans are often optimal algorithms

      - memory is best treated as a block device

      - vectorization can offer large performance gains

      - etc...

      No one is immune to this. I am sometimes surprised at the extent to which assumptions are no longer true when I revisit optimization work I did 10+ years ago.

      Most performance these days is architectural, so getting the initial design right often has a bigger impact than micro-optimizations and localized Big-O tweaks. You can always go back and tweak algorithms or codegen later but architecture is permanent.

      • neonsunset 11 days ago
        .NET is a particularly bad case for this because it was a decade of few performance improvements, which caused a certain intuition to develop within the industry, then 6-8 years of significant changes each year (with most wins compressed to the last 4 years or so). Companies moving from .NET Framework 4.6/7/8 to .NET 8 experience a 10x average performance improvement, which naturally comes with rendering a lot of performance know-how obsolete overnight.

        (the techniques that used to work were similar to earlier Java versions and overall very dynamic languages with some exceptions, the techniques that still work and now are required today are the same as in C++ or Rust)

        • throwaway2037 10 days ago
          .NET 4.6 to .NET 8 is a 10x "average" performance improvement. I find this hard to believe. In what scenarios? I tried to Google for it and found very little hard evidence.
          • neonsunset 10 days ago
            In general purpose scenarios, particularly in codebases which have high amount of abstractions, use ASP.NET Core and EF Core, parse and de/serialize text with the use of JSON, Regex and other options, have network and file IO, and are deployed on many-core hosts/container images.

            There are a few articles on msft devblogs that cover from-netframework migration to older versions (Core 3.1, 5/6/7):

            - https://devblogs.microsoft.com/dotnet/bing-ads-campaign-plat...

            - https://devblogs.microsoft.com/dotnet/microsoft-graph-dotnet...

            - https://devblogs.microsoft.com/dotnet/the-azure-cosmos-db-jo...

            - https://devblogs.microsoft.com/dotnet/one-service-journey-to...

            - https://devblogs.microsoft.com/dotnet/microsoft-commerce-dot...

            The tl;dr is depending on codebase the latency reduction was anywhere from 2x to 6x, varying per percentile, or the RPS was maintained with CPU usage dropping by ~2-6x.

            Now, these are codebases of likely above average quality.

            If you consider that moving 6 -> 8 yields another up to 15-30% on average through improved and enabled by default DynamicPGO, and if you also consider that the average codebase is of worse quality than whatever msft has, meaning that DPGO-reliant optimizations scale way better, it is not difficult to see the 10x number.

            Keep in mind that while particular regular piece of enterprise code could have improved within bounds of "poor netfx codegen" -> "not far from LLVM with FLTO and PGO", the bottlenecks have changed significantly where previously they could have been in lock contention (within GC or user code), object allocation, object memory copying, e.g. for financial domains - anything including possibly complex Regex queries on imported payment reports (these alone have now difference anywhere between 2 and >1000[0]), and for pretty much every code base also in interface/virtual dispatch for layers upon layers of "clean architecture" solutions.

            The vast majority of performance improvements (both compiler+gc and CoreLib+frameworks), which is difficult to think about, given it was 8 years, address the above first and foremost. At my previous employer the migration from NETFX 4.6 to .NET Core 3.1, while also deploying to much more constrained container images compared to beefy Windows Server hosts, reduced latency of most requests by the same factor of >5x (certain request type went from 2s to 350ms). It was my first wow moment when I decided to stay with .NET rather than move over to Go back then (was never a fan of syntax though, and other issues, which subsequently got fixed in .NET, that Go still has, are not tolerable for me).

            [0] Cumulative of

            https://devblogs.microsoft.com/dotnet/regex-performance-impr...

            https://devblogs.microsoft.com/dotnet/regular-expression-imp...

            https://devblogs.microsoft.com/dotnet/performance-improvemen...

            • rerdavies 10 days ago
              Cheating.

              All of the 6x performance improvement cases seem to be related to using the .net based Kestrel web server instead of IIS web server, which requires marshalling and interprocess communication. Several of the 2x gains appear to be related to using a different database backend. Claims that regex performance has improved a thousand-fold.... seem more troubling than cause for celebration. Were you not precompiling your regex's in the older code? That would be a bug.

              Somewhere in there, there might be 30% improvements in .net codegen (it's hard to tell). Profile Guided Optimization (PGO) seems to provide a 35% performance improvement over older versions of .net with PGO disabled. But that's dishonest. PGO was around long before .net Core. And claiming that PGO will provide 10x performance because our code is worse than Microsoft's code insults both our code and our intelligence.

              • ygra 10 days ago
                Not sure about the 10×, either, and if true it would involve more than just the JIT changes. But changing ASP.NET to ASP.NET Core at the same time and the web server as well as other libraries may make it plausible. For certain applications moving from .NET Framework to .NET isn't so simple when they have dependencies and those have changed their API significantly. And in that case most of the newer stuff seems to be built with performance in mind. So you gain 30 % from the JIT, 2× from Kestrel, and so on. Perhaps.

                With a Roslyn-based compiler at work I saw 20 % perf improvement just by switching from .NET Core 3.1 to .NET 6. No idea how slow .NET Framework was, though. I probably can't target the code to that anymore.

                But for regex even with precompilation, the compiler got a lot better at transforming the regex into an equivalent regex that performs better (automatic atomic grouping to reduce unnecessary backtracking when it's statically known that backtracking won't create more matches for example) and it also benefits a lot from the various vectorized implementations of Index of, etc. Typically with each improvement of one of those core methods for searching stuff in memory there's a corresponding change that uses them in regex.

                So where in .NET Framework a regex might walk through a whole string character by character multiple times with backtracking it might be replaced with effectively an EndsWith and LastIndexOfAny call in newer versions.

                • neonsunset 10 days ago
                  Roslyn didn't have much of changes in terms of optimizations - it compiles C# to IL so does very little of that, save for switches and certain new or otherwise features like collection literals. You are probably talking about RyuJIT, also called just JIT nowadays :D

                  (the distinction becomes important for targets serviced by Mono, so to outline the difference Mono is usually specified, while CoreCLR and RyuJIT may not be, it also doesn't help that JIT, that is, the IL to machine code compiler, also services NativeAOT, so it gets more annoying to be accurate in a conversation without saying the generic ".net compiler", some people refer to it as JIT/ILC)

                  • ygra 10 days ago
                    No, I meant that we've written a compiler, based on Roslyn, whose runtime for compiling the code has improved by 20 % when switching to .NET 6.

                    And indeed, on the C# -> IL side there's little that's being actually optimized. Besides collection literals there's also switch statements/expressions over strings, along with certain pattern matching constructs that get improved on that side.

                    • neonsunset 10 days ago
                      Interesting! (I was way off the mark, not reading carefully, ha)

                      Is it a public project?

                      • ygra 10 days ago
                        Nope, completely internal and part of how we offer essentially the same product on multiple platforms with minimal integration work. And existing C# → anything compilers are typically too focused on compiling a whole application instead of offering a library with a stable and usable API on the other end, so we had to roll our own.
              • neonsunset 10 days ago
                No. DynamicPGO was first introduced in .NET 6 but was not mature and needed two releases worth of work to become enabled by default. It needs no user input and is similar to what OpenJDK Hotspot has been doing for some time and then a little more. It also is required for major features that were strictly not available previously: guarded devirtualization of virtual and interface calls and delegate inlining.

                Also, IIS hosting through Http.sys is still an option that sees separate set of improvements, but that's not relevant in most situations given the move to .NET 8 from Framework usually also involves replacing Windows Server host with a Linux container (though it works perfectly fine on Windows as well).

                On Regex, compiled and now source generated automata has seen a lot of work in all recent releases, it is night and day to what it was before - just read the articles. Previously linear scans against heavy internal data structures (matching by hashset) and heavy transient allocations got replaced with bloom-filter style SIMD search and other state of the art text search algorithms[0], on a completely opposite end of a performance spectrum.

                So when you have compiler improvements multiplied by changes to CoreLib internals multiplied by changes to frameworks built on top - it's achievable with relative ease. .NET Framework, while performing adequately, was still that slow compared to what we got today.

                [0] https://github.com/dotnet/runtime/tree/main/src/libraries/Sy...

                • rerdavies 10 days ago
                  Sure. But static PGO was introduced in .Net Framework 4.7.0. And we're talking about apps in production, so there's no excuse NOT to use static PGO on the .net framework 4.7.0 version.

                  And you have misrepresented the contents of the blogs. The projects discussed in the blogs are typically claiming ~30% improvements (perhaps because they weren't using static PGO in their 4.7.0 incarnation), with two dramatic outliers that seem to be related to migrating from IIS to Kestrel.

                  • neonsunset 10 days ago
                    It’s a moot point. Almost no one used static PGO and its feature set was way more limited - it did not have devirtualization which provides the biggest wins. Though you are welcome to disagree it won’t change the reality of the impact .NET 8 release had on real world code.

                    It’s also convenient to ignore the rest of the content at the links but it seems you’re more interested in proving your argument so the data I provided doesn’t matter.

              • andyayers 10 days ago
                Something closer to a "pure codegen/runtime" example perhaps: I have data showing Roslyn (the C# compiler, itself written in C#) speeds up between ~2x and ~3x running on .NET 8 vs .NET 4.7.1. Roslyn is built so that it can run either against full framework or core, so it's largely the same application IL.
              • throwaway2037 6 days ago

                    > Were you not precompiling your regex's in the older code? That would be a bug.
                
                I never heard of this before. Perl has legendary fast regexen and I never heard of this feature. Does Java do it? I don't think so, and the regexes are fast enough in my experience. Can you name a language when regexen are precompiled?
      • tombert 11 days ago
        Yep, completely agree with you on this. Intuition is often wrong, or at least outdated.

        When I'm building stuff I try my best to focus on "correctness", and try to come up with an algorithm/design that will encompass all realistic use cases. If I focus on that, it's relatively easy to go back and convert my `decimal` type to a float64, or even convert an if statement into a switch if it's actually faster.

    • saghm 11 days ago
      > But there's tons of this stuff like this in CS

      Reminds me of the classic https://stackoverflow.com/questions/24848359/which-is-faster...

      • sgerenser 11 days ago
        Never saw that before, that is indeed a classic.
    • wvenable 11 days ago
      In my opinion, the only things that really matter are algorithmic complexity and readability. And even algorithmic complexity is usually only an issue a certain scales. Whether or not an 'if' is faster than a 'switch' is the micro of micro optimizations -- you better have a good reason to care. The question I would have for you is was your bunch of ifs more readable than a switch would be.
      • tombert 11 days ago
        Yeah, and it's not like I didn't know how to do the stuff I was doing with a switch, I just don't like switches because I've forgotten to add break statements and had code that appeared correct but actually a month down the line. I've also seen other people make the same mistakes. ifs, in my opinion at least, are a bit harder to screw up, so I will always prefer them.

        But I agree, algorithmic complexity is generally the only thing I focus on, and even then it's almost always a case of "will that actually matter?" If I know that `n` is never going to be more than like `10`, I might not bother trying to optimize an O(n^2) operation.

        What I feel often gets ignored in these conversations is latency; people obsess over some "optimization" they learned in college a decade ago, and ignore the 200 HTTP or Redis calls being made ten lines below, despite the fact that the latter will have a substantially higher impact on performance.

        • dllthomas 10 days ago
          > in my opinion at least, are a bit harder to screw up, so I will always prefer them

          My experience is the opposite - a sizeable chain of ifs has more that can go wrong precisely because it is more flexible. If I'm looking at a switch, I immediately know, for instance, that none of the tests modifies anything.

          Meanwhile, while a missing break can be a brutal error in a language that allows it, it's usually trivial to set up linting to require either an explicit break or a comment indicating fallthrough.

      • doctor_phil 11 days ago
        But a switch and an if-else *is* a matter of algorithmic complexity. (Well, at least could be for a naive compiler). A switch could be converted to a constant time jump, but the if-else would be trying each case linearly.
        • bregma 11 days ago
          But what if, and stick with me here, a compiler is capable of reading and processing your code and through simple scalar evolution of the conditionals and phi-reduction, it can't tell the difference between a switch statement and a sequence of if statements by the time it finishes its single static analysis phase?

          It turns out the algorithmic complexity of a switch statement and the equivalent series of if-statements is identical. The bijective mapping between them is close to the identity function. Does a naive compiler exist that doesn't emit the same instructions for both, at least outside of toy hobby project compilers written by amateurs with no experience?

          • smaudet 10 days ago
            The issue with if statements (for compiled languages) is not one of "speed" but of correctness.

            If statements are unbounded, unconstrained logic constructs, whereas switch statements are type-checkable. The concern about missing break statements here is irrelevant, where your linter/compiler can warn about missing switch cases they can easily warn about non-terminated (non-explicitly marked as fall-through) cases.

            For non-compiled languages (so branch prediction is not possible because the code is not even loaded), switch statements also provide a speed-up, i.e. the parser can immediately evaluate the branch to execute vs being forced to evaluate intermediate steps (and the conditions to each if statement can produce side-effects e.g. if(checkAndDo()) { ... } else if (checkAndDoB()) { ... } else if (checkAndDoC()) { ... }

            Which, of course, is a potential use of if statements that switches cannot use (although side-effects are usually bad, if you listened to your CS profs)... And again a sort of "static analysis" guarantee that switches can provide that if statements cannot.

        • saurik 11 days ago
          While I personally find the if statements harder to immediately mentally parse/grok--as I have to prove to myself that they are all using the same variable and are all chained correctly in a way that is visually obvious for the switch statement--I don't find "but what if we use a naive compiler" at all a useful argument to make as, well, we aren't using a naive compiler, and, if we were, there are a ton of other things we are going to be sad about the performance of leading us down a path of re-implementing a number of other optimizations. The goal of the compiler is to shift computational complexity from runtime to compile time, and figuring out whether the switch table or the comparisons are the right approach seems like a legitimate use case (which maybe we have to sometimes disable, but probably only very rarely).
          • smaudet 10 days ago
            Per my sibling comment, I think the argument is not about speed, but simplicity.

            Awkward switch syntax aside, the switch is simpler to reason about. Fundamentally we should strive to keep our code simple to understand and verify, not worry about compiler optimizations (on the first pass).

            • saurik 10 days ago
              Right, and there I would say we even agree, per my first sentence; however, I wanted to reply not to you, but to doctor_phil, who was explicitly disagreeing about speed.
        • cogman10 11 days ago
          Yup.

          That said, the linear test is often faster due to CPU caches, which is why JITs will often convert switches to if/elses.

          IMO, switch is clearer in general and potentially faster (at very least the same speed) so it should be preferred when dealing with 3+ if/elseif statements.

          • neonsunset 11 days ago
            Any sufficiently advanced compiler will rewrite those arbitrarily depending on its heuristics. What authors usually forget is that there is defined behavior and specification which the compiler abides by, but it is otherwise free to produce any codegen that preserves the defined program order. Branch reordering, generating jump tables, optimizing away or coalescing checks into branchless forms are all very common. When someone says "oh I write C because it lets you tell CPU how exactly to execute the code" is simply a sign that a person never actually looked at disassembly and has little to no idea how the tool they use works.
            • cogman10 11 days ago
              A complier will definitely try this, but it's important to note that if/else blocks tell the compiler that "you will run these evaluations in order". Now, if the compiler can detect that the evaluations have no side effects (which, in this simple example with just integer checks, is fairly likely) then yeah I can see a jump table getting shoved in as an optimization.

              However, the moment you add a side effect or something more complicated like a method call, it becomes really hard for the complier to know if that sort of optimization is safe to do.

              The benefit of the switch statement is that it's already well positioned for the compiler to optimize as it does not have the "you must run these evaluations in order" requirement. It forces you to write code that is fairly compiler friendly.

              All that said, probably a waste of time debating :D. Ideally you have profiled your code and the profiler has told you "this is the slow block" before you get to the point of worrying about how to make it faster.

              • tombert 11 days ago
                I agree with what you said but in this particular case, it actually was a direct integer equality check, there was zero risk of hitting side effects and that was plainly obvious to me, the checker, and compiler.
                • cogman10 11 days ago
                  And to your original comment, I think the reviewer was wrong to reject the PR over that. Performance has to be measured before you can use it to reject (or create...) a PR. If someone hasn't done that then unless it's something obvious like "You are making a ton of tiny heap allocations in a tight loop" then I think nitpicking these sorts of things is just wrong.
          • tombert 11 days ago
            Hard disagree that it's "clearer". I have had to deal with a ton of bugs with people trying to be clever with the `break` logic, or forgetting to put `break` in there at all.

            if statements are dumber, and maybe arguably uglier, but I feel like they're also more clear, and people don't try and be clever with them.

            • cogman10 11 days ago
              Updates to languages (don't know where C# is on this) have different types of switch statements that eliminate the `break` problem.

              For example, with java there's enhanced switch that looks like this

                  var val = switch(foo) {
                   case 1, 2, 3 -> bar;
                   case 4 -> baz;
                   default -> {
                     yield bat();
                   }
                  }
              
              The C style switch break stuff is definitely a language mistake.
              • neonsunset 11 days ago
                C# has switch statements which are C/C++ style switches and switch expressions which are like Rust's match except no control flow statements inside:

                    var len = slice switch
                    {
                        null => 0,
                        "Hello" or "World" => 1,
                        ['@', ..var tags] => tags.Length,
                        ['{', ..var body, '}'] => body.Length,
                        _ => slice.Length,
                    };
                
                (it supports a lot more patterns but that wouldn't fit)
              • wvenable 11 days ago
                C# has both switch expressions like this and also break statements are not optional in traditional switch statements so it actually solves both problems. You can't get too clever with switch statements in C#.

                However most languages have pretty permissive switch statements just like C.

                • tombert 11 days ago
                  Yeah, fair, it's been awhile since I've done any C#, so my memory is a bit hazy with the details. I've been burned C with switch statements so I have a pretty strong distaste for them.
                  • smaudet 10 days ago
                    I think using C as your language with which to judge language constructs is hardly fair - one of its main strengths has been as a fairly stable, unchanging code-to-compiler contract, i.e. little to none syntax change or improvements.

                    So no offense, but I would revisit the wider world of language constructs before claiming that switch statements are "all bad". There are plenty of bad languages or languages with poor implementations of syntax, that do not make the fundamental language construct bad.

              • gloryjulio 11 days ago
                This is just forcing return value. You either have to break or return at the branches. To me they all look equivalent
            • SAI_Peregrinus 10 days ago
              I always set -Werror=implicit-fallthrough, among others. That prevents fallthrough unless explicitly annotated. Sadly these will forever remain optional warnings requiring specific compiler flags, since requiring them could break compiling broken legacy code.
        • adrianN 10 days ago
          Both the switch and the if have O(1) instructions, so both are the same from an algorithmic complexity perspective.
        • Gazoche 10 days ago
          It's linear with respect to the number of cases, not the size of inputs. It's still O(1) in the sense of algorithmic complexity.
        • yau8edq12i 10 days ago
          Unless the number of "else if" statements somehow grows e.g. linearly with the size of your input, which isn't plausible, the "else if" statements also execute in O(1) time.
      • jpc0 10 days ago
        ... really matter are algorithmic complexity ...

        This is not entirely true either... Measure. There are many cases where the optimiser will vectorise a certian algorithm but not another... In many cases On^2 vectorised may be significantly faster than On or Onlogn even for very large datasets depending on your data...

        Make your algorithms generic and it won't matter which one you use, if you find that one is slower swap it for the quicker one. Depending on CPU arch and compiler optimisations the fastest algorithm may actually change multiple times in a codebases lifetime even if the usage pattern doesn't change at all.

        • bluGill 10 days ago
          While you are not wrong, if you have a decent language you will discover all the useful algorithms are already in your standard library and so it isn't a worry. Your code should mostly look like apply this existing algorithm to some new data structure.
          • jpc0 10 days ago
            I don't disagree with you at all on this. However you may need to combine several to get to an end result. And if that happens a few times in a codebase, well makes sense to factor that into a library.
    • leetcrew 11 days ago
      agreed, especially in cases like this. final is primarily a way to prohibit overriding methods and extending classes, and it indicates to the reader that they should not be doing this. use it when it makes conceptual sense.

      that said, c++ is usually a language you use when you care about performance, at least to an extent. it's worth understanding features like nrvo and rewriting functions to allow the compiler to pick the optimization if it doesn't hurt readability too much.

    • BurningFrog 11 days ago
      Even if one of these constructs is faster it doesn't matter 99% of the time.

      Writing well structured readable code is typically far more important than making it twice as fast. And those times can rarely be predicted beforehand, so you should mostly not worry about it until you see real performance problems.

      • apantel 11 days ago
        The counter-argument to this is if you are building something that is in the critical path of an application (for example, parsing HTTP in a web server), you need to be performance-minded from the beginning because design decisions lead to design decisions. If you are building something in the critical path of the application, the best thing to do is build it from the ground up measuring the performance of what you have as you go. This way, each time you add something you will see the performance impact and usually there’s a more performant way of doing something that isn’t more obscure. If you do this as you build, early choices become constraints, but because you chose the most performant thing at every stage, the whole process takes you in the direction of a highly-performant implementation.

        Why should you care about performance?

        I can give you my personal experience: I’ve been working on a Java web/application server for the past 15 years and a typical request (only reading, not writing to the db) would take maybe 4-5 ms to execute. That includes HTTP request parsing, JSON parsing, session validation, method execution, JSON serialization, and HTTP response dispatch. Over the past 9 months I have refactored the entire application for performance and a typical request now takes about 0.25 ms or 250 microseconds. The computer is doing so much less work to accomplish the same tasks, it’s almost silly how much work it was doing before. And the result is the machine can handle 20x more requests in the same amount of time. If it could handle 200 requests per second per core before, now it can handle 4000. That means the need to scale is felt 20x less intensely, which means less complexity around scaling.

        High performance means reduced scaling requirements.

        • neonsunset 11 days ago
          Please accept a high five from a fellow "it does so little work it must have sub-millisecond request latency" aficionado (though I must admit I'm guilty of abusing memory caches to achieve this).
          • apantel 11 days ago
            Caches, precomputed values, lookup tables — it’s all good as long as it’s well-organized and maintainable.
        • tombert 11 days ago
          But even that sort of depends right? Hardware is often pretty cheap in comparison to dev-time. I really depends on the project, what kind of servers you're using, the nature of the application etc, but I think a lot of the time it might be cheaper to just pay for 20x the servers than it would be to pay a human to go find a critical path.

          I'm not saying you completely throw caution to the wind, I'm just saying that there's a finite amount of human resources and it can really vary how you want to allocate them. Sometimes the better path is to just throw money at the problem.

          It really depends.

          • apantel 11 days ago
            I think it depends on what you’re building and who’s building it. We’re all benefitting from the fact that the designers of NGINX made performance a priority. We like using things that were designed to be performant. We like high-FPS games. We like fast internet.

            I personally don’t like the idea of throwing compute at a slow solution. I like when the extra effort has been put into something. The good feeling I get from interacting with something that is optimal or excellent is an end in itself and one of the things I live for.

            • tombert 11 days ago
              Sure, though I've mentioned a few times in this thread now that the thing that bothers me more than CPU optimizations is not taking into account latency, particularly when hitting the network, and I think focusing on that will generally pay higher dividends than trying to optimize for processing.

              CPUs are ridiculously fast now, and compilers are really really good now too. I'm not going to say that processing speed is a "solved" problem, but I am going to say that in a lot of performance-related cases the CPU processing is probably not your problem. I will admit that this kind of pokes holes in my previous response, because introducing more machines into the mix will almost certainly increase latency, but I think it more or less holds depending on context.

              But I think it really is a matter of nuance, which you hinted at. If I'm making an admin screen that's going to have like a dozen users max, then a slow, crappy solution is probably fine; the requests will be served fast enough to where no one will notice anyway, and you can probably even get away with the cheapest machine/VM. If I'm making an FPS game that has 100,000 concurrent users, then it almost certainly will be beneficial to squeeze out as much performance out of the machine as possible, both CPU and latency-wise.

              But as I keep repeating everywhere, you have to measure. You cannot assume that your intuition is going to be right, particularly at-scale.

              • apantel 11 days ago
                I absolutely agree that latency is the real thing to optimize for. In my case, I only leave the application to access the db, and my applications tend not to be write-heavy. So in my case latency-per-request == how much work the computer has to do, which is constrained to one core because the overhead of parallelizing any part of the pipeline is greater than the work required. See, in that sense, we’re already close to the performance ceiling for per-request processing because clock speeds aren’t going up. You can’t make the processing of a given request faster by throwing more hardware at it. You can only make it faster by creating less work for the hardware to do.

                (Ironically, HN is buckling under load right now, or some other issue.)

          • oivey 11 days ago
            It almost certainly would require more than 20x servers because setting up horizontal scaling will have some sort of overhead. Not only that, there is the significant engineering effort to develop and maintain the code to scale.

            If your problem can fit on one server, it can massively reduce engineering and infrastructure costs.

      • tombert 11 days ago
        I mostly focus on "using stuff that won't break", and yeah "if it actually matters".

        For example, much to the annoyance of a lot of people, I don't typically use floating point numbers when I start out. I will use the "decimal" or "money" types of the language, or GMP if I'm using C. When I do that, I can be sure that I won't have to worry about any kind of funky overflow issues or bizarre rounding problems. There might be a performance overhead associated with it, but then I have to ask myself "how often is this actually called?"

        If the answer is "a billion times" or "once in every iteration of the event loop" or something, then I will probably eventually go back and figure out if I can use a float or convert it to an integer-based thing, but in a lot of cases the answer is "like ten or twenty times", and at that point I'm not even 100% sure it would be even measurable to change to the "faster" implementations.

        What annoys me is that people will act like they really care about speed, do all these annoying micro-optimizations, and then forget that pretty much all of them get wiped out immediately upon hitting the network, since the latency associated with that is obscene.

      • neonsunset 11 days ago
        This attitude is part of the problem. Another part of the problem is having no idea which things actually end up costing performance and how much.

        It is why many language ecosystems suffered from performance issues for a really long time even if completely unwarranted.

        Is changing ifs to switch or vice versa, as outlined in the post above, a waste of time? Yes, unless you are writing some encoding algorithm or a parser, it will not matter. The compiler will lower trivial statements to the same codegen and it will not impact the resulting performance anyway even if there was difference given a problem the code was solving.

        However, there are things that do cost like interface spam, abusing lambdas writing needlessly complex wokflow-style patterns (which are also less readable and worse in 8 out of 10 instances), not caching objects that always have the same value, etc.

        These kinds of issues, for example, plagued .NET ecosystem until more recent culture shift where it started to be cool once again to focus on performance. It wasn't being helped by the notion of "well-structured code" being just idiotic "clean architecture" and "GoF patterns" style dogma applied to smallest applications and simplest of business domains.

        (it is also the reason why picking slow languages in general is a really bad idea - everything costs more and you have way less leeway for no productivity win - Ruby and Python, and JS with Node.js are less productive to write in than C#/F#, Kotlin/Java or Go(under some conditions))

        • tombert 11 days ago
          I mean, that's kind of why I tried to emphasize measuring things yourself instead of depending on tribal knowledge.

          There are plenty of cases where even the "slow" implementation is more than fast enough, and there are also plenty of cases where the "correct" solution (from a big-O or intuition perspective) is actually slower than the dumb case. Intuition helps, you have to measure and/or look at the compiled results if you want to ensure correct numbers.

          An example that really annoys me is how every whiteboard interview ends up being "interesting ways to use a hashmap", which isn't inherently an issue, but they will usually be so small-scoped that an iterative "array of pairs" might actually be cheaper than paying the up-front cost of hashing and potentially dealing with collisions. Interviews almost always ignore constant factors, and that's fair enough, but in reality constant factors can matter, and we're training future employees to ignore that.

          I'll say it again: as far as I can tell, you have to measure if you want to know if your result is "faster". "Measuring" might involve memory profilers, or dumb timers, or a mixture of both. Gut instincts are often wrong.

    • klyrs 11 days ago
      > A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college

      When I was taught about performance, it was all about benchmarking and profiling. I never needed to trust what my professors taught, because they taught me to dig in and find the truth for myself. This was taught alongside the big-O stuff, with several examples where "fast" algorithms are slower on small inputs.

      • TylerE 11 days ago
        How do you even get meaningful profiling out of most modern langs? It seems the vast majority of time and calls gets spent inside tiny anonymous functions, GC allocations, and stuff like that.
        • neonsunset 11 days ago
          This is easy in most modern programming languages.

          JVM ecosystem has IntelliJ Idea profiler and similar advanced tools (AFAIK).

          .NET has VS/Rider/dotnet-trace profilers (they are very detailed) to produce flamegraphs.

          Then there are native profilers which can work with any AOT compiled language that produces canonically symbolicated binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.

          For example, you can do `samply record ./some_binary`[0] and then explore multi-threaded flamegraph once completed (I use it to profile C#, it's more convenient than dotTrace for preliminary perf work and is usually more than sufficient).

          [0] https://github.com/mstange/samply

          • TylerE 10 days ago
            I mean sure, but I've never seen much in a flamegraph besides noise.
            • neonsunset 10 days ago
              My experience is complete opposite. You just need to construct a realistic load test for the code and the bottlenecks will stand out (more often than not).

              Also there is learning curve to grouping and aggregating data.

        • klyrs 11 days ago
          I don't use most modern langs! And especially if I'm doing work where performance is critical, I won't kneecap myself by using a language that I can't reasonably profile.
    • jollyllama 11 days ago
      I've encountered similar situations before. It's insane to me when people hold up PRs over that kind of thing.
    • ot1138 10 days ago
      >I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.

      Very true, though there is one case where one can be highly confident that this is the case: code elimination.

      You can't get any faster than not doing something in the first place.

      • konstantinua00 10 days ago
        inb4 instruction (cache) alignment screws everythin up
    • zmj 11 days ago
      .NET is a little smarter about switch code generation these days: https://github.com/dotnet/roslyn/pull/66081
    • KerrAvon 11 days ago
      > `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR

      Yeah, that's never been true. Old compilers would often compile a switch to __slower__ code because they'd tend to always go to a jump table implementation.

      A better reason to use the switch is because it's better style in C-like languages. Using an if statement for that sort of thing looks like Python; it makes the code harder to maintain.

      • wzdd 10 days ago
        And it's better style because it better conveys intent. An if-else chain in C/C++ implies there's something important about the ordering of cases. Though I'd say that for a very small number of cases it's fine.

        (Also, Python has a switch-like construct now.)

    • mynameisnoone 10 days ago
      Yep. "Profiling or it didn't happen." The issue is that it's essentially impossible for even the most neckbeard of us to predict with a high degree of accuracy and precision the performance on modern systems impact of change A vs. change B due to the unpredictable nature of the many variables that are difficult to control including compiler optimization passes, architecture gotchas (caches, branch misses), and interplay of quirks on various platforms. Therefore, irreducible and necessary work to profile the differences become the primary viable path to resolving engineering decision points. Hopefully, LLMs now and in the future will be able to help build out boilerplate roughly in the direct of creating such profiling benchmarks and fixtures.

      PS: I'm presently revisiting C++14 because it's the most universal statically-compiled language to quickly answer interview problems. It would be unfair to impose Rust, Go, Elixir, or Haskell on an interviewer software engineer.

      • pjmlp 10 days ago
        I would say it would be safer to go up to C++17, and there are some goodies there, specially for better compile time stuff.
    • trueismywork 11 days ago
      There's not yet a culture of writing reproducible benchmarks to gage these effects.
    • dosshell 11 days ago
      > I can get away with a smaller sized float

      When talking about not assuming optimizations...

      32bit float is slower than 64bit float on reasonable modern x86-64.

      The reason is that 32bit float is emulated by using 64bit.

      Of course if you have several floats you need to optimize against cache.

      • jcranmer 11 days ago
        Um... no. This is 100% completely and totally wrong.

        x86-64 requires the hardware to support SSE2, which has native single-precision and double-precision instructions for floating-point (e.g., scalar multiply is MULSS and MULSD, respectively). Both the single precision and the double precision instructions will take the same time, except for DIVSS/DIVSD, where the 32-bit float version is slightly faster (about 2 cycles latency faster, and reciprocal throughput of 3 versus 5 per Agner's tables).

        You might be thinking of x87 floating-point units, where all arithmetic is done internally using 80-bit floating-point types. But all x86 chips in like the last 20 years have had SSE units--which are faster anyways. Even in the days when it was the major floating-point units, it wasn't any slower, since all floating-point operations took the same time independent of format. It might be slower if you insisted that code compilation strictly follow IEEE 754 rules, but the solution everybody did was to not do that and that's why things like Java's strictfp or C's FLT_EVAL_METHOD were born. Even in that case, however, 32-bit floats would likely be faster than 64-bit for the simple fact that 32-bit floats can safely be emulated in 80-bit without fear of double rounding but 64-bit floats cannot.

        • dosshell 11 days ago
          I agree with you. It should take the same time when thinking more about it. I remember learning this in ~2016 and I did performance test on Skylake which confirmed (Windows VS2015). I think I remember that i only tested with addsd/addss. Definitely not x87. But as always, if the result can not be reproduced... I stand corrected until then.
          • dosshell 10 days ago
            I tried to reproduce it on Ivybridge (Windows VS20122) and failed (mulss and muldd) [0]. single and double precision takes the same time. I also found a behavior where the first batch of iterations takes more time regardless of precision. It is possible that this tricked me last time.

            [0] https://gist.github.com/dosshell/495680f0f768ae84a106eb054f2...

            Sorry for the confusion and spreading false information.

      • tombert 11 days ago
        Sure, I clarified this in a sibling comment, but I kind of meant that I will use the slower "money" or "decimal" types by default. Usually those are more accurate and less error-prone, and then if it actually matters I might go back to a floating point or integer-based solution.
      • sgerenser 11 days ago
        I think this is only true if using x87 floating point, which anything computationally intensive is generally avoiding these days in favor of SSE/AVX floats. In the latter case, for a given vector width, the cpu can process twice as many 32 bit floats as 64 bit floats per clock cycle.
        • dosshell 11 days ago
          Yes, as I wrote, it is only true for one float value.

          SIMD/MIMD will benefit of working on smaller width. This is not only true because they do more work per clock but because memory is slow. Super slow compared to the cpu. Optimization is alot about cache misses optimization.

          (But remember that the cache line is 64 bytes, so reading a single value smaller than that will take the same time. So it does not matter in theory when comparing one f32 against one f64)

  • andrewla 11 days ago
    I'm surprised that it has any impact on performance at all, and I'd love to see the codegen differences between the applications.

    Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes, but what `final` assures is that if you attempt to derive from such a class, you will raise a compile-time error.

    This is similar to how `inline` works in practice -- rather than providing a useful hint to the compiler (though the compiler is free to treat it that way) it provides an assertion that if you do non-inlinable operations (e.g. non-tail recursion) then the compiler can flag that.

    All of this is to say that `final` can speed up runtimes -- but it does so by forcing you to organize your code such that the guarantees apply. By using `final` classes, in places where dynamic dispatch can be reduced to static dispatch, you force the developer to not introduce patterns that would prevent static dispatch.

    • GuB-42 11 days ago
      "inline" is confusing in C++, as it is not really about inlining. Its purpose is to allow multiple definitions of the same function. It is useful when you have a function defined in a header file, because if included in several source files, it will be present in multiple object files, and without "inline" the linker will complain of multiple definitions.

      It is also an optimization hint, but AFAIK, modern compiler ignore it.

      • fweimer 11 days ago
        GCC does not ignore inline for inlining purposes:

        Need a way to make inlining heuristics ignore whether a function is inline https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008

        (Bug saw a few updates recently, that's how I remembered.)

        As a workaround, if you need the linkage aspect of the inline keyword, you currently have to write fake templates instead. Not great.

      • jacoblambda 11 days ago
        The thing with `inline` as an optimisation is that it's not about optimising by inlining directly. It's a promise about how you intend to use the function.

        It's not just "you can have multiple definitions of the same function" but rather a promise that the function doesn't need to be address/pointer equivalent between translation units. This is arguably more important than inlining directly because it means the compiler can fully deduce how the function may be used without any LTO or other cross translation unit optimisation techniques.

        Of course you could still technically expose a pointer to the function outside a TU but doing so would be obvious to the compiler and it can fall back to generating a strictly conformant version of the function. Otherwise however it can potentially deduce that some branches in said function are unreachable and eliminate them or otherwise specialise the code for the specific use cases in that TU. So it potentially opens up alternative optimisations even if there's still a function call and it's not inlined directly.

      • wredue 11 days ago
        I believe the wording I’ve seen is that compilers may not respect the inline keyword, not that it is ignored.
      • lqr 11 days ago
        10 years ago it was already folklore that compilers ignore the "inline" keyword when optimizing, but that was false for clang/llvm: https://stackoverflow.com/questions/27042935/are-the-inline-...
      • lelanthran 10 days ago
        > It is useful when you have a function defined in a header file, because if included in several source files, it will be present in multiple object files, and without "inline" the linker will complain of multiple definitions.

        Traditionally you'd use `static` for that use case, wouldn't you?

        After all, `inline` can be ignored, `static` can't.

        • pjmlp 10 days ago
          No, because that would make it internal to each object file, while what you want is for all object files to see the same memory location.
          • lelanthran 10 days ago
            > No, because that would make it internal to each object file, while what you want is for all object files to see the same memory location.

            I can see exactly one use for an effect like that: static variables within the function.

            Are there any other uses?

            • pjmlp 10 days ago
              Global variables and the magic of a build system based on C semantics.
      • ack_complete 10 days ago
        > "inline" is confusing in C++, as it is not really about inlining. Its purpose is to allow multiple definitions of the same function.

        No, its purpose was and is still to specify a preference for inlining. The C++ standard itself says this:

        > The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism.

        https://eel.is/c++draft/dcl.inline

    • bgirard 11 days ago
      > The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes

      How? The compiler doesn't see the full program.

      The linker I'm less sure about. If the class isn't guaranteed to be fully private wouldn't an optimizing linker have to be conservative in case you inject a derived class?

    • lanza 11 days ago
      > Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes

      That's incorrect. The optimizer has to assume everything escapes the current optimization unit unless explicitly told otherwise. It needs explicit guarantees about the visibility to figure out the extent of the derivations allowed.

    • wheybags 11 days ago
      What if I dlopen a shared object that contains a derived class, then instantiate it. You cannot statically verify that I won't. Or you could swap out a normally linked shared object for one that creates a subclass. Etc etc. This kind of stuff is why I think shared object boundaries should be limited to the lowest common denominator (basically c abi). Dynamic linking high level languages was a mistake. The only winning move is not to play.
    • sixthDot 10 days ago
      > I'd love to see the codegen differences between the applications

      There are two applications, dynamic calls and dynamic casts.

      Dynamic casts to final classes dont require to check the whole inheritance chain. Recently done this in styx [0]. The gain may appear marginal, e.g 3 or 4 dereferences saved but in programs based on OOP you can easily have *Billions* of dynamic casts saved.

      [0]: https://gitlab.com/styx-lang/styx/-/commit/62c48e004d5485d4f....

  • mgraczyk 11 days ago
    The main case where I use final and where I would expect benefits (not covered well by the article) is when you are using an external library with pure virtual interfaces that you implement.

    For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).

    I'm curious to understand better how clang is producing worse code in these cases. The code used for the blog post is a bit too complicated for me to look at, but I would love to see some microbenchmarks. My guess is that there is some kind of icache or code side problem. where inlining more produces worse code.

    • cogman10 11 days ago
      Could easily just be a bad optimization pathway.

      `final` tells the compiler that nothing extends this class. That means the compiler can theoretically do things like inlining class methods and eliminate virtual method calls (perhaps duplicating the method)?

      However, it's quite possible that one of those optimizations makes the code bigger or misaligns things with the cache in unexpected ways. Sometimes, a method call can bet faster than inlining. Especially with hot loops.

      All this being said, I'd expect final to offer very little benefit over PGO. Its main value is the constraint it imposes and not the optimization it might enable.

    • lpapez 10 days ago
      > For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).

      I want to ask, and I sincerely mean no snark, what is the point?

      When working with AWS through an SDK your code will spend most of the time waiting on network calls.

      What is the point of devirtualizing your function calls to save an indirection when you will be spending several orders of magnitude more time just waiting for the RPC to resolve?

      It just doesn't seem like something even worth thinking about at all.

      • mgraczyk 10 days ago
        Yeah that's was just the first public C++ library with this pattern that popped into my head. I just make all my classes final out of habit and don't think about it. I remove final if I want to subclass, but that almost never happens.
  • lionkor 10 days ago
    The only thing worse than no benchmark is a bad benchmark.

    I don't think this really shows what `final` does, not to code generation, not to performance, not to the actual semantics of the program. There is no magic bullet - if putting `final` on every single class would always make it faster, it wouldn't be a keyword, it'd be a compiler optimization.

    `final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.

    • Nevermark 10 days ago
      'Final' cannot be assumed without complete knowledge of all final linking cases, and knowledge that this will not change in the future. The latter can never be assumed by a compiler without indication.

      "In theory" adding 'final' only gives a compiler more information, so should only result in same or faster code.

      In practice, some optimizations improve performance for more expected or important cases (in the compiler writer's estimation), with worse outcomes in other less expected, less important cases. Without a clear understanding the when and how of these 'final' optimizations, it isn't clear without benchmarking after the fact, when to use it, or not.

      That makes any given test much less helpful. Since all we know is 'final' was not helpful in this case. We have no basis to know how general these results are.

      But it would be deeply strange if 'final' was generally unhelpful. Informationally it does only one purely helpful thing: reduce the number of linking/runtime contexts the compiler needs to worry about.

    • opticfluorine 10 days ago
      Not disagreeing with your point, but it couldn't be a compiler optimization, could it? The compiler isn't able to infer that the class will not be inherited anywhere else, since another compilation unit unknown to the class could inherit.
      • vedantk 10 days ago
        Possibly not in the default c++ language mode, but check out -fwhole-program-vtables. It can be a useful option in cases where all relevant inheritance relationships are known at compile time.

        https://reviews.llvm.org/D16821

        • bluGill 10 days ago
          Which is good, but may not apply. I have an application where I can't do that because we support plugins and so a couple classes will get overridden outside of the compilation (this was in hindsight a bad decision, but too late to change now). Meanwhile most classes will never be overriden and so I use final to saw that. We are also a multi-repo project (which despite the hype I think is better for us than mono-repo), another reason why -f-whole-program-vtables would be difficult to use - but we could make it work with effort if it wasn't for the plugins.
      • ftrobro 10 days ago
        I assume it could be or is part of link time optimization when compiling an application rather than a library?
    • paulddraper 10 days ago
      > `final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.

      ...and the compiler can optimize using that information.

      (It could also do the same without the keyword, with LTO.)

      • bluGill 10 days ago
        LTO can only apply in specific situations though, if there is any possibility that a plugin derived from the class LTO can do nothing.
  • akoboldfrying 11 days ago
    I would expect "final" to have no effect on this type of code at all. That it does in some cases cause measurable differences I put down to randomly hitting internal compiler thresholds (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).

    Why would I expect no performance difference? I haven't looked at the code, but I would expect that for each pixel, it iterates through an array/vector/list etc. of objects that implement some common interface, and calls one or more methods (probably something called intersectRay() or similar) on that interface. By design, that interface cannot be made final, and that's what counts. Whether the concrete derived classes are final or not makes no difference.

    In order to make this a good test of "final", the pointer type of that container should be constrained to a concrete object type, like Sphere. Of course, this means the scene is limited to spheres.

    The only case where final can make a difference, by devirtualising a call that couldn't otherwise be devirtualised, is when you hold a pointer to that type, and the object it points at was allocated "uncertainly", e.g., by the caller. (If the object was allocated in the same basic block where the method call later occurs, the compiler already knows its runtime type and will devirtualise the call anyway, even without "final".)

    • koyote 11 days ago
      > (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).

      That definitely is one of the heuristics in MSVC++.

      We have some performance critical code and at one point we noticed a slowdown of around ~4% in a couple of our performance tests. I investigated but the only change to that code base involved fixing up an error message (i.e. no logic difference and not even on the direct code path of the test as it would not hit that error).

      Turns out that:

          int some_func() {
            if (bad)
              throw std::exception("Error");
          
            return some_int;
          }
      
      Inlined just fine, but after adding more text to the exception error message it no longer inlined, causing the slow-down. You could either fix it with __forceinline or by moving the exception to a function call.
      • Maxatar 11 days ago
        Since the inlining is performed in MSVC's backend, as opposed to its frontend, and hence operates strictly on MSVC's intermediate representation which lacks information about tokens or the AST, it's unlikely due to tokens.

        std::exception does not take a string in its constructor, so most likely you used std::runtime_error. std::runtime_error has a pretty complex constructor if you pass into it a long string. If it's a small string then there's no issue because it stores its contents in an internal buffer, but if it's a longer string then it has to use a reference counting scheme to allow for its copy constructor to be noexcept.

        That is why you can see different behavior if you use a long string versus a short string. You can also see vastly different codegen with plain std::string as well depending on whether you pass it a short string literal or a long string literal.

        • koyote 11 days ago
          > std::exception does not take a string in its constructor

          You're right, I used it as a short-hand for our internal exception function, forgetting that the std one does not take a string. Our error handling function is a simple static function that takes an std::string and throws a newly constructed object with that string as a field.

          But yes, it could very well have been that the string surpassed the short string optimisation threshold or something similar. I did verify the assembly before and after and the function definitely inlined before and no longer inlined after. Moving the 'throw' (and, importantly, the string literal) into a separate function that was called from the same spot ensured it inlined again and the performance was back to normal.

        • akoboldfrying 11 days ago
          Wow, I had no idea. And I thought I knew about most of C++'s weirdnesses.
    • simonask 10 days ago
      Actually, the compiler can only implicitly devirtualize under very specific circumstances. For example, it cannot devirtualize if there was previously a non-inlined call through the same pointer.

      The reason is placement new. It is legal (given that certain invariants are upheld) in C++ to say `new(this) DerivedClass`, and compilers must assume that each method could potentially have done this, changing the vtable pointer of the object.

      The `final` keyword somewhat counteracts this, but even GCC still only opportunistically honors it - i.e. it inserts a check if the vtable is the expected value before calling the devirtualized function, falling back on the indirect call.

      • akoboldfrying 10 days ago
        Fascinating, though a little sad. Are there any important kinds of behaviour that can only be implemented via this `new(this) DerivedClass` chicanery? Because if not, it seems a shame to make the optimiser pay such a heavy price just to support it.
        • simonask 9 days ago
          Presumably there is some arcane trick that somebody will argue is only implementable in this way, but I would personally never let such code through review.
  • ein0p 11 days ago
    You should use final to express design intent. In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead, but that ship has sailed long time ago. Any measurable negative perf impact should be filed as a bug and fixed.
    • leni536 11 days ago
      C++ doesn't have the fragile base problem, as members aren't virtual my default. The only concern with unintended inheritance is with polymorhpic deletion. "final" on class definition disables some tricks thag you can do with private inheritance.

      Having said that "final" on member functions is great, and I like to see that instead of "override".

      • pjmlp 10 days ago
        All OOP languages have it, the issue is related to changing the behaviour of the base class, and the change introducing unforceen consequences on the inheritance tree.

        Changing an existing method way of calling (regular, virtual, static), changing visibility, overloading, introducing a name that clashes downstream, introducing a virtual destructor, making a data member non-copyable,...

        • leni536 10 days ago
          > All OOP languages have it, the issue is related to changing the behaviour of the base class, and the change introducing unforceen consequences on the inheritance tree.

          C++ largely solves it by having tight encapsulation. As long as you don't change anything that breaks your existing interface, you should be good. And your interface is opt-in, including public members and virtual functions.

          • pjmlp 10 days ago
            Not when you change the contents of the class itself for public and protected inheritance members, which is exactly the whole issue of fragile base class.

            It doesn't go away just because private members exist as possible language feature.

            • leni536 10 days ago
              That's not a fragile base, that's just a fragile class. You can break APIs for all kinds of users, including derived classes.

              Some APIs are aimed towards derived classes, like protected members and virtual functions, but that doesn't make the issue fundamentally different. It's just breaking APIs.

              Point is, in C++ you have to opt-in to make these API surfaces, they are not the default.

              • pjmlp 10 days ago
                I give up, word games to avoid acknowledging the same happens.
      • jstimpfle 10 days ago
        Now try a regular function, you will be blown away. No need to type "final"...
    • josefx 11 days ago
      Intent is nice and all that, but I would like a "nonwithstanding" keyword instead that just lets me bypass that kind of "intent" without having to copy paste the entire implementation just to remove a pointless keyword or make a destructor public when I need it.
    • cesarb 11 days ago
      > In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead

      Kotlin (which uses the equivalent of the Java "final" keyword by default) uses the "open" keyword for that purpose.

    • jbverschoor 11 days ago
      In general, I think things should be strict by default. Way easier to optimize and less error prone.
  • ndesaulniers 11 days ago
    As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.
    • fransje26 10 days ago
      Is there any logical reason why Clang is 50% slower than GCC on Ubuntu?
      • ijan1 8 days ago
        Hi, I had a look today because it was bugging me and no one had investigated it, so I created an issue[1] with an explanation.

        But it basically boils down to uniform_real_distribution having a bunch of uninlined calls to 'logl' when compiled with Clang.

        Otherwise, Clang beats GCC at least on the configuration I tested.

        (I am the author of the issue)

        [1] https://gitlab.com/define-private-public/PSRayTracing/-/issu...

        • fransje26 8 days ago
          Oh, very nice find!

          Coincidentally, I happened to be playing around yesterday with a small performance test case using uniform_real_distribution, and for some strange reason Clang was 6x slower than GCC.

          I put it down to some weird clang bug on my LTS version of Ubuntu. As my installed version was clang-14, I decided it possibly had been noticed and fixed a long time ago.

          After reading your message I replaced uniform_real_distribution by uniform_int_distribution, and lo and behold, Clang was indeed faster than GCC, as expected.

          Thank you for coming back to me with your findings.

    • saagarjha 10 days ago
      Bug, misunderstanding, weird edge case…
  • mastax 11 days ago
    Changes in the layout of the binary can have large impacts on the program performance [0] so it's possible that the unexpected performance decrease is caused by unpredictable changes in the layout of the binary between compilations. I think there is some tool which helps ensure layout is consistent for benchmarking, but I can't remember what it's called.

    [0]: https://research.facebook.com/publications/bolt-a-practical-...

  • jeffbee 11 days ago
    I profiled this project and there are abundant opportunities for devirtualization. The virtual interface `IHittable` is the hot one. However, the WITH_FINAL define is not sufficient, because the hot call is still virtual. At `hit_object |= _objects[node->object_index()]->hit` I am still seeing ` mov (%rdi),%rax; call *0x18(%rax)` so the application of final here was not sufficient to do the job. Whatever differences are being measures are caused by bogons.
    • akoboldfrying 11 days ago
      An interface, like IHittable, can't possibly be made final since its whole purpose is to enable multiple different concrete subclasses that implement it.

      As you say, that's the hot one -- and making the concrete subclasses themselves "final" enables no devirtualisations since there are no opportunities for it.

    • gpderetta 11 days ago
      I haven't looked at the code, but if you have multiple leaves, even marking all of them as final won't help if the call is through a base class.
      • jeffbee 11 days ago
        Yeah the practical cases for devirtualization are when you have a base class, a derived class that you actually use, and another derived class that you use in tests. For your release binary the tests aren't visible so that can all be devirtualized.

        In cases where you have Dog and Goose that both derive from Animal and then you have std::vector<Animal>, what is the compiler supposed to do?

        • kccqzy 10 days ago
          The compiler simply knows that the actual dynamic type is Animal because it is not a pointer. You need Animal* to trigger all the fun virtual dispatch stuff.
          • froh 10 days ago
            I intuit vector<Animal*> is what was meant...
            • jeffbee 10 days ago
              Yes. I reflexively avoid asterisks on this site because they can hose your formatting.
  • bluGill 11 days ago
    I use final more for communication. Don't look for deeper derived classes as there are none. that it results in slower code is an annoying surprise.
  • leni536 11 days ago
    This is the gist of the difference in code generation when final is involved:

    https://godbolt.org/z/7xKj6qTcj

    edit: And a case involving inlining:

    https://godbolt.org/z/E9qrb3hKM

  • alex_smart 10 days ago
    One thing that wasn't mentioned in the article that I wished it did was the size of the compiled binary with and without final. Only reason I would expect the final version to be slower is that we are emitting more code because of inlining and that is resulting in a larger portion of instruction cache misses.

    Also, now that I think of it, they should have run the code under perf and compared the stats.

    • account42 10 days ago
      Yeah, really unsatisfying that there was no attempt to explain why it might be slower since it just gives the compiler more information to decide on optimizations which in theory should only make thins faster.
  • magnat 11 days ago
    > I created a "large test suite" to be more intensive. On my dev machine it needed to run for 8 hours.

    During such long and compute-intensive tests, how are thermal considerations mitigated? Not saying that this was case here, but I can see how after saturating all cores for 8 hours, the whole PC might get hot to the point CPU starts throttling, so when you reboot to next OS or start another batch, overall performance could be a bit lower.

    • lastgeniusua 11 days ago
      having recently done similar day-and-night long suites of benchmarks (on a laptop in heat dissipation conditions worse than on any decent desktop), I've found that there is no correlation between the order the benchmarks are run in and their performance (or energy consumption!). i would therefore assume that a non-overclocked processor would not exhibit the patterns you are thinking of here
    • Bene592 9 days ago
      8 hours should be enough to let the temperatures settle
  • JackYoustra 11 days ago
    I really wish he'd listed all the flags he used. To add on to the flags already listed by some other commenters, `-mcpu` and related flags are really crucial in these microbenchmarks: over such a small change and such a small set of tight loops, you could just be regression on coincidences in the microarchitecture scheduler vs higher level assumptions.
    • j_not_j 11 days ago
      And he didn't repeat each test case 5 or 9 times, and take the median (or even an average).

      There will be operating system noise that can be in the multi-percent range. This is defined as various OS services that run "in the background" taking up cpu time, emptying cache lines (which may be most important), and flushing a few translate lookaside entries.

      Once you recognize the variability from run to run, claiming "1%" becomes less credible. Depending on the noise level, of course.

      Linux benchmarks like SPECcpu tend to be run in "single-user mode" meaning almost no background processes are running.

  • gpderetta 11 days ago
    1% is nothing to scoff of. But I suspect that the variability of compilation (specifically quirks of instruction selection, register allocation and function alignment) more than mask any gains.

    The clang regression might be explainable by final allowing some additional inlining and clang making an hash of it.

  • fransje26 11 days ago
    I'm actually more worried about Clang being close to 100% slower than GCC on Linux. That doesn't seem right.

    I am prepared to believe that there is some performance difference between the two, varying per case, but I would expect a few percent difference, not twice the run time..

  • lanza 11 days ago
    If you're measuring a compiler you need to post the flags and version used. Otherwise the entire experiment is in the noise.
  • sfink 11 days ago
    tldr: sprinkled a keyword around in the hopes that it "does something" to speed things up, tested it, got noisy results but no miraculous speedup.

    I started skimming this article after a while, because it seemed to be going into the weeds of performance comparison without ever backing up to look at what the change might be doing. Which meant that I couldn't tell if I was going to be looking at the usual random noise of performance testing or something real.

    For `final`, I'd want to at least see if it changing the generated code by replacing indirect vtable calls with direct or inlined calls. It might be that the compiler is already figuring it out and the keyword isn't doing anything. It might be that the compiler is changing code, but the target address was already well-predicted and it's perturbing code layout enough that it gets slower (or faster). There could be something interesting here, but I can't tell without at least a little assembly output (or perhaps a relevant portion of some intermediate representation, not that I would know which one to look at).

    If it's not changing anything, then perhaps there could be an interesting investigation into the variance of performance testing in this scenario. If it's changing something, then there could be an interesting investigation into when that makes things faster vs slower. As it is, I can't tell what I should be looking for.

    • akoboldfrying 11 days ago
      >changing the generated code by replacing indirect vtable calls with direct or inlined calls

      It can't possibly be doing this, if the raytracing code is like any other raytracer I've ever seen -- since it must be looping through a list of concrete objects that implement some shared interface, calling intersectRay() on each one, and the existence of those derived concrete object types means that that shared interface can't be made final, and that's the only thing that would enable devirtualisation -- it makes no difference whether the concrete derived types themselves are final or not.

    • sgerenser 11 days ago
      This is what I was waiting for too. Especially with the large regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM codegen bug, but you’d need to compare the generated assembly to know.
    • drivebycomment 10 days ago
      +1. On modern hardware and software systems, performance is effectively stochastic to some degree, as small random perturbations to the input (code, data, environments, etc) can have arbitrary effects for the performance. This means you can't draw a direct causal chain / mechanism from what you changed to the performance change - when it matters, you do need to do a deeper analysis and investigation to find the actual and full causal chain. I.e. a correlation is not a causation, and especially more so on modern hardware and software systems.
  • pklausler 10 days ago
    Mildly related programming language trivia:

    Fortran has virtual functions ("type bound procedures"), and supports a NON_OVERRIDABLE attribute on them that is basically "final". (FINAL exists but means something else.). But it also has a means for localizing the non-overridable property.

    If a type bound procedure is declared in a module, and is PRIVATE, then overrides in subtypes ("extended derived types") work as usual for subtypes in the same module, but can't be affected by overrides that appear in other modules. This allows a compiler to notice when a type has no subtypes in the same module, and basically infer that it is non-overridable locally, and thus resolve calls at compilation time.

    Or it would, if compilers implemented this feature correctly. It's not well described in the standard, and only half of the Fortran compilers in the wild actually support it. So like too many things in the Fortran world, it might be useful, but it's not portable.

  • jeffbee 11 days ago
    It's difficult to discuss this stuff because the impact can be negligible or negative for one person, but large and consistently positive for another. You can only usefully discuss it on a given baseline, and for something like final I would hope that baseline would be a project that already enjoys PGO, LTO, and BOLT.
  • pcvarmint 9 days ago
    Each of the test cases measured needs to be run at least 3 times in a row, to warm caches (not just CPU but OS too) and to detect and remove noise.

    In fact, I would run the same test repeatedly, keeping track of the k fastest times (k being ~3-7), and only stopping when the first and the kth fastest times are within a certain tolerance (as low as 1%). This ensures repeatability.

    One sample of performance data for each test is not enough. This study provides no new insights.

    Performance analyst

  • chris_wot 10 days ago
    Surely "final" is a conceptual thing... in other words, you don't want anyone else to derive from the class for good reasons. It's for conceptual understanding, surely?
  • MathMonkeyMan 10 days ago
    I think it was Chandler Carruth who said "If you're not measuring, then you don't care about performance." I agree, and by that measure, nobody I've ever worked with cares about performance.

    The best I'll see is somebody who cooked up a naive microbenchmark to show that style 1 takes fewer wall nanoseconds than style 2 on his laptop.

    People I've worked with don't use profilers, claiming that they can't trust it. Really they just can't be bothered to run it and interpret the output.

    The truth is, most of us don't write C++ because of performance; we write C++ because that's the language the code is written in.

    The performance gained by different C++ techniques seldom matters, and when it does you have to measure. Profiler reports almost always surprise me the first few times -- your mental model of what's going on and what matters is probably wrong.

    • scottLobster 10 days ago
      It matters to some degree. If it's just a simple technique you can file away and repeat as muscle memory, well that means your code is that much better.

      From a user perspective it could be the difference between software that's pleasant to use and software that's annoying to use. From a philosophical perspective it's the difference between software that functions vs software that works well.

      Of course it depends on your context as to whether this is valued, but I wouldn't dismiss it. Once person's micro-optimization is another person's polish.

  • jcalvinowens 11 days ago
    That's interesting. Maybe final enabled more inlining, and clang is being too aggressive about it for the icache sizes in play here? I'd love to see a comparison of the generated code.

    I'm disappointed the author's conclusion is "don't use final", not "something is wrong with clang".

    • ot 11 days ago
      Or "something is wrong with my benchmark setup", which is also a possibility :)

      Without a comparison of generated code, it could be anything.

  • indigoabstract 11 days ago
    If it does have a noticeable impact, that would be surprising, a bit like going back to the days when 'inline' was supposed to tell the compiler to inline the designated functions (no longer its main use case nowadays).
  • account42 10 days ago
    I'm amused at the AI advert spam in the comments here that can't even be bothered to make the spam even vaguely normal looking comments.
  • pineapple_sauce 11 days ago
    What should be evaluated is removing indirection and tightly packing your data. I'm sure you'll gain a better performance improvement. virtual calls and shared_ptr are littered in the codebase.

    In this way: you can avoid the need for the `final` keyword and do the optimization the keyword enables (de-virtualize calls).

    >Yes, it is very hacky and I am disgusted by this myself. I would never do this in an actual product

    Why? What's with the C++ community and their disgust for macros without any underlying reasoning? It reminds me of everyone blindly saying "Don't use goto; it creates spaghetti code".

    Sure, if macros are overly used: it can be hard to read and maintain. But, for something simple like this, you shouldn't be thinking "I would never do this in an actual product".

    • sfink 11 days ago
      Macros that are giving you some value can be ok. In this case, once the performance conclusion is reached, the only reason to continue using a macro is if you really need the `final`ity to vary between builds. Otherwise, just delete it or use the actual keyword.

      (But I'm worse than the author; if I'm just comparing performance, I'd probably put `final` everywhere applicable and then do separate compiles with `-Dfinal=` and `-Dfinal=final`... I'd be making the assumption that it's something I either always or never want eventually, though.)

    • jandrewrogers 11 days ago
      In modern C++, macros are a viewed as a code smell because they are strictly worse than alternatives in almost all situations. It is a cultural norm; it is a bit like using "unsafe" in Rust if not strictly required for some trivial case. The C++ language has made a concerted effort to eliminate virtually all use cases for macros since C++11 and replace them with type-safe first-class features in the language. It is a bit of a legacy thing at this point, there are large modern C++ codebases with no macros at all, not even for things like logging. While macros aren't going away, especially in older code, the cultural norm in modern C++ has tended toward macros being a legacy foot-gun and best avoided if at all possible.

      The main remaining use case for the old C macro facility I still see in new code is to support conditional compilation of architecture-specific code e.g. ARM vs x86 assembly routines or intrinsics.

      • sgerenser 11 days ago
        But how would one conditionally enable or disable the “final” keyword on class members without a preprocessor macro, even in C++23?
        • jandrewrogers 11 days ago
          Macros are still useful for conditional compilation, as in this case. They've been sunsetted for anything that looks like code generation, which this isn't. I was more commenting on the reflexive "ick" reaction of the author to the use of macros (even when appropriate) because avoiding them has become so engrained in C++ culture. I'm a macro minimalist but I would use them here.

          Many people have a similar reaction to the use of "goto", even though it is absolutely the right choice in some contexts.

    • bluGill 11 days ago
      Macros in C are a text replace and so it is hard to see from a debugger how th code got like that.
      • pineapple_sauce 11 days ago
        Yes, I'm well aware of the definition of a macro in C and C++. Macros are simpler than templates. You can expand them with a compiler flag.
        • bluGill 11 days ago
          when things get complex templete error messages are easier to follow. nobody makes complex macros but if you tried. (template error messeges are legendary for a reason. nested macros are worse)
  • p0w3n3d 11 days ago
    I would say the most performance impact would give `constexpr` followed by `const`. I wouldn't bet any money on `final` which in C++ is a guard of inheritance, and C++ function invocation address is resolved the `vtable` hence final wouldn't change anything. Maybe the author was mistaken with `final` keyword in Java
    • adrianN 11 days ago
      In my experience the compiler is pretty good at figuring out what is constant so adding const is more documentation for humans, especially in C++, where const is more of a hint than a hard boundary. Devirtualization, as can happen when you add a final, or the optimizations enabled by adding a restrict to a pointer, are on the other hand often essential for performance in hot code.
      • bayindirh 11 days ago
        Since "const" makes things read-only, being const correct makes sure that you don't do funny things with the data you shouldn't mutate, which in turn eliminates tons of data bugs out of the gate.

        So, it's an opt-in security feature first, and a compiler hint second.

        • Lockal 10 days ago
          How does const affects code generation in C/C++? Last time I checked, const was purely informational. Compilers can't eliminate reads for const pointer data, because const_cast exists. Compilers can't eliminate double calls to const methods, because inside function definition such functions can still legally modify mutable variables (and have many side effects).

          What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).

          • account42 10 days ago
            Const affects code generation when used on variables. If you have a `const int i` then the compiler can assume that i never changes.

            But you're right that this does not hold true for const pointers or references.

            > What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).

            It's disppointing that these haven't been standardized. I'd prefer different semantics though, e.g. something that allows things like memoization or other forms of caching that are technically side effects but where you still are ok with allowing the compiler to remove / reorder / eliminate calls.

            • adrianN 9 days ago
              Do you have an example where a const on a variable changes codegen? I would be surprised if the compiler couldn't figure out variable constness itself.
              • account42 9 days ago
                Sure, in the following example the compiler is able to propagate the constant to the return statement with const in f1 but needs to load it back from the stack without const in f0:

                https://godbolt.org/z/6ebrbaM7b

                In general, whenever you call a function that the compiler cannot inspect (because it is in another TU) and the compiler cannot prove that that function doesn't have any reference to your variable it has to assume that the function might change your variable. Only passing a const reference won't help you here because it is legal to cast away constness and modify the variable unless the original variable was const.

                I wish that const meant something on reference or pointers and you had to do something more explicit like a mutable member to allow modifying a variable. But even that would not help if the compiler can't prove that a non-const pointer hasn't escaped somehow. You could add __attribute__((pure)) to the function to help the compiler but that is a lot stricter so can't always be used.

              • bayindirh 9 days ago
                If you modify a const variable, compiler will error out and refuse to compile.
            • bayindirh 9 days ago
              > If you have a `const int i` then the compiler can assume that i never changes.

              Plus, you can’t even compile your code if you try to modify a const variable.

              • account42 9 days ago
                This isn't guaranteed: Modification after const cast on a const variable is ill formed but the compiler is not required to diagnose it - an generally it can't because it doesn't know what your reference/pointer points to.
      • lelanthran 10 days ago
        > In my experience the compiler is pretty good at figuring out what is constant so adding const is more documentation for humans,

        In the same TU, sure. But across TU boundaries the compiler really can't figure out what should be const and what should not, so `const` in parameter or return values allows the compiler to tell the human "You are attempting to make a modification to a value that some other TU put into RO memory.", or issue similar diagnostics.

    • account42 10 days ago
      > followed by `const`

      Const can only ever possibly have a performance impact when used directly on variables. const pointers / references are purely for the benefit of the programmer - the compiler can assume nothing because the variable could be modified elsewhere or through another pointer/reference and const_cast is legal anyway unless the original variable was const.

  • teeuwen 10 days ago
    I do not see how the final keyword would make a difference in performance at all in this case. The compiler should be able to build an inheritance tree and determine by itself which classes are to be treated as final.

    Now for libraries, this is a different story. There I can imagine final keyword could have an impact.

    • connicpu 10 days ago
      But dynamically loaded libraries exist, so even if it knows the class is the most derived version out of all classes that exist in all of the statically-linked code through LTO or something, unless it can see the instantiation site it won't be able to devirtualize the function calls without the class being marked as final.
    • pjmlp 10 days ago
      Only if the complete source code is available to the compiler.
  • juliangmp 10 days ago
    >Personally, I'm not turning it on. And would in fact, avoid using it. It doesn't seem consistent.

    I feel like we'd have to repeat these tests quite a few times to get to a decent conclusion. Hell small variations in performance could be caused by all sorts of things outside the actual program.

    • kreetx 10 days ago
      AFAIU, these tests were ran 30 times each and apparently some took minutes to run, so it's unlikely that you'll get any different conclusions.
  • jey 11 days ago
    I wonder if LTO was turned on when using Clang? Might lead to a performance improvement.
  • AtNightWeCode 10 days ago
    Most benchmarks are wrong. I doubt this is correct. Final should have been the default in the lang I think though.

    There are tons of these suggestions. Like always using sealed in C# or never use private in Java.

  • headline 10 days ago
    re: final macro

    > I would never do this in an actual product

    what, why?

  • kasajian 10 days ago
    I'm surprised by this article. the author genuinely believes that a language construct to benefit performance was added to the language without anyone ever running any metrics to verify. "just trust me bro", is the quote.

    It's is an insane level of ignorance about how these things are decided by the standards committee.

    • kreetx 10 days ago
      And yet, results from current compilers show that results are mixed, in summary not making programs faster.
  • manlobster 10 days ago
    This seems like a reasonable use of the preprocessor to me. I've seen similar use in high-quality codebases. I wonder why the author is so disgusted by it.
  • LorenDB 11 days ago
    Man, I wish this blog had an RSS feed.
  • SEXMCNIGGA12882 11 days ago
    [dead]
  • SEXMCNIGGA29511 11 days ago
    [dead]
  • SEXMCNIGGA8338 11 days ago
    [dead]
  • SEXMCNIGGA34015 11 days ago
    [dead]
  • SEXMCNIGGA12763 11 days ago
    [dead]
  • SEXMCNIGGA24284 11 days ago
    [dead]
  • SEXMCNIGGA32916 11 days ago
    [dead]
  • SEXMCNIGGA814 11 days ago
    [dead]
  • SEXMCNIGGA22716 11 days ago
    [dead]
  • SEXMCNIGGA20888 11 days ago
    [dead]
  • SEXMCNIGGA15704 11 days ago
    [dead]
  • SEXMCNIGGA15272 11 days ago
    [dead]
  • SEXMCNIGGA47781 11 days ago
    [dead]
  • SEXMCNIGGA48968 11 days ago
    [dead]
  • SEXMCNIGGA5473 11 days ago
    [dead]
  • SEXMCNIGGA31533 11 days ago
    [dead]
  • SEXMCNIGGA33183 11 days ago
    [dead]
  • SEXMCNIGGA7321 11 days ago
    [dead]
  • 2genders34679 11 days ago
    [dead]
  • 2genders30059 11 days ago
    [dead]
  • SEXMCNIGGA40830 11 days ago
    [dead]
  • SEXMCNIGGA11323 11 days ago
    [dead]
  • SEXMCNIGGA43790 11 days ago
    [dead]
  • SEXMCNIGGA39299 11 days ago
    [dead]
  • SEXMCNIGGA43361 11 days ago
    [dead]
  • SEXMCNIGGA10221 11 days ago
    [dead]
  • SEXMCNIGGA5633 11 days ago
    [dead]
  • 2genders20897 11 days ago
    [dead]
  • SEXMCNIGGA7003 11 days ago
    [dead]
  • SEXMCNIGGA18638 11 days ago
    [dead]
  • SEXMCNIGGA25735 11 days ago
    [dead]
  • SEXMCNIGGA22017 11 days ago
    [dead]
  • SEXMCNIGGA27791 11 days ago
    [dead]
  • SEXMCNIGGA4548 11 days ago
    [dead]
  • SEXMCNIGGA41363 11 days ago
    [dead]
  • SEXMCNIGGA37663 11 days ago
    [dead]
  • SEXMCNIGGA1200 11 days ago
    [dead]
  • SEXMCNIGGA39329 11 days ago
    [dead]
  • SEXMCNIGGA33795 11 days ago
    [dead]
  • 2genders38019 11 days ago
    [dead]
  • 2genders30392 11 days ago
    [dead]
  • 2genders15281 11 days ago
    [dead]
  • 2genders36213 11 days ago
    [dead]
  • SEXMCNIGGA19432 11 days ago
    [dead]
  • 2genders23148 11 days ago
    [dead]
  • SEXMCNIGGA3042 11 days ago
    [dead]
  • SEXMCNIGGA28742 11 days ago
    [dead]
  • indianmilf42562 11 days ago
    [dead]
  • 2genders39967 11 days ago
    [dead]
  • sexmc10428 11 days ago
    [flagged]
  • sexmc39112 11 days ago
    [flagged]
  • sexmc41797 11 days ago
    [flagged]
  • sexmc10652 11 days ago
    [flagged]
  • sexmc16971 11 days ago
    [flagged]
  • 2genders3032 11 days ago
    [flagged]
  • 2genders24636 11 days ago
    [flagged]
  • sexmc11774 11 days ago
    [flagged]
  • sexmc11753 11 days ago
    [flagged]
  • SEXMCNIGGA18229 11 days ago
    [flagged]
  • sexmc26749 11 days ago
    [flagged]
  • sexmc49391 11 days ago
    [flagged]
  • sexmc8699 11 days ago
    [flagged]
  • sexmc12130 11 days ago
    [flagged]
  • sexmc29089 11 days ago
    [flagged]
  • 2genders19091 11 days ago
    [flagged]
  • 2genders9277 11 days ago
    [flagged]
  • 2genders2516 11 days ago
    [flagged]
  • 2genders46844 11 days ago
    [flagged]
  • 2genders15635 11 days ago
    [flagged]
  • 2genders8333 11 days ago
    [flagged]
  • 2genders49493 11 days ago
    [flagged]
  • 2genders44672 11 days ago
    [flagged]
  • 2genders38257 11 days ago
    [flagged]
  • 2genders46002 11 days ago
    [flagged]
  • SEXMCNIGGA9711 11 days ago
    [flagged]
  • SEXMCNIGGA12324 11 days ago
    [flagged]
  • SEXMCNIGGA27640 11 days ago
    [flagged]
  • SEXMCNIGGA45380 11 days ago
    [flagged]
  • SEXMCNIGGA10733 11 days ago
    [flagged]
  • SEXMCNIGGA22363 11 days ago
    [flagged]
  • SEXMCNIGGA17776 11 days ago
    [flagged]
  • SEXMCNIGGA5 11 days ago
    [flagged]
  • SEXMCNIGGA7400 11 days ago
    [flagged]
  • SEXMCNIGGA8655 11 days ago
    [flagged]
  • SEXMCNIGGA7452 11 days ago
    [flagged]
  • SEXMCNIGGA49672 11 days ago
    [flagged]
  • SEXMCNIGGA21866 11 days ago
    [flagged]
  • SEXMCNIGGA38416 11 days ago
    [flagged]
  • SEXMCNIGGA9660 11 days ago
    [flagged]
  • SEXMCNIGGA28174 11 days ago
    [flagged]
  • SEXMCNIGGA44397 11 days ago
    [flagged]
  • SEXMCNIGGA16826 11 days ago
    [flagged]
  • SEXMCNIGGA12616 11 days ago
    [flagged]
  • 2genders38019 11 days ago
    [flagged]
  • SEXMCNIGGA21472 11 days ago
    [flagged]
  • SEXMCNIGGA9342 11 days ago
    [flagged]
  • SEXMCNIGGA38183 11 days ago
    [flagged]
  • SEXMCNIGGA20645 11 days ago
    [flagged]
  • SEXMCNIGGA40201 11 days ago
    [flagged]
  • SEXMCNIGGA38614 11 days ago
    [flagged]
  • SEXMCNIGGA37282 11 days ago
    [flagged]
  • SEXMCNIGGA4613 11 days ago
    [flagged]
  • 2genders10037 11 days ago
    [flagged]
  • 2genders47485 11 days ago
    [flagged]
  • SEXMCNIGGA35526 11 days ago
    [flagged]
  • 2genders11380 11 days ago
    [flagged]
  • SEXMCNIGGA4814 11 days ago
    [flagged]
  • 2genders47456 11 days ago
    [flagged]
  • SEXMCNIGGA9611 11 days ago
    [flagged]
  • SEXMCNIGGA7303 11 days ago
    [flagged]
  • 2genders21619 11 days ago
    [flagged]
  • 2genders47311 11 days ago
    [flagged]
  • 2genders47311 11 days ago
    [flagged]
  • kookamamie 10 days ago
    > And probably, that reason is performance.

    That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.

    If you wish to have a relatively good performance for your code, try ISPC, which still allows you to get great performance with vectorization up to AVX-512, without turning to intrisics.

    • chipdart 10 days ago
      > That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.

      That's a bold statement due to the way it heavily contrasts with reality.

      C++ is ever present in high performance benchmarks as either the highest performing language or second only to C. It's weird seeing someone claim with a straight face that "C++ isn't a fast language, as it is".

      To make matters worse, you go on confusing what a programming language is, and confusing implementation details with language features. It's like claiming that C++ isn't a language for computational graphics just because no C++ standard dedicates a chapter to it.

      Just like every engineering domain,you need to have deep knowledge on details to milk the last drop of performance improvements out of a program. Low-latency C++ is a testament of how the smallest details can be critical of performance. But you need to be completely detached from reality to claim that C++ isn't a fast language.

      • kookamamie 10 days ago
        > That's a bold statement due to the way it heavily contrasts with reality.

        I'm ready to back this up. And no, I'm not confusing things - I work in HPC (realtime computer vision) and in reality the only thing we'd use C++ for is "glue", i.e. binding implementations of the actual algorithms implemented in other languages together.

        Implementations could be e.g. in CUDA, ISPC, neural-inference via TensorRT, etc.

        • jpc0 10 days ago
          "We use extreme vectorisation and can't do it in native C++ therefore the language is slow"

          You a junior or something? For 99% of use cases C++ autovectorisation does plenty and will outperform the same code written in higher level languages. You are literally in the 1% and conflating your use case for that of the general case...

        • chipdart 10 days ago
          I've worked in computer vision and real time image processing. We use C++ extensively in the field due to it's high performance. OpenCV is the tool of the trade. Both iOS and Android support C++ modules for performance reasons.

          But to add to all the nonsense,you claim otherwise.

          Frankly, your comments lack any credibility, which is confirmed by your lame appeal to authority.