What final enables is devirtualization in certain cases. The
main advantage of devirtualization is that it is necessary for inlining.
Inlining has other requirements as well -- LTO pretty much covers it.
The article doesn't have sufficient data to tell whether the testcase is built in such a way that any of these optimizations can happen or is beneficial.
> What final enables is devirtualization in certain cases. The main advantage of devirtualization is that it is necessary for inlining.
I think that enabling inlining is just one of the indirect consequences of devirtualization, and perhaps one that is largely irrelevant for performance improvements.
The whole point of devirtualization is eliminating the need to resort to pointer dereferencing when calling virtual members. The main trait of a virtual class is it's use of a vtable that requires dereferencing virtual members to access each and every one of them.
In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.
Once a class is final, none of that is required anymore. When a member is referred, no dereferencing takes place.
Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.
The whole purpose of the final keyword is that compilers can easily eliminate all pointer dereferencing used by virtual members. What stops them from applying this optimization is that they have no information on whether that class will be inherited and one of its members will either override any of its members or invoke any member function implemented by one of its parent classes.
With the introduction of the final keyword, you are now able to tell the compiler "from thereon, this is exactly what you get" and the compiler can trim out anything loose.
An extra indirection (indirect call versus direct call) is practically nothing on modern hardware. Branch predictors are insanely good, and this isn't something you generally have to worry about.
Inlining is by far the most impactful optimization here, because it can eliminate the call altogether, and thus specialize the called function to the callsite, lifting constants, hoisting loop variables, etc.
I had a section of code which incurred ~20 clock cycles to make a function call to a virtual function in a critical loop. That's over and above potential delays resulting from cache misses and the need to place multiple parameters on the stack.
I was going to eliminate polymorphism altogether for this object but later figured out how to refactor so that this particular call could be called once a millisecond. Then if more work was needed, it would dispatch a task to a dedicated CPU.
This was an incredibly performant improvement which made a significant difference to my P&L.
Could just be inefficient spilling caused by ABI requirements due to the inability to inline.
In general if you're manipulating values that fit into registers and work on a platform with a shitty ABI,you need to be very careful of what your function call boundaries look like.
The most obvious example is SIMD programming on Windows x86 32-bit.
"is practically nothing on modern hardware" if the data is already present in the L2 cache. Random RAM access that stalls execution is expensive.
My guess is this is why he didn't see any speedup: all the code could fit inside the L2 cache, so he did not have to pay for RAM access for the deference.
The number of different classes is important, not the number of objects as they have the same small number of vtable pointers.
It might be different for large codebases like Chrome and Firefox.
Both the number of objects (dcache) and the number of classes (icache) are significant, as well as the size of both, but yeah. It's pretty rare to have extremely wide class hierarchies, though. You really have to go out of your way to run into significant icache misses.
C++ vtables need 2 levels of indirection. See the asm or decompile it with ghidra.
First the vtable field, and then the method field.
Of course you have to worry about pointer chasing, when you can easily avoid it. Either via a switch to a single indirection (by passing method pointers around) or inlining with final. Or other compile-time specialization.
Though the branch predictor can chew though both layers of indirection. It can actually start fetching code from the function (and even executing it) before it even reads the function pointer from the vtable.
Though, that assumes a correct prediction. But modern branch predictors are really good, they can track and correctly predict hundreds (if not thousands) of indirect calls, taking into account the history of the last few branches (so it can even get an idea of what class is currently being executed, and make branch predictions based on that). Modern branch predictors do a really good job at chewing up indirect branches in hot sequences of code.
Virtual functions are probably the most harmful for warm code. We are talking about code that's executed too often to be considered cold code, but not often enough to stick around in the branch predictors' cache, executed only a few hundred times a second. It's a death by a thousand cuts type thing. And that's where devirtualisation will help the most...
As long as you don't go too far with the inlining and start causeing icache misses with code bloat. In an ideal would the compiler would inline enough to devirtualise the class, but not necessarily inline the actual function (unless they are small, or only called from one place)
In general it takes a significant amount of nondeterministic pointer chasing to fool modern branch predictors. Decades of research have been put into optimizing the hardware for languages like C++ and Java, both of which exhibit a lot of pointer chasing.
> In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.
This is not a thing in C++; vtables are flat, not nested. Function pointers are always 1 dereference away.
> Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.
virtual inheritance. Regular old inheritance does not need or benefit from devirtualization. This is why the CRTP exists.
CRTP does not exist for that. CRTP was one of the many happy accidents in template metaprogramming that happened to be discovered when doing recursive templates.
Also, you've missed the whole point. CRTP is a way to rearchitect your code to avoid dereferencing pointers to virtual members in inheritance. The whole point is that with final you do not need to pull tricks: just tell the compiler that you don't want the class to be inherited, and the compiler picks up from there and does everything for you.
If that's your point then it is simply wrong. Final does not allow the compiler to devirtualize calls through a base pointer, it only eliminates the virtualness for calls through pointers to the (final) derived type. The compiler can devirtualize calls through base pointers in others ways (by deducing the possible derived types via whole program optimization or PGO) but final does not help with that.
> If that's your point then it is simply wrong. Final does not allow the compiler to devirtualize calls through a base pointer, it only eliminates the virtualness for calls through pointers to the (final) derived type.
Please read my post. That's not my claim. I think I was very clear.
Jumps/calls are actually be pretty cheap with modern branch predictors. Even indirect calls through vtables, which is the opposite of most programmers intuition.
And if the devirtualisation leads to inlining, that results in code bloat which can lower performance though more instruction cache misses, which are not cheap.
Inlining is actually pretty evil. It almost always speeds things up for microbenchmarks, as such benchmarks easily fit in icache. So programmers and modern compilers often go out of their way to do more inlining. But when you apply too much inlining to a whole program, things start to slow down.
But it's not like inlining is universally bad in larger program, inlining can enable further optimisations, mostly because it allows constant propagation to travel across function boundaries.
Basically, compilers need better heuristics about when they should be inlining. If it's just saving the overhead of a lightweight call, then they shouldn't be inlining.
No it's not. Except if you __force_inline__ everything, of course.
Inlining reduces the number of instructions in a lot of cases. Especially when things are abstracted and factored with lot of indirections into small functions that calls other small functions and so on. Consider a 'isEmpty' function, which dissolves to 1 cpu instruction once inlined, compared with a call/save reg/compare/return. Highly dynamic code (with most functions being virtual) tend to result in a fest of chained calls, jumping into functions doing very little work. Yes the stack is usually hot and fast, but spending 80% of the instructions doing stack management is still a big waste.
Compilers already have good heuristics about when they should be inlining, chances are they are a lot better at it than you. They don't always inline, and that's not possible anyway.
My experience is that compiler do marvels with inlining decisions when there are lots of small functions they _can_ inline if they want to. It gives the compiler a lot of freedom. Lambdas are great for that as well.
Make sure you make the most possible compile-time information available to the compiler, factor your code, don't have huge functions, and let the compiler do its magic. As a plus, you can have high level abstractions, deep hierarchies, and still get excellent performances.
doesn't the compiler usually do well enough that you really only need to worry about time critical sections of code? Even then you could go in and look at the assembler and see if it's being inlined, no?
I find the Unreal Engine source to be a reasonable reference for C++ discussions, because it runs just unbelievably well for what it does, and on a huge array of hardware (and software). And it's explicit with inlining, other hints, and even a million things that could be easily called micro-optimizations, to a somewhat absurd degree. So I'd take away two conclusions from this.
The first is that when building a code base you don't necessarily know what it's being compiled with. And so even if there were a super-amazing compiler, there's no guarantee that's what will be compiling your code. Making it explicit, so long as you have a reasonably good idea of what you're doing, is generally just a good idea. It also conveys intent to some degree, especially things like final.
The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil. Because that mindset has gradually transitioned to being against optimization in general outside of the most primitive things like not running critical sections in O(N^2) when they could be O(N). And I think it's this mindset that has gradually brought us to where we are today where need what what would have been a literal supercomputer not that long ago, to run a word processor. It's like death by a thousand cuts, and quite ridiculous.
> The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil.
The greater evil is putting a one-sentence quote out of context:
"""
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.
"""
Indeed, but I think even that advice, with context, is pretty debatable. Obviously one should prioritize critical sections, but completely ignoring those "small efficiencies" is certainly a big part of how we got to where we are today in software performance. A 10% jump in performance is huge; whether that comes from a single 10% jump, or a hundred 0.1% jumps - it's exactly the same!
So referencing something in particular from Unreal Engine, they actually created a caching system for converting between a quaternion and a rotator (euler rotation)! Obviously that sort of conversion isn't going to, in a million years, be even close to a bottleneck. That conversion is quite cheap on modern hardware, and so that caching system probably only gives the engine one of those 0.1% boosts in performance. But there are literally thousands of these "small efficiencies" spread all throughout the code. And it yields a final product that runs dramatically better than comparable engines.
I find that gcc and clang are so aggressive about inlining that it's usually more effective to tell them what not to inline.
In a moderately-sized codebase I regularly work on, I use __attribute__((noinline)) nearly ten times as often as __attribute__((always_inline)). And I use __attribute__((cold)) even more than noinline.
So yeah, I can kind of see why someone would say inlining is 'evil', though I think it's more accurate to say that it's just not possible for compilers to figure out these kinds of details without copious hints (like PGO).
+1 on the __attribute__((cold)). Compilers so aggressively optimize based on their heuristics that you spend more time telling them that an apparent optimization opportunity is not actually an optimization.
When writing ultra-robust code that has to survive every vaguely plausible contingency in a graceful way, the code is littered with code paths that only exist for astronomically improbable situations. The branch predictor can figure this out but the compiler frequently cannot without explicit instructions to not pollute the i-cache.
Another for the pro side: inlining can allow for better branch prediction if the different call sites would tend to drive different code paths in the function.
This was true 15 years ago, but not so much today.
The branch predictors actually hash the history of the last few branches taken into the branch prediction query. So the exact same branch within a child function will map different branch predictors entries depending on which parent function it was called from, and there is no benifit to inlining.
It also means that branch predictor can also learn correlations between branches within a function. Like when a branches at the top and bottom of functions share conditions, or have inverted conditions.
It basically never should unless the inliner made a terrible judgement. Devirtualizing in C++ can remove 3 levels of pointer chasing, all of which could be cache misses. Many optimizations in modern compilers require the context of the function to be inlined to make major optimizations, which requires devirtualization. The only downside is I$ pressure, but this is generally not a problem because hot loops are usually tight.
There's a cost to loading more instructions, especially if you have more types of instructions.
The main advantages to inlining are (1) avoiding a jump and other function call overhead, (2) the ability to push down optimizations.
If you execute the "same" code (same instructions, different location) in many places that can cause cache evictions and other slowdowns. It's worse if some minor optimizations were applied by the inlining, so you have more types of instructions to unpack.
The question, roughly, is whether the gains exceed the costs. This can be a bit hard to determine because it can depend on the size of the whole program and other non-local parameters, leading to performance cliffs at various stages of complexity. Microbenchmarks will tend to suggest inlining is better in more cases that it actually is.
Over time you get a feel for which functions should be inlined. E.g., very often you'll have guard clauses or whatnot around a trivial amount of work when the caller is expected to be able to prove the guarded information at compile-time. A function call takes space in the generated assembly too, and if you're only guarding a few instructions it's usually worth forcing an inline (even in places where the compiler's heuristics would choose not to because the guard clauses take up too much space), regardless of the potential cache costs.
If you have something like a `while` loop and that while loop's instructions fit neatly on the cache line, then executing that loop can be quiet fast even if you have to jump to different code locations to do the internals. However, if you pump in more instructions in that loop you can exceed the length of the cache line which causes you to need more memory loads to do the same work.
It can also create more code. A method that took a `foo(NotFinal& bar)` could be duplicated by the compiler for the specialized cases which would be bad if there's a lot of implementations of `NotFinal` that end up being marshalled into foo. You could end up loading multiple implementations of the same function which may be slower than just keeping the virtual dispatch tables warm.
Practically - it never does. It is always cheaper to perform a direct, possibly inlined, call (devirtualization != inlining) than a virtual one.
Guarded devirtualization is also cheaper than virtual calls, even when it has to do
if (instance is SpecificType st) { st.Call() }
else { instance.Call() }
or even chain multiple checks at once (with either regular ifs or emitting a jump table)
This technique is heavily used in various forms by .NET, JVM and JavaScript JIT implementations (other platforms also do that, but these are the major ones)
The first two devirtualize virtual and interface calls (important in Java because all calls default to virtual, important in C# because people like to abuse interfaces and occasionally inheritance, C# delegates are also devirtualized/inlined now). The JS JIT (like V8) performs "inline caching" which is similar where for known object shapes property access is shape type identifier comparison and direct property read instead of keyed lookup which is way more expensive.
Caution! If you compare across languages like that, not all virtual calls are implemented equally.
A C++ virtual call is just a load from a fixed offset in the vtbl followed by an indirect call. This is fairly cheap, on modern CPUs pretty much the same as a non-virtual non-inlined call.
A Java/C# interface call involves a lot more stuff, because there's no single fixed vtbl offset that's valid for all classes implementing the interface.
Yes, it is true that there is difference. I'm not sure about JVM implementation details but the reason the comment says "virtual and interface" calls is to outline it. Virtual calls in .NET are sufficiently close[0] to virtual calls in C++. Interface calls, however, are coded differently[1].
Also you are correct - virtual calls are not terribly expensive, but they encroach on ever limited* CPU resources like indirect jump and load predictors and, as noted in parent comments, block inlining, which is highly undesirable.
* through great effort of our industry to take back whatever performance wins each generation brings with even more abstractions that fail to improve our productivity
If it's done badly, the same code that runs N times also gets cached N times because it's in N different locations in memory rather than one location that gets jumped to. Modern compilers and schedulers will eliminate a lot of that (but probably not for anything much smaller than a page), but in general there's always a tradeoff.
In general the compiler/linker cannot assume that derived classes won't arrive later through a shared object.
You can tell it "I won't do that" though with additional flags, like Clang's -fwhole-program-vtables, and even then it's not that simple. There was an effort in Clang to better support whole program devirtualization, but I haven't been following what kind of progress has been made:
https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
This optimization option isn't on by default? That sounds like a lot of missed optimization. Most programs aren't going to be loading from shared libraries.
Maybe I can set this option at work. Though it's scary because I'd have to be certain.
The JVM can actually perform this optimization optimistically and can undo it if the assumption is violated at runtime. So Java's 'everything is virtual by default' approach doesn't hurt. Of course relying an a sufficiently smart JIT comes with its own trade-offs.
Optimization means "make it faster without changing behaviour in ways I don't like". Clang can't generally default that one to on because it doesn't know whether you're going to splice in more code it can't see at runtime.
Lots of code gets slower if it might need to be called from something not currently in the compiler's scope. That's essentially what ABI overhead is. If there isn't already, there should be a compiler flag that says "this is the whole program, have at it" which implies the vtables option.
I think you have answered your own question: If turning on the setting is scary for you in a very localized project at your company, imagine how scary it would be to turn on by default for everybody :-P
If your runtime environment has dynamic linking, then the LTO pass can't always be sure that a subclass won't be introduced later that overrides the method.
This is one of the cases where JIT compiling can shine. You can use a bazillion interfaces to decouple application code, and the JIT will optimize the calls after it found out which implementation is used. This works as long as there is only one or two of them actually active at runtime.
* It is possible with `dlopen()` to load code objects that violate the assumptions made during compilation.
* The presence of runtime configuration mechanisms and application input can make it impossible to anticipate things like the choice of implementations of an interface.
One can always strive to reduce such situations, but it might simply not be necessary if a JIT is present.
At the level that LLVM's LTO operates, no information about classes or objects is left, so LLVM itself can't really devirtualize C++ methods in most cases
I think this is a bug. There's dedicated metadata that's supposed to end up on the indirect call to list the possible targets and when that list of possible targets is this short it should be turning into a switch over concrete targets. Don't have time to dig into the IR now but it might be worth posting to the github llvm issues.
MSVC with LTO and PGO will inline virtual calls in some situations along with a check for the expected vtable, bypassing the inlined code and calling the virtual function normally if it is an unexpected value.
Funny how things work. From working with Julia I've built a good intuition for guessing when functions would be inlined. And yet, I've never heard the word devirtualization until now.
In C++ virtual functions are polymorphic and indirected, with the target not known to the compiler. Devirtualization gives the compiler this information (in this case a final method cannot be overridden and branch to something else).
I don't do much C++, but I have definitely found that engineers will just assert that something is "faster" without any evidence to back that up.
Quick example, I got in an argument with someone a few years ago that claimed in C# that a `switch` was better than an `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR. I mentioned that that doesn't appear to be true, we went back and forth until I did a compile-then-decompile of a minimal test with equality-based-ifs, and showed that the compiler actually converts equality-based-ifs to `switch` behind the scenes. The guy accepted my PR after that.
But there's tons of this stuff like this in CS, and I kind of blame professors for a lot of it [1]. A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college. Most of what they said was fine, but you can't assume that; what they tell you could be out of date, or simply never correct to begin with, and as far as I can tell you have to always test these things.
It doesn't help that a lot of these "it's faster" arguments are often reductive because they only are faster in extremely minimal tests. Sometimes a microbenchmark will show that something is faster, and there's value in that, but I think it's important that that can also be a small percentage of the total program; compilers are obscenely good at optimizing nowadays, it can be difficult to determine when something will be optimized, and your assertion that something is "faster" might not actually be true in a non-trivial program.
This is why I don't really like doing any kind of major optimizations before the program actually works. I try to keep the program in a reasonable Big-O and I try and minimize network calls cuz of latency, but I don't bother with any kind of micro-optimizations in the first draft. I don't mess with bitwise, I don't concern myself on which version of a particular data structure is a millisecond faster, I don't focus too much on whether I can get away with a smaller sized float, etc. Once I know that the program is correct, then I benchmark to see if any kind of micro-optimizations will actually matter, and often they really don't.
A significant part of it is that what engineers believe was effectively true at one time. They simply haven't revisited those beliefs or verified their relevance in a long time. It isn't a terrible heuristic for life in general to assume that what worked ten years ago will work today. The rate at which the equilibriums shift due to changes in hardware and software environments when designing for system performance is so rapid that you need to make a continuous habit of checking that your understanding of how the world works maps to reality.
I've solved a lot of arguments with godbolt and simple performance tests. Some topics are recurring themes among software engineers e.g.:
- compilers are almost always better at micro-optimizations than you are
- disk I/O is almost never a bottleneck in competent designs
- brute-force sequential scans are often optimal algorithms
- memory is best treated as a block device
- vectorization can offer large performance gains
- etc...
No one is immune to this. I am sometimes surprised at the extent to which assumptions are no longer true when I revisit optimization work I did 10+ years ago.
Most performance these days is architectural, so getting the initial design right often has a bigger impact than micro-optimizations and localized Big-O tweaks. You can always go back and tweak algorithms or codegen later but architecture is permanent.
.NET is a particularly bad case for this because it was a decade of few performance improvements, which caused a certain intuition to develop within the industry, then 6-8 years of significant changes each year (with most wins compressed to the last 4 years or so). Companies moving from .NET Framework 4.6/7/8 to .NET 8 experience a 10x average performance improvement, which naturally comes with rendering a lot of performance know-how obsolete overnight.
(the techniques that used to work were similar to earlier Java versions and overall very dynamic languages with some exceptions, the techniques that still work and now are required today are the same as in C++ or Rust)
.NET 4.6 to .NET 8 is a 10x "average" performance improvement. I find this hard to believe. In what scenarios? I tried to Google for it and found very little hard evidence.
In general purpose scenarios, particularly in codebases which have high amount of abstractions, use ASP.NET Core and EF Core, parse and de/serialize text with the use of JSON, Regex and other options, have network and file IO, and are deployed on many-core hosts/container images.
There are a few articles on msft devblogs that cover from-netframework migration to older versions (Core 3.1, 5/6/7):
The tl;dr is depending on codebase the latency reduction was anywhere from 2x to 6x, varying per percentile, or the RPS was maintained with CPU usage dropping by ~2-6x.
Now, these are codebases of likely above average quality.
If you consider that moving 6 -> 8 yields another up to 15-30% on average through improved and enabled by default DynamicPGO, and if you also consider that the average codebase is of worse quality than whatever msft has, meaning that DPGO-reliant optimizations scale way better, it is not difficult to see the 10x number.
Keep in mind that while particular regular piece of enterprise code could have improved within bounds of "poor netfx codegen" -> "not far from LLVM with FLTO and PGO", the bottlenecks have changed significantly where previously they could have been in lock contention (within GC or user code), object allocation, object memory copying, e.g. for financial domains - anything including possibly complex Regex queries on imported payment reports (these alone have now difference anywhere between 2 and >1000[0]), and for pretty much every code base also in interface/virtual dispatch for layers upon layers of "clean architecture" solutions.
The vast majority of performance improvements (both compiler+gc and CoreLib+frameworks), which is difficult to think about, given it was 8 years, address the above first and foremost. At my previous employer the migration from NETFX 4.6 to .NET Core 3.1, while also deploying to much more constrained container images compared to beefy Windows Server hosts, reduced latency of most requests by the same factor of >5x (certain request type went from 2s to 350ms). It was my first wow moment when I decided to stay with .NET rather than move over to Go back then (was never a fan of syntax though, and other issues, which subsequently got fixed in .NET, that Go still has, are not tolerable for me).
All of the 6x performance improvement cases seem to be related to using the .net based Kestrel web server instead of IIS web server, which requires marshalling and interprocess communication. Several of the 2x gains appear to be related to using a different database backend. Claims that regex performance has improved a thousand-fold.... seem more troubling than cause for celebration. Were you not precompiling your regex's in the older code? That would be a bug.
Somewhere in there, there might be 30% improvements in .net codegen (it's hard to tell). Profile Guided Optimization (PGO) seems to provide a 35% performance improvement over older versions of .net with PGO disabled. But that's dishonest. PGO was around long before .net Core. And claiming that PGO will provide 10x performance because our code is worse than Microsoft's code insults both our code and our intelligence.
Not sure about the 10×, either, and if true it would involve more than just the JIT changes. But changing ASP.NET to ASP.NET Core at the same time and the web server as well as other libraries may make it plausible. For certain applications moving from .NET Framework to .NET isn't so simple when they have dependencies and those have changed their API significantly. And in that case most of the newer stuff seems to be built with performance in mind. So you gain 30 % from the JIT, 2× from Kestrel, and so on. Perhaps.
With a Roslyn-based compiler at work I saw 20 % perf improvement just by switching from .NET Core 3.1 to .NET 6. No idea how slow .NET Framework was, though. I probably can't target the code to that anymore.
But for regex even with precompilation, the compiler got a lot better at transforming the regex into an equivalent regex that performs better (automatic atomic grouping to reduce unnecessary backtracking when it's statically known that backtracking won't create more matches for example) and it also benefits a lot from the various vectorized implementations of Index of, etc. Typically with each improvement of one of those core methods for searching stuff in memory there's a corresponding change that uses them in regex.
So where in .NET Framework a regex might walk through a whole string character by character multiple times with backtracking it might be replaced with effectively an EndsWith and LastIndexOfAny call in newer versions.
Roslyn didn't have much of changes in terms of optimizations - it compiles C# to IL so does very little of that, save for switches and certain new or otherwise features like collection literals. You are probably talking about RyuJIT, also called just JIT nowadays :D
(the distinction becomes important for targets serviced by Mono, so to outline the difference Mono is usually specified, while CoreCLR and RyuJIT may not be, it also doesn't help that JIT, that is, the IL to machine code compiler, also services NativeAOT, so it gets more annoying to be accurate in a conversation without saying the generic ".net compiler", some people refer to it as JIT/ILC)
No, I meant that we've written a compiler, based on Roslyn, whose runtime for compiling the code has improved by 20 % when switching to .NET 6.
And indeed, on the C# -> IL side there's little that's being actually optimized. Besides collection literals there's also switch statements/expressions over strings, along with certain pattern matching constructs that get improved on that side.
Nope, completely internal and part of how we offer essentially the same product on multiple platforms with minimal integration work. And existing C# → anything compilers are typically too focused on compiling a whole application instead of offering a library with a stable and usable API on the other end, so we had to roll our own.
No. DynamicPGO was first introduced in .NET 6 but was not mature and needed two releases worth of work to become enabled by default. It needs no user input and is similar to what OpenJDK Hotspot has been doing for some time and then a little more. It also is required for major features that were strictly not available previously: guarded devirtualization of virtual and interface calls and delegate inlining.
Also, IIS hosting through Http.sys is still an option that sees separate set of improvements, but that's not relevant in most situations given the move to .NET 8 from Framework usually also involves replacing Windows Server host with a Linux container (though it works perfectly fine on Windows as well).
On Regex, compiled and now source generated automata has seen a lot of work in all recent releases, it is night and day to what it was before - just read the articles. Previously linear scans against heavy internal data structures (matching by hashset) and heavy transient allocations got replaced with bloom-filter style SIMD search and other state of the art text search algorithms[0], on a completely opposite end of a performance spectrum.
So when you have compiler improvements multiplied by changes to CoreLib internals multiplied by changes to frameworks built on top - it's achievable with relative ease. .NET Framework, while performing adequately, was still that slow compared to what we got today.
Sure. But static PGO was introduced in .Net Framework 4.7.0. And we're talking about apps in production, so there's no excuse NOT to use static PGO on the .net framework 4.7.0 version.
And you have misrepresented the contents of the blogs. The projects discussed in the blogs are typically claiming ~30% improvements (perhaps because they weren't using static PGO in their 4.7.0 incarnation), with two dramatic outliers that seem to be related to migrating from IIS to Kestrel.
It’s a moot point. Almost no one used static PGO and its feature set was way more limited - it did not have devirtualization which provides the biggest wins. Though you are welcome to disagree it won’t change the reality of the impact .NET 8 release had on real world code.
It’s also convenient to ignore the rest of the content at the links but it seems you’re more interested in proving your argument so the data I provided doesn’t matter.
Something closer to a "pure codegen/runtime" example perhaps: I have data showing Roslyn (the C# compiler, itself written in C#) speeds up between ~2x and ~3x running on .NET 8 vs .NET 4.7.1. Roslyn is built so that it can run either against full framework or core, so it's largely the same application IL.
> Were you not precompiling your regex's in the older code? That would be a bug.
I never heard of this before. Perl has legendary fast regexen and I never heard of this feature. Does Java do it? I don't think so, and the regexes are fast enough in my experience. Can you name a language when regexen are precompiled?
Yep, completely agree with you on this. Intuition is often wrong, or at least outdated.
When I'm building stuff I try my best to focus on "correctness", and try to come up with an algorithm/design that will encompass all realistic use cases. If I focus on that, it's relatively easy to go back and convert my `decimal` type to a float64, or even convert an if statement into a switch if it's actually faster.
In my opinion, the only things that really matter are algorithmic complexity and readability. And even algorithmic complexity is usually only an issue a certain scales. Whether or not an 'if' is faster than a 'switch' is the micro of micro optimizations -- you better have a good reason to care. The question I would have for you is was your bunch of ifs more readable than a switch would be.
Yeah, and it's not like I didn't know how to do the stuff I was doing with a switch, I just don't like switches because I've forgotten to add break statements and had code that appeared correct but actually a month down the line. I've also seen other people make the same mistakes. ifs, in my opinion at least, are a bit harder to screw up, so I will always prefer them.
But I agree, algorithmic complexity is generally the only thing I focus on, and even then it's almost always a case of "will that actually matter?" If I know that `n` is never going to be more than like `10`, I might not bother trying to optimize an O(n^2) operation.
What I feel often gets ignored in these conversations is latency; people obsess over some "optimization" they learned in college a decade ago, and ignore the 200 HTTP or Redis calls being made ten lines below, despite the fact that the latter will have a substantially higher impact on performance.
> in my opinion at least, are a bit harder to screw up, so I will always prefer them
My experience is the opposite - a sizeable chain of ifs has more that can go wrong precisely because it is more flexible. If I'm looking at a switch, I immediately know, for instance, that none of the tests modifies anything.
Meanwhile, while a missing break can be a brutal error in a language that allows it, it's usually trivial to set up linting to require either an explicit break or a comment indicating fallthrough.
But a switch and an if-else *is* a matter of algorithmic complexity. (Well, at least could be for a naive compiler). A switch could be converted to a constant time jump, but the if-else would be trying each case linearly.
But what if, and stick with me here, a compiler is capable of reading and processing your code and through simple scalar evolution of the conditionals and phi-reduction, it can't tell the difference between a switch statement and a sequence of if statements by the time it finishes its single static analysis phase?
It turns out the algorithmic complexity of a switch statement and the equivalent series of if-statements is identical. The bijective mapping between them is close to the identity function. Does a naive compiler exist that doesn't emit the same instructions for both, at least outside of toy hobby project compilers written by amateurs with no experience?
The issue with if statements (for compiled languages) is not one of "speed" but of correctness.
If statements are unbounded, unconstrained logic constructs, whereas switch statements are type-checkable. The concern about missing break statements here is irrelevant, where your linter/compiler can warn about missing switch cases they can easily warn about non-terminated (non-explicitly marked as fall-through) cases.
For non-compiled languages (so branch prediction is not possible because the code is not even loaded), switch statements also provide a speed-up, i.e. the parser can immediately evaluate the branch to execute vs being forced to evaluate intermediate steps (and the conditions to each if statement can produce side-effects e.g. if(checkAndDo()) { ... } else if (checkAndDoB()) { ... } else if (checkAndDoC()) { ... }
Which, of course, is a potential use of if statements that switches cannot use (although side-effects are usually bad, if you listened to your CS profs)... And again a sort of "static analysis" guarantee that switches can provide that if statements cannot.
While I personally find the if statements harder to immediately mentally parse/grok--as I have to prove to myself that they are all using the same variable and are all chained correctly in a way that is visually obvious for the switch statement--I don't find "but what if we use a naive compiler" at all a useful argument to make as, well, we aren't using a naive compiler, and, if we were, there are a ton of other things we are going to be sad about the performance of leading us down a path of re-implementing a number of other optimizations. The goal of the compiler is to shift computational complexity from runtime to compile time, and figuring out whether the switch table or the comparisons are the right approach seems like a legitimate use case (which maybe we have to sometimes disable, but probably only very rarely).
Per my sibling comment, I think the argument is not about speed, but simplicity.
Awkward switch syntax aside, the switch is simpler to reason about. Fundamentally we should strive to keep our code simple to understand and verify, not worry about compiler optimizations (on the first pass).
Right, and there I would say we even agree, per my first sentence; however, I wanted to reply not to you, but to doctor_phil, who was explicitly disagreeing about speed.
That said, the linear test is often faster due to CPU caches, which is why JITs will often convert switches to if/elses.
IMO, switch is clearer in general and potentially faster (at very least the same speed) so it should be preferred when dealing with 3+ if/elseif statements.
Any sufficiently advanced compiler will rewrite those arbitrarily depending on its heuristics. What authors usually forget is that there is defined behavior and specification which the compiler abides by, but it is otherwise free to produce any codegen that preserves the defined program order. Branch reordering, generating jump tables, optimizing away or coalescing checks into branchless forms are all very common. When someone says "oh I write C because it lets you tell CPU how exactly to execute the code" is simply a sign that a person never actually looked at disassembly and has little to no idea how the tool they use works.
A complier will definitely try this, but it's important to note that if/else blocks tell the compiler that "you will run these evaluations in order". Now, if the compiler can detect that the evaluations have no side effects (which, in this simple example with just integer checks, is fairly likely) then yeah I can see a jump table getting shoved in as an optimization.
However, the moment you add a side effect or something more complicated like a method call, it becomes really hard for the complier to know if that sort of optimization is safe to do.
The benefit of the switch statement is that it's already well positioned for the compiler to optimize as it does not have the "you must run these evaluations in order" requirement. It forces you to write code that is fairly compiler friendly.
All that said, probably a waste of time debating :D. Ideally you have profiled your code and the profiler has told you "this is the slow block" before you get to the point of worrying about how to make it faster.
I agree with what you said but in this particular case, it actually was a direct integer equality check, there was zero risk of hitting side effects and that was plainly obvious to me, the checker, and compiler.
And to your original comment, I think the reviewer was wrong to reject the PR over that. Performance has to be measured before you can use it to reject (or create...) a PR. If someone hasn't done that then unless it's something obvious like "You are making a ton of tiny heap allocations in a tight loop" then I think nitpicking these sorts of things is just wrong.
Hard disagree that it's "clearer". I have had to deal with a ton of bugs with people trying to be clever with the `break` logic, or forgetting to put `break` in there at all.
if statements are dumber, and maybe arguably uglier, but I feel like they're also more clear, and people don't try and be clever with them.
C# has both switch expressions like this and also break statements are not optional in traditional switch statements so it actually solves both problems. You can't get too clever with switch statements in C#.
However most languages have pretty permissive switch statements just like C.
Yeah, fair, it's been awhile since I've done any C#, so my memory is a bit hazy with the details. I've been burned C with switch statements so I have a pretty strong distaste for them.
I think using C as your language with which to judge language constructs is hardly fair - one of its main strengths has been as a fairly stable, unchanging code-to-compiler contract, i.e. little to none syntax change or improvements.
So no offense, but I would revisit the wider world of language constructs before claiming that switch statements are "all bad". There are plenty of bad languages or languages with poor implementations of syntax, that do not make the fundamental language construct bad.
I always set -Werror=implicit-fallthrough, among others. That prevents fallthrough unless explicitly annotated. Sadly these will forever remain optional warnings requiring specific compiler flags, since requiring them could break compiling broken legacy code.
Unless the number of "else if" statements somehow grows e.g. linearly with the size of your input, which isn't plausible, the "else if" statements also execute in O(1) time.
This is not entirely true either... Measure. There are many cases where the optimiser will vectorise a certian algorithm but not another... In many cases On^2 vectorised may be significantly faster than On or Onlogn even for very large datasets depending on your data...
Make your algorithms generic and it won't matter which one you use, if you find that one is slower swap it for the quicker one. Depending on CPU arch and compiler optimisations the fastest algorithm may actually change multiple times in a codebases lifetime even if the usage pattern doesn't change at all.
While you are not wrong, if you have a decent language you will discover all the useful algorithms are already in your standard library and so it isn't a worry. Your code should mostly look like apply this existing algorithm to some new data structure.
I don't disagree with you at all on this. However you may need to combine several to get to an end result. And if that happens a few times in a codebase, well makes sense to factor that into a library.
agreed, especially in cases like this. final is primarily a way to prohibit overriding methods and extending classes, and it indicates to the reader that they should not be doing this. use it when it makes conceptual sense.
that said, c++ is usually a language you use when you care about performance, at least to an extent. it's worth understanding features like nrvo and rewriting functions to allow the compiler to pick the optimization if it doesn't hurt readability too much.
Even if one of these constructs is faster it doesn't matter 99% of the time.
Writing well structured readable code is typically far more important than making it twice as fast. And those times can rarely be predicted beforehand, so you should mostly not worry about it until you see real performance problems.
The counter-argument to this is if you are building something that is in the critical path of an application (for example, parsing HTTP in a web server), you need to be performance-minded from the beginning because design decisions lead to design decisions. If you are building something in the critical path of the application, the best thing to do is build it from the ground up measuring the performance of what you have as you go. This way, each time you add something you will see the performance impact and usually there’s a more performant way of doing something that isn’t more obscure. If you do this as you build, early choices become constraints, but because you chose the most performant thing at every stage, the whole process takes you in the direction of a highly-performant implementation.
Why should you care about performance?
I can give you my personal experience: I’ve been working on a Java web/application server for the past 15 years and a typical request (only reading, not writing to the db) would take maybe 4-5 ms to execute. That includes HTTP request parsing, JSON parsing, session validation, method execution, JSON serialization, and HTTP response dispatch. Over the past 9 months I have refactored the entire application for performance and a typical request now takes about 0.25 ms or 250 microseconds. The computer is doing so much less work to accomplish the same tasks, it’s almost silly how much work it was doing before. And the result is the machine can handle 20x more requests in the same amount of time. If it could handle 200 requests per second per core before, now it can handle 4000. That means the need to scale is felt 20x less intensely, which means less complexity around scaling.
High performance means reduced scaling requirements.
Please accept a high five from a fellow "it does so little work it must have sub-millisecond request latency" aficionado (though I must admit I'm guilty of abusing memory caches to achieve this).
But even that sort of depends right? Hardware is often pretty cheap in comparison to dev-time. I really depends on the project, what kind of servers you're using, the nature of the application etc, but I think a lot of the time it might be cheaper to just pay for 20x the servers than it would be to pay a human to go find a critical path.
I'm not saying you completely throw caution to the wind, I'm just saying that there's a finite amount of human resources and it can really vary how you want to allocate them. Sometimes the better path is to just throw money at the problem.
I think it depends on what you’re building and who’s building it. We’re all benefitting from the fact that the designers of NGINX made performance a priority. We like using things that were designed to be performant. We like high-FPS games. We like fast internet.
I personally don’t like the idea of throwing compute at a slow solution. I like when the extra effort has been put into something. The good feeling I get from interacting with something that is optimal or excellent is an end in itself and one of the things I live for.
Sure, though I've mentioned a few times in this thread now that the thing that bothers me more than CPU optimizations is not taking into account latency, particularly when hitting the network, and I think focusing on that will generally pay higher dividends than trying to optimize for processing.
CPUs are ridiculously fast now, and compilers are really really good now too. I'm not going to say that processing speed is a "solved" problem, but I am going to say that in a lot of performance-related cases the CPU processing is probably not your problem. I will admit that this kind of pokes holes in my previous response, because introducing more machines into the mix will almost certainly increase latency, but I think it more or less holds depending on context.
But I think it really is a matter of nuance, which you hinted at. If I'm making an admin screen that's going to have like a dozen users max, then a slow, crappy solution is probably fine; the requests will be served fast enough to where no one will notice anyway, and you can probably even get away with the cheapest machine/VM. If I'm making an FPS game that has 100,000 concurrent users, then it almost certainly will be beneficial to squeeze out as much performance out of the machine as possible, both CPU and latency-wise.
But as I keep repeating everywhere, you have to measure. You cannot assume that your intuition is going to be right, particularly at-scale.
I absolutely agree that latency is the real thing to optimize for. In my case, I only leave the application to access the db, and my applications tend not to be write-heavy. So in my case latency-per-request == how much work the computer has to do, which is constrained to one core because the overhead of parallelizing any part of the pipeline is greater than the work required. See, in that sense, we’re already close to the performance ceiling for per-request processing because clock speeds aren’t going up. You can’t make the processing of a given request faster by throwing more hardware at it. You can only make it faster by creating less work for the hardware to do.
(Ironically, HN is buckling under load right now, or some other issue.)
It almost certainly would require more than 20x servers because setting up horizontal scaling will have some sort of overhead. Not only that, there is the significant engineering effort to develop and maintain the code to scale.
If your problem can fit on one server, it can massively reduce engineering and infrastructure costs.
I mostly focus on "using stuff that won't break", and yeah "if it actually matters".
For example, much to the annoyance of a lot of people, I don't typically use floating point numbers when I start out. I will use the "decimal" or "money" types of the language, or GMP if I'm using C. When I do that, I can be sure that I won't have to worry about any kind of funky overflow issues or bizarre rounding problems. There might be a performance overhead associated with it, but then I have to ask myself "how often is this actually called?"
If the answer is "a billion times" or "once in every iteration of the event loop" or something, then I will probably eventually go back and figure out if I can use a float or convert it to an integer-based thing, but in a lot of cases the answer is "like ten or twenty times", and at that point I'm not even 100% sure it would be even measurable to change to the "faster" implementations.
What annoys me is that people will act like they really care about speed, do all these annoying micro-optimizations, and then forget that pretty much all of them get wiped out immediately upon hitting the network, since the latency associated with that is obscene.
This attitude is part of the problem. Another part of the problem is having no idea which things actually end up costing performance and how much.
It is why many language ecosystems suffered from performance issues for a really long time even if completely unwarranted.
Is changing ifs to switch or vice versa, as outlined in the post above, a waste of time? Yes, unless you are writing some encoding algorithm or a parser, it will not matter. The compiler will lower trivial statements to the same codegen and it will not impact the resulting performance anyway even if there was difference given a problem the code was solving.
However, there are things that do cost like interface spam, abusing lambdas writing needlessly complex wokflow-style patterns (which are also less readable and worse in 8 out of 10 instances), not caching objects that always have the same value, etc.
These kinds of issues, for example, plagued .NET ecosystem until more recent culture shift where it started to be cool once again to focus on performance. It wasn't being helped by the notion of "well-structured code" being just idiotic "clean architecture" and "GoF patterns" style dogma applied to smallest applications and simplest of business domains.
(it is also the reason why picking slow languages in general is a really bad idea - everything costs more and you have way less leeway for no productivity win - Ruby and Python, and JS with Node.js are less productive to write in than C#/F#, Kotlin/Java or Go(under some conditions))
I mean, that's kind of why I tried to emphasize measuring things yourself instead of depending on tribal knowledge.
There are plenty of cases where even the "slow" implementation is more than fast enough, and there are also plenty of cases where the "correct" solution (from a big-O or intuition perspective) is actually slower than the dumb case. Intuition helps, you have to measure and/or look at the compiled results if you want to ensure correct numbers.
An example that really annoys me is how every whiteboard interview ends up being "interesting ways to use a hashmap", which isn't inherently an issue, but they will usually be so small-scoped that an iterative "array of pairs" might actually be cheaper than paying the up-front cost of hashing and potentially dealing with collisions. Interviews almost always ignore constant factors, and that's fair enough, but in reality constant factors can matter, and we're training future employees to ignore that.
I'll say it again: as far as I can tell, you have to measure if you want to know if your result is "faster". "Measuring" might involve memory profilers, or dumb timers, or a mixture of both. Gut instincts are often wrong.
> A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college
When I was taught about performance, it was all about benchmarking and profiling. I never needed to trust what my professors taught, because they taught me to dig in and find the truth for myself. This was taught alongside the big-O stuff, with several examples where "fast" algorithms are slower on small inputs.
How do you even get meaningful profiling out of most modern langs? It seems the vast majority of time and calls gets spent inside tiny anonymous functions, GC allocations, and stuff like that.
This is easy in most modern programming languages.
JVM ecosystem has IntelliJ Idea profiler and similar advanced tools (AFAIK).
.NET has VS/Rider/dotnet-trace profilers (they are very detailed) to produce flamegraphs.
Then there are native profilers which can work with any AOT compiled language that produces canonically symbolicated binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
For example, you can do `samply record ./some_binary`[0] and then explore multi-threaded flamegraph once completed (I use it to profile C#, it's more convenient than dotTrace for preliminary perf work and is usually more than sufficient).
My experience is complete opposite. You just need to construct a realistic load test for the code and the bottlenecks will stand out (more often than not).
Also there is learning curve to grouping and aggregating data.
I don't use most modern langs! And especially if I'm doing work where performance is critical, I won't kneecap myself by using a language that I can't reasonably profile.
> `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR
Yeah, that's never been true. Old compilers would often compile a switch to __slower__ code because they'd tend to always go to a jump table implementation.
A better reason to use the switch is because it's better style in C-like languages. Using an if statement for that sort of thing looks like Python; it makes the code harder to maintain.
And it's better style because it better conveys intent. An if-else chain in C/C++ implies there's something important about the ordering of cases. Though I'd say that for a very small number of cases it's fine.
Yep. "Profiling or it didn't happen." The issue is that it's essentially impossible for even the most neckbeard of us to predict with a high degree of accuracy and precision the performance on modern systems impact of change A vs. change B due to the unpredictable nature of the many variables that are difficult to control including compiler optimization passes, architecture gotchas (caches, branch misses), and interplay of quirks on various platforms. Therefore, irreducible and necessary work to profile the differences become the primary viable path to resolving engineering decision points. Hopefully, LLMs now and in the future will be able to help build out boilerplate roughly in the direct of creating such profiling benchmarks and fixtures.
PS: I'm presently revisiting C++14 because it's the most universal statically-compiled language to quickly answer interview problems. It would be unfair to impose Rust, Go, Elixir, or Haskell on an interviewer software engineer.
Um... no. This is 100% completely and totally wrong.
x86-64 requires the hardware to support SSE2, which has native single-precision and double-precision instructions for floating-point (e.g., scalar multiply is MULSS and MULSD, respectively). Both the single precision and the double precision instructions will take the same time, except for DIVSS/DIVSD, where the 32-bit float version is slightly faster (about 2 cycles latency faster, and reciprocal throughput of 3 versus 5 per Agner's tables).
You might be thinking of x87 floating-point units, where all arithmetic is done internally using 80-bit floating-point types. But all x86 chips in like the last 20 years have had SSE units--which are faster anyways. Even in the days when it was the major floating-point units, it wasn't any slower, since all floating-point operations took the same time independent of format. It might be slower if you insisted that code compilation strictly follow IEEE 754 rules, but the solution everybody did was to not do that and that's why things like Java's strictfp or C's FLT_EVAL_METHOD were born. Even in that case, however, 32-bit floats would likely be faster than 64-bit for the simple fact that 32-bit floats can safely be emulated in 80-bit without fear of double rounding but 64-bit floats cannot.
I agree with you. It should take the same time when thinking more about it. I remember learning this in ~2016 and I did performance test on Skylake which confirmed (Windows VS2015). I think I remember that i only tested with addsd/addss. Definitely not x87. But as always, if the result can not be reproduced... I stand corrected until then.
I tried to reproduce it on Ivybridge (Windows VS20122) and failed (mulss and muldd) [0]. single and double precision takes the same time. I also found a behavior where the first batch of iterations takes more time regardless of precision. It is possible that this tricked me last time.
Sure, I clarified this in a sibling comment, but I kind of meant that I will use the slower "money" or "decimal" types by default. Usually those are more accurate and less error-prone, and then if it actually matters I might go back to a floating point or integer-based solution.
I think this is only true if using x87 floating point, which anything computationally intensive is generally avoiding these days in favor of SSE/AVX floats. In the latter case, for a given vector width, the cpu can process twice as many 32 bit floats as 64 bit floats per clock cycle.
Yes, as I wrote, it is only true for one float value.
SIMD/MIMD will benefit of working on smaller width. This is not only true because they do more work per clock but because memory is slow. Super slow compared to the cpu. Optimization is alot about cache misses optimization.
(But remember that the cache line is 64 bytes, so reading a single value smaller than that will take the same time. So it does not matter in theory when comparing one f32 against one f64)
I'm surprised that it has any impact on performance at all, and I'd love to see the codegen differences between the applications.
Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes, but what `final` assures is that if you attempt to derive from such a class, you will raise a compile-time error.
This is similar to how `inline` works in practice -- rather than providing a useful hint to the compiler (though the compiler is free to treat it that way) it provides an assertion that if you do non-inlinable operations (e.g. non-tail recursion) then the compiler can flag that.
All of this is to say that `final` can speed up runtimes -- but it does so by forcing you to organize your code such that the guarantees apply. By using `final` classes, in places where dynamic dispatch can be reduced to static dispatch, you force the developer to not introduce patterns that would prevent static dispatch.
"inline" is confusing in C++, as it is not really about inlining. Its purpose is to allow multiple definitions of the same function. It is useful when you have a function defined in a header file, because if included in several source files, it will be present in multiple object files, and without "inline" the linker will complain of multiple definitions.
It is also an optimization hint, but AFAIK, modern compiler ignore it.
The thing with `inline` as an optimisation is that it's not about optimising by inlining directly. It's a promise about how you intend to use the function.
It's not just "you can have multiple definitions of the same function" but rather a promise that the function doesn't need to be address/pointer equivalent between translation units. This is arguably more important than inlining directly because it means the compiler can fully deduce how the function may be used without any LTO or other cross translation unit optimisation techniques.
Of course you could still technically expose a pointer to the function outside a TU but doing so would be obvious to the compiler and it can fall back to generating a strictly conformant version of the function. Otherwise however it can potentially deduce that some branches in said function are unreachable and eliminate them or otherwise specialise the code for the specific use cases in that TU. So it potentially opens up alternative optimisations even if there's still a function call and it's not inlined directly.
> It is useful when you have a function defined in a header file, because if included in several source files, it will be present in multiple object files, and without "inline" the linker will complain of multiple definitions.
Traditionally you'd use `static` for that use case, wouldn't you?
After all, `inline` can be ignored, `static` can't.
> "inline" is confusing in C++, as it is not really about inlining. Its purpose is to allow multiple definitions of the same function.
No, its purpose was and is still to specify a preference for inlining. The C++ standard itself says this:
> The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism.
> The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes
How? The compiler doesn't see the full program.
The linker I'm less sure about. If the class isn't guaranteed to be fully private wouldn't an optimizing linker have to be conservative in case you inject a derived class?
> Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes
That's incorrect. The optimizer has to assume everything escapes the current optimization unit unless explicitly told otherwise. It needs explicit guarantees about the visibility to figure out the extent of the derivations allowed.
What if I dlopen a shared object that contains a derived class, then instantiate it. You cannot statically verify that I won't. Or you could swap out a normally linked shared object for one that creates a subclass. Etc etc. This kind of stuff is why I think shared object boundaries should be limited to the lowest common denominator (basically c abi). Dynamic linking high level languages was a mistake. The only winning move is not to play.
> I'd love to see the codegen differences between the applications
There are two applications, dynamic calls and dynamic casts.
Dynamic casts to final classes dont require to check the whole inheritance chain. Recently done this in styx [0]. The gain may appear marginal, e.g 3 or 4 dereferences saved but in programs based on OOP you can easily have *Billions* of dynamic casts saved.
The main case where I use final and where I would expect benefits (not covered well by the article) is when you are using an external library with pure virtual interfaces that you implement.
For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).
I'm curious to understand better how clang is producing worse code in these cases. The code used for the blog post is a bit too complicated for me to look at, but I would love to see some microbenchmarks. My guess is that there is some kind of icache or code side problem. where inlining more produces worse code.
`final` tells the compiler that nothing extends this class. That means the compiler can theoretically do things like inlining class methods and eliminate virtual method calls (perhaps duplicating the method)?
However, it's quite possible that one of those optimizations makes the code bigger or misaligns things with the cache in unexpected ways. Sometimes, a method call can bet faster than inlining. Especially with hot loops.
All this being said, I'd expect final to offer very little benefit over PGO. Its main value is the constraint it imposes and not the optimization it might enable.
> For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).
I want to ask, and I sincerely mean no snark, what is the point?
When working with AWS through an SDK your code will spend most of the time waiting on network calls.
What is the point of devirtualizing your function calls to save an indirection when you will be spending several orders of magnitude more time just waiting for the RPC to resolve?
It just doesn't seem like something even worth thinking about at all.
Yeah that's was just the first public C++ library with this pattern that popped into my head. I just make all my classes final out of habit and don't think about it. I remove final if I want to subclass, but that almost never happens.
The only thing worse than no benchmark is a bad benchmark.
I don't think this really shows what `final` does, not to code generation, not to performance, not to the actual semantics of the program. There is no magic bullet - if putting `final` on every single class would always make it faster, it wouldn't be a keyword, it'd be a compiler optimization.
`final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.
'Final' cannot be assumed without complete knowledge of all final linking cases, and knowledge that this will not change in the future. The latter can never be assumed by a compiler without indication.
"In theory" adding 'final' only gives a compiler more information, so should only result in same or faster code.
In practice, some optimizations improve performance for more expected or important cases (in the compiler writer's estimation), with worse outcomes in other less expected, less important cases. Without a clear understanding the when and how of these 'final' optimizations, it isn't clear without benchmarking after the fact, when to use it, or not.
That makes any given test much less helpful. Since all we know is 'final' was not helpful in this case. We have no basis to know how general these results are.
But it would be deeply strange if 'final' was generally unhelpful. Informationally it does only one purely helpful thing: reduce the number of linking/runtime contexts the compiler needs to worry about.
Not disagreeing with your point, but it couldn't be a compiler optimization, could it? The compiler isn't able to infer that the class will not be inherited anywhere else, since another compilation unit unknown to the class could inherit.
Possibly not in the default c++ language mode, but check out -fwhole-program-vtables. It can be a useful option in cases where all relevant inheritance relationships are known at compile time.
Which is good, but may not apply. I have an application where I can't do that because we support plugins and so a couple classes will get overridden outside of the compilation (this was in hindsight a bad decision, but too late to change now). Meanwhile most classes will never be overriden and so I use final to saw that. We are also a multi-repo project (which despite the hype I think is better for us than mono-repo), another reason why -f-whole-program-vtables would be difficult to use - but we could make it work with effort if it wasn't for the plugins.
I would expect "final" to have no effect on this type of code at all. That it does in some cases cause measurable differences I put down to randomly hitting internal compiler thresholds (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).
Why would I expect no performance difference? I haven't looked at the code, but I would expect that for each pixel, it iterates through an array/vector/list etc. of objects that implement some common interface, and calls one or more methods (probably something called intersectRay() or similar) on that interface. By design, that interface cannot be made final, and that's what counts. Whether the concrete derived classes are final or not makes no difference.
In order to make this a good test of "final", the pointer type of that container should be constrained to a concrete object type, like Sphere. Of course, this means the scene is limited to spheres.
The only case where final can make a difference, by devirtualising a call that couldn't otherwise be devirtualised, is when you hold a pointer to that type, and the object it points at was allocated "uncertainly", e.g., by the caller. (If the object was allocated in the same basic block where the method call later occurs, the compiler already knows its runtime type and will devirtualise the call anyway, even without "final".)
> (perhaps one of the inlining heuristics is "Don't inline a function with more than 100 tokens", and the "final" keyword pushes a couple of functions to 101).
That definitely is one of the heuristics in MSVC++.
We have some performance critical code and at one point we noticed a slowdown of around ~4% in a couple of our performance tests.
I investigated but the only change to that code base involved fixing up an error message (i.e. no logic difference and not even on the direct code path of the test as it would not hit that error).
Turns out that:
int some_func() {
if (bad)
throw std::exception("Error");
return some_int;
}
Inlined just fine, but after adding more text to the exception error message it no longer inlined, causing the slow-down.
You could either fix it with __forceinline or by moving the exception to a function call.
Since the inlining is performed in MSVC's backend, as opposed to its frontend, and hence operates strictly on MSVC's intermediate representation which lacks information about tokens or the AST, it's unlikely due to tokens.
std::exception does not take a string in its constructor, so most likely you used std::runtime_error. std::runtime_error has a pretty complex constructor if you pass into it a long string. If it's a small string then there's no issue because it stores its contents in an internal buffer, but if it's a longer string then it has to use a reference counting scheme to allow for its copy constructor to be noexcept.
That is why you can see different behavior if you use a long string versus a short string. You can also see vastly different codegen with plain std::string as well depending on whether you pass it a short string literal or a long string literal.
> std::exception does not take a string in its constructor
You're right, I used it as a short-hand for our internal exception function, forgetting that the std one does not take a string.
Our error handling function is a simple static function that takes an std::string and throws a newly constructed object with that string as a field.
But yes, it could very well have been that the string surpassed the short string optimisation threshold or something similar.
I did verify the assembly before and after and the function definitely inlined before and no longer inlined after. Moving the 'throw' (and, importantly, the string literal) into a separate function that was called from the same spot ensured it inlined again and the performance was back to normal.
Actually, the compiler can only implicitly devirtualize under very specific circumstances. For example, it cannot devirtualize if there was previously a non-inlined call through the same pointer.
The reason is placement new. It is legal (given that certain invariants are upheld) in C++ to say `new(this) DerivedClass`, and compilers must assume that each method could potentially have done this, changing the vtable pointer of the object.
The `final` keyword somewhat counteracts this, but even GCC still only opportunistically honors it - i.e. it inserts a check if the vtable is the expected value before calling the devirtualized function, falling back on the indirect call.
Fascinating, though a little sad. Are there any important kinds of behaviour that can only be implemented via this `new(this) DerivedClass` chicanery? Because if not, it seems a shame to make the optimiser pay such a heavy price just to support it.
Presumably there is some arcane trick that somebody will argue is only implementable in this way, but I would personally never let such code through review.
You should use final to express design intent. In fact I’d rather it were the default in C++, and there was some sort of an opposite (‘derivable’?) keyword instead, but that ship has sailed long time ago. Any measurable negative perf impact should be filed as a bug and fixed.
C++ doesn't have the fragile base problem, as members aren't virtual my default. The only concern with unintended inheritance is with polymorhpic deletion. "final" on class definition disables some tricks thag you can do with private inheritance.
Having said that "final" on member functions is great, and I like to see that instead of "override".
All OOP languages have it, the issue is related to changing the behaviour of the base class, and the change introducing unforceen consequences on the inheritance tree.
Changing an existing method way of calling (regular, virtual, static), changing visibility, overloading, introducing a name that clashes downstream, introducing a virtual destructor, making a data member non-copyable,...
> All OOP languages have it, the issue is related to changing the behaviour of the base class, and the change introducing unforceen consequences on the inheritance tree.
C++ largely solves it by having tight encapsulation. As long as you don't change anything that breaks your existing interface, you should be good. And your interface is opt-in, including public members and virtual functions.
Not when you change the contents of the class itself for public and protected inheritance members, which is exactly the whole issue of fragile base class.
It doesn't go away just because private members exist as possible language feature.
That's not a fragile base, that's just a fragile class. You can break APIs for all kinds of users, including derived classes.
Some APIs are aimed towards derived classes, like protected members and virtual functions, but that doesn't make the issue fundamentally different. It's just breaking APIs.
Point is, in C++ you have to opt-in to make these API surfaces, they are not the default.
Intent is nice and all that, but I would like a "nonwithstanding" keyword instead that just lets me bypass that kind of "intent" without having to copy paste the entire implementation just to remove a pointless keyword or make a destructor public when I need it.
As an LLVM developer, I really wish the author filed a bug report and waited for some analysis BEFORE publishing an article (that may never get amended) that recommends not using this keyword with clang for performance reasons. I suspect there's just a bug in clang.
Coincidentally, I happened to be playing around yesterday with a small performance test case using uniform_real_distribution, and for some strange reason Clang was 6x slower than GCC.
I put it down to some weird clang bug on my LTS version of Ubuntu. As my installed version was clang-14, I decided it possibly had been noticed and fixed a long time ago.
After reading your message I replaced uniform_real_distribution by uniform_int_distribution, and lo and behold, Clang was indeed faster than GCC, as expected.
Thank you for coming back to me with your findings.
Changes in the layout of the binary can have large impacts on the program performance [0] so it's possible that the unexpected performance decrease is caused by unpredictable changes in the layout of the binary between compilations. I think there is some tool which helps ensure layout is consistent for benchmarking, but I can't remember what it's called.
I profiled this project and there are abundant opportunities for devirtualization. The virtual interface `IHittable` is the hot one. However, the WITH_FINAL define is not sufficient, because the hot call is still virtual. At `hit_object |= _objects[node->object_index()]->hit` I am still seeing ` mov (%rdi),%rax; call *0x18(%rax)` so the application of final here was not sufficient to do the job. Whatever differences are being measures are caused by bogons.
An interface, like IHittable, can't possibly be made final since its whole purpose is to enable multiple different concrete subclasses that implement it.
As you say, that's the hot one -- and making the concrete subclasses themselves "final" enables no devirtualisations since there are no opportunities for it.
Yeah the practical cases for devirtualization are when you have a base class, a derived class that you actually use, and another derived class that you use in tests. For your release binary the tests aren't visible so that can all be devirtualized.
In cases where you have Dog and Goose that both derive from Animal and then you have std::vector<Animal>, what is the compiler supposed to do?
The compiler simply knows that the actual dynamic type is Animal because it is not a pointer. You need Animal* to trigger all the fun virtual dispatch stuff.
One thing that wasn't mentioned in the article that I wished it did was the size of the compiled binary with and without final. Only reason I would expect the final version to be slower is that we are emitting more code because of inlining and that is resulting in a larger portion of instruction cache misses.
Also, now that I think of it, they should have run the code under perf and compared the stats.
Yeah, really unsatisfying that there was no attempt to explain why it might be slower since it just gives the compiler more information to decide on optimizations which in theory should only make thins faster.
> I created a "large test suite" to be more intensive. On my dev machine it needed to run for 8 hours.
During such long and compute-intensive tests, how are thermal considerations mitigated? Not saying that this was case here, but I can see how after saturating all cores for 8 hours, the whole PC might get hot to the point CPU starts throttling, so when you reboot to next OS or start another batch, overall performance could be a bit lower.
having recently done similar day-and-night long suites of benchmarks (on a laptop in heat dissipation conditions worse than on any decent desktop), I've found that there is no correlation between the order the benchmarks are run in and their performance (or energy consumption!). i would therefore assume that a non-overclocked processor would not exhibit the patterns you are thinking of here
I really wish he'd listed all the flags he used. To add on to the flags already listed by some other commenters, `-mcpu` and related flags are really crucial in these microbenchmarks: over such a small change and such a small set of tight loops, you could just be regression on coincidences in the microarchitecture scheduler vs higher level assumptions.
And he didn't repeat each test case 5 or 9 times, and take the median (or even an average).
There will be operating system noise that can be in the multi-percent range. This is defined as various OS services that run "in the background" taking up cpu time, emptying cache lines (which may be most important), and flushing a few translate lookaside entries.
Once you recognize the variability from run to run, claiming "1%" becomes less credible. Depending on the noise level, of course.
Linux benchmarks like SPECcpu tend to be run in "single-user mode" meaning almost no background processes are running.
1% is nothing to scoff of. But I suspect that the variability of compilation (specifically quirks of instruction selection, register allocation and function alignment) more than mask any gains.
The clang regression might be explainable by final allowing some additional inlining and clang making an hash of it.
I'm actually more worried about Clang being close to 100% slower than GCC on Linux. That doesn't seem right.
I am prepared to believe that there is some performance difference between the two, varying per case, but I would expect a few percent difference, not twice the run time..
tldr: sprinkled a keyword around in the hopes that it "does something" to speed things up, tested it, got noisy results but no miraculous speedup.
I started skimming this article after a while, because it seemed to be going into the weeds of performance comparison without ever backing up to look at what the change might be doing. Which meant that I couldn't tell if I was going to be looking at the usual random noise of performance testing or something real.
For `final`, I'd want to at least see if it changing the generated code by replacing indirect vtable calls with direct or inlined calls. It might be that the compiler is already figuring it out and the keyword isn't doing anything. It might be that the compiler is changing code, but the target address was already well-predicted and it's perturbing code layout enough that it gets slower (or faster). There could be something interesting here, but I can't tell without at least a little assembly output (or perhaps a relevant portion of some intermediate representation, not that I would know which one to look at).
If it's not changing anything, then perhaps there could be an interesting investigation into the variance of performance testing in this scenario. If it's changing something, then there could be an interesting investigation into when that makes things faster vs slower. As it is, I can't tell what I should be looking for.
>changing the generated code by replacing indirect vtable calls with direct or inlined calls
It can't possibly be doing this, if the raytracing code is like any other raytracer I've ever seen -- since it must be looping through a list of concrete objects that implement some shared interface, calling intersectRay() on each one, and the existence of those derived concrete object types means that that shared interface can't be made final, and that's the only thing that would enable devirtualisation -- it makes no difference whether the concrete derived types themselves are final or not.
This is what I was waiting for too. Especially with the large regression on Clang/Ubuntu. Maybe he uncovered a Clang/LLVM codegen bug, but you’d need to compare the generated assembly to know.
+1. On modern hardware and software systems, performance is effectively stochastic to some degree, as small random perturbations to the input (code, data, environments, etc) can have arbitrary effects for the performance. This means you can't draw a direct causal chain / mechanism from what you changed to the performance change - when it matters, you do need to do a deeper analysis and investigation to find the actual and full causal chain. I.e. a correlation is not a causation, and especially more so on modern hardware and software systems.
Fortran has virtual functions ("type bound procedures"), and supports a NON_OVERRIDABLE attribute on them that is basically "final". (FINAL exists but means something else.). But it also has a means for localizing the non-overridable property.
If a type bound procedure is declared in a module, and is PRIVATE, then overrides in subtypes ("extended derived types") work as usual for subtypes in the same module, but can't be affected by overrides that appear in other modules. This allows a compiler to notice when a type has no subtypes in the same module, and basically infer that it is non-overridable locally, and thus resolve calls at compilation time.
Or it would, if compilers implemented this feature correctly. It's not well described in the standard, and only half of the Fortran compilers in the wild actually support it. So like too many things in the Fortran world, it might be useful, but it's not portable.
It's difficult to discuss this stuff because the impact can be negligible or negative for one person, but large and consistently positive for another. You can only usefully discuss it on a given baseline, and for something like final I would hope that baseline would be a project that already enjoys PGO, LTO, and BOLT.
Each of the test cases measured needs to be run at least 3 times in a row, to warm caches (not just CPU but OS too) and to detect and remove noise.
In fact, I would run the same test repeatedly, keeping track of the k fastest times (k being ~3-7), and only stopping when the first and the kth fastest times are within a certain tolerance (as low as 1%). This ensures repeatability.
One sample of performance data for each test is not enough. This study provides no new insights.
Surely "final" is a conceptual thing... in other words, you don't want anyone else to derive from the class for good reasons. It's for conceptual understanding, surely?
I think it was Chandler Carruth who said "If you're not measuring, then you don't care about performance." I agree, and by that measure, nobody I've ever worked with cares about performance.
The best I'll see is somebody who cooked up a naive microbenchmark to show that style 1 takes fewer wall nanoseconds than style 2 on his laptop.
People I've worked with don't use profilers, claiming that they can't trust it. Really they just can't be bothered to run it and interpret the output.
The truth is, most of us don't write C++ because of performance; we write C++ because that's the language the code is written in.
The performance gained by different C++ techniques seldom matters, and when it does you have to measure. Profiler reports almost always surprise me the first few times -- your mental model of what's going on and what matters is probably wrong.
It matters to some degree. If it's just a simple technique you can file away and repeat as muscle memory, well that means your code is that much better.
From a user perspective it could be the difference between software that's pleasant to use and software that's annoying to use. From a philosophical perspective it's the difference between software that functions vs software that works well.
Of course it depends on your context as to whether this is valued, but I wouldn't dismiss it. Once person's micro-optimization is another person's polish.
That's interesting. Maybe final enabled more inlining, and clang is being too aggressive about it for the icache sizes in play here? I'd love to see a comparison of the generated code.
I'm disappointed the author's conclusion is "don't use final", not "something is wrong with clang".
If it does have a noticeable impact, that would be surprising, a bit like going back to the days when 'inline' was supposed to tell the compiler to inline the designated functions (no longer its main use case nowadays).
What should be evaluated is removing indirection and tightly packing your data. I'm sure you'll gain a better performance improvement. virtual calls and shared_ptr are littered in the codebase.
In this way: you can avoid the need for the `final` keyword and do the optimization the keyword enables (de-virtualize calls).
>Yes, it is very hacky and I am disgusted by this myself. I would never do this in an actual product
Why? What's with the C++ community and their disgust for macros without any underlying reasoning? It reminds me of everyone blindly saying "Don't use goto; it creates spaghetti code".
Sure, if macros are overly used: it can be hard to read and maintain. But, for something simple like this, you shouldn't be thinking "I would never do this in an actual product".
Macros that are giving you some value can be ok. In this case, once the performance conclusion is reached, the only reason to continue using a macro is if you really need the `final`ity to vary between builds. Otherwise, just delete it or use the actual keyword.
(But I'm worse than the author; if I'm just comparing performance, I'd probably put `final` everywhere applicable and then do separate compiles with `-Dfinal=` and `-Dfinal=final`... I'd be making the assumption that it's something I either always or never want eventually, though.)
In modern C++, macros are a viewed as a code smell because they are strictly worse than alternatives in almost all situations. It is a cultural norm; it is a bit like using "unsafe" in Rust if not strictly required for some trivial case. The C++ language has made a concerted effort to eliminate virtually all use cases for macros since C++11 and replace them with type-safe first-class features in the language. It is a bit of a legacy thing at this point, there are large modern C++ codebases with no macros at all, not even for things like logging. While macros aren't going away, especially in older code, the cultural norm in modern C++ has tended toward macros being a legacy foot-gun and best avoided if at all possible.
The main remaining use case for the old C macro facility I still see in new code is to support conditional compilation of architecture-specific code e.g. ARM vs x86 assembly routines or intrinsics.
Macros are still useful for conditional compilation, as in this case. They've been sunsetted for anything that looks like code generation, which this isn't. I was more commenting on the reflexive "ick" reaction of the author to the use of macros (even when appropriate) because avoiding them has become so engrained in C++ culture. I'm a macro minimalist but I would use them here.
Many people have a similar reaction to the use of "goto", even though it is absolutely the right choice in some contexts.
when things get complex templete error messages are easier to follow. nobody makes complex macros but if you tried. (template error messeges are legendary for a reason. nested macros are worse)
I would say the most performance impact would give `constexpr` followed by `const`. I wouldn't bet any money on `final` which in C++ is a guard of inheritance, and C++ function invocation address is resolved the `vtable` hence final wouldn't change anything. Maybe the author was mistaken with `final` keyword in Java
In my experience the compiler is pretty good at figuring out what is constant so adding const is more documentation for humans, especially in C++, where const is more of a hint than a hard boundary. Devirtualization, as can happen when you add a final, or the optimizations enabled by adding a restrict to a pointer, are on the other hand often essential for performance in hot code.
Since "const" makes things read-only, being const correct makes sure that you don't do funny things with the data you shouldn't mutate, which in turn eliminates tons of data bugs out of the gate.
So, it's an opt-in security feature first, and a compiler hint second.
How does const affects code generation in C/C++? Last time I checked, const was purely informational. Compilers can't eliminate reads for const pointer data, because const_cast exists. Compilers can't eliminate double calls to const methods, because inside function definition such functions can still legally modify mutable variables (and have many side effects).
What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).
Const affects code generation when used on variables. If you have a `const int i` then the compiler can assume that i never changes.
But you're right that this does not hold true for const pointers or references.
> What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).
It's disppointing that these haven't been standardized. I'd prefer different semantics though, e.g. something that allows things like memoization or other forms of caching that are technically side effects but where you still are ok with allowing the compiler to remove / reorder / eliminate calls.
Do you have an example where a const on a variable changes codegen? I would be surprised if the compiler couldn't figure out variable constness itself.
Sure, in the following example the compiler is able to propagate the constant to the return statement with const in f1 but needs to load it back from the stack without const in f0:
In general, whenever you call a function that the compiler cannot inspect (because it is in another TU) and the compiler cannot prove that that function doesn't have any reference to your variable it has to assume that the function might change your variable. Only passing a const reference won't help you here because it is legal to cast away constness and modify the variable unless the original variable was const.
I wish that const meant something on reference or pointers and you had to do something more explicit like a mutable member to allow modifying a variable. But even that would not help if the compiler can't prove that a non-const pointer hasn't escaped somehow. You could add __attribute__((pure)) to the function to help the compiler but that is a lot stricter so can't always be used.
This isn't guaranteed: Modification after const cast on a const variable is ill formed but the compiler is not required to diagnose it - an generally it can't because it doesn't know what your reference/pointer points to.
> In my experience the compiler is pretty good at figuring out what is constant so adding const is more documentation for humans,
In the same TU, sure. But across TU boundaries the compiler really can't figure out what should be const and what should not, so `const` in parameter or return values allows the compiler to tell the human "You are attempting to make a modification to a value that some other TU put into RO memory.", or issue similar diagnostics.
Const can only ever possibly have a performance impact when used directly on variables. const pointers / references are purely for the benefit of the programmer - the compiler can assume nothing because the variable could be modified elsewhere or through another pointer/reference and const_cast is legal anyway unless the original variable was const.
I do not see how the final keyword would make a difference in performance at all in this case. The compiler should be able to build an inheritance tree and determine by itself which classes are to be treated as final.
Now for libraries, this is a different story. There I can imagine final keyword could have an impact.
But dynamically loaded libraries exist, so even if it knows the class is the most derived version out of all classes that exist in all of the statically-linked code through LTO or something, unless it can see the instantiation site it won't be able to devirtualize the function calls without the class being marked as final.
>Personally, I'm not turning it on. And would in fact, avoid using it. It doesn't seem consistent.
I feel like we'd have to repeat these tests quite a few times to get to a decent conclusion. Hell small variations in performance could be caused by all sorts of things outside the actual program.
I'm surprised by this article. the author genuinely believes that a language construct to benefit performance was added to the language without anyone ever running any metrics to verify. "just trust me bro", is the quote.
It's is an insane level of ignorance about how these things are decided by the standards committee.
This seems like a reasonable use of the preprocessor to me. I've seen similar use in high-quality codebases. I wonder why the author is so disgusted by it.
That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.
If you wish to have a relatively good performance for your code, try ISPC, which still allows you to get great performance with vectorization up to AVX-512, without turning to intrisics.
> That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.
That's a bold statement due to the way it heavily contrasts with reality.
C++ is ever present in high performance benchmarks as either the highest performing language or second only to C. It's weird seeing someone claim with a straight face that "C++ isn't a fast language, as it is".
To make matters worse, you go on confusing what a programming language is, and confusing implementation details with language features. It's like claiming that C++ isn't a language for computational graphics just because no C++ standard dedicates a chapter to it.
Just like every engineering domain,you need to have deep knowledge on details to milk the last drop of performance improvements out of a program. Low-latency C++ is a testament of how the smallest details can be critical of performance. But you need to be completely detached from reality to claim that C++ isn't a fast language.
> That's a bold statement due to the way it heavily contrasts with reality.
I'm ready to back this up. And no, I'm not confusing things - I work in HPC (realtime computer vision) and in reality the only thing we'd use C++ for is "glue", i.e. binding implementations of the actual algorithms implemented in other languages together.
Implementations could be e.g. in CUDA, ISPC, neural-inference via TensorRT, etc.
"We use extreme vectorisation and can't do it in native C++ therefore the language is slow"
You a junior or something? For 99% of use cases C++ autovectorisation does plenty and will outperform the same code written in higher level languages. You are literally in the 1% and conflating your use case for that of the general case...
I've worked in computer vision and real time image processing. We use C++ extensively in the field due to it's high performance. OpenCV is the tool of the trade. Both iOS and Android support C++ modules for performance reasons.
But to add to all the nonsense,you claim otherwise.
Frankly, your comments lack any credibility, which is confirmed by your lame appeal to authority.
Inlining has other requirements as well -- LTO pretty much covers it.
The article doesn't have sufficient data to tell whether the testcase is built in such a way that any of these optimizations can happen or is beneficial.
I think that enabling inlining is just one of the indirect consequences of devirtualization, and perhaps one that is largely irrelevant for performance improvements.
The whole point of devirtualization is eliminating the need to resort to pointer dereferencing when calling virtual members. The main trait of a virtual class is it's use of a vtable that requires dereferencing virtual members to access each and every one of them.
In classes with larger inheritance chains, you can easily have more than one pointer dereferencing taking place before you call a virtual members function.
Once a class is final, none of that is required anymore. When a member is referred, no dereferencing takes place.
Devirtualization helps performance because you are able to benefit from inheritance and not have to pay a performance penalty for that. Without the final keyword, a performance oriented project would need to be architected to not use inheritance at all, or in the very least in code in the hot path, because that sneaks gratuitous pointer dereferences all over the place, which require running extra operations and has a negative impact on caching.
The whole purpose of the final keyword is that compilers can easily eliminate all pointer dereferencing used by virtual members. What stops them from applying this optimization is that they have no information on whether that class will be inherited and one of its members will either override any of its members or invoke any member function implemented by one of its parent classes.
With the introduction of the final keyword, you are now able to tell the compiler "from thereon, this is exactly what you get" and the compiler can trim out anything loose.
Inlining is by far the most impactful optimization here, because it can eliminate the call altogether, and thus specialize the called function to the callsite, lifting constants, hoisting loop variables, etc.
I was going to eliminate polymorphism altogether for this object but later figured out how to refactor so that this particular call could be called once a millisecond. Then if more work was needed, it would dispatch a task to a dedicated CPU.
This was an incredibly performant improvement which made a significant difference to my P&L.
In general if you're manipulating values that fit into registers and work on a platform with a shitty ABI,you need to be very careful of what your function call boundaries look like.
The most obvious example is SIMD programming on Windows x86 32-bit.
My guess is this is why he didn't see any speedup: all the code could fit inside the L2 cache, so he did not have to pay for RAM access for the deference.
The number of different classes is important, not the number of objects as they have the same small number of vtable pointers.
It might be different for large codebases like Chrome and Firefox.
Of course you have to worry about pointer chasing, when you can easily avoid it. Either via a switch to a single indirection (by passing method pointers around) or inlining with final. Or other compile-time specialization.
Though, that assumes a correct prediction. But modern branch predictors are really good, they can track and correctly predict hundreds (if not thousands) of indirect calls, taking into account the history of the last few branches (so it can even get an idea of what class is currently being executed, and make branch predictions based on that). Modern branch predictors do a really good job at chewing up indirect branches in hot sequences of code.
Virtual functions are probably the most harmful for warm code. We are talking about code that's executed too often to be considered cold code, but not often enough to stick around in the branch predictors' cache, executed only a few hundred times a second. It's a death by a thousand cuts type thing. And that's where devirtualisation will help the most...
As long as you don't go too far with the inlining and start causeing icache misses with code bloat. In an ideal would the compiler would inline enough to devirtualise the class, but not necessarily inline the actual function (unless they are small, or only called from one place)
In general it takes a significant amount of nondeterministic pointer chasing to fool modern branch predictors. Decades of research have been put into optimizing the hardware for languages like C++ and Java, both of which exhibit a lot of pointer chasing.
If they cannot be predicted, write your code accordingly.
This is not a thing in C++; vtables are flat, not nested. Function pointers are always 1 dereference away.
virtual inheritance. Regular old inheritance does not need or benefit from devirtualization. This is why the CRTP exists.
What you're talking about is dynamic dispatch
CRTP does not exist for that. CRTP was one of the many happy accidents in template metaprogramming that happened to be discovered when doing recursive templates.
Also, you've missed the whole point. CRTP is a way to rearchitect your code to avoid dereferencing pointers to virtual members in inheritance. The whole point is that with final you do not need to pull tricks: just tell the compiler that you don't want the class to be inherited, and the compiler picks up from there and does everything for you.
Please read my post. That's not my claim. I think I was very clear.
Is there a theory as to how devirtualisation could hurt performance?
And if the devirtualisation leads to inlining, that results in code bloat which can lower performance though more instruction cache misses, which are not cheap.
Inlining is actually pretty evil. It almost always speeds things up for microbenchmarks, as such benchmarks easily fit in icache. So programmers and modern compilers often go out of their way to do more inlining. But when you apply too much inlining to a whole program, things start to slow down.
But it's not like inlining is universally bad in larger program, inlining can enable further optimisations, mostly because it allows constant propagation to travel across function boundaries.
Basically, compilers need better heuristics about when they should be inlining. If it's just saving the overhead of a lightweight call, then they shouldn't be inlining.
No it's not. Except if you __force_inline__ everything, of course.
Inlining reduces the number of instructions in a lot of cases. Especially when things are abstracted and factored with lot of indirections into small functions that calls other small functions and so on. Consider a 'isEmpty' function, which dissolves to 1 cpu instruction once inlined, compared with a call/save reg/compare/return. Highly dynamic code (with most functions being virtual) tend to result in a fest of chained calls, jumping into functions doing very little work. Yes the stack is usually hot and fast, but spending 80% of the instructions doing stack management is still a big waste.
Compilers already have good heuristics about when they should be inlining, chances are they are a lot better at it than you. They don't always inline, and that's not possible anyway.
My experience is that compiler do marvels with inlining decisions when there are lots of small functions they _can_ inline if they want to. It gives the compiler a lot of freedom. Lambdas are great for that as well.
Make sure you make the most possible compile-time information available to the compiler, factor your code, don't have huge functions, and let the compiler do its magic. As a plus, you can have high level abstractions, deep hierarchies, and still get excellent performances.
As you say: “chances are they are a lot better at it than you”. Infrequently they are not.
The first is that when building a code base you don't necessarily know what it's being compiled with. And so even if there were a super-amazing compiler, there's no guarantee that's what will be compiling your code. Making it explicit, so long as you have a reasonably good idea of what you're doing, is generally just a good idea. It also conveys intent to some degree, especially things like final.
The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil. Because that mindset has gradually transitioned to being against optimization in general outside of the most primitive things like not running critical sections in O(N^2) when they could be O(N). And I think it's this mindset that has gradually brought us to where we are today where need what what would have been a literal supercomputer not that long ago, to run a word processor. It's like death by a thousand cuts, and quite ridiculous.
The greater evil is putting a one-sentence quote out of context:
""" There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off. """
So referencing something in particular from Unreal Engine, they actually created a caching system for converting between a quaternion and a rotator (euler rotation)! Obviously that sort of conversion isn't going to, in a million years, be even close to a bottleneck. That conversion is quite cheap on modern hardware, and so that caching system probably only gives the engine one of those 0.1% boosts in performance. But there are literally thousands of these "small efficiencies" spread all throughout the code. And it yields a final product that runs dramatically better than comparable engines.
In a moderately-sized codebase I regularly work on, I use __attribute__((noinline)) nearly ten times as often as __attribute__((always_inline)). And I use __attribute__((cold)) even more than noinline.
So yeah, I can kind of see why someone would say inlining is 'evil', though I think it's more accurate to say that it's just not possible for compilers to figure out these kinds of details without copious hints (like PGO).
When writing ultra-robust code that has to survive every vaguely plausible contingency in a graceful way, the code is littered with code paths that only exist for astronomically improbable situations. The branch predictor can figure this out but the compiler frequently cannot without explicit instructions to not pollute the i-cache.
The branch predictors actually hash the history of the last few branches taken into the branch prediction query. So the exact same branch within a child function will map different branch predictors entries depending on which parent function it was called from, and there is no benifit to inlining.
It also means that branch predictor can also learn correlations between branches within a function. Like when a branches at the top and bottom of functions share conditions, or have inverted conditions.
The main advantages to inlining are (1) avoiding a jump and other function call overhead, (2) the ability to push down optimizations.
If you execute the "same" code (same instructions, different location) in many places that can cause cache evictions and other slowdowns. It's worse if some minor optimizations were applied by the inlining, so you have more types of instructions to unpack.
The question, roughly, is whether the gains exceed the costs. This can be a bit hard to determine because it can depend on the size of the whole program and other non-local parameters, leading to performance cliffs at various stages of complexity. Microbenchmarks will tend to suggest inlining is better in more cases that it actually is.
Over time you get a feel for which functions should be inlined. E.g., very often you'll have guard clauses or whatnot around a trivial amount of work when the caller is expected to be able to prove the guarded information at compile-time. A function call takes space in the generated assembly too, and if you're only guarding a few instructions it's usually worth forcing an inline (even in places where the compiler's heuristics would choose not to because the guard clauses take up too much space), regardless of the potential cache costs.
If you have something like a `while` loop and that while loop's instructions fit neatly on the cache line, then executing that loop can be quiet fast even if you have to jump to different code locations to do the internals. However, if you pump in more instructions in that loop you can exceed the length of the cache line which causes you to need more memory loads to do the same work.
It can also create more code. A method that took a `foo(NotFinal& bar)` could be duplicated by the compiler for the specialized cases which would be bad if there's a lot of implementations of `NotFinal` that end up being marshalled into foo. You could end up loading multiple implementations of the same function which may be slower than just keeping the virtual dispatch tables warm.
Guarded devirtualization is also cheaper than virtual calls, even when it has to do
or even chain multiple checks at once (with either regular ifs or emitting a jump table)This technique is heavily used in various forms by .NET, JVM and JavaScript JIT implementations (other platforms also do that, but these are the major ones)
The first two devirtualize virtual and interface calls (important in Java because all calls default to virtual, important in C# because people like to abuse interfaces and occasionally inheritance, C# delegates are also devirtualized/inlined now). The JS JIT (like V8) performs "inline caching" which is similar where for known object shapes property access is shape type identifier comparison and direct property read instead of keyed lookup which is way more expensive.
Also you are correct - virtual calls are not terribly expensive, but they encroach on ever limited* CPU resources like indirect jump and load predictors and, as noted in parent comments, block inlining, which is highly undesirable.
[0] https://github.com/dotnet/runtime/blob/5111fdc0dc464f01647d6...
[1] https://github.com/dotnet/runtime/blob/main/docs/design/core... (mind you, the text was initially written 18 years ago, wow)
* through great effort of our industry to take back whatever performance wins each generation brings with even more abstractions that fail to improve our productivity
You can tell it "I won't do that" though with additional flags, like Clang's -fwhole-program-vtables, and even then it's not that simple. There was an effort in Clang to better support whole program devirtualization, but I haven't been following what kind of progress has been made: https://groups.google.com/g/llvm-dev/c/6LfIiAo9g68?pli=1
Maybe I can set this option at work. Though it's scary because I'd have to be certain.
Lots of code gets slower if it might need to be called from something not currently in the compiler's scope. That's essentially what ABI overhead is. If there isn't already, there should be a compiler flag that says "this is the whole program, have at it" which implies the vtables option.
* It is possible with `dlopen()` to load code objects that violate the assumptions made during compilation.
* The presence of runtime configuration mechanisms and application input can make it impossible to anticipate things like the choice of implementations of an interface.
One can always strive to reduce such situations, but it might simply not be necessary if a JIT is present.
Funny how things work. From working with Julia I've built a good intuition for guessing when functions would be inlined. And yet, I've never heard the word devirtualization until now.
Quick example, I got in an argument with someone a few years ago that claimed in C# that a `switch` was better than an `if(x==1) elseif(x==2)...` because switch was "faster" and rejected my PR. I mentioned that that doesn't appear to be true, we went back and forth until I did a compile-then-decompile of a minimal test with equality-based-ifs, and showed that the compiler actually converts equality-based-ifs to `switch` behind the scenes. The guy accepted my PR after that.
But there's tons of this stuff like this in CS, and I kind of blame professors for a lot of it [1]. A large part of becoming a decent engineer [2] for me was learning to stop trusting what professors taught me in college. Most of what they said was fine, but you can't assume that; what they tell you could be out of date, or simply never correct to begin with, and as far as I can tell you have to always test these things.
It doesn't help that a lot of these "it's faster" arguments are often reductive because they only are faster in extremely minimal tests. Sometimes a microbenchmark will show that something is faster, and there's value in that, but I think it's important that that can also be a small percentage of the total program; compilers are obscenely good at optimizing nowadays, it can be difficult to determine when something will be optimized, and your assertion that something is "faster" might not actually be true in a non-trivial program.
This is why I don't really like doing any kind of major optimizations before the program actually works. I try to keep the program in a reasonable Big-O and I try and minimize network calls cuz of latency, but I don't bother with any kind of micro-optimizations in the first draft. I don't mess with bitwise, I don't concern myself on which version of a particular data structure is a millisecond faster, I don't focus too much on whether I can get away with a smaller sized float, etc. Once I know that the program is correct, then I benchmark to see if any kind of micro-optimizations will actually matter, and often they really don't.
[1] That includes me up to about a year ago.
[2] At least I like to pretend I am.
I've solved a lot of arguments with godbolt and simple performance tests. Some topics are recurring themes among software engineers e.g.:
- compilers are almost always better at micro-optimizations than you are
- disk I/O is almost never a bottleneck in competent designs
- brute-force sequential scans are often optimal algorithms
- memory is best treated as a block device
- vectorization can offer large performance gains
- etc...
No one is immune to this. I am sometimes surprised at the extent to which assumptions are no longer true when I revisit optimization work I did 10+ years ago.
Most performance these days is architectural, so getting the initial design right often has a bigger impact than micro-optimizations and localized Big-O tweaks. You can always go back and tweak algorithms or codegen later but architecture is permanent.
(the techniques that used to work were similar to earlier Java versions and overall very dynamic languages with some exceptions, the techniques that still work and now are required today are the same as in C++ or Rust)
There are a few articles on msft devblogs that cover from-netframework migration to older versions (Core 3.1, 5/6/7):
- https://devblogs.microsoft.com/dotnet/bing-ads-campaign-plat...
- https://devblogs.microsoft.com/dotnet/microsoft-graph-dotnet...
- https://devblogs.microsoft.com/dotnet/the-azure-cosmos-db-jo...
- https://devblogs.microsoft.com/dotnet/one-service-journey-to...
- https://devblogs.microsoft.com/dotnet/microsoft-commerce-dot...
The tl;dr is depending on codebase the latency reduction was anywhere from 2x to 6x, varying per percentile, or the RPS was maintained with CPU usage dropping by ~2-6x.
Now, these are codebases of likely above average quality.
If you consider that moving 6 -> 8 yields another up to 15-30% on average through improved and enabled by default DynamicPGO, and if you also consider that the average codebase is of worse quality than whatever msft has, meaning that DPGO-reliant optimizations scale way better, it is not difficult to see the 10x number.
Keep in mind that while particular regular piece of enterprise code could have improved within bounds of "poor netfx codegen" -> "not far from LLVM with FLTO and PGO", the bottlenecks have changed significantly where previously they could have been in lock contention (within GC or user code), object allocation, object memory copying, e.g. for financial domains - anything including possibly complex Regex queries on imported payment reports (these alone have now difference anywhere between 2 and >1000[0]), and for pretty much every code base also in interface/virtual dispatch for layers upon layers of "clean architecture" solutions.
The vast majority of performance improvements (both compiler+gc and CoreLib+frameworks), which is difficult to think about, given it was 8 years, address the above first and foremost. At my previous employer the migration from NETFX 4.6 to .NET Core 3.1, while also deploying to much more constrained container images compared to beefy Windows Server hosts, reduced latency of most requests by the same factor of >5x (certain request type went from 2s to 350ms). It was my first wow moment when I decided to stay with .NET rather than move over to Go back then (was never a fan of syntax though, and other issues, which subsequently got fixed in .NET, that Go still has, are not tolerable for me).
[0] Cumulative of
https://devblogs.microsoft.com/dotnet/regex-performance-impr...
https://devblogs.microsoft.com/dotnet/regular-expression-imp...
https://devblogs.microsoft.com/dotnet/performance-improvemen...
All of the 6x performance improvement cases seem to be related to using the .net based Kestrel web server instead of IIS web server, which requires marshalling and interprocess communication. Several of the 2x gains appear to be related to using a different database backend. Claims that regex performance has improved a thousand-fold.... seem more troubling than cause for celebration. Were you not precompiling your regex's in the older code? That would be a bug.
Somewhere in there, there might be 30% improvements in .net codegen (it's hard to tell). Profile Guided Optimization (PGO) seems to provide a 35% performance improvement over older versions of .net with PGO disabled. But that's dishonest. PGO was around long before .net Core. And claiming that PGO will provide 10x performance because our code is worse than Microsoft's code insults both our code and our intelligence.
With a Roslyn-based compiler at work I saw 20 % perf improvement just by switching from .NET Core 3.1 to .NET 6. No idea how slow .NET Framework was, though. I probably can't target the code to that anymore.
But for regex even with precompilation, the compiler got a lot better at transforming the regex into an equivalent regex that performs better (automatic atomic grouping to reduce unnecessary backtracking when it's statically known that backtracking won't create more matches for example) and it also benefits a lot from the various vectorized implementations of Index of, etc. Typically with each improvement of one of those core methods for searching stuff in memory there's a corresponding change that uses them in regex.
So where in .NET Framework a regex might walk through a whole string character by character multiple times with backtracking it might be replaced with effectively an EndsWith and LastIndexOfAny call in newer versions.
(the distinction becomes important for targets serviced by Mono, so to outline the difference Mono is usually specified, while CoreCLR and RyuJIT may not be, it also doesn't help that JIT, that is, the IL to machine code compiler, also services NativeAOT, so it gets more annoying to be accurate in a conversation without saying the generic ".net compiler", some people refer to it as JIT/ILC)
And indeed, on the C# -> IL side there's little that's being actually optimized. Besides collection literals there's also switch statements/expressions over strings, along with certain pattern matching constructs that get improved on that side.
Is it a public project?
Also, IIS hosting through Http.sys is still an option that sees separate set of improvements, but that's not relevant in most situations given the move to .NET 8 from Framework usually also involves replacing Windows Server host with a Linux container (though it works perfectly fine on Windows as well).
On Regex, compiled and now source generated automata has seen a lot of work in all recent releases, it is night and day to what it was before - just read the articles. Previously linear scans against heavy internal data structures (matching by hashset) and heavy transient allocations got replaced with bloom-filter style SIMD search and other state of the art text search algorithms[0], on a completely opposite end of a performance spectrum.
So when you have compiler improvements multiplied by changes to CoreLib internals multiplied by changes to frameworks built on top - it's achievable with relative ease. .NET Framework, while performing adequately, was still that slow compared to what we got today.
[0] https://github.com/dotnet/runtime/tree/main/src/libraries/Sy...
And you have misrepresented the contents of the blogs. The projects discussed in the blogs are typically claiming ~30% improvements (perhaps because they weren't using static PGO in their 4.7.0 incarnation), with two dramatic outliers that seem to be related to migrating from IIS to Kestrel.
It’s also convenient to ignore the rest of the content at the links but it seems you’re more interested in proving your argument so the data I provided doesn’t matter.
When I'm building stuff I try my best to focus on "correctness", and try to come up with an algorithm/design that will encompass all realistic use cases. If I focus on that, it's relatively easy to go back and convert my `decimal` type to a float64, or even convert an if statement into a switch if it's actually faster.
Reminds me of the classic https://stackoverflow.com/questions/24848359/which-is-faster...
But I agree, algorithmic complexity is generally the only thing I focus on, and even then it's almost always a case of "will that actually matter?" If I know that `n` is never going to be more than like `10`, I might not bother trying to optimize an O(n^2) operation.
What I feel often gets ignored in these conversations is latency; people obsess over some "optimization" they learned in college a decade ago, and ignore the 200 HTTP or Redis calls being made ten lines below, despite the fact that the latter will have a substantially higher impact on performance.
My experience is the opposite - a sizeable chain of ifs has more that can go wrong precisely because it is more flexible. If I'm looking at a switch, I immediately know, for instance, that none of the tests modifies anything.
Meanwhile, while a missing break can be a brutal error in a language that allows it, it's usually trivial to set up linting to require either an explicit break or a comment indicating fallthrough.
It turns out the algorithmic complexity of a switch statement and the equivalent series of if-statements is identical. The bijective mapping between them is close to the identity function. Does a naive compiler exist that doesn't emit the same instructions for both, at least outside of toy hobby project compilers written by amateurs with no experience?
If statements are unbounded, unconstrained logic constructs, whereas switch statements are type-checkable. The concern about missing break statements here is irrelevant, where your linter/compiler can warn about missing switch cases they can easily warn about non-terminated (non-explicitly marked as fall-through) cases.
For non-compiled languages (so branch prediction is not possible because the code is not even loaded), switch statements also provide a speed-up, i.e. the parser can immediately evaluate the branch to execute vs being forced to evaluate intermediate steps (and the conditions to each if statement can produce side-effects e.g. if(checkAndDo()) { ... } else if (checkAndDoB()) { ... } else if (checkAndDoC()) { ... }
Which, of course, is a potential use of if statements that switches cannot use (although side-effects are usually bad, if you listened to your CS profs)... And again a sort of "static analysis" guarantee that switches can provide that if statements cannot.
Awkward switch syntax aside, the switch is simpler to reason about. Fundamentally we should strive to keep our code simple to understand and verify, not worry about compiler optimizations (on the first pass).
That said, the linear test is often faster due to CPU caches, which is why JITs will often convert switches to if/elses.
IMO, switch is clearer in general and potentially faster (at very least the same speed) so it should be preferred when dealing with 3+ if/elseif statements.
However, the moment you add a side effect or something more complicated like a method call, it becomes really hard for the complier to know if that sort of optimization is safe to do.
The benefit of the switch statement is that it's already well positioned for the compiler to optimize as it does not have the "you must run these evaluations in order" requirement. It forces you to write code that is fairly compiler friendly.
All that said, probably a waste of time debating :D. Ideally you have profiled your code and the profiler has told you "this is the slow block" before you get to the point of worrying about how to make it faster.
if statements are dumber, and maybe arguably uglier, but I feel like they're also more clear, and people don't try and be clever with them.
For example, with java there's enhanced switch that looks like this
The C style switch break stuff is definitely a language mistake.However most languages have pretty permissive switch statements just like C.
So no offense, but I would revisit the wider world of language constructs before claiming that switch statements are "all bad". There are plenty of bad languages or languages with poor implementations of syntax, that do not make the fundamental language construct bad.
This is not entirely true either... Measure. There are many cases where the optimiser will vectorise a certian algorithm but not another... In many cases On^2 vectorised may be significantly faster than On or Onlogn even for very large datasets depending on your data...
Make your algorithms generic and it won't matter which one you use, if you find that one is slower swap it for the quicker one. Depending on CPU arch and compiler optimisations the fastest algorithm may actually change multiple times in a codebases lifetime even if the usage pattern doesn't change at all.
that said, c++ is usually a language you use when you care about performance, at least to an extent. it's worth understanding features like nrvo and rewriting functions to allow the compiler to pick the optimization if it doesn't hurt readability too much.
Writing well structured readable code is typically far more important than making it twice as fast. And those times can rarely be predicted beforehand, so you should mostly not worry about it until you see real performance problems.
Why should you care about performance?
I can give you my personal experience: I’ve been working on a Java web/application server for the past 15 years and a typical request (only reading, not writing to the db) would take maybe 4-5 ms to execute. That includes HTTP request parsing, JSON parsing, session validation, method execution, JSON serialization, and HTTP response dispatch. Over the past 9 months I have refactored the entire application for performance and a typical request now takes about 0.25 ms or 250 microseconds. The computer is doing so much less work to accomplish the same tasks, it’s almost silly how much work it was doing before. And the result is the machine can handle 20x more requests in the same amount of time. If it could handle 200 requests per second per core before, now it can handle 4000. That means the need to scale is felt 20x less intensely, which means less complexity around scaling.
High performance means reduced scaling requirements.
I'm not saying you completely throw caution to the wind, I'm just saying that there's a finite amount of human resources and it can really vary how you want to allocate them. Sometimes the better path is to just throw money at the problem.
It really depends.
I personally don’t like the idea of throwing compute at a slow solution. I like when the extra effort has been put into something. The good feeling I get from interacting with something that is optimal or excellent is an end in itself and one of the things I live for.
CPUs are ridiculously fast now, and compilers are really really good now too. I'm not going to say that processing speed is a "solved" problem, but I am going to say that in a lot of performance-related cases the CPU processing is probably not your problem. I will admit that this kind of pokes holes in my previous response, because introducing more machines into the mix will almost certainly increase latency, but I think it more or less holds depending on context.
But I think it really is a matter of nuance, which you hinted at. If I'm making an admin screen that's going to have like a dozen users max, then a slow, crappy solution is probably fine; the requests will be served fast enough to where no one will notice anyway, and you can probably even get away with the cheapest machine/VM. If I'm making an FPS game that has 100,000 concurrent users, then it almost certainly will be beneficial to squeeze out as much performance out of the machine as possible, both CPU and latency-wise.
But as I keep repeating everywhere, you have to measure. You cannot assume that your intuition is going to be right, particularly at-scale.
(Ironically, HN is buckling under load right now, or some other issue.)
If your problem can fit on one server, it can massively reduce engineering and infrastructure costs.
For example, much to the annoyance of a lot of people, I don't typically use floating point numbers when I start out. I will use the "decimal" or "money" types of the language, or GMP if I'm using C. When I do that, I can be sure that I won't have to worry about any kind of funky overflow issues or bizarre rounding problems. There might be a performance overhead associated with it, but then I have to ask myself "how often is this actually called?"
If the answer is "a billion times" or "once in every iteration of the event loop" or something, then I will probably eventually go back and figure out if I can use a float or convert it to an integer-based thing, but in a lot of cases the answer is "like ten or twenty times", and at that point I'm not even 100% sure it would be even measurable to change to the "faster" implementations.
What annoys me is that people will act like they really care about speed, do all these annoying micro-optimizations, and then forget that pretty much all of them get wiped out immediately upon hitting the network, since the latency associated with that is obscene.
It is why many language ecosystems suffered from performance issues for a really long time even if completely unwarranted.
Is changing ifs to switch or vice versa, as outlined in the post above, a waste of time? Yes, unless you are writing some encoding algorithm or a parser, it will not matter. The compiler will lower trivial statements to the same codegen and it will not impact the resulting performance anyway even if there was difference given a problem the code was solving.
However, there are things that do cost like interface spam, abusing lambdas writing needlessly complex wokflow-style patterns (which are also less readable and worse in 8 out of 10 instances), not caching objects that always have the same value, etc.
These kinds of issues, for example, plagued .NET ecosystem until more recent culture shift where it started to be cool once again to focus on performance. It wasn't being helped by the notion of "well-structured code" being just idiotic "clean architecture" and "GoF patterns" style dogma applied to smallest applications and simplest of business domains.
(it is also the reason why picking slow languages in general is a really bad idea - everything costs more and you have way less leeway for no productivity win - Ruby and Python, and JS with Node.js are less productive to write in than C#/F#, Kotlin/Java or Go(under some conditions))
There are plenty of cases where even the "slow" implementation is more than fast enough, and there are also plenty of cases where the "correct" solution (from a big-O or intuition perspective) is actually slower than the dumb case. Intuition helps, you have to measure and/or look at the compiled results if you want to ensure correct numbers.
An example that really annoys me is how every whiteboard interview ends up being "interesting ways to use a hashmap", which isn't inherently an issue, but they will usually be so small-scoped that an iterative "array of pairs" might actually be cheaper than paying the up-front cost of hashing and potentially dealing with collisions. Interviews almost always ignore constant factors, and that's fair enough, but in reality constant factors can matter, and we're training future employees to ignore that.
I'll say it again: as far as I can tell, you have to measure if you want to know if your result is "faster". "Measuring" might involve memory profilers, or dumb timers, or a mixture of both. Gut instincts are often wrong.
When I was taught about performance, it was all about benchmarking and profiling. I never needed to trust what my professors taught, because they taught me to dig in and find the truth for myself. This was taught alongside the big-O stuff, with several examples where "fast" algorithms are slower on small inputs.
JVM ecosystem has IntelliJ Idea profiler and similar advanced tools (AFAIK).
.NET has VS/Rider/dotnet-trace profilers (they are very detailed) to produce flamegraphs.
Then there are native profilers which can work with any AOT compiled language that produces canonically symbolicated binaries: Rust, C#/F#(AOT mode), Go, Swift, C++, etc.
For example, you can do `samply record ./some_binary`[0] and then explore multi-threaded flamegraph once completed (I use it to profile C#, it's more convenient than dotTrace for preliminary perf work and is usually more than sufficient).
[0] https://github.com/mstange/samply
Also there is learning curve to grouping and aggregating data.
Very true, though there is one case where one can be highly confident that this is the case: code elimination.
You can't get any faster than not doing something in the first place.
Yeah, that's never been true. Old compilers would often compile a switch to __slower__ code because they'd tend to always go to a jump table implementation.
A better reason to use the switch is because it's better style in C-like languages. Using an if statement for that sort of thing looks like Python; it makes the code harder to maintain.
(Also, Python has a switch-like construct now.)
PS: I'm presently revisiting C++14 because it's the most universal statically-compiled language to quickly answer interview problems. It would be unfair to impose Rust, Go, Elixir, or Haskell on an interviewer software engineer.
When talking about not assuming optimizations...
32bit float is slower than 64bit float on reasonable modern x86-64.
The reason is that 32bit float is emulated by using 64bit.
Of course if you have several floats you need to optimize against cache.
x86-64 requires the hardware to support SSE2, which has native single-precision and double-precision instructions for floating-point (e.g., scalar multiply is MULSS and MULSD, respectively). Both the single precision and the double precision instructions will take the same time, except for DIVSS/DIVSD, where the 32-bit float version is slightly faster (about 2 cycles latency faster, and reciprocal throughput of 3 versus 5 per Agner's tables).
You might be thinking of x87 floating-point units, where all arithmetic is done internally using 80-bit floating-point types. But all x86 chips in like the last 20 years have had SSE units--which are faster anyways. Even in the days when it was the major floating-point units, it wasn't any slower, since all floating-point operations took the same time independent of format. It might be slower if you insisted that code compilation strictly follow IEEE 754 rules, but the solution everybody did was to not do that and that's why things like Java's strictfp or C's FLT_EVAL_METHOD were born. Even in that case, however, 32-bit floats would likely be faster than 64-bit for the simple fact that 32-bit floats can safely be emulated in 80-bit without fear of double rounding but 64-bit floats cannot.
[0] https://gist.github.com/dosshell/495680f0f768ae84a106eb054f2...
Sorry for the confusion and spreading false information.
SIMD/MIMD will benefit of working on smaller width. This is not only true because they do more work per clock but because memory is slow. Super slow compared to the cpu. Optimization is alot about cache misses optimization.
(But remember that the cache line is 64 bytes, so reading a single value smaller than that will take the same time. So it does not matter in theory when comparing one f32 against one f64)
Mostly the `final` keyword serves as a compile-time assertion. The compiler (sometimes linker) is perfectly capable of seeing that a class has no derived classes, but what `final` assures is that if you attempt to derive from such a class, you will raise a compile-time error.
This is similar to how `inline` works in practice -- rather than providing a useful hint to the compiler (though the compiler is free to treat it that way) it provides an assertion that if you do non-inlinable operations (e.g. non-tail recursion) then the compiler can flag that.
All of this is to say that `final` can speed up runtimes -- but it does so by forcing you to organize your code such that the guarantees apply. By using `final` classes, in places where dynamic dispatch can be reduced to static dispatch, you force the developer to not introduce patterns that would prevent static dispatch.
It is also an optimization hint, but AFAIK, modern compiler ignore it.
Need a way to make inlining heuristics ignore whether a function is inline https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
(Bug saw a few updates recently, that's how I remembered.)
As a workaround, if you need the linkage aspect of the inline keyword, you currently have to write fake templates instead. Not great.
It's not just "you can have multiple definitions of the same function" but rather a promise that the function doesn't need to be address/pointer equivalent between translation units. This is arguably more important than inlining directly because it means the compiler can fully deduce how the function may be used without any LTO or other cross translation unit optimisation techniques.
Of course you could still technically expose a pointer to the function outside a TU but doing so would be obvious to the compiler and it can fall back to generating a strictly conformant version of the function. Otherwise however it can potentially deduce that some branches in said function are unreachable and eliminate them or otherwise specialise the code for the specific use cases in that TU. So it potentially opens up alternative optimisations even if there's still a function call and it's not inlined directly.
Traditionally you'd use `static` for that use case, wouldn't you?
After all, `inline` can be ignored, `static` can't.
I can see exactly one use for an effect like that: static variables within the function.
Are there any other uses?
No, its purpose was and is still to specify a preference for inlining. The C++ standard itself says this:
> The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism.
https://eel.is/c++draft/dcl.inline
How? The compiler doesn't see the full program.
The linker I'm less sure about. If the class isn't guaranteed to be fully private wouldn't an optimizing linker have to be conservative in case you inject a derived class?
That's incorrect. The optimizer has to assume everything escapes the current optimization unit unless explicitly told otherwise. It needs explicit guarantees about the visibility to figure out the extent of the derivations allowed.
There are two applications, dynamic calls and dynamic casts.
Dynamic casts to final classes dont require to check the whole inheritance chain. Recently done this in styx [0]. The gain may appear marginal, e.g 3 or 4 dereferences saved but in programs based on OOP you can easily have *Billions* of dynamic casts saved.
[0]: https://gitlab.com/styx-lang/styx/-/commit/62c48e004d5485d4f....
For example, the AWS C++ SDK uses virtual functions for everything. When you subclass their classes, marking your classes as final allows the compiler to devirtualize your own calls to your own functions (GCC does this reliably).
I'm curious to understand better how clang is producing worse code in these cases. The code used for the blog post is a bit too complicated for me to look at, but I would love to see some microbenchmarks. My guess is that there is some kind of icache or code side problem. where inlining more produces worse code.
`final` tells the compiler that nothing extends this class. That means the compiler can theoretically do things like inlining class methods and eliminate virtual method calls (perhaps duplicating the method)?
However, it's quite possible that one of those optimizations makes the code bigger or misaligns things with the cache in unexpected ways. Sometimes, a method call can bet faster than inlining. Especially with hot loops.
All this being said, I'd expect final to offer very little benefit over PGO. Its main value is the constraint it imposes and not the optimization it might enable.
I want to ask, and I sincerely mean no snark, what is the point?
When working with AWS through an SDK your code will spend most of the time waiting on network calls.
What is the point of devirtualizing your function calls to save an indirection when you will be spending several orders of magnitude more time just waiting for the RPC to resolve?
It just doesn't seem like something even worth thinking about at all.
I don't think this really shows what `final` does, not to code generation, not to performance, not to the actual semantics of the program. There is no magic bullet - if putting `final` on every single class would always make it faster, it wouldn't be a keyword, it'd be a compiler optimization.
`final` does one specific thing: It tells a compiler that it can be sure that the given object is not going to have anything derive from it.
"In theory" adding 'final' only gives a compiler more information, so should only result in same or faster code.
In practice, some optimizations improve performance for more expected or important cases (in the compiler writer's estimation), with worse outcomes in other less expected, less important cases. Without a clear understanding the when and how of these 'final' optimizations, it isn't clear without benchmarking after the fact, when to use it, or not.
That makes any given test much less helpful. Since all we know is 'final' was not helpful in this case. We have no basis to know how general these results are.
But it would be deeply strange if 'final' was generally unhelpful. Informationally it does only one purely helpful thing: reduce the number of linking/runtime contexts the compiler needs to worry about.
https://reviews.llvm.org/D16821
...and the compiler can optimize using that information.
(It could also do the same without the keyword, with LTO.)
Why would I expect no performance difference? I haven't looked at the code, but I would expect that for each pixel, it iterates through an array/vector/list etc. of objects that implement some common interface, and calls one or more methods (probably something called intersectRay() or similar) on that interface. By design, that interface cannot be made final, and that's what counts. Whether the concrete derived classes are final or not makes no difference.
In order to make this a good test of "final", the pointer type of that container should be constrained to a concrete object type, like Sphere. Of course, this means the scene is limited to spheres.
The only case where final can make a difference, by devirtualising a call that couldn't otherwise be devirtualised, is when you hold a pointer to that type, and the object it points at was allocated "uncertainly", e.g., by the caller. (If the object was allocated in the same basic block where the method call later occurs, the compiler already knows its runtime type and will devirtualise the call anyway, even without "final".)
That definitely is one of the heuristics in MSVC++.
We have some performance critical code and at one point we noticed a slowdown of around ~4% in a couple of our performance tests. I investigated but the only change to that code base involved fixing up an error message (i.e. no logic difference and not even on the direct code path of the test as it would not hit that error).
Turns out that:
Inlined just fine, but after adding more text to the exception error message it no longer inlined, causing the slow-down. You could either fix it with __forceinline or by moving the exception to a function call.std::exception does not take a string in its constructor, so most likely you used std::runtime_error. std::runtime_error has a pretty complex constructor if you pass into it a long string. If it's a small string then there's no issue because it stores its contents in an internal buffer, but if it's a longer string then it has to use a reference counting scheme to allow for its copy constructor to be noexcept.
That is why you can see different behavior if you use a long string versus a short string. You can also see vastly different codegen with plain std::string as well depending on whether you pass it a short string literal or a long string literal.
You're right, I used it as a short-hand for our internal exception function, forgetting that the std one does not take a string. Our error handling function is a simple static function that takes an std::string and throws a newly constructed object with that string as a field.
But yes, it could very well have been that the string surpassed the short string optimisation threshold or something similar. I did verify the assembly before and after and the function definitely inlined before and no longer inlined after. Moving the 'throw' (and, importantly, the string literal) into a separate function that was called from the same spot ensured it inlined again and the performance was back to normal.
The reason is placement new. It is legal (given that certain invariants are upheld) in C++ to say `new(this) DerivedClass`, and compilers must assume that each method could potentially have done this, changing the vtable pointer of the object.
The `final` keyword somewhat counteracts this, but even GCC still only opportunistically honors it - i.e. it inserts a check if the vtable is the expected value before calling the devirtualized function, falling back on the indirect call.
Having said that "final" on member functions is great, and I like to see that instead of "override".
Changing an existing method way of calling (regular, virtual, static), changing visibility, overloading, introducing a name that clashes downstream, introducing a virtual destructor, making a data member non-copyable,...
C++ largely solves it by having tight encapsulation. As long as you don't change anything that breaks your existing interface, you should be good. And your interface is opt-in, including public members and virtual functions.
It doesn't go away just because private members exist as possible language feature.
Some APIs are aimed towards derived classes, like protected members and virtual functions, but that doesn't make the issue fundamentally different. It's just breaking APIs.
Point is, in C++ you have to opt-in to make these API surfaces, they are not the default.
Kotlin (which uses the equivalent of the Java "final" keyword by default) uses the "open" keyword for that purpose.
But it basically boils down to uniform_real_distribution having a bunch of uninlined calls to 'logl' when compiled with Clang.
Otherwise, Clang beats GCC at least on the configuration I tested.
(I am the author of the issue)
[1] https://gitlab.com/define-private-public/PSRayTracing/-/issu...
Coincidentally, I happened to be playing around yesterday with a small performance test case using uniform_real_distribution, and for some strange reason Clang was 6x slower than GCC.
I put it down to some weird clang bug on my LTS version of Ubuntu. As my installed version was clang-14, I decided it possibly had been noticed and fixed a long time ago.
After reading your message I replaced uniform_real_distribution by uniform_int_distribution, and lo and behold, Clang was indeed faster than GCC, as expected.
Thank you for coming back to me with your findings.
[0]: https://research.facebook.com/publications/bolt-a-practical-...
As you say, that's the hot one -- and making the concrete subclasses themselves "final" enables no devirtualisations since there are no opportunities for it.
In cases where you have Dog and Goose that both derive from Animal and then you have std::vector<Animal>, what is the compiler supposed to do?
https://godbolt.org/z/7xKj6qTcj
edit: And a case involving inlining:
https://godbolt.org/z/E9qrb3hKM
Also, now that I think of it, they should have run the code under perf and compared the stats.
During such long and compute-intensive tests, how are thermal considerations mitigated? Not saying that this was case here, but I can see how after saturating all cores for 8 hours, the whole PC might get hot to the point CPU starts throttling, so when you reboot to next OS or start another batch, overall performance could be a bit lower.
There will be operating system noise that can be in the multi-percent range. This is defined as various OS services that run "in the background" taking up cpu time, emptying cache lines (which may be most important), and flushing a few translate lookaside entries.
Once you recognize the variability from run to run, claiming "1%" becomes less credible. Depending on the noise level, of course.
Linux benchmarks like SPECcpu tend to be run in "single-user mode" meaning almost no background processes are running.
The clang regression might be explainable by final allowing some additional inlining and clang making an hash of it.
I am prepared to believe that there is some performance difference between the two, varying per case, but I would expect a few percent difference, not twice the run time..
See [1] for more information.
[1] https://news.ycombinator.com/item?id=40156196
I started skimming this article after a while, because it seemed to be going into the weeds of performance comparison without ever backing up to look at what the change might be doing. Which meant that I couldn't tell if I was going to be looking at the usual random noise of performance testing or something real.
For `final`, I'd want to at least see if it changing the generated code by replacing indirect vtable calls with direct or inlined calls. It might be that the compiler is already figuring it out and the keyword isn't doing anything. It might be that the compiler is changing code, but the target address was already well-predicted and it's perturbing code layout enough that it gets slower (or faster). There could be something interesting here, but I can't tell without at least a little assembly output (or perhaps a relevant portion of some intermediate representation, not that I would know which one to look at).
If it's not changing anything, then perhaps there could be an interesting investigation into the variance of performance testing in this scenario. If it's changing something, then there could be an interesting investigation into when that makes things faster vs slower. As it is, I can't tell what I should be looking for.
It can't possibly be doing this, if the raytracing code is like any other raytracer I've ever seen -- since it must be looping through a list of concrete objects that implement some shared interface, calling intersectRay() on each one, and the existence of those derived concrete object types means that that shared interface can't be made final, and that's the only thing that would enable devirtualisation -- it makes no difference whether the concrete derived types themselves are final or not.
Fortran has virtual functions ("type bound procedures"), and supports a NON_OVERRIDABLE attribute on them that is basically "final". (FINAL exists but means something else.). But it also has a means for localizing the non-overridable property.
If a type bound procedure is declared in a module, and is PRIVATE, then overrides in subtypes ("extended derived types") work as usual for subtypes in the same module, but can't be affected by overrides that appear in other modules. This allows a compiler to notice when a type has no subtypes in the same module, and basically infer that it is non-overridable locally, and thus resolve calls at compilation time.
Or it would, if compilers implemented this feature correctly. It's not well described in the standard, and only half of the Fortran compilers in the wild actually support it. So like too many things in the Fortran world, it might be useful, but it's not portable.
In fact, I would run the same test repeatedly, keeping track of the k fastest times (k being ~3-7), and only stopping when the first and the kth fastest times are within a certain tolerance (as low as 1%). This ensures repeatability.
One sample of performance data for each test is not enough. This study provides no new insights.
Performance analyst
The best I'll see is somebody who cooked up a naive microbenchmark to show that style 1 takes fewer wall nanoseconds than style 2 on his laptop.
People I've worked with don't use profilers, claiming that they can't trust it. Really they just can't be bothered to run it and interpret the output.
The truth is, most of us don't write C++ because of performance; we write C++ because that's the language the code is written in.
The performance gained by different C++ techniques seldom matters, and when it does you have to measure. Profiler reports almost always surprise me the first few times -- your mental model of what's going on and what matters is probably wrong.
From a user perspective it could be the difference between software that's pleasant to use and software that's annoying to use. From a philosophical perspective it's the difference between software that functions vs software that works well.
Of course it depends on your context as to whether this is valued, but I wouldn't dismiss it. Once person's micro-optimization is another person's polish.
I'm disappointed the author's conclusion is "don't use final", not "something is wrong with clang".
Without a comparison of generated code, it could be anything.
In this way: you can avoid the need for the `final` keyword and do the optimization the keyword enables (de-virtualize calls).
>Yes, it is very hacky and I am disgusted by this myself. I would never do this in an actual product
Why? What's with the C++ community and their disgust for macros without any underlying reasoning? It reminds me of everyone blindly saying "Don't use goto; it creates spaghetti code".
Sure, if macros are overly used: it can be hard to read and maintain. But, for something simple like this, you shouldn't be thinking "I would never do this in an actual product".
(But I'm worse than the author; if I'm just comparing performance, I'd probably put `final` everywhere applicable and then do separate compiles with `-Dfinal=` and `-Dfinal=final`... I'd be making the assumption that it's something I either always or never want eventually, though.)
The main remaining use case for the old C macro facility I still see in new code is to support conditional compilation of architecture-specific code e.g. ARM vs x86 assembly routines or intrinsics.
Many people have a similar reaction to the use of "goto", even though it is absolutely the right choice in some contexts.
http://boost.org/libs/preprocessor
So, it's an opt-in security feature first, and a compiler hint second.
What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).
But you're right that this does not hold true for const pointers or references.
> What actually may help is __attribute__((pure)) and __attribute__((const)), but I don't see them often in real code (unfortunately).
It's disppointing that these haven't been standardized. I'd prefer different semantics though, e.g. something that allows things like memoization or other forms of caching that are technically side effects but where you still are ok with allowing the compiler to remove / reorder / eliminate calls.
https://godbolt.org/z/6ebrbaM7b
In general, whenever you call a function that the compiler cannot inspect (because it is in another TU) and the compiler cannot prove that that function doesn't have any reference to your variable it has to assume that the function might change your variable. Only passing a const reference won't help you here because it is legal to cast away constness and modify the variable unless the original variable was const.
I wish that const meant something on reference or pointers and you had to do something more explicit like a mutable member to allow modifying a variable. But even that would not help if the compiler can't prove that a non-const pointer hasn't escaped somehow. You could add __attribute__((pure)) to the function to help the compiler but that is a lot stricter so can't always be used.
Plus, you can’t even compile your code if you try to modify a const variable.
In the same TU, sure. But across TU boundaries the compiler really can't figure out what should be const and what should not, so `const` in parameter or return values allows the compiler to tell the human "You are attempting to make a modification to a value that some other TU put into RO memory.", or issue similar diagnostics.
Const can only ever possibly have a performance impact when used directly on variables. const pointers / references are purely for the benefit of the programmer - the compiler can assume nothing because the variable could be modified elsewhere or through another pointer/reference and const_cast is legal anyway unless the original variable was const.
Now for libraries, this is a different story. There I can imagine final keyword could have an impact.
I feel like we'd have to repeat these tests quite a few times to get to a decent conclusion. Hell small variations in performance could be caused by all sorts of things outside the actual program.
There are tons of these suggestions. Like always using sealed in C# or never use private in Java.
> I would never do this in an actual product
what, why?
It's is an insane level of ignorance about how these things are decided by the standards committee.
That's the first problem I see with the article. C++ isn't a fast language, as it is. There are far too many issues with e.g. aliasing rules, lack of proper vectorization (for the runtime arch), etc.
If you wish to have a relatively good performance for your code, try ISPC, which still allows you to get great performance with vectorization up to AVX-512, without turning to intrisics.
That's a bold statement due to the way it heavily contrasts with reality.
C++ is ever present in high performance benchmarks as either the highest performing language or second only to C. It's weird seeing someone claim with a straight face that "C++ isn't a fast language, as it is".
To make matters worse, you go on confusing what a programming language is, and confusing implementation details with language features. It's like claiming that C++ isn't a language for computational graphics just because no C++ standard dedicates a chapter to it.
Just like every engineering domain,you need to have deep knowledge on details to milk the last drop of performance improvements out of a program. Low-latency C++ is a testament of how the smallest details can be critical of performance. But you need to be completely detached from reality to claim that C++ isn't a fast language.
I'm ready to back this up. And no, I'm not confusing things - I work in HPC (realtime computer vision) and in reality the only thing we'd use C++ for is "glue", i.e. binding implementations of the actual algorithms implemented in other languages together.
Implementations could be e.g. in CUDA, ISPC, neural-inference via TensorRT, etc.
You a junior or something? For 99% of use cases C++ autovectorisation does plenty and will outperform the same code written in higher level languages. You are literally in the 1% and conflating your use case for that of the general case...
But to add to all the nonsense,you claim otherwise.
Frankly, your comments lack any credibility, which is confirmed by your lame appeal to authority.