Ask HN: Insight on subroutine vs. global variable in modern PC

Anyone generous enough to share knowledge and insight on truthfulness on comparing efficiencies and speed between subroutine/function i.e. local and global variable being implemented to Hardware Subroutine/function or local variable so commonly known, by some c/c++ experts, being inside the stack (frame) while global variable is in heap.

And they or some argue that the one in stack will be shoved into CPU's cache so get more efficient ie. faster than that of on DRAM

How true is the way really as expected above when c/c++ function/local and global variable have been compiled to assembly or CPU binary then go on implemented in such ?

How to prove and to measure the speed/efficiency advantage ? Thanks before

3 points | by mardiyah 692 days ago

1 comments

duped 692 days ago
So it sounds like there are some misconceptions here. PDF warning, chapter 9 here goes into detail of how programs are laid out in memory: https://www3.nd.edu/~dthain/compilerbook/
Essentially the entire memory of a program including code, data (global variables, constants, string literals, etc), the heap, and stack are segments of one contiguous block of memory.
Next, whether something is on the stack or heap or in data is not an indicator of where it is in the memory hierarchy of the machine. Ultimately it can be anywhere from registers to L1 cache, L2, L3 if it exists, or main memory. Eventually all data that gets operated on is moved to registers (insert hand waving) so the question of whether it's faster for data to be in the data segment, on the heap, or on the stack is the question of whether it's faster to load data from one or the other into registers. And like everything, the answer is "it depends."
The greatest penalty happens when there is a "cache miss." What this means is the program wants some data from its memory chunk, so the CPU looks for it in cache, but it's not there so it has to go out to main memory to retrieve it. The way you generally minimize this is by making sure data that needs to be operated on at the same time is physically closer to each other in memory. That can be on the stack or the heap, not necessarily one or the other.
The reason the stack is preferable is that oftentimes, it's in L1 cache (sometimes, the stack can be bigger than the cache). That's because it is accessed often by a function call, which needs to push and pop from the stack to make function calls and access local variables.
But that's not necessarily faster than data on the heap. For example, a long vector of integers is probably faster to keep on the heap and access than if it was pushed way down the stack (since it's just a pointer, and most of the data is on the same page, so you only pay the cost of dereferencing once and the rest will be in L1 cache once the pointer is in a register, the same as the stack pointer).
That's why high performance algorithms are designed to maximize data locality, not whether it's on the stack or heap.
The way that you measure this is with tools that count cache misses, or times when the cpu wants to grab something in cache but it's not there. cachegrind is a popular tool for this.
So in general you don't optimize for whether stuff is on the stack or heap. You optimize the actual performance and measure the cache performance so you pick the approach that minimizes misses and maximizes hits.
There's also a bit of nuance over things like atomics and memory fences in parallel computations, and the fact that there can be multiple stacks in programming languages with stackful coroutines for implementing concurrency. In those cases the memory model you're familiar with becomes less useful.
And finally no discussion is complete without talking about pointers. The reason you don't shove everything on the heap is because you will always need to dereference via pointers, which is a recipe for cache misses. Indirection is slow. So some compilers are designed to first assume things go on the heap, but if they can prove that the object is safe to place on the stack they will move it there, which is an optimization because again, the other stuff on the stack is usually closer to the things that need to be operated on together.
TL;DR it's complicated. You have to measure. The stack is sometimes faster for simple stuff, but for bigger things where you need the heap, moving to the stack is just more work. Locality is the thing that matters.
[-]
- mardiyah 691 days ago
  I'm so lucky indeed to be replied and given such the answer.
  Thanks billions.. more importantly must God bless and reward you, ameen !