I like this site very much, its simple and old school.
Take a tour around the site for some other aspects of agners work, there is a good treatise regarding the fundamentals of digital electronics, as well as a proposed archetecture standard [ForwardCom].
some [as in a couple] of the links on agners site are stale or dead but do give you the general idea what to look for.
"llvm-exegesis is a benchmarking tool that uses information available in LLVM to measure host machine instruction characteristics like latency, throughput, or port decomposition.
Given an LLVM opcode name and a benchmarking mode, llvm-exegesis generates a code snippet that makes execution as serial (resp. as parallel) as possible so that we can measure the latency (resp. inverse throughput/uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. The result is printed out as YAML to the standard output.
The main goal of this tool is to automatically (in)validate the LLVM’s TableDef scheduling models. To that end, we also provide analysis of the results."
I advocated for the term to only be used by folks that can build their own computer, language, OS, and apps. Look up Chuck Moore and Niklaus Wirth for examples focusing on simplicity. For full-stack architect w/ team building it, look up Bob Barton of Burrough B5000. I cant remember if early folks writing Lisp interpreters built the CPU's, too. I know Burger did for a Scheme CPU, though.
Heh, the people who can dabble in Assembly for a living won't be able to (typically) setup a major Web PHP + Database and all that. Assembly optimization is a different skill than web development, or other stuff in the stack.
We all gotta pick our specialties. For me, low level high-performance stuff is more of hobby project level. I got a few cool things I've learned by myself but it isn't really what I do for a living.
I've worked with enough backend developers who complain on how much tooling there is in frontend to know that this isn't true. And being a "fullstack developer" myself I've seen the stuff that sysadmins deal with and I know that thats a job I could never do. Knowing one part of the stack doesn't automatically qualify you for the other parts. Knowing assembly doesn't make you an expert in CSS, Webpack nor configuring IIS servers.
I friend of mine is a senior compiler engineer for a living (in a major player of the console gaming field). He said multiple time that he would not be able to keep up with the incredible mess that is web development nowadays (frontend in particular).
Have you ever implemented optimized microcode for a pipelined, branch-predicting processor that you also had to build? MICMAC is the architecture we used once upon a time. I won all of the course's extra credit for both smallest microcode program and fastest (fewest micro ops) because I Huffman-encoded the macro ops of the test programs and used progressive decoding.
How about a Java to MIPS compiler? In C++ and then again in Java just to be self-hosting.
Or an all software OpenGL-compatible pipeline with painter-algorithm Z-buffering and trapezoid engine? Quaternions and all that state... fun-fun.
Another mandatory project in college was a non-pre-fork()ed select() caching HTTP/1.0 proxy in POSIX C89. IPC was a PITA back then.
There are still bigger fish, but one can be that much less of dilettante by digging deeper and mastering the why, not the what or the current fashion.
"xor %eax, %eax" is even smaller (2 bytes vs. 3) and does the same thing since operations on the 32-bit subregisters are defined to zero the upper half . This is what compilers tend to do when they want to zero a register.
 This is not true for 8- or 16-bit subregisters, causing all kinds of fun issues (see "partial register stalls", also discussed in Agner Fog's materials).
This just came up today. Anybody know the answer? The problem involves a JIT (for opcodes, not scripting).
Chunks of code are generated. Sometimes, a chunk will need to branch to a chunk that isn't yet generated or even allocated. Chunks can be cleared away for various reasons like evictions to make room, so the branches to a chunk might need to not go there anymore.
One strategy is to use indirect branches. When the destination doesn't exist, the branch goes to code that will resolve that. It's a bit like how dynamic linkers work.
Another strategy is to replace the code. The processor may stumble a bit over what is essentially self-modifying code, but then it eventually runs faster. The degree to which the processor will pause is a key consideration here. Anybody know?
BTW, if you like digging into this stuff, we're hiring.
On a multithreaded implementation, you can do it with 2 stub functions and an array/hash of function pointers (hereto refered to as the "table") that itself acts as a spinlock array. Initialize the table with stub 1 and atomic write stub 1 to the table on eviction.
stub 1) upon entry, atomic write stub 2 into the table. Then compile or enqueue onto a compilation thread/process. At the end of compilation, atomic write the newly compiled address to the table.
stub 2) spinlock on function pointer array entry != Stub 2 address. At end of spinlock, call new function pointer. If you care about power consumption or timing attacks, the spinlock can have a backoff corresponding with the speed of your JIT compiler. This function should be able to fit into a single cacheline, so it should usually be hot.
inlined thunk: Given a function, read address from table and call it.
The client uses the thunk to call the function address in the table without knowing if it's compiled or not. On Intel, you shouldn't need a read fence due to MESI rules.
The branch/fetch predictor works by hashing the address of the branch instruction into its own lookup table. So by forcing stub 3 to be inlined, you're relying on the hash to prefetch the correct function address and call it directly.
You should also be able to put the JIT into a separate process that mmaps with the NX bit set, and have the client load without the NX bit set. So it would be difficult to exploit the JIT compiler itself (but the produced code could still be unsafe).
You might want to cachealign the function table or pack similar chunks in the same function table entry to prevent false sharing whenever there's an atomic write. Also, this table is going to thrash your cache by incurring an extra cacheline load for the trampoline that it does every time.
Anyways, that's my back of the napkin design for how to do it.
That is certainly a different way to think about threading. My instinct would be to keep a separate JIT cache for each thread, keeping the threads from stealing resources from each other but also keeping them from sharing the JIT effort.
I think your approach is pretty much how /lib/ld-linux.so.2 deals with the PLT. I didn't see an explanation for stub 3, but I guess you mean the part of the JIT output that does the indirect jump. I take it that you believe the speed advantage of a direct jump is not enough to overcome the occasional hit caused by modifying it, with all the invalidations (icache, trace cache, etc.) that it might cause. Being similar to the dynamic linker is probably well-aligned with what Intel is trying to optimize for upcoming chips. OTOH, the spectre problem might limit what Intel does.
Putting the JIT in a separate process would have to add latency. Upon hitting a missing chunk of code, translation can't really wait. The idea is to run modern stuff, such as a recent desktop OS, via the JIT. We do use more than one mapping on Linux, as required to keep SE Linux happy.
Yes, I'm assuming once code is JIT'ed, it will take a while before being JIT'ed again. But in the meantime the caches will cover some of the trampoline costs.
With regards to latency, if you have a dedicated JIT thread, you can read ahead and JIT the start of all the jump/calls up to the next current ret in parallel to executing the first chunk. Like a super fancy prefetch. Heuristically, I think it's safe to assume that most code is local but I could be wrong.
Also, if you load hwloc on x86-64, you can read your CPU topology at initialization time. If hyperthreading is present (which is somewhat common), you can set affinity for JIT threads to be on the same core as the emulated CPUs. This will minimize read concurrency overhead since you'll be reading/writing to the physical core's L1 almost every time. (This part is crazy, but you might even be able to get away without using atomics due to how cache-associativity works, but I've never tried it and it might not be guaranteed into the future.)
I found an old paper I wrote about my first DBT, and it has a nice figure about how I implemented indirect branch resolution: https://imgur.com/a/dCpY7V2. Feel free to ping me at pag on the Empire Hacking slack if you like talking about this stuff :-D
All JIT code runs into this issue. Modern CPUs are "Harvard" machines which split instruction-fetching and data-fetching. This means you need a memory fence of some kind to synchronize the separate L1 caches.
To properly handle all cases means forcing a memory fence whenever you are done writing code. I haven't thought too much about which fence is proper, but at minimum a write fence needs to ensure all data is written out as expected. You might need a full fence.
EDIT: Maybe a sfence at the location of the "writing" thread, while the JIT Code needs an lfence (to ensure all caches are properly updated). IIRC, fences on both sides are needed in these cases.
The "dynamic linker" idea seems better to my instinct, because it won't require these fences. Of course, testing would be required (Instincts are typically proven wrong in performance questions). IIRC, Intel Skylake and AMD Zen perform branch prediction on dynamic addresses, so you should get okay performance out of the dynamic lookup.
I'm sorry, but I don't think I really should take your bait.
I'm generally willing to hold a technical discussion among peers. But it has to be on even terms with a decent amount of respect given both ways. Your tone in the post above, as well as your posting history, does not give me any hope towards having a good discussion with you.
bait? regardless of my tone, a memory fence is not required. no reading of the relevant documentation supports that, if you want to be wrong you're free to do so. but please don't pretend to educate others with wild uninformed guesses, that seems worse to me.
Section 8.1.3 Handling Self- and Cross-Modifying code
(* OPTION 1 )
Store modified code (as data) into code segment; Jump to new code or an intermediate location; Execute new code;
( OPTION 2 )
Store modified code (as data) into code segment;
Execute a serializing instruction; ( For example, CPUID instruction *) Execute new code;
this doesn't strike me as contentious, there is no guideline that states a fence is necessary. you're possibly making the assumption that fences are "serializing instructions" and, well, again i'd question what documentation backs that interpretation up. is a sfence or lfence "more" serializing?
I can't ignore your tone. Sorry. Maybe other posters can, but I can't and I won't.
EDIT: I had a lot of other stuff here. But my attacks vs you were likely unfair. So I've erased them.
Nonetheless, as I stated in my first response to you: I have no hope to have a positive discussion from this thread with you. I do believe you might be a stronger assembly programmer than I am. Maybe next time we can have a good discussion. But the tone was not set correctly in the initial post, so I do not wish to discuss this matter with you any further.
im sad i missed your edit, apparently filled with unhinged personal attacks, but ive limited my commentary to the technical details and it's amazing that you still claim some higher ground while steadfastly ignoring the subject matter, on which you are incorrect and misinforming others.
im not claiming to be a "better assembler programmer" im literally quoting the only relevant documentation on a tricky area where you were spreading lies charitably dressed up as half-guesses. if you hadn't barged in with some grossly incorrect theory about fences we wouldn't be talking at all. i still think lying and misleading while putting on airs of educating folks is worse than, uh, quoting documentation? are you thinking the documentation is being too harsh on your pet theory?
will you at least agree to stop spreading this dangerous misinformation? neither fence is serializing (like... you wouldn't want them to be...) and someone building a JIT around your guessing is heading towards an erroneous implementation.
> ive limited my commentary to the technical details
> apparently filled with unhinged personal attacks,
> you were spreading lies
> lying and misleading
Too late. You already failed at this within your very own post. The above words are not technical in any means, and I predicted that this would happen. Let me repeat myself one last time:
> I do not wish to discuss this matter with you any further
Perhaps I was not clear? But let me say it one more time. I DO NOT WISH TO DISCUSS THIS MATER WITH YOU ANY FUTHER. Am I clear this time?
EDIT: That does not mean I'm unwilling to discuss things with you in the future. As I stated earlier, it seems like you're an avid assembly programmer, and there aren't many of those in the world. So maybe next time, I'll feel confident to have a discussion with you. But in this thread, at this time? I do not wish to hold any technical discussion with you today.
i legit don't know what you're expecting? your lie was central to the technical discussion and at no point did you acknowledge the error.
if you can't handle being called out for providing false technical information... don't hand out false technical information? seems easy. it's not, as you so desperately want, a personal attack to identify areas where you're misrepresenting the technology.
im curious, how did you want this to go? DT, incorrectly:"fences are required" jv6:"ah, well, hm, er, here's section 8.1.3 which doesn't mention fences at all and specifies two methods (neither of which is a fence) that ARE required to ensure correct operation" like can i even tell people the truth? or must your egregious misrepresentation stand without question for you to find the "tone" acceptable?
Clearly, you won't leave me along after my previous hints. My only conclusion is you probably want something from me. Lemme be direct about it: What do you want from me?
I know one thing you want: you want to talk about assembly language / various low level details. Unfortunately, I don't want to do that with you in this thread. So I will NOT give you that. Is there anything else you might want from me?
As for what I want: I want to say goodbye to you, at least for now. Perhaps you and I will have a more pleasant discussion in the future. But this current discussion was poisoned from the start and I do not wish for this discussion to continue.
you should probably check the Programmer's Reference Manual (PRM)
i mean, they don't try to hide this information? Section 8.1.3 gives two options that are guaranteed to work across future iterations of hardware. the sibling comment that assumes a heavy serializing like CPUID is the ONLY method is shooting from the hip and only really appropriate if you're playing tricks with what LA you're using for the write/read.
you'll probably want to skim "11.6 Self-Modifying Code" as well. SMC is guaranteed to work, not work fast. Most folks want that second one and blasting out icache entries is going to put a damper on that.
There has recently been some work in the Linux kernel to implement some infrastructure to allow hot indirections through function pointers to be replaced by dynamically rewritten branches - this is entirely done for performance reasons, so that's at least a data point in favour of the "rewrite the jump target" option.
(This is particularly motivated by the fact that Spectre mitigation requires ordinary indirect jumps in the kernel to use a "retpoline", which prevents the processor speculating the jump target but is very slow - are you using retpolines for your indirect jumps?)
At the end of the first chunk just ret. When the second chunk is generated, also emit a thunk which calls the first chunk and then calls the second chunk. Should be quite nice and efficient and will leverage the return history buffer. You're right, this stuff is fun.
Also, self modifying code typically requires full pipeline serialization via an instruction like cpuid (x86) or isb (arm). There are comments about this in the architecture optimization manuals, but in general try not to go around munging code all the time. While I'm thinking about it, you should consider an out of proc JIT architecture like Edge has, as JITs are fragile and easily exploited .
Depending on the size of your JIT code, it's possible to be more cache efficient if the JIT is pinned to its own core and uses lockless ring buffers to enqueue one-cacheline-sized chunks back and forth. But this only works if you spend more time instruction fetching to the JIT and back than it takes to atomic queue/dequeue.
In terms of latency, having the JIT in a process or thread is semantics at that point so you might as well process isolate it.
It all depends on how often your JIT instruction fetches outside of L2 due to the size of itself and the code that it generates. You can approximately measure it using cachegrind and then multiply the Data + instruction L1/L2 miss values by whatever they cost on your architecture.