I understood the problem, but I found the page's explanation a little confusing at first. In particular, "lexical differential highlighting" misled me, because the word "differential" made me think that his algorithm was comparing lines or tokens in some way, and it doesn't do that.
Basically, this algorithm tokenizes the source code, and tries to color each token so that identical tokens have the same color, but similar-looking tokens have very different colors. When tokenizing it specially handles comments and quoted text.
That's an interesting approach to countering errors from "it's almost the same but I didn't notice they were different". I wonder - if I were trying to review source code that were malicious, maybe I could vary the color algorithm using a random source so that the source code writer couldn't make different tokens look similar in color. That might be an interesting countermeasure to some kinds of underhanded code.
That reminds me of something I read in applied cryptography when I was young about how one could theoretically pass messages with “ \b” to generate infinite versions of “identical” text to cause collisions
This idea is related to "rainbow parentheses" (e.g. for Lisp): different levels of parens just get arbitrary different colors. But matching parens are the same color, just like two occurrences of %ecx in the same line are the same.
This is a lexical highlighter that tries to highlight similar, but different text differently. There's a point in time where there are no new features necessary.
radare2 is a portable reversing framework. I can't think of 2 projects more dissimilar. Perhaps you were thinking that the highlighter actually did something other than color text in an arbitrary way? Can you give an example of something that you would expect to change about it, especially at the rate of multiple times a day?
Complete tangent but one thing that I've wondered about modernish asm mnemonics is how complex they are, and especially how much type information they encode in a semi-structured way. Taking the authors example of PMULHUW, the core operation is MUL(tiply), P for packed integers, H for high result, U for unsigned, and W for word sized (16 bit). I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup. I don't know exactly what that would be, spelling out everything would probably make assembly way too verbose. So some sort of middle ground would be nice.
> I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup.
Yes, that's called a programming language :^)
Assembly is usually essentially a macro engine over the actual instructions you are emitting for your processor, and the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too. Heck, the author mentions specifically reading assembly too, so knowing what you're reading is 1:1 with the actual instruction stream is helpful, no matter how bad the official names are.
Actual programming languages just abstract away some complex instructions like SSE vectorizing (which have famously terrible names) to some high-level API and intrinsic functions. And you should too.
> the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too.
I don't see why that has to be the case; why I'd must use Intel specified mnemonics instead of my own syntax? While not as radical, the att vs intel syntax demonstrates that the vendor syntax is not the only option. As long as the syntax captures all the details of instructions to be completely unambiguous then it should be perfectly interchangeable.
I specifically do not desire higher level of abstraction because I want to maintain that 1:1 relation with the actual machine code. Heck, even Intel mnemonics do not truly have 1:1 relation to machine code, because the instruction (encoding) can depend on operand types.
Actually, it would be interesting to experiment with coloring all the abbreviations separately. P, then MUL, then H, then U, then W (or UW altogether). Not sure if it works, but it's something worth trying.
Edit: Here's a scope based js highlighting repo that cites Crockford as the inspiration but unfortunately he posted the linked description on Google+ so... uh... oops
 was a similar idea where color is determined by the prefix, so for example `currentIndex` and `randomIndex` are distinguished from each other but `currentIndex` and `currentIdx` are not.
I'm not sure about both because, i) there are only a handful number of mutually distinguishable colors ( does mention the same complication), ii) we often want to highlight both the similarity and difference among identifiers and the cutoff is not clear. For i) we may want to leverage more formattings; for ii) I really don't have a good solution.
I'm not a fan of this approach in general, but I am a fan of highlighting instructions from different subsets in different colors in asm, and perhaps differentiating the saturation by latency/throughput. I.e. a "heavy" instruction should probably be bright, urgent red, whereas loads, stores, adds, bit ops should probably be more muted.
Something like this is implemented in vscode-clangd. I used it for a bit but it's just too colourful. There are just colours everywhere and it's overwhelming. I went back to normal syntax highlighting.
Curious. I mean it sounds like relying simply on contrast rather than the structure. I know our visual system is insane at contrast, and we, as humans tend to group tokens as a shorthand.
I'll have to remember to load up CSS or a test suite (with lots of framework calls) using this approach.
I really like this idea. I always wanted to try to take this to insane levels. For example, for large code bases have different images associated with different modules. So that your brain has more things to latch on to. e.g.: This function from the banana module is calling the teddy bear module. It seems a bit absurd since there is no correlation between the image and the module functionality but I still want to try it.
Oh my god this would have saved my bacon two days ago. p_value_default is so visually similar to v_value_default that after sitting there with another developer trying to figure out the problem for 30 mins we rewrote the whole method.
Only the next day after the deadline pressure was gone did I spot the problem.
I understand it in the case of assembly, but I don't think it'd work for something like Python better than existing syntax highlighting. So it's nice and I hope things like Radare or IDA adopt it where people even intentionally make syntax highlighting nearly impossible.
In this particular case, the highlighting is a clever workaround for the fact that x86 register naming conventions are awful. RISC architectures tend to number the registers, which makes things significantly easier to read.
which forces you to read everything individually and not miss something.
I prefer less highlighting for this reason. I highlight a few keywords but other than that I don't highlight. I find it helps me _read_ the code rather than skim the code. (and for skimming, I'd grep through it most likely looking for something specific rather than trying to understand it.)
> In 2013 I was working in nuclear power plant automation ... the job required reading a lot of assembly code.
Does anyone else find this terrifying? Nuclear power plant automation should be done in the safest of the safe languages. I would be alarmed at the thought of stuff like this being written in C, never mind in assembly!
Not really. There are plenty of chips out there without even a C compiler. Some don't even support Turing Completeness. There's even more that were designed and installed before manufacturers started slapping C compilers together for their DSPs, FPGAs, and MCUs.
It would be weird to care about memory safety when your board doesn't even have a heap!
Yes he said reading assembly, not writing. Whatever they use, I'm glad that someone's having a glance at what the compiler spits out. Also could be talking about microcontrollers, and in an industrial setting PLCs wouldn't be unexpected.