Lexical differential highlighting instead of syntax highlighting

(wordsandbuttons.online)

249 points | by based2 102 days ago

24 comments

  • dwheeler 102 days ago

    I understood the problem, but I found the page's explanation a little confusing at first. In particular, "lexical differential highlighting" misled me, because the word "differential" made me think that his algorithm was comparing lines or tokens in some way, and it doesn't do that.

    Basically, this algorithm tokenizes the source code, and tries to color each token so that identical tokens have the same color, but similar-looking tokens have very different colors. When tokenizing it specially handles comments and quoted text.

    That's an interesting approach to countering errors from "it's almost the same but I didn't notice they were different". I wonder - if I were trying to review source code that were malicious, maybe I could vary the color algorithm using a random source so that the source code writer couldn't make different tokens look similar in color. That might be an interesting countermeasure to some kinds of underhanded code.

    • saagarjha 102 days ago

      Yeah, I thought this would do something like highlight all "mov" derivatives the same way and was somewhat surprised at the brevity of the code at the bottom…

      • shipof123 102 days ago

        That reminds me of something I read in applied cryptography when I was young about how one could theoretically pass messages with “ \b” to generate infinite versions of “identical” text to cause collisions

      • kazinator 102 days ago

        This idea is related to "rainbow parentheses" (e.g. for Lisp): different levels of parens just get arbitrary different colors. But matching parens are the same color, just like two occurrences of %ecx in the same line are the same.

        • human_banana 101 days ago

          In emacs there's a package rainbow-delimiters-mode for parantheses, braces, brackets and what not, and rainbow-identifier-mode which makes variables names unique colors.

          • andrepd 102 days ago

            It's legitimately one of the best features of Excel. Does anybody know how I can achieve that in Sublime? The few options I found were subpar.

        • fake-name 102 days ago

          There's a sublime text package that does this for a bunch of different languages: https://github.com/vprimachenko/Sublime-Colorcoder

          I'm not involved in any way, I just ran it for a while at one point.

          • synthc 102 days ago

            There is also an emacs package that does something similar: https://github.com/jacksonrayhamilton/context-coloring

            • sprobertson 102 days ago
              • synthc 102 days ago

                I think DrRacket also has something like this, but it shows lines between identical variables instead of using colors.

                • xvilka 102 days ago

                  Seems dead for many years already.

                  • cjs_2 102 days ago

                    How many updates per month are you expecting for a package like this?

                    • xvilka 102 days ago

                      Multiple times a day, like radare2. Seriously, if there is no activity in 6 months - then the project is dead.

                      • mikekchar 102 days ago

                        This is a lexical highlighter that tries to highlight similar, but different text differently. There's a point in time where there are no new features necessary.

                        radare2 is a portable reversing framework. I can't think of 2 projects more dissimilar. Perhaps you were thinking that the highlighter actually did something other than color text in an arbitrary way? Can you give an example of something that you would expect to change about it, especially at the rate of multiple times a day?

              • guessmyname 102 days ago

                > There's a sublime text package that does this for a bunch of different languages

                You don’t need a package for this, Sublime Text 3 already does this automatically [1].

                [1] https://www.sublimetext.com/docs/3/color_schemes.html#hashed...

                • nh2 102 days ago

                  How can I use it?

                  The simplest way seems to be to use the "Celeste" color scheme which implements this. Is this the only way? I'd like to use a dark theme, like the default Monokai.

                • fake-name 102 days ago

                  Well, neat!

                  I haven't used the plugin since the ST2 days, so I didn't realize it was no longer needed.

                • soulofmischief 102 days ago

                  Webstorm has an option for this and it makes things like dense enclosures or JSON actually parsable.

                  • galaxyLogic 102 days ago

                    Which feature is that? I've been using WebStorm for some time and wishing for a feature that would highlight all matching parenthesis (), [] and {}.

                    • _virtu 101 days ago

                      - plugin: rainbow brackets

                      - preference: semantic highlighting

                      • galaxyLogic 101 days ago

                        Thanks. I tried it but it did not quite do what I needed so I uninstalled it. (I'm afraid of plugins in general taking performace away). It worked on JS-files but I have HTML-documents containing (example) JavaScript etc. code. Seems it did not react to parenthesis in them. Also even in plain JS-files you may have strings containing parenthesis.

                        Standard WebStorm already highlights matching parenthesis in JavaScript and does a good job at that.

                        • soulofmischief 101 days ago

                          I don't use rainbow brackets, but I do use semantic highlighting. It's worth seeing if semantic highlighting would still be useful to you. It greatly helps scanning speed.

                  • cylon13 102 days ago

                    What made you decide to stop using it?

                  • gpspake 102 days ago

                    I remember Doug Crockford mentioning the idea of scope based highlighting for JavaScript in a workshop years back and thinking it would be useful. Cool to see it pop back up here.

                    Edit: Here's a scope based js highlighting repo that cites Crockford as the inspiration but unfortunately he posted the linked description on Google+ so... uh... oops https://github.com/azz/vscode-levels

                    • zokier 102 days ago

                      Complete tangent but one thing that I've wondered about modernish asm mnemonics is how complex they are, and especially how much type information they encode in a semi-structured way. Taking the authors example of PMULHUW, the core operation is MUL(tiply), P for packed integers, H for high result, U for unsigned, and W for word sized (16 bit). I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup. I don't know exactly what that would be, spelling out everything would probably make assembly way too verbose. So some sort of middle ground would be nice.

                      • chc4 102 days ago

                        > I feel like there must be a better way to express the same thing that wouldn't lead stuff looking like one word all caps alphabet soup.

                        Yes, that's called a programming language :^)

                        Assembly is usually essentially a macro engine over the actual instructions you are emitting for your processor, and the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too. Heck, the author mentions specifically reading assembly too, so knowing what you're reading is 1:1 with the actual instruction stream is helpful, no matter how bad the official names are.

                        Actual programming languages just abstract away some complex instructions like SSE vectorizing (which have famously terrible names) to some high-level API and intrinsic functions. And you should too.

                        • zokier 102 days ago

                          > the Intel x86 chip manuals or whatever you're targeting use the outrageously long proper names, so your assembly will too.

                          I don't see why that has to be the case; why I'd must use Intel specified mnemonics instead of my own syntax? While not as radical, the att vs intel syntax demonstrates that the vendor syntax is not the only option. As long as the syntax captures all the details of instructions to be completely unambiguous then it should be perfectly interchangeable.

                          I specifically do not desire higher level of abstraction because I want to maintain that 1:1 relation with the actual machine code. Heck, even Intel mnemonics do not truly have 1:1 relation to machine code, because the instruction (encoding) can depend on operand types.

                        • breck 102 days ago

                          I’ve done some experiments with tree languages that compile to ASM. I think it’s definitely the way forward.

                          • okaleniuk 101 days ago

                            Actually, it would be interesting to experiment with coloring all the abbreviations separately. P, then MUL, then H, then U, then W (or UW altogether). Not sure if it works, but it's something worth trying.

                          • lifthrasiir 102 days ago

                            [1] was a similar idea where color is determined by the prefix, so for example `currentIndex` and `randomIndex` are distinguished from each other but `currentIndex` and `currentIdx` are not.

                            I'm not sure about both because, i) there are only a handful number of mutually distinguishable colors ([1] does mention the same complication), ii) we often want to highlight both the similarity and difference among identifiers and the cutoff is not clear. For i) we may want to leverage more formattings; for ii) I really don't have a good solution.

                            [1] https://medium.com/@evnbr/coding-in-color-3a6db2743a1e

                            • css 102 days ago

                              Wow, this actually looks amazing for math (though it seems to be stripping out a lot of the code I pasted in): https://i.imgur.com/Iur9FgK.png

                              How difficult would it be to implement this as a VSCode extension?

                              • petschge 102 days ago

                                This looks pretty good, but notice how it does not split "log(difference_squared" into two tokens. Adding '(' and ')' as delimiters should fix that.

                              • BenFrantzDale 102 days ago

                                I love that visually I can find usages of, day, `alpha`.

                                I do wish it did some syntax highlighting, but one could easily imagine blending between this and conventional syntax highlighting.

                              • panopticon 102 days ago

                                Tangential, but "Just as every other piece of code on Words and Buttons, it's properly unlicensed." reads like the code is literally unlicensed and not using the Unlicense license.

                                It's a little weird to me because unlicensed code is very different than the Unlicense license.

                                • ChrisSD 102 days ago

                                  And I'd add that CC0 is more "properly unlicensed" than Unlicensed is. Or at least more thoroughly so.

                                • canadaduane 102 days ago

                                  I think this is also called semantic coloring. Visual Studio Code has it on the roadmap to try this year: https://github.com/Microsoft/vscode/wiki/Roadmap#editor

                                  • sixplusone 102 days ago

                                    No, semantic coloring is about the editor having deep knowledge about your code, this is about having very similar looking names or lexemes appear different. FTA:

                                    It's fine that mov doesn't look like eax, but I'd rather prefer pmulhw and pmulhuw to be shown as differently as possible.

                                    • jcelerier 102 days ago

                                      KDevelop has pioneered this a decade ago : https://zwabel.wordpress.com/2009/01/08/c-ide-evolution-from...

                                      • gmueckl 102 days ago

                                        Ecliose also has had this for ages at this point. I don't remember when they introduced it, but when you can memorize the meanings of all the colors, it's great.

                                    • m0zg 102 days ago

                                      I'm not a fan of this approach in general, but I am a fan of highlighting instructions from different subsets in different colors in asm, and perhaps differentiating the saturation by latency/throughput. I.e. a "heavy" instruction should probably be bright, urgent red, whereas loads, stores, adds, bit ops should probably be more muted.

                                      • IshKebab 101 days ago

                                        Something like this is implemented in vscode-clangd. I used it for a bit but it's just too colourful. There are just colours everywhere and it's overwhelming. I went back to normal syntax highlighting.

                                        • KuhlMensch 102 days ago

                                          Curious. I mean it sounds like relying simply on contrast rather than the structure. I know our visual system is insane at contrast, and we, as humans tend to group tokens as a shorthand.

                                          What mades me immediately pause, is when I reflect reading javascript: How often do I scan past 3+ lines using colour as my "bridge"? As far as I can remember, not often. Maybe I've overestimated colour-to-lead-me-through-structure. Maybe it is often, colour-to-give-me-token-rhythm. Curious.

                                          I'll have to remember to load up CSS or a test suite (with lots of framework calls) using this approach.

                                          • SilkySailor 101 days ago

                                            I really like this idea. I always wanted to try to take this to insane levels. For example, for large code bases have different images associated with different modules. So that your brain has more things to latch on to. e.g.: This function from the banana module is calling the teddy bear module. It seems a bit absurd since there is no correlation between the image and the module functionality but I still want to try it.

                                            • stochastimus 102 days ago

                                              This is really cool. It kinda looks like rainbow salad, but who cares? For me at least, it is much easier to visually parse.

                                              • DarmokJalad1701 102 days ago

                                                Nice to see some MASM32 code in there in one of the examples. That's from a WIN32 app if I am not wrong.

                                                Brings back memories.

                                                • FrancisNarwhal 102 days ago

                                                  Oh my god this would have saved my bacon two days ago. p_value_default is so visually similar to v_value_default that after sitting there with another developer trying to figure out the problem for 30 mins we rewrote the whole method.

                                                  Only the next day after the deadline pressure was gone did I spot the problem.

                                                  • Avamander 102 days ago

                                                    I understand it in the case of assembly, but I don't think it'd work for something like Python better than existing syntax highlighting. So it's nice and I hope things like Radare or IDA adopt it where people even intentionally make syntax highlighting nearly impossible.

                                                    • ggm 102 days ago

                                                      I encourage the original author to find a way to talk about assembly coding in the nuclear industry.

                                                      • gcbw2 102 days ago

                                                        what do you expect to be different from your run-of-the-mill maintenance of outdated industrial automation gig?

                                                        • YeGoblynQueenne 102 days ago

                                                          At a guess, an increased probability of causing a criticality accident as a result of getting a program slightly wrong.

                                                          • exDM69 102 days ago

                                                            I'm assuming the "reading assembly" part is verifying compiler output matches what the programmer thinks and signing it off as a "blessed binary".

                                                            Some safety critical areas of software are done this way, in aerospace for example. But run-of-the-mill automation jobs aren't.

                                                            • ggm 102 days ago

                                                              bit flips from surplus neutrons? TMR? Batshit crazy lack of process checks on 'what does this button do'

                                                              war stories.

                                                              actually, I encourage anyone in coding to share run-of-the-mill maintenance of outdated industrial automation, as a gig. I'd read that blog.

                                                          • pcwalton 102 days ago

                                                            In this particular case, the highlighting is a clever workaround for the fact that x86 register naming conventions are awful. RISC architectures tend to number the registers, which makes things significantly easier to read.

                                                            • m463 102 days ago

                                                              Not code, but I'm surprised that email clients don't have better colorization from the getgo.

                                                              I think it would be the single best thing to help a huge amount of people.

                                                              • gnuvince 102 days ago

                                                                There are too many colors in too many places. Everything is highlighted and nothing stands out.

                                                                • galaxyLogic 101 days ago

                                                                  I agree. Rather than rainbow the brackets I think a better solution is to highlight the matching brackets with a temporarily different color as user moves the cursor.

                                                                  Or at least make it easy to turn the rainbows on and off.

                                                                  • Insanity 102 days ago

                                                                    which forces you to read everything individually and not miss something. I prefer less highlighting for this reason. I highlight a few keywords but other than that I don't highlight. I find it helps me _read_ the code rather than skim the code. (and for skimming, I'd grep through it most likely looking for something specific rather than trying to understand it.)

                                                                  • Analemma_ 102 days ago

                                                                    > In 2013 I was working in nuclear power plant automation ... the job required reading a lot of assembly code.

                                                                    Does anyone else find this terrifying? Nuclear power plant automation should be done in the safest of the safe languages. I would be alarmed at the thought of stuff like this being written in C, never mind in assembly!

                                                                    • holy_city 102 days ago

                                                                      Not really. There are plenty of chips out there without even a C compiler. Some don't even support Turing Completeness. There's even more that were designed and installed before manufacturers started slapping C compilers together for their DSPs, FPGAs, and MCUs.

                                                                      It would be weird to care about memory safety when your board doesn't even have a heap!

                                                                      • ARandomerDude 102 days ago

                                                                        To me, it's less terrifying than a complete rewrite in a modern language. Modern languages are great. Rewrites are often littered with bugs.

                                                                        • pvg 102 days ago

                                                                          Systems like that tend to be designed with different kinds of safeties. A mildly silly example - your typical Rails app doesn't have a watchdog timer, your toaster probably does.

                                                                        • sixplusone 102 days ago

                                                                          Yes he said reading assembly, not writing. Whatever they use, I'm glad that someone's having a glance at what the compiler spits out. Also could be talking about microcontrollers, and in an industrial setting PLCs wouldn't be unexpected.

                                                                        • splittingTimes 102 days ago

                                                                          Does something like this exist for Java eclipse?