The rev.ng decompiler goes open source

(rev.ng)

267 points | by quic_bcain 29 days ago

10 comments

  • Fnoord 29 days ago
    Price model:

    > Very briefly:

    > The rev.ng framework is fully open source. You can decompile anything you want from the CLI. > The UI will be available in the following forms: > free to use in the cloud for public projects; > available through a subscription in the cloud for private projects; > available at a cost as a fully standalone, fully offline application.

    In comparison, Hopper costs 100 USD with one year of updates [1]. Ghidra and Radare2 are FOSS and completely free to use, IDA Pro costs a fortune

    [1] https://www.hopperapp.com/index.html

    • eyegor 29 days ago
      Binary ninja is another good option. In my experience it's pretty similar to ida but I find it more user friendly. It just has a lot of well thought out features that make me more productive. I haven't tried hopper but ghidra and radare2 both had a bad dev experience and produced c that didn't "read well". Granted it's been a couple of years since I tried either.

      Binja is $300 (or $1500 for commercial, both cheaper for students).

      https://binary.ninja/features

      • aleclm 29 days ago
        Students shouldn't pay a dime. They are poor.

        Our view is: the engine is 100% open source. The UI is available for free in the cloud for anyone experimenting, which we define as "I'm OK with leaving the project public".

        Basically, the decompiler engine is Free Software, extensible and available for automation/scripting, while the UI is available for free for students/researchers and we can make a living out of professionals (i.e., when your company is paying for it).

      • halayli 29 days ago
        They are now offering a free version: https://binary.ninja/2024/02/28/4.0-dorsai.html
        • Nereuxofficial 29 days ago
          Oh that is awesome! I've used the cloud version previously but now that the desktop version is free with some small limitations i think I'll probably use it instead of Ghidra
    • dvzk 29 days ago
      Decompilation is often the least important (and least reliable) part of IDA/Ghidra, so comparing the two is unfair. That said, the scene is perpetually starved for good C decompilers, so more attempts are always exciting.
      • aleclm 29 days ago
        > Decompilation is often the least important (and least reliable) part of IDA/Ghidra

        This is something all people using decompilers say and sort of shows how low is trust towards decompilers. Expectations have always been rather low.

        I've been there, but this does not have to be the case, the whole reason why we started rev.ng is to prove that expectations can be raised.

        Apart from accuracy, which is difficult but engineering work, why don't decompilers emit syntactically valid C? Have you ever tried to re-compile code from any decompiler? It's a terrible experience.

        rev.ng only emits valid C code, and we test it with a bunch of -Wall -Wextra:

        https://github.com/revng/revng-c/blob/develop/share/revng-c/...

        Other key topic: data structures. When reversing I spend half of the time renaming things and half of the time detecting data structures. The help I get from decompilers in latter is basically none.

        rev.ng, by default, detects data structures on the whole binary, interprocedurally, including arrays. See the linked list example in the blog post. We also have plans to detect enums and other stuff.

        Clearly we're not there yet, we still need to work on robustness, but our goal is to increase the confidence in decompilers and actually offer features that save time. Certain tools have made progress in improving the UI and the scripting experience, but there's other things to do beyond that.

        I see this a bit like the transition from the phase in which C developers where using macros to ensure things were being inlined/unrolled to the phase where they stopped doing that because compilers got smart enough to the right thing and to do it much more effectively.

        • jcranmer 29 days ago
          Here's my issue with decompilers:

          I don't want to look at assembly code. I'd rather see expression trees, expressed in C-like syntax, than trying to piece together variables from two-address or three-address instructions. Looking at assembly tends to lead to brain farts like "wait, was the first or second operand the output operand?" (really, fuck AT&T syntax) or "wait, does ja implement ugt or sgt?"

          So that means I want to look at something vaguely C-like. But the problem is that the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong. And when it's wrong, I have to resort to staring at the assembly, which (for Ghidra at least) means throwing away a lot of the notes I've accumulated because they don't correlate back to underlying assembly.

          So what I really want isn't something that can emit recompilable C code, that's optimizing for something that doesn't help me in the end. What I want is robust decompilation to something that lets me ignore the assembly entirely. I'm a compiler writer, I can handle a language where integers aren't signed but the operands are.

          • aleclm 29 days ago
            I 120% agree with what you're saying, but emitting valid C is kinda part of what you're asking, in design terms.

            Our goal is: omit all the casts that can be omitted without changing the semantics according to C. In fact, we have a PR doing exactly this (still on the old repo, hopefully it will go in soon).

            But, how can you expect to be able to be strict with what C allows you to do implicitly, if you're not even emitting valid C? For instance, thanks to the fact that we emit valid C, we could test if the assembly emitted by a compiler is the same before and after removing redundant casts.

            My point is that emitting valid C is kind of a prerequisite for what you're asking, a rather low bar to pass, but that, in practice, no mainstream decompiler passes. It's pretty obvious the decompiled code will often be redundant and outright wrong if you don't even guarantee it's syntactically valid. Then clearly it's not a panacea, but it's an important design criterion and shows the direction we want to go.

            As for comments: we still haven't implemented inline comments, but they will be attached to program addresses, so they will be available both in disassembly and decompiled C. It's not very hard to do, but that needs some love.

            • jcranmer 29 days ago
              One of the blog posts I keep meaning to write but never quite get around to is a post that C is not portable assembly. What is necessary is decompilation to a portable C-like assembly, but that target is not C, and I think focusing on creating valid C tends to drag you towards suboptimal decisions, even leaving aside issues like "should SLL decompile to x << y or x << (y % 32)?"

              In my experience with Ghidra, I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether. There are some cases where it's clear it's just poor analysis on Ghidra's part (e.g., it doesn't seem to understand stack slot reuse, and memcpy-via-xmm is very confusing to it). And Ghidra's type system lacks function pointer types, which is very annoying when you're doing vtable-heavy C++ code.

              I do like the appeal of a recompileable target language. But that language need not be C--in fact, I'm actually sketching out the design of such a language for my own purposes in being able to read LLVM IR without going crazy (which means I need to distinguish between, e.g., add nuw and just plain add).

              Analysis necessarily involves multiple levels. Given that a lot of the type analysis today tends to be crap, I'd rather prefer to have the ability to see a more solid first-level analysis that does variable recovery and works out function calling conventions so that it can inform my ability to reverse engineer structures or things like "does this C++ method return a non-trivial struct that is an implicit first parameter?"

              (Also, since I'm largely looking at C++ code in practice, I'd absolutely love to be able to import C++ header files to fill in known structure types.)

              • aleclm 29 days ago
                > should SLL decompile to x << y or x << (y % 32)?

                I think this a bit of a misguided question. The hardware has a precise semantic defined, usually. QEMU's << behaves similarly to C (undefined behavior for rhs > 32), but this means that the lifter (still QEMU) will account for this and emit code preserving the semantics.

                tl;dr: the code we emit should do the right thing depending on what the original instruction did, without making assumptions on what happens in case of C undefined behaviors.

                > Ghidra's type system lacks function pointer types

                Weird limitation, we support those.

                > it doesn't seem to understand stack slot reuse

                That's a tricky one. We're now re-designing certain parts of the pipeline to enable LLVM to promote stack accesses to SSA values, which basically solves the stack slot reuse. This is probably one of the most important features experienced reversers ask for.

                > that language need not be C--

                Making up your own language is temptation one should resist.

                Anyway, we're rewriting our backend using an MLIR dialect (we call it clift) which targets C but should be good enough to emit something "similar to C but slightly different". It might make sense to have a different backend there. But a "standard C" backend has to be the first use case.

                We thought about emitting C++, it would make our life simpler. But I think targeting non-C as the first and foremost backend would be a mistake.

                Also, a Python backend would be cool.

                > Analysis necessarily involves...

                I would be interested in discussing more what exactly you mean here. Why don't you join our discord server?

                > I'd absolutely love to be able to import C++ header files to fill in known structure types

                We have a project for importing from header files. Basically we want use a compiler to turn them into DWARF debug symbols and then import those. Not too hard.

              • pfez 28 days ago
                > I do like the appeal of a recompileable target language. But that language need not be C.

                Hey! Thanks for the very interesting feedback!

                I also strongly feel the appeal of having a decompiler emit a recompilable language. But I want to stress that's not just appealing for it's own sake. It opens up the possibility of consumption by other tools, which is a great opportunity.

                Basically, until the decompiler only emits some half-baked pseudocode that looks like C and humans can understand, that "language" is only an output format. It's the end of the journey from the binary. You can look at it, you can reason about it, you can even edit it change types and rename stuff, but its final purpose (and the only purpose of any adjustments you do to it) is for human consumption and understanding.

                Don't get me wrong, human understanding is great, but it has shortcomings, and it doesn't scale.

                On the other hand, the very moment a decompiler starts emitting decompiled code in a language that is parsable from other tools, its output stops being the end of the journey. In a way, it becomes yet another intermediate language, at a different level of abstraction, that can be consumed by other tools. Think any static analysis tool that usually requires having access to the source code, except now you can throw the decompiled code at it and get useful information about your binary.

                And not hypothetically speaking. At rev.ng we have a PoC where we detect memory bugs like use-after-free in a binary, without access to the original source code, but using CodeQL or clang-static-analyzer on the decompiled C code. With all the nice reports that usually come with these tools, telling you the conditions that must be verified during the execution in order for the bug to be triggered. So, it is entirely possible to use C-bases source-level static analysis tools to automate at least some part of the grinding analysis job on a binary.

                Take this with a grain of salt. It's a PoC. We haven't realeased it and it's not production grade yet, even if we're planning to show it around :) Also, I'm definitely not saying that's a silver bullet for every problem, or that it can solve stuff at every level of abstraction. But it's to make a point: decompiling to a recompilable language is a great opportunity to tap the potential of the analysis tools available for that language.

                And if that's a direction you want to go, it suddenly becomes very important that the language you decompile to has a large pool of powerful robust and battle-tested static analysis tools. That's definitely true for C, not so much for a custom language you roll on your own. Which is not to say your custom language isn't good, but AFAIU from your message you are designing it basically for being able to better read LLVM IR yourself without going crazy. So it seems to me to be something designed for your own eyes and mind, not for mass consumption form other analysis tools. And even if it turns out to be good for consumption by other tools, it's hard to beat the amount of engineering effort that has been put into static analysis tools for C, that already available off the shelf.

                So, all in all, I totally agree with you on the appeal of a recompilable target language. On that language being C or not, I really think it depends what you're trying to do. If you're trying to improve human understanding of the code, in the right conditions, I can see your point. If the decompiled code is just a starting point for other tools, I still think nothing beats C (yet?).

                > Ghidra's type system lacks function pointer types

                Wow! I think this is really crippling, and even without considering C++. I can think of many C codebases where people just do "C-with-classes" with a bunch of struct with function pointer fields.

                > the C type system is too powerful for decompilers to robustly lift to, and the resulting code is generally at best filled with distractions of wait-I-can-fix-this excessive casting and at worst just wrong.

                > I've just seen far too many times where Ghidra starts with wrong types for something and the result becomes gibberish--even just plain dropping stuff altogether.

                Besides the lack of function pointers, which I can't say loud enough how crippling I think it is, I'd be really interested in knowing more about the specifics of your complaints on plain-wrong type recovery. I second the invite to join our Discord server!

        • j-krieger 29 days ago
          What happens if you put in a binary which outputs C-like machine code, like Rust (llvm) or zig?
          • aleclm 29 days ago
            Languages with a rich standard library and generating a lot of code for you usually need some love to get rid/represent idiomatically common patterns and to detect common data structures.

            We haven't looked into it yet, but the automatic data structure recognition might help.

            Frankly, Rust looks particularly scary: https://media.ccc.de/v/37c3-11684-rust_binary_analysis_featu...

            • tux3 29 days ago
              Oh, very nice! I've dealt with forsaken deeply abstract vtable mazes of hell, but the idea of using a ton of sum types, dynamic dispatch, async everywhere, and long iterator chains would make for some deliciously unreadable binaries!
        • Sesse__ 29 days ago
          > Other key topic: data structures. When reversing I spend half of the time renaming things and half of the time detecting data structures. The help I get from decompilers in latter is basically none.

          That's funny, because I've used both Hex-Rays and Ghidra, and gotten lots of help with data structures. The interactivity really helps a bunch with filling in the blanks.

          • aleclm 29 days ago
            In IDA you basically have only detection of stack frame layout (in a quite confusing fashion) and "create struct out of this pointer", which is something you have to do manually and its intraprocedural.

            Imagine this being done automatically, across all of the binary. If you pass a pointer to another function the type is correct and you build the type from all the functions using it.

            Then obviously the user needs to fix things, but boostrapping can definitely be hugely improved.

            • Sesse__ 28 days ago
              I'm sure user-defined structs can benefit from combining information from multiple functions, but saying that what you get today is “basically none” is a bit of an overstatement. Also, the special (and important!) case of operating system ABI structs is great, and that information propagates throughout function calls.
        • saagarjha 29 days ago
          Curious what you do when you encounter an instruction you don't model
          • aleclm 29 days ago
            That's unlikely, since we use QEMU as a lifter, which sometimes supports new instructions before they hit silicon.

            However, I think we'll emit a call to some `noreturn` function. Basically we emit a call to `abort`.

            • saagarjha 29 days ago
              Right but you do see how this means that you need to lift code that has semantics that cannot be modeled in C?
              • aleclm 29 days ago
                Sure, in those cases we emit calls to C functions. The only thing we need to know is what registers are taken as input, what registers are output and what registers are preserved.

                In QEMU parlance, these are helper functions, and they have actual implementations. But for decompilation purposes, you don't need to implement them. You just need to know how they interact with the registers.

      • vient 29 days ago
        Huh, for me as a malware analyst previously and a reverse engineer in general, decompilation is the most important part of such tools. It's all about speed, pseudo-C of some kind lets you roughly understand what's going on in a function in seconds. I guess you can become pretty fast with assembly too, but C is just a lot more dense.

        Regarding reliability, I would say that Hex-Rays is pretty reliable (at least for x86) if you know its limitations, like throwing away all code in catch blocks. Usually wrong decompilation is caused by either wrong section permissions, or wrong function signature, both of them can be fixed. It can have bad time when stack frame size goes "negative" or some complex dynamic stack array logic is involved, which are usually signs of obfuscation anyway.

        It was less reliable 10 years ago though.. Also even now hex-rays weirdly does not support some simple instructions like movbe.

      • saagarjha 29 days ago
        I hear this a lot and in my experience people who Ghidra or IDA and don’t use the decompiler are exceptionally rare. Why would you suffer that when you can use something else for what you actually want?
        • dvzk 29 days ago
          I didn't say I never use it, just that it's not always the core feature. This will depend heavily on your field, but in my past work, the features that were way more essential are: scripting (+ IR lifting), xrefs, CFGs, labels/notes (in a persistent DB).

          In my experience decompilers will totally ignore or fail on certain types of malicious code, so they mainly exist to assist disassembly analysis. And for that purpose, they save us an incredible amount of human hours.

          • aleclm 29 days ago
            For scripting, our approach is to give you access to the project file (just a YAML file), and you can make changes from any scripting language you want. Everything the user can customize is in there, all the rest is deterministically produced from that file.

            I really disliked the fact that you usually need to buy into the version of Python that $TOOL requires you to use, or the fact itself that you need to use a specific language.

            Can parse YAML? You're mostly done.

            The "project file" is what we call the model: https://docs.rev.ng/user-manual/model-tutorial/

            For xrefs, CFG and the rest: we have all of that in the UI, but we also produce them in a rich way. For instance, when we emit disassembly and decompiled code, we actually emit plain text + HTML-like markup to provide metainformation for navigation (basically, xrefs) and highlighting. So you can use all that from any language that can parse HTML/XML. It's called PTML: https://docs.rev.ng/references/ptml/

            For lifting: we use LLVM IR as our internal representation. This means that: 1) you don't have to learn an IR that no one else uses, 2) you can use off the shelf tools (e.g., KLEE for symbolic execution) but you can also use all the standard LLVM optimizations and analyses and 3) you can recompile it, but we're not into the binary translation business anymore.

            • znpy 29 days ago
              > 3) you can recompile it, but we're not into the binary translation business anymore

              How comes?

              • aleclm 29 days ago
                Short answer: if you want to execute a program (maybe with some instrumentation, for fuzzing purposes) it's much easier to adopt a dynamic approach (i.e., emulation or virtualization). With static binary translation you can get better performance, but there's a lot of other things you need to get 100% right and that with a dynamic approach are a given (e.g., the CFG).

                There's much more space of improvement in the field of analyzing code (as opposed to running it), so we're investing our energies there.

                Then we're strong believers in integrating dynamic and static information, for instance see PageBuster: https://rev.ng/blog/pagebuster

                But other than that, static binary translation is a feature of rev.ng in maintenance mode.

    • felipefar 29 days ago
      I really like licensing models of one-time payments with a pre-defined duration of updates. But I wonder how they enforce it while not making internet access a requirement for the app.
      • 8organicbits 29 days ago
        I've been planning to use a non-enforcement model for a future project. Some users will always pay, because of corporate policy or ethics. Some will never pay and will reverse engineer out any software license checks. Asking the user if they have a license keeps the honest ones honest and permits ad-hoc free trials, emergency use, and other reasonable "unlicensed use".
        • userbinator 29 days ago
          Some will never pay and will reverse engineer out any software license checks.

          For a long time (and might still be; not paying much attention anymore), it was a "rite of passage" in the scene to crack IDA... using itself.

          • mrexodia 29 days ago
            It never was a “rite of passage”, because removing IDA’s license checks has always been trivial…
  • albertzeyer 29 days ago
    Checking the team about: https://rev.ng/about

    And looking at the code contributions: https://github.com/revng/revng/graphs/contributors

    Isn't it a bit weird that the CEO (aleclearmind) has most commits, even much more than the CTO (pfez)? I often hear the complaints from other CEOs that they don't really find any time anymore to code... Even the CTO usually is more on the managing side and less active in actual coding.

    Anyway, if this works, then I guess it's a lot of fun for them.

    Edit Ah right, I didn't check the timeline.

    • aleclm 29 days ago
      The CTO mostly works on the backend of the decompiler, revng-c, which we just released:

      https://github.com/revng/revng-c/commits/develop/

      Eventually we'll merge the two repos.

      Also, I develop stuff every day. For some reason GitHub is not picking up my user correctly.

      > Anyway, if this works, then I guess it's a lot of fun for them.

      It is!

    • zote 29 days ago
      The CTO has more recent commits, aleclearmind's commits drop to 0 after 2020 so maybe they also have a hard time getting to code.
    • albertzeyer 29 days ago
      I wonder a bit about the downvotes. I didn't mean this as a criticism or so in any way. In fact, I like this very much. I just found this interesting and unlike what I saw elsewhere.

      So the downvotes are because this is not interesting or not unusual?

      • halayli 29 days ago
        your observation was spot on and your question was answered by the ceo. People on hn can be oversensitive.
  • nextos 29 days ago
    A cool company fueled by one of the best PLT books out there: https://link.springer.com/book/10.1007/978-3-662-03811-6

    "He also met a partner in crime, Pietro. Romantically enough, he met him thanks to a book which will turn out to be foundational for company."

    https://rev.ng/about

    Congrats on the launch.

    • aleclm 29 days ago
      About the book, here's the full story: I was getting into compilers, but I was really struggling with the theory, the most famous books weren't doing it for me, and I felt really down.

      Then I find this book, which seems very dense, but clear. So I ask my advisor if I could buy it and goes like "well, first check out the university library". I check it out and there's a copy, but... it's taken.

      Working in the only group that was doing research on compilers I'm like "who dares do compilers stuff out of our group!?".

      I go to the library:

      Me: who has the book?

      Library guy: can't tell you, privacy reasons.

      Me: what's the third letter of its surname?

      Library guy: Z

      Me: what's the second letter of its name?

      Library: I

      Me: thanks.

      I go here: https://www.deib.polimi.it/ita/personale-lista-alfabetica I found him.

      Fast forward, we become friends and we start the company together.

      > Congrats on the launch.

      Thanks! It was a lot of work.

  • londons_explore 29 days ago
    Idea: automatically name variables and members of structs based on how code interacts with them.

    Eg. The next pointer in a linked list should be easy to identify as 'next'.

    That would be done by downloading all of GitHub, then seeing what variables in GitHub code have the most similar layouts and interactions, and then if the confidence is high enough, using those names.

    • aleclm 29 days ago
      In the past we were thinking to do something like this by hand. For instance, we detect induction variables, we could rename them into `i`.

      However, nowadays, it seems pretty obvious that the right way to do this things is using LLMs.

      This said, at this stage, we see ourselves as people building robust infrastructure. Once the infrastructure is there, using some off the shelf model to rename things or add comments is relatively easy.

      Basically: we do the hard decompilation work that needs 100% accuracy, and then we can adopt LLMs for things that are OK to be approximate such as names, comments and the like.

      Anyway, writing a script that renames stuff is pretty easy. Check out the docs: https://docs.rev.ng/user-manual/model-tutorial/

      • londons_explore 29 days ago
        If an LLM is used, it's unclear how to best do it.

        One could try to train ones own LLM from scratch, using an encoder-decoder (translation - aka seq2seq) architecture trying to predict the correct variable name given the decompiled output.

        One could try to use something like GPT-4 with a carefully designed prompt "Given this datastructure, what might be the name for this field?"

        One could try to use something pretrained like llama, but then finetune it based on hundreds of thousands of compiled and decompiled programs.

        • Eisenstein 29 days ago
          Option 4:

          One could take an pretrained model like llama, train it on only a few thousands of compiled and decompiled programs, then feed it compiled programs and have it decompile them and evaluate that output to make a new dataset and fine tune it again. Repeat until satisfactory.

    • diggan 29 days ago
      Would be very cool indeed, something like http://jsnice.org/

      Paper that describes what JSNice is doing behind the scenes: https://files.sri.inf.ethz.ch/website/papers/jsnice15.pdf

    • 19h 29 days ago
      Sounds like sidekick for binary ninja
    • qweqwe14 29 days ago
      Sort of like GitHub Copilot but for reversing?
  • dark-star 29 days ago
    It doesn't work with my ELF file:

        [orchestra] [darkstar@shiina revng]$ ./revng artifact --analyze --progress decompile-to-single-file ../maytag.ko 
        [=======================================] 100% 0.57s Analysis list revng-initial-auto-analysis (5): import-binary
        [===================>                   ]  50% 0.57s Run analyses lists (2): revng-initial-auto-analysis
        [=========>                             ]  25% 0.57s revng-artifact (2): Run analyses
        Only ELF executables and ELF dynamic libraries are supported
        [orchestra] [darkstar@shiina revng]$ file ../maytag.ko 
        ../maytag.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (FreeBSD), not stripped
    
    Does it not support FreeBSD binaries?

    Edit: Ah I missed that it doesn't support kernel modules, probably has nothing to do with FreeBSD but the fact that this is not a simple executable

    • aleclm 29 days ago
      Can you open an issue on GitHub and attach the binary? I don't think it should be too hard to load that.
      • dark-star 28 days ago
        can I somehow share the binary privately? It's a proprietary module that I probably shouldn't share publically (also it's ... rather large)

        I opened issue #366 for it already

  • yakkityyak 29 days ago
    I hope collaborative workflows get a lot of attention. I haven't used IDA teams or anything, but a reverse engineering experience that felt as frictionless as Google Docs would be amazing.
    • aleclm 29 days ago
      That's our goal. We used to use QtCreator as a basis for the UI, terrible idea.

      Then we switched to VSCode, which happens to be able to run in the browser. So we added some magic kubernetes sauce and voilà, you got the cloud decompiler with exactly the same user experience as the fully standalone one.

      We still need to perform some QA on collaboration, but basically works. One daemon, many clients. Very simple architecture.

      I think we got inspiration to do this from a CTF where we were doing "collaboration" using IDA with multiple windows on a X session on a server with multiple cursors. Very cursed, but effective.

  • flexagoon 27 days ago
    Are there any plans to support type inference? It seems like it currently shows all variables as generic64_t. Would be nice to automatically detect their types like Ghidra does (albeit sometimes incorrectly)
  • fwr00t 29 days ago
    Seems exciting. I'm keen to try the fully standalone version. Is there any news about tentative pricing? Hopefully its affordable enough for hobbyist as well.
  • JonChesterfield 29 days ago
    Always pleased to see more binary hacking tools. A load of overly-precise suggestions on the chosen packaging format follows because I might want to use this tool myself :)

    > `source ./environment`

    That's a bad omen. I downloaded the tar to find it does indeed set a bunch of environment variables including PATH, though thankfully not LD_LIBRARY_PATH. Mostly prefixed "HARD_" which is maybe unique (REVNG would be a more obvious choice, colliding with existing environment variables is a bad thing).

    It sets `AWS_EC2_METADATA_DISABLED="true"` which won't break me (I don't use AWS) but in general seems dubious.

        export RPATH_PLACEHOLDER="////////////////////////////////////////////////$ORCHESTRA_ROOT"
        export HARD_FLAGS_CXX_CLANG="-stdlib=libc++"
        ... "-Wl,-rpath,$RPATH_PLACEHOLDER/lib ...
    
    This is suboptimal. The very long PATH setting with mingw32 and gentoo and mips strings in it also looks very fragile.

    I usually bail when the running instructions include "now mangle your environment variables" because that step is really strongly correlated with programs that don't work properly on my non-ubuntu system. Wiring your application control flow through the launching environment introduces a lot of failure modes - it's not as convenient as it first appears. Very like global variables.

    Clang will burn a lot of this stuff in as defaults when you build it if you ask, e.g. `-DCLANG_DEFAULT_CXX_STDLIB=libc++` would remove the stdlib setting environment variable. DEFAULT_SYSROOT is useful too.

    Using rpath means you're vulnerable to someone running this script with LD_LIBRARY_PATH set as the environment variable will override your DT_RUNPATH setting in the binaries. The background on this is aggravating. Abbreviating here, '-Wl,rpath' no longer means rpath, it means 'runpath' which is a similar but much less useful construct. The badly documented invocation you probably want is `-Wl,rpath -Wl,--disable-new-dtags` to set rpath instead of set runpath, at which point the loader will ignore LD_LIBRARY_PATH when looking for libraries.

    There's a good chance you can completely remove the environment mangling through a combination of setting different flags when building clang, static linking and embedding binaries in other binaries.

    Related, your clang-16 binary is dynamically linked. As in it goes looking for things like libLLVMAArch64CodeGen.so.16 at runtime. A lot of failure modes can be removed by LLVM_BUILD_STATIC=ON. E.g. if I run your dynamically linked clang with a module based HPC toolchain active, your compiler will pick up the libraries from the HPC toolchain and it'll have a bad time. The tools are all linked against glibc as well, pros and cons to that.

    Tools are also linked against libc++.so, which is linked against libc++abi.so and so forth. Worth considering static libc++, but even if you decline that, libc++abi and libunwind can and probably should be statically linked into the libc++. The above rpath rant? Runpath isn't transitive so dynamic libaries finding other dynamic libraries using runpath (the one you get when you ask for rpath) works really poorly.

    Context for there being so many suggestions above - I am completely out of patience with distributing dynamically linked programs on Linux. I don't want a stray environment variable from some program that had `source ourhack` in the readme or a "module system" to reach into my application and rewire what libraries it calls at runtime as the user experience and subsequent bug report overhead is terrible. Static linking is really good in comparison.

    Thanks again for shipping, and I hope some of the above feedback is helpful!

    • aleclm 29 days ago
      I think most of your concerns about messing with the environment are sensible only under the assumption that you actually do `source environment`.

      In truth, we suggest to do that only so you use the GCC we distribute for the demo binary. The actual way this is intended to be used is through the `./revng` script. In that way, the environment changes only affect the invocation of `revng`.

      This is documented here: https://docs.rev.ng/user-manual/working-environment/ We should probably add a warning about `source ./environment`.

      Now, let's get to each of your comments :D

      > though thankfully not LD_LIBRARY_PATH

      We spent a lot of time to have a completely self-contained set of binaries where each ELF refers to its dependencies through relative paths. LD_LIBRARY_PATH is evil.

      > Mostly prefixed "HARD_"

      Those are just used by our compiler wrappers, I don't think those environment variables collide with anything in practice.

      > It sets `AWS_EC2_METADATA_DISABLED="true"`

      Original discussion: https://github.com/revng/revng/pull/309#discussion_r12805759...

      I guess we could patch the AWS SDK to avoid this. Anyway, it affects only when rev.ng is running in the cloud.

      > export RPATH_PLACEHOLDER=... > export HARD_FLAGS_CXX_CLANG=...

      Those are used when linking binaries translated by revng. If you're not interested in end-to-end binary translation, they don't matter.

      > it means 'runpath' which is a similar but much less useful construct

      We specifically want DT_RUNPATH. DT_RPATH is deprecated and there might an use case for overriding our libraries with LD_LIBRARY_PATH.

      > There's a good chance you can completely remove the environment mangling

      I think your observations concerning "mangling the environment" are only valid for non-private environment variables. The following variables are private: RPATH_PLACEHOLDER, HARD_*, REVNG_*. Also, they are all only for binary translation purposes. We could push them down into some smaller-scoped compiler wrappers, but those make sense only if we can get rid of environment entirely, which we can't because we ship Python.

      > a combination of setting different flags when building clang

      No, the flags also affect the linker and there's some features of our wrappers that cannot simply be burned in. We can push them in more private places, though.

      > a lot of failure modes can be removed > libc++abi and libunwind can and probably should be statically linked into the libc++

      We no longer have issues with that, our build system is pretty reliable in that regard. LLVM is just one of the components, these things need to work robustly in general, and they do (with quite some effort).

      You seem to be wary of using dynamic linking, we put some effort in it, now it works pretty good and always looks up things in the right place, and without ever hardcoding absolute paths anywhere, nor any install phase that "patches" the binaries. The unpacked directory can be moved wherever you want.

      > I am completely out of patience with distributing dynamically linked programs on Linux

      You're thinking of some other solution, our solution does not use LD_LIBRARY_PATH and all the binaries reference each other in a robust way using `$ORIGIN`. Try:

          ./root/bin/python ./root/bin/revng artifact --help
      
      It works.

      But again, doing `source environment` is mostly for demo purposes, in the actual use case, you just do `./revng` and your environment is untouched.

      We ship our Python, but you don't have to use it: you're supposed to just do ./revng (or interact over the network in daemon mode).

      Our approach is: use whatever tool you like for scripting as long as it can parse our YAML project file, make changes to it, and then invoke `./revng artifact` (or interact with the daemon): https://docs.rev.ng/user-manual/model-tutorial/

      Result: we get to use our Python version (the latest) and you get to use whatever language you like. Then we'll provide on pypi wrappers that help you with that and are compatible with large set of Python versions.

      tl;dr Don't `source ./environment`, use `./revng`.

      > Thanks again for shipping, and I hope some of the above feedback is helpful!

      I'm happy there's someone that cares about this :D

      Our next big iteration of this might involve simplifying things a lot by adopting nix + mount namespace to make /nix/store available without root.

      Maybe this is not the right place for discussing this, we can chat on our discord server if you'd like :)

      • JonChesterfield 29 days ago
        Not setting environment variables is indeed solved by not setting environment variables - but `source ./environment` is what's written on the announcement page at the top of this thread. './revng' doesn't appear anywhere on it.

        You haven't set LD_LIBRARY_PATH but other people will do. Also LIBRARY_PATH, and put other stuff on PATH and so forth. Module systems are especially prone to this, but ending up with .bashrc doing it happens too.

        You have granted the user the ability to override parts of the toolchain with environment variables and moving files to various different directories. That's nice. Some compiler devs will appreciate it. Also it's doing the thing Linux recommends for things installed globally so that's defensible.

        In exchange, you will get bug reports saying "your product does not work", where the root cause eventually turns out to be "my linker chose a different library to my loader for some internal component". You also lose however many people try the product once, see it immediately fall over and don't take the time to tell you about the experience.

        I think that's a bad trade-off. Static linking is my preferred fix, but generally anything that stops forgotten environment variables breaking your software in confusing ways is worth considering.

        • aleclm 29 days ago
          > `source ./environment` is what's written on the announcement page at the top of this thread. './revng' doesn't appear anywhere on it.

          You're right, but after that there's a link to the docs where we say to use `./revng`. The blog post is for the impatient :) On the long run the docs is what most people will look at.

          I don't think we want to support use cases that might break system packages too. If you set LD_LIBRARY_PATH to a directory where you have an LLVM installation, that might break any system program using LLVM too... Why should we try to fix that using `DT_RPATH` (which is a deprecated way of doing things) when system components don't do it?

          We might cleanup the environment from LD_LIBRARY_PATH and other stuff, that might be a sensible default, yeah. Also we might have some sanity check printing a warning if weird libraries are pulled in.

          But it's hard to take a decision without a specific use case in mind. If you have an example, bring it forward and I'm happy to discuss what should be the right approach there.

          • JonChesterfield 28 days ago
            LLVM picking up the wrong libraries from the environment has cost me at least a couple of months over the last decade or so. Maybe twenty instances of customers being broken, ten hours or so in meetings explaining the problem and trying to persuade people that the right thing really is different for the system compiler vs your bespoke thing.

            If you think it's better for your product to find unrelated libraries with the same name at runtime, you go for it.

            Detecting that failure mode would be an interesting exercise - you could crawl your own address space after startup and try to guess whether the libraries you got are the ones you wanted. Probably implementable.

  • costco 29 days ago
    Congrats. Do you have any regrets about outsourcing lifting to the QEMU TCG or has it worked well?
    • aleclm 29 days ago
      Thanks!

      It has been working very well. Two regrets:

      1. Not rebasing our fork of QEMU for years has put us in a bad spot. But just today a member of our team managed to lift stuff with the latest QEMU. And he has also been able to lift Qualcomm Hexagon code, for which we helped to add support in QEMU. Eventually we'll be the first proper Hexagon decompiler :)

      2. Focusing too much on QEMU led our frontend to be tightly coupled with QEMU. It will now take some effort to enable support for additional frontends, non-QEMU based. But not impossible: our idea is to let user add support for a new architecture by defining, in C, a struct for the CPU state and a bunch of functions acting on it. That's it. No need to learn any internal representation.

      tl;dr QEMU was a great choice, it worked so well that we didn't work on that part of the codebase for too much time and now there's some technical debt there. But we're addressing it.