Show HN: CLI tool for saving web pages as a single file

(github.com)

640 points | by flatroze 1701 days ago

36 comments

  • FreeHugs 1701 days ago
    One thing I always wonder when I see native software posted here:

    How do you guys handle the security aspect of executing stuff like this on your machines?

    Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?

    Do you execute it in a sandboxed environment?

    Do you just hope for the best like in the good old times of the C64?

    • mike-cardwell 1701 days ago
      These are the install instructions the docs say to use:

          $ git clone https://github.com/Y2Z/monolith.git
          $ cd monolith
          $ cargo install
      
      These are the ones I used:

          $ git clone https://github.com/Y2Z/monolith.git
          $ cd monolith
          $ sudo docker run --rm -w "$(pwd)" -v "$(pwd):$(pwd)" -u "$(id -u):$(id -g)" rust cargo install
      
      That isolated the build process. Similar method to isolate the execution of the built project:

          $ cd target/release
          $ sudo docker run --rm -w "$(pwd)" -v "$(pwd):$(pwd)" -u "$(id -u):$(id -g)" rust ./monolith https://www.grepular.com
      
      Slightly related: I released a project last night for easily running node applications inside containers: https://gitlab.com/mikecardwell/safernode - Without even having node or npm installed on the host system, you can still run commands like "npm install" or "npm start" to run node applications safely isolated inside ephemeral containers.
      • jolmg 1700 days ago
        Though I don't know the specifics, isn't it commonly advised to not rely on docker for secure isolation of potential malware?
        • mike-cardwell 1700 days ago
          I've heared this too, but as far as I know it's only because there are potential bugs in the container software that allow the malware to escape.

          To me, this is kind of like saying you should just run stuff as root, because there might be a privelege escalation vulnerability which lets the code run as root anyway.

          Correct me if I'm wrong.

          My goal was to make things more secure, not completely secure.

          Previously, dodgy libs could read (and add) ssh keys into ~/.ssh/, take over my NPM account by fetching ~/.npmrc, grab a copy of my ~/.bitcoin/wallet.dat, and add a keylogger into my ~/.bashrc

          Now, at least it has to break out of docker first.

          • jolmg 1700 days ago
            > To me, this is kind of like saying you should just run stuff as root, because there might be a privelege escalation vulnerability which lets the code run as root anyway.

            But I never said it was preferable to run directly on the host. There are other choices.

            > My goal was to make things more secure, not completely secure.

            There is no such thing as completely secure. The argument against docker is more along the lines of "is it really as secure as people think it is?"

            > I've heared this too, but as far as I know it's only because there are potential bugs in the container software that allow the malware to escape.

            I'm not sure docker was designed for the purpose of secure isolation, so if it fails to securely isolate, I'm not sure it would count as a bug.

            • cookiecaper 1700 days ago
              Linux relies on a concoction of properly-configured kernel subsystems to provide some level of isolation for containerized processes, and systems like LXD and Docker try to patch up the gaps.

              The cgroups interfaces don't offer much security stuff directly -- they're mainly about containing groups of process within certain resource consumption quotas, and afaik, don't really attempt to contemplate secure isolation directly.

              LXD approaches this by adding a uid/gid translation layer, so that the uid/gid for anything within an unprivileged container will be offset by a specified value, e.g., calls with user ID 1000 in a container are made to present to the host as user ID 1000000. This comes with its own host of issues which LXD tries to hide.

              The short answer is that if security is any type of priority for the system in question and you want to run containerized processes, you should use an OS that implements container security directly in the kernel, like FreeBSD with jails or illumos with Zones, instead of depending on getting exactly the right configuration between all the moving pieces in the Linux container stack.

              • hollerith 1700 days ago
                I have no recent information, but about 10 years ago FreeBSD maintainers were telling people not to rely on jails for security.
          • geggam 1700 days ago
            A full VM seems somewhat of a better stance on this no ?
            • mike-cardwell 1697 days ago
              Only if you ignore why I'm doing this.

              I could counter your "VM is better than container", with "Separate hardware is better than VM".

        • zymhan 1700 days ago
          It's certainly not as secure as a VM, without a lot of safeguards.
      • wolco 1700 days ago
        For someone having trouble with npm and windows 10, the suggestion couldn't come at a better time.
    • gambler 1700 days ago
      This is a good question. I think you can make it even better by generalizing the problem. How on earth do developers hope to advance general computing forward when simply running programs isn't a solved problem? Most software engineers I know don't run docker on their home PCs. What about people who aren't in IT? Does anyone here even care? The general attitude I see is "plebs don't need to run anything they can't get outside of an app store". It's a horrible attitude.

      I very much like this quote from Alan Kay:

      "It doesn't matter what the computer can do, if it can't be learned by billions of people."

      There is no good technical reason why modern operating systems can't work out some some scheme for sanboxing arbitrary programs by default. It is obviously necessary. I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container". It would have to be designed with security as the primary concern, though; unlike current container solutions.

      • derefr 1700 days ago
        But arbitrary programs are... arbitrary. Especially ones run by software engineers, and especially ones run by software engineers as part of a POSIX-alike “utility bag” ecosystem.

        Who’s to say that the user’s intent by running the program they just downloaded, isn’t to—say—overwrite a system folder? (Oh, wait, that’s exactly what Homebrew does, with the user’s full intent behind it!)

        There are tons of attempts to do what you’re talking about. Canonical’s “snaps” are a good example. As well, every OS sandboxes legacy apps by default (because they’re already virtualizing them, and sandboxing something in a virtualization layer is easy.)

        But none of those solutions really work for the “neat FOSS hack script someone wrote” workflow we’re talking about here, where you build programs from source and run them for their intentional side-effects on your system.

        You might suggest that there could be a shared sandbox for all the POSIX-like utilities to interoperate in. But what if you’re attempting to use those utilities against your real documents? (For example, a bulk metadata auto-tagging and auto-renaming utility, to get TV episodes from torrents loaded into Plex correctly.) How do you draw the line of what such a program can operate on? AFAICT, you just... can’t. Its whole purpose is to silently automate some task. If it requires constant security prompting, the task isn’t automated.

        • gambler 1700 days ago
          >But what if you’re attempting to use those utilities against your real documents?

          You copy or move documents inside the specific sandbox.

          If you want a pipeline, you establish a chain of inbox/outbox folders.

          Obviously, most of this should be done by the OS, not the user.

          The workflow:

          - You click "download" in your browser.

          - When it's done and you click on your download, the OS asks how you want to open it. Instead of "execute" option you get "run in a sandbox" option.

          - You type in the name of the sandbox, the app gets copies to /apps/sandboxName or something of that sort.

          - The system automatically creates /apps/sandboxName/inbox and /apps/sandboxName/outbox.

          - To process a file in some way, you drop it into inbox dir.

          For command line, the only change would be switching from "executable pulls arbitrary files" to "I push specific file to the executable".

            zip -r squash.zip dir1
          becomes

            | -r squash.zip | zip dir1 |
          
          Start pipeline. Push squash.zip as an argument to zip, get the output. Zip would be the container name.
          • derefr 1700 days ago
            Let me put it another way: how would you implement a dotfile management framework (like any of these: https://dotfiles.github.io)? Programmers seem to really like them, judging by how many of them there are. But the whole point of them is to forcefully usurp the assets of literally every other program on the system. They're user-level rootkits, in a sense.

            Or, for a simpler, more obvious example: find(1), grep(1), etc. A set of utilities that can all be asked the equivalent of "read literally every file the VFS has access to and tell me whether they match an arbitrary-code-execution predicate." Do you want to literally copy your entire hard disk into the 'inbox' of these utilities, in order to get them to search it for you? (And before you say "well, we can trust the base utilities that ship with the OS to do more than arbitrary third-party utilities"—there's a whole competition of grep(1) replacements, e.g. ag(1), rg(1), etc. Do you want to make it impossible for people to innovate in this space?)

            Or how about Nix, or GNU Stow, or, uh, Git? These utilities become useless if they have their own sandbox. Does your git worktree live in Git's sandbox? Vi's sandbox? The inability to make this distinction functional is why mobile OSes only have fullly-integrated IDEs!

            Or how about shells themselves! (Or, equivalently, any scripting runtime, e.g. Ruby, Python, etc.) Should people not be allowed to install these from third parties?

            Or, the most based example of all: make(1) [and its spiritual descendants], and the GNU autotools built atop it. How does ./configure work if you can't detect true properties of the target system, only of the sandbox you're in?

            • gambler 1700 days ago
              >"Do you want to literally copy your entire hard disk into the 'inbox' of these utilities, in order to get them to search it for you?"

              Well, let's think about the goal here. grep reads files and outputs lines from those files. It needs full read access to everything you want to search. It does not need write access outside of its sandbox. It does not need direct access to network sockets, audio stuff and so on.

              Is it unreasonable to create a readonly "view" of the filesystem inside grep's folder? Is it unreasonable to have "files" representing network access, microphone, audio? It will have visual representation in file manager without the need to create custom UI. It could be manipulated by drag-and-drop OR command line. More importantly: it's easy for users to understand. "This app lives in a box. You can put things in that box for the app to use."

              >Does your git worktree live in Git's sandbox?

              Yes? I mean, I currently have a folder called projects. All my git stuff is in there anyway.

              >Vi's sandbox?

              If you want multiple sandboxes to be able to operate on a directory, you create "views" for that directory (readonly or read/write) in multiple sandboxes. This shouldn't be some sort of mind-bending idea, considering Unix has symlinks, hardlinks, and mounted filesystems of all sorts.

              >Or how about shells themselves! (Or, equivalently, any scripting runtime, e.g. Ruby, Python, etc.) Should people not be allowed to install these from third parties?

              There is no reason why a Ruby executable should have unlimited access to the entire file system. Especially if you're only using it for a specific purpose, like serving a website.

              What I'm describing here isn't some novel, mind-blowing idea. It's simply dependency injection. With file-based user interface. Every single part of this had been done in various operating systems or programming environments more than once. It's just a matter of combining it all in a sensible way.

              • JetSpiegel 1699 days ago
                > It does not need write access outside of its sandbox.

                You have SElinux for that, if you like bureaucracy and filing triplicate forms to able to run scripts with side effects.

        • popey 1697 days ago
          Amusing that you should mention Canonical "snaps". I made a snap of monolith and contributed the yaml upstream. https://snapcraft.io/monolith - it's in the edge channel because upstream haven't done a stable release yet. It's a strictly confined application.
        • Hackbraten 1700 days ago
          > Oh, wait, that’s exactly what Homebrew does, with the user’s full intent behind it!

          Care to elaborate?

          • derefr 1700 days ago
            Homebrew is designed to take over your OS /usr/local directory. Not that there's much in there by default, but Homebrew's presence greatly changes the semantics of /usr/local, given that it's normally meant as a prefix for the local system administrator to install things as root into, whereas Homebrew re-assigns ownership of the whole directory structure to the user installing Homebrew (who, admittedly, is a member of the "staff" group, but still only one member on a potentially multiuser computer.)
      • skybrian 1700 days ago
        Even with decent containers, making software that's good enough for everyone to use is considerably more difficult than making it for yourself and a few friends.

        It's okay to do things that don't scale at all, and also okay to make a proof of concept and let someone else worry about scaling it up.

        That said, you might want to look at Chromebooks and the Windows and Mac app stores to see what's going on with containerization beyond mobile. (Also, web browsers.)

      • vaer-k 1700 days ago
        > I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container". It would have to be designed with security as the primary concern, though; unlike current container solutions.

        Isn't that basically just Qubes OS? https://www.qubes-os.org/

      • wongarsu 1700 days ago
        > I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container"

        Snap [1] goes pretty far in this direction. Apps are isolated against each other, and with AppArmor isolated from the system (at least on Ubuntu, your distro might vary). Android does much of the same.

        A big problem is that most software exists to manipulate data on the user's machine, so isolating the software from the User folder is impractical. At the same time this data is usually the most valuable thing about the entire computer. That makes it fundamentally very hard to design a system where you can trust arbitrary apps. Android tried to solve this with a "file open" dialog that's controlled by the OS so that there's an easy way to give apps temporary access to single files, but that leads to weird UX.

        1: https://snapcraft.io/

      • Rediscover 1699 days ago
        > sanboxing arbitrary programs by default

        See OLPC Bitfrost

        http://wiki.laptop.org/go/Bitfrost

    • OJFord 1701 days ago
      Some people run ~everything in individual containers, with original commands being aliases for the docker (or whatever) equivalent.

      I've shunned that assuming it'd be a big slow down, but I do keep meaning to at least try it, uh, after I knock it.

      (No, containers like anything else aren't and haven't been completely secure all of the time since and forever, but it'd take a more sophisticated - and certainly deliberately malicious - tool to do any damage to your system, or to files you didn't explicitly allow it access to.)

      • roblabla 1700 days ago
        At least with containers, if something manages to do damage, it will have to do it through a security vulnerability - which will eventually get patched if caught. The current state of affairs is that malware can do damage without even needing to use a security vulnerability whatsoever. It's "working as designed", and will keep working forever.
      • forgotmypw 1701 days ago
        I do all of my development in a VM, which allows me to take snapshots and have a portable GNU workstation that's decoupled from my desktop and hardware.

        Meanwhile, my desktop remains clean and ready to play media in native environment with good hardware support.

        I used to think it'd be slow, until I tried it. My computer is 8+ years old, and it works fine. I mostly do text work.

        • OJFord 1700 days ago
          I used to use a VM (motivation being to run Linux on my locked-down corporate-imaged Macbook) and it was usable, but not as fast it could be if it had free reign over all CPU/RAM.

          But, at least the way I was doing it, it's not adding any security as discussed here, since you're doing everything in the VM so anything in the VM has access to everything just as if everything on the host anyway.

          • forgotmypw 1700 days ago
            Scripts I run have access to my development environment for a free software project I already keep in a public repo.

            What they don't have access to is my actual desktop, nor yesterday's snapshot of the VM desktop.

            It's obviously not as fast as running GNU on bare metal, but it's fast enough for text work.

            • AnIdiotOnTheNet 1700 days ago
              > it's fast enough for text work.

              I guess this is where we set the bar in 2019.

              • OJFord 1700 days ago
                Isn't the bar wherever you need it to be, on your own system, whatever the year?

                I'm not GP commenter, but certainly what I do for a living is browse the the internet and edit text files.

              • forgotmypw 1700 days ago
                That's where my bar has been for 20+ years.
            • OJFord 1700 days ago
              Ah okay, that's why I thought to include 'at least for me' - because our 'all my development's were different! :)

              For me, the VM was my entire machine. (It wasn't meant to improve security, it was purely because I wanted Linux but couldn't have it on the host.)

              • forgotmypw 1700 days ago
                That's a big part of the motivation for me.

                I wanted GNU tools and environment, and I also wanted to not have to install Linux on the Apple hardware, because I don't have the patience for that.

    • flatroze 1701 days ago
      It's a valid question. It seems to me users tend to trust things which have certain level of popularity and reputation associated with them.

      I personally prefer to hope for the worst. This way when nothing happens I feel extra lucky, and if bad things do happen, I feel proud of being ready for it.

      • air7 1700 days ago
        I don't. In the age old debate between security and convince I sometimes want the latter and willing to ease on the former. To me being prepared for the worst is not fun. I use reputation of the software creator to manage my risks.
      • sieabahlpark 1701 days ago
        That could be said about nearly everything that you do.
    • chrisked 1701 days ago
      I just clicked on the OP user name first which has the caption Bad boy. Then I read your comment. Now I worry more about the security aspect than before.
      • ErrantX 1701 days ago
        Its funny how we sometimes think about risk; if the caption was 'good egg' would you be more trusting?
        • jolmg 1700 days ago
          People trust by default and only distrust when they spot a signal that believably correlates with maliciousness. Kind of like how people judge each other based on how they dress.
        • chrisked 1701 days ago
          I’d have a different initial response. Somehow it triggered my brain to think more about security and be on alert. I’m any case this faded away quickly. The mind is beautiful :)
          • flatroze 1701 days ago
            There is nothing either girl or boy, but thinking makes it so.
    • progval 1701 days ago
      > Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?

      I usually do a quick check to see if there's any red flag. Like URLs or Base64 blobs. I also try to stay away from programs written in languages whose environment I don't know so I can check if any dependency stands out wrt what the program claims to do.

      > Do you execute it in a sandboxed environment?

      Either I run it on an untrusted server, or on my laptop as an unpriviliged user (with no access to X/wayland).

    • sequoia 1700 days ago
      > Do you read all that code and evaluate the reputation of all dependencies?

      Why of course. I do this for every piece of software on my computer, from the device drivers to the OS, I review every patch to firefox & chrome as well. /s

      Running someone else's software inherently means extending them trust. This objection is especially confusing on a piece of software where you can actually inspect all the source if you like (unlike e.g. device drivers and OS code unless you run Gnu/Linux + all Free drivers, as few but RMS do).

      • Inhibit 1699 days ago
        Your tongue in cheek points are good.

        But on that last bit I disagree. Many, many systems run nothing but stock open source kernel drivers under Linux. I daresay home systems with closed drivers are more of an exception. All those VMs in the "cloud".

    • guy-brush 1700 days ago
      Related (the idea of having android-like permissions for desktop linux): https://github.com/open-source-ideas/open-source-ideas/issue...
    • jasonvorhe 1701 days ago
      What computer are you using and which operating system is running on that? Have you read the code?
    • Avi-D-coder 1700 days ago
      Before I installed this tool, I checked if the author was a member of any well known organization and since they are not, I skimmed all the code for monolith and any of it's dependency that are not extremely popular, in this case mime-sniffer.
      • flatroze 1700 days ago
        That dependency is now gone, thank you for the review
    • iamvik 1700 days ago
      Largely agree, but it's hardly OP's fault. In my opinion this is one of the things we get by using 30 year old Operating System and there having been almost no innovation in that field for so long.
    • js4ever 1701 days ago
      It takes 2 commandes to create a new LXC container and ssh into it, then I install the software inside the container
      • masklinn 1701 days ago
        That's more secure than running straight on the machine but not completely secure, security was never a core goal of linux containers.

        You want to run the software properly sandboxed, since linux doesn't have really engineered OS-level solutions (à la Solaris) that means vm e.g. running in qemu, or biting the bullet and switching to qubes.

        • Huggernaut 1701 days ago
          I think that's an extremely uncharitable view on containers. There has been a massive amount of work put into securing containers for a variety of use cases using both layers available and by adding to the kernel.
          • masklinn 1701 days ago
            > I think that's an extremely uncharitable view on containers.

            It's an objective one.

            > There has been a massive amount of work put into securing containers for a variety of use cases using both layers available and by adding to the kernel.

            That doesn't change the fact that security was never the primary goal for containers, so secure containers were and are a bunch of tricks, kludges and prayers being built up in the hope that eventually all the holes in the model will be patched.

            The "Making containers safer"[lwn][hn] talk was literally two days ago. Note how it says safer, not safe.

            [lwn] https://lwn.net/SubscriberLink/796700/9bc9daa32a8fe499/

            [hn] https://news.ycombinator.com/item?id=20764355

            • pbhjpbhj 1701 days ago
              Of course it says safer, if someone tells you something is entirely safe you know they're lying.
            • Huggernaut 1696 days ago
              It is not objective by any stretch of the imagination.

              I suppose when security enhancements are made to any other system to make them safer (i.e. everything in the realm of security), you apply the same logic? Subjective.

            • roblabla 1700 days ago
              Can you please quote how containers are not built with security in mind? What would even be the point of user namespacing, network namespaces, filesystem namespaces, etc... if not security?
              • MaxBarraclough 1700 days ago
                Look at how the cloud providers offer support for containers.

                Do they ever offer to run your container in the same VM as those of other customers?

                They never do this. For secure isolation, they only trust VM isolation. It seems unlikely that this will change.

                > What would even be the point of user namespacing, network namespaces, filesystem namespaces, etc... if not security?

                They're for installation/configuration/administration. They allow you to run multiple applications on one Linux VM, and to configure them independently, almost as if you were running multiple VMs (with the advantage of lower overheads - only one instance of the kernel).

                Kubernetes puts this to good use, letting you treat application deployments as commodities across your cluster.

                Containers do not offer secure isolation. They are by nature much leakier than the isolation VMs can offer. The Docker folks still treat isolation-failures as bugs, of course. (Well, ignoring things like the way 'uptime' gives the uptime of the underlying machine, and not of your container.)

                • Huggernaut 1696 days ago
                  There are many services that run applications colocated on VMs in containers.

                  I don't disagree they are leakier abstractions but they can still satisfy a wide variety of workload security needs.

              • yters 1700 days ago
                I recently took a course on how cgroups and namespaces work, and can be combined to create containers, and my impression is security is a huge kludge. For example, the capabilities are just a seeming random assortment of different permissions, with a big dumping ground in the admin capability. It's hard to see how such a system can be reliably secured. Plus, it's all open source with a couple core contributors. What's to stop some state agency inserting its code into the core? No way to review everything, and a suitably clever developer can place a backdoor somewhere in all of the millions of lines of code. So, I must agree that security is not really at the forefront of Linux or container technology.
                • MaxBarraclough 1700 days ago
                  > my impression is security is a huge kludge

                  Docker itself could be called a huge kluge, at least compared to Solaris 'zones' and FreeBSD 'jails'.

                  They're similar to containers, but are supported directly by the kernel, whereas Docker has to pull together different kernel features to create its abstraction. [0]

                  > What's to stop some state agency inserting its code into the core? No way to review everything

                  1. This isn't a point about containers, it's a point about Free and Open Source software in general. Do you avoid all Open Source software when security matters? 2. I'm pretty sure the Linux kernel folks review everything, and I imagine the Docker folks do too 3. You're implicitly assuming that closed-source software is safe from government pressure. It is not.

                  [0] https://blog.jessfraz.com/post/containers-zones-jails-vms/

                  • yters 1699 days ago
                    Nothing is safe from government pressure. But, at least with local closed source we know it's going to just be our government pressure. Otherwise, it could be any actor, which may be less friendly towards us.
                    • MaxBarraclough 1698 days ago
                      > with local closed source we know it's going to just be our government pressure

                      We don't. Companies that produce proprietary code are not immune from attacks on their repository, and are more vulnerable to, say, bribery. They're also more vulnerable to attacks on their distributed binaries - users do not have the option to compile from source, so you compromise every user this way.

                      Proprietary software is also far more likely to embed 'telemetry' spying, or to use sloppy security practices and rely on security-by-obscurity. Authors of Free and Open Source software know that they (generally at least [0]) cannot get away with this kind of thing.

                      It simply isn't true that proprietary software is more trustworthy than FOSS. If anything, the opposite appears to be true.

                      [0] https://news.ycombinator.com/item?id=14754740

                      • yters 1697 days ago
                        [citation requested]
    • mixmastamyk 1701 days ago
      I often run new things in a vm with snapshots. When I don’t it’s as a limited user.
    • interfixus 1701 days ago
      We may semi-trust our package and repo systems. This tool is readily available through AUR on my Arch machine, I see. Or we may go the whole hog and actually have a peek through the source.
      • laumars 1701 days ago
        AUR are packages that aren't in Arch's repo system. Granted tools like yaourt do make installing AUR packages nearly as easy as pacman but anyone can upload anything to AUR thus you are expected to vet the packages yourself (hence why tools like yaourt repeatedly prompt you to read the build scripts et al before running them).
        • interfixus 1701 days ago
          > AUR are packages that aren't in Arch's repo system

          AUR, Arch User Repository.

          > thus you are expected to vet the packages yourself

          Obviously, as a long time Arch user I didn't know this.

          • psychrometer 1700 days ago
            You emphasized the wrong word.

            It is the Arch User Repository.

            > DISCLAIMER: AUR packages are user produced content. Any use of the provided files is at your own risk.[0]

            [0] https://aur.archlinux.org

  • mikaelmorvan 1701 days ago
    The main problem with your code is that you only handle simple web1 site.

    What about javascript execution ? If you replay your capture, you have no idea of what you will see on general Web2 website.

    The only way I know to capture a web page properly is to "execute" it on a browser.

    Gildas, the guy behind SingleFile (https://github.com/gildas-lormeau/SingleFile) is well aware of that and his approach realy works everytime.

    Try on a Facebook post, a Tweet, ... It just works.

    • lucideer 1701 days ago
      The capture includes JS, so this should work for most JS-dependent sites, with the exception of scripts loading other additional assets.

      Tbh, often those are superfluous, or egregious examples of bad web dev, so it seems a reasonable solution for most cases.

      SingleFile is a different approach, but it's a lot more involved/less convenient than a cli, and loading in something like WebDriver on the cli for this would be overkill, unless you're doing very serious archival work.

      • mikaelmorvan 1700 days ago
        superfluous, or egregious examples of bad web dev?? Do you know what Web 2.0 is? Do you know what are React, Angular, and the other JS Frameworks?

        When you create a modern webapp, a lot of data are retrieved from servers as Json and formated in the browser in Javascript. Even sometimes Css is generated on browser-side. Even more, on webapp where user login is taken into account, the display is modified accordingly.

        That's the web of 2019. The approach consisting of geting remote files and launching them in a browser is really naive.

        Speaking of SingleFile, it as a cli version and can handle full web 2.0 webapp without any problem. And of course, the Web 1.0 webapps work as well.

        • lucideer 1700 days ago
          With the exception of actual XHR requests (which should ideally be for dynamic resources, and as such somewhat outside the remit of saving a webpage), I was referring specifically to JS loading JS, etc. solutions. React, Angular do not recommend/advise you to do this. This isn't a requirement in Web 2.0 or Web 5.0 or anything else.

          In terms of React at least, fetch requests are not a part of the framework in any way and any present would typically be done in custom code in lifecycle methods. Even Redux, is—by default—client-side only. Stores are in-memory, actions populating them would make fetch requests with React/Redux-independent logic.

          Other JS frameworks are, typically, the same. And all of that is just considering dynamic XHR. Loading scripts is much less typical, and never required. The most common application of this I've seen is the GA snippet, which mainly does it to ensure the load is async without relying on developer implementation: it's 100% unnecessary to do it this way.

          So yes, unless you're distributing a tracking snippet that you expect non-devs to be blindly pasting into their wordpress panels and still have it work efficiently, generally speaking use of this method is never necessary, and commonly a red flag for poor architecture.

        • wolco 1700 days ago
          Sometimes we spent too much time in our own ecosystem. In 2019 most sites globally still use php and jquery.

          Web 2.0 refers to the use of ajax. This refers to the early 2000s sajax, jquery..

          If you want to separate angular, react, vue maybe it's web 3.0.. but wasn't web 3.0 referred to as mobile?

        • tinsx 1700 days ago
          I think that's exactly what that person means by superfluous and egregious examples of bad web development; SPAs, javascript frameworks of that nature. :p
          • mikaelmorvan 1700 days ago
            Yes, the debate between building a SPA with rich features or Web old pages with good SEO is eternal :) We see more and more an hybrid approach that can be called web 1.5 :)
      • baby 1700 days ago
        So any ajax and it won't work?
        • lucideer 1700 days ago
          Any ajax won't work offline.

          Ajax will* still work fine with an internet connection as long as those ajax endpoints don't require cookies and don't linkrot.

          * not 100% sure how the tool handles relative URLs embedded in source : if it's not clever enough though, this is very fixable via PR (as in its not an architectural limitation)

          • baby 1700 days ago
            What's the point of saving a webpage if it won't work offline?
    • chiefalchemist 1700 days ago
      I hear ya but another way to look at this is...

      The main problem with too many websites is they've become too much about technology and have left visitors, as well as the spirit and intent of the internet behind.

      • UI_at_80x24 1700 days ago
        It stopped being about information and started being entertainment.

        I agree with you. BadSite: "we want you to experience this.."

        GoodSite: "we want you to learn this..."

        • chiefalchemist 1700 days ago
          To clarify myself a bit. Many of the new technologies are great...for building applications; applications where UI and UX are essential.

          But for run-of-the-mill websites? Such tools are not only overkill, they break the spirit of the internet.

    • h1d 1701 days ago
      Is there a CLI version of it?
    • flatroze 1700 days ago
      Everything has its limits
  • Springtime 1701 days ago
    MHTML is pretty good for this already btw (not to take away from this neat project though :)). Similarly stores assets as base64'd data URIs and saves it as a single file. Can be enabled in Blink-based browsers using a settings flag and previously in Firefox using addons (also in the past natively in Opera and IE).
    • flatroze 1701 days ago
      Apparently everybody knew about MHTML but me Ü I'm going to look into that format and see if I could enhance monolith to output proper MHTML, among other additions and improvements. Thank you for the info!
      • masklinn 1701 days ago
        I don't know that it would be a very useful thing to do at least in the short term: there's a bunch of "web archive" formats out there and the common thread between them is that they're custom archive formats, you need special clients or support for those formats:

        * mthml encodes the page as a multipart MIME message (using multipart/related), essentially an email (you're usually able to open them by replacing the .mth by .eml)

        * WARC is its own thing with its own spec

        * WAFF is a zipfile, not sure about the specifics

        * webarchive is a binary plist, not sure about the specifics either

        Your tool generates straight HTML which any browser should be able to open. It probably has more limitations, but it doesn't require dedicated client / viewer support.

        Maybe once you've got all the fetching and extracting and linking nailed down it would be a nice extension to add "output filters", but that seems more like a secondary long-term goal, especially as those archive formats are usually semi-proprietary and get dropped as fast as they get created (WARC might be the most long-lived as it descends from the Internet Archive's ARC, is an ISO standard and is recognised as a proper archival format by various national libraries).

        • mftrhu 1701 days ago
          There isn't much to WAFF. Each WAFF file can contain more than one saved page. Each page needs to be contained within its own folder (whose name is usually the timestamp of when the page was saved, but it doesn't matter AFAICT). There can be an `index.rdf` file in there, to specify metadata and which file to open, but otherwise you should look for an `index.SOMETHING` file - usually `index.html`.

          E.g.

            test.maff
            `--  1566561512/
                 |--  index.rdf
                 |--  index.html
                 `--  index_files/
                      `--  ???
          
          When I was messing around with archiving things locally I settled on WAFF, because it's pretty much trivial to create and to use. Even if your browser does not support it, you just need to unpack it to a tempdir and open the index file.
      • japanuspus 1700 days ago
        I had never heard about MHTML either. Another use case could be embedding markdown source for the HTML in the document as well. This would allow single-file documents with figures that could be edited as light markup (with some tooling) and be viewed by anyone with a browser. This is something I have been dreaming about for years!

        Tbh. I had arrived at the conclusion that Mime would be great, but it never struck me that someone had already made a "standard" of mime and HTML.

    • bhl 1701 days ago
      One issue with MHTML is that it does not seem to be currently supported by iframes. The use case I was working on was comparing search results from Google and DuckDuckGo by simply scraping and downloading to later embed. For that, I used a cli tool from an open source library [1]. MHTML seems like a nice format but I'm not sure if there's a library to convert them into stand-alone HTML files.

      [1] https://github.com/gildas-lormeau/SingleFile

      Edit: This question just came to mind. If MHTML saves images using base64, and base64 dataurl images have a limit size, how would you save extremely large photos? Take for example the cover image of this article https://story.californiasunday.com/gone-paradise-fire. When I saved the page in MHTML format, the re-rendered image showed up quite blurry compared to the original. Was the size limit the cause?

      [2] https://stackoverflow.com/questions/12637395/what-is-the-siz...

      • Springtime 1701 days ago
        > MHTML seems like a nice format but I'm not sure if there's a library to convert them into stand-alone HTML files.

        Many years ago used a program that could convert them but the name escapes me. A brief search shows a few results that appear to do a similar conversion, potentially they may be of use.

        > This question just came to mind. If MHTML saves images using base64, and base64 dataurl images have a limit size, how would you save extremely large photos?

        From what I've read there's no official limit for base64 encodings though IE/Edge limit them to 4GB according to caniuse.com. Haven't personally encountered an image or GIF that was too large not to be saved in an MHT (I have over 10k MHTML files saved, some image and GIF heavy ones up to 200MB each).

        Also a sibling comment corrected me about the data URI use, MHTML uses a separate scheme but nevertheless still uses base64 for encoding.

        For that example article you linked it seems likely to be the way the program you're using is handling the Javascript-loaded images. I saved it in Vivaldi (Blink-based engine) and the main image displayed at full res when opened locally while the other images didn't, while when saved with a pre-Quantum Firefox using the UnMHT addon it saved all the images at their fully loaded resolutions. Some MHTML saving implementations clearly have advantages over others it would seem.

      • app4soft 1701 days ago
        > One issue with MHTML is that it does not seem to be currently supported by iframes.

        There no needs to insert MHTML page into IFrame!

        If you need insert MHTML content into IFrame, just convert it to HTML+JS firstly.

    • masklinn 1701 days ago
      > Similarly stores assets as base64'd data URIs and saves it as a single file.

      Does it? IIRC it stores assets as MIME attachments, hence the "M": the result is not HTML (which this would I assume be), it's a multipart MIME message whose root is an HTML document.

      edit: in fact when downloading mht files osx / safari misrecognises them as exported emails and appends the "eml" extension.

    • Causality1 1701 days ago
      Always struck me as quite odd MHTML fell out of favor. Back in the day when I wanted to preserve a web page it was the logical choice since you could just click "save as archive".
      • da_chicken 1701 days ago
        Blame XMLHttpRequest, Flash, JS, and embedded video. It doesn't make sense to archive a document when the necessary interactive content elements will essentially fail when opened offline.
        • paggle 1701 days ago
          What would be cool is one package with the original HTML file, the fully rendered DOM written out as a second file+assets package, and a PDF just in case the first two get fucked.
        • jordwalke 1701 days ago
          You can first prerender the page with Chrome in headless mode (see my other comment), and then convert it into a single document using an inlining tool (such as the OP's). That way the JS will run and render the page (see my other comments here for an example).
      • flatroze 1701 days ago
        We'll bring it back, don't you worry!
    • jordwalke 1701 days ago
      It appears that (at least) Safari cannot open mhtml files. The benefit of a tool such as what the OP shared is that it can produce plain html pages that are openable by anyone. (also, I tried mhtml in Chrome using the proper flag and it doesn't appear to store/inline/render static assets correctly).
    • jordwalke 1701 days ago
      I'm not aware of a way to save as MHTML from Chrome in headless mode (from the command line). Are you?
  • alpb 1701 days ago
    I think it would be way better to explain in the repository:

    - how do you handle images?

    - does it handle embedded videos?

    - does it handle JS? to what extent?

    - does it handle lazily loaded assets (i.e. images that load only when you scroll down, or JS that loads 3 seconds later after the page is loaded)

    In general, how does this work? The current readme doesn't do a decent job explaining what the tool exactly is. For all I can tell, it probably just takes a screenshot of the page, encodes as base64 into the html and shows it.

    • flatroze 1701 days ago
      Good points, thank you for the review. I'll work on enhancing the readme file to be more informative.
    • quickthrower2 1701 days ago
      It can’t handle JS completely because we can’t predict a programs behaviour using static analysis. See Halting Problem for example.
      • kuzehanka 1701 days ago
        I saw a tool that handles JS to a limited extent by capturing and replaying network requests to accommodate said JS. It records your session while you interact with a site, and is then able to replay everything it captured.

        This tool was able to capture three.js applications and other interactive sites quite well.

        • bhl 1701 days ago
          Was it webrecorder [1]? I found this project a couple weeks back while looking for web archiving tools.

          [1] https://webrecorder.io/

          • kuzehanka 1701 days ago
            Yep, that's the one! Thanks for reminding the name.
      • penagwin 1700 days ago
        Sure but many websites load resources and such with JS - the request might not contain the content but 3 seconds on the page with JS to let it fill in everything, etc.

        SPA's often (but not always) do this. Content is loaded in via React components and such...

  • mrieck 1700 days ago
    If you only want a portion of a webpage I made a tool called SnipCSS for that:

    https://www.snipcss.com

    The desktop version saves an HTML file, stylesheet and images/fonts locally, and it only contains the HTML of the snippet with the CSS rules that apply to the DOM subtree of the element you select.

    I'm still working out bugs but it would be great if people try it out and let me know how it goes.

    • sansnomme 1700 days ago
      • mrieck 1700 days ago
        I tried SnappySnippet before when looking into the idea - it didn't work well for me and crashed often. I never saw DesignPirate, but just now I tried it and it didn't output any CSS. I'm not sure but it doesn't look like either of these use chrome.debugger API to call devtools api methods. (you get a warning in Chrome if you use that)

        I'm hoping my tool will be better so it's good enough people would be willing to pay for it, but we'll just have to see.

  • jordwalke 1701 days ago
    I really like this concept, and I've been using an npm package called inliner which does this too: https://www.npmjs.com/package/inliner

    I'm glad there's more people taking a look at the use case, and I'd be interested to see a list of similar solutions.

    If you combine this with Chrome's headless mode, you can prerender many pages that use JavaScript to perform the initial render, and then once you're done send it to one of these tools that inlines all the resources as data URLs.

      /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome ./site/index.md.html --headless --dump-dom --virtual-time-budget=400
    
    The result is that you get pages that load very fast and are a single HTML file with all resources embedded. Allowing the page to prerender before inlining will also allow you to more easily strip all the JavaScript in many cases for pages that aren't highly interactive once rendered.
  • mehrdadn 1701 days ago
    This is awesome. One question though: how does it handle the same resource (e.g. image) appearing multiple times? Does it store multiple copies, potentially blowing up the file size? If not, how does it link to them in a single HTML file? If or if so, is there any way to get around it without using MHTML (or have you considered using MHTML in that case)?

    Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.

    • flatroze 1701 days ago
      Thank you! It's pretty straight-forward: this program just retrieves assets and converts them into data-URLs (data:...), then replaces the original href/src attribute value, so in case with the same image being linked multiple times, monolith will for sure bloat the output with the same base64 data, correct. I haven't looked into MHTMTL, ashamed to admit it's the first time I'm hearing about that format. I need to do some research, maybe I could improve monolith to overcome issues related to file size, thank you for the tip!

      And about Rust: I think you're way ahead of me here as well, this is my first Rust program. If you're talking about it embedding some debug info into the binary which may include things like /home/dataflow then perhaps there's a compiler option for cargo or a way to strip the binary after it's compiled. ¯\_(ツ)_/¯ Sorry, that's the best I can tell at the moment.

      • mehrdadn 1701 days ago
        Okay thanks! That was a pretty quick reply :) Regarding MHTML, it's basically the MIME format emails are in (which are basically inherently self-contained HTML documents). Various browsers have had varying degrees of support for it over the years. Chrome recently made it harder to save in MHTML format; I don't know how long they will be able to read the format, so I can't guarantee that if you go in that direction it'll still be useful for a long time, but at the moment there is still some support for it.
      • dspillett 1701 days ago
        One way to dedupe inline image resources while still using HTML rather than MHTML, could be to encode them in css once, and transform the image element to something with that class.
        • mehrdadn 1700 days ago
          That'd easily break Javascript though.
          • dspillett 1697 days ago
            Good point. I was thinking in the direction of something I'm tinkering with in a similar area. There getting a static snapshot of the current DOM or fragment is key (meaning scripts being stripped out is an intentional feature). Tweaking the document contents for efficiency could significantly impact a lot of script work that may be present.
    • chmod775 1701 days ago
      https://github.com/Y2Z/monolith/blob/master/src/html.rs#L94

      Hope this answers your question (it gets converted to a data URI, and there's apparently no de-duplication).

    • gildas 1701 days ago
      SingleFile which can run on CLI too handles this by using CSS variables.
  • fit2rule 1701 days ago
    I've been printing to PDF for decades now, and nothing comes close to the ease of use and versatility of 2 decades worth of interesting web pages .. I have pretty much every interesting article, including many from HN, from decades of this habit.

    Need to find all articles relating to 'widget'?

        $ ls -l ~/PDFArchive/ | grep -i widget
    
    This has proven so valuable, time and again .. there is a great joy in not having to maintain bookmarks, and in being able to copy the whole directory content to other machines for processing/reference .. and then there's the whole pdf->text situation, which has its thorns truly (some website content is buried in masses of ad-noise), but also has huge advantage - there's a lot of data to be mined from 50,000 PDF files ..

    Therefore, I'd quite like to know, what does monolith have to offer over this method? I can imagine that its useful to have all the scripting content packaged up and bundled into a single .html file - but does it still work/run? (This can be either a pro or a con in my opinion..)

    • dredmorbius 1701 days ago
      Having gone this route in part myself, advantages of HTML or other more-structured file formats, if there is appropriate metadata markup:

      - Allow for recording source and author information (PDF ... doesn't always provide this).

      - Allows for full-text search.

      - Allows for editing out annoyances.

      I'll frequently go from HTML to some simplified representation (e.g., Markdown), and then re-generate formats that are useful elsewhere: HTML, PDF, ePub, etc.

      Dumping from HTML to Markdown frequently makes cruft-removal far simpler, and the principle content of most pages is text. In rare instances, images are useful, and even more rarely, any multimedia content (video, audio, programmatic content).

      What's depressing is the number of sites which screw with even basic HTML. E.g., the NY Times rarely use HTML tables for tabular representation, and instead use a homebrew combination of custom markup, CSS, and JS to much the same effect. Pretty, in situ, but brittle and transports exceedingly poorly.

      That's just one of many such cases.

    • mxuribe 1700 days ago
      Its funny, on a rare occasion, I too have saved some content as a PDF - more so for archiving rather than for offline viewing...But i guess i never thought to scale it for all/most of my bookmarks. It seems so obvious now after reading your comment. However, my experience with PDFs has been negative. From filesize to slow booting of myriad pdf viewers, etc., it just seems like viewing stuff in native html, text is better - at least for what I've experienced. Further, my preferred browser - firefox - leaves much to be desired in this arena of generating proper PDFs, and i end up switching to chrome (bleh!) just to "PDF something" that i saw/read online. Again, this function in firefox is not something that i use as often, hence why i stick with FF, and not gone back to chrome. However, going back to your approach....I wonder if i can use a tool - either like this monolith or singlefile, or even pupeteer, etc. - to snapshot web content, but save it into html instead of pdf. I would guess html content is still grep-able (as you noted for your PDF local searches). Hmmm...a local cache of my own offline bookmarks...Hmmm, interesting. Thank you for this inspiration!!
    • flatroze 1701 days ago
      I'd say since monolith produces a plaintext document it lets you edit things easier if needed.

      JS can be removed from the final document using the -j flag. HTML Files can also be grepped for content, unlike PDFs.

  • leshokunin 1701 days ago
    This would be a perfect fit for IPFS. I love the idea of having just one file in a permanent link.
    • flatroze 1701 days ago
      This could also be an interesting alternative to PDF, especially with web fonts embedded as data URLs.
      • turbinerneiter 1701 days ago
        I've been using this kind of standalone PDFs produced from Markdown with pandoc for a while, and the possibilities are insane.

        Imagine a paper in the form of a single HTML file, which has (a subset of) the data included, the graphs zoomable, the colors chanegable (to whatever vision problems you have) - maybe even the algorithm to play around with!

        Jupyter Notebooks already go in that way. only without the single-file, open in browser aspect, I think.

  • js8 1701 days ago
    I am using "Save Page WE" Firefox extension for this. Better at saving JS content and less clutter than saving all the images and stuff.
    • Crinus 1700 days ago
      Same, Save Page WE is a great extension :-)
      • ausjke 1699 days ago
        singlefileZ on chrome is good too
  • sametmax 1700 days ago
    Good, but won't work with the heavy JS pages using Ajax to load any single content.

    The firefox extension seems to do that :

    https://addons.mozilla.org/fr/firefox/addon/single-file/

    • gildas 1700 days ago
      Unfortunately it's not written in Rust so it won't make the first page of HN.
  • gildas 1701 days ago
    Note that SingleFile can easily run on command line too, cf. https://github.com/gildas-lormeau/SingleFile/tree/master/cli.
  • interfixus 1701 days ago
    Nice. I can see some automated uses for this. In ordinary browsing, am currently using a Firefox addon called SingleFile which works surprisingly well. Stuffs everything into (surprise, surprise) one huge single file - html with embedded data, so compatible everywhere.
    • flatroze 1701 days ago
      It sounds like a great add-on, I have to check it out to see what it does to remote assets and how it works with asynchronously loaded assets.
  • mikekchar 1701 days ago
    With respect to the Unlicense, does anybody have any knowledge about how good it is in countries which don't allow you to intentionally pass things into the public domain (most countries that aren't the US)? How does it compare to CC0 in that respect?
    • flatroze 1701 days ago
      Which license would you recommend to release this software under to reach the broad adoption yet permissive terms, if not the Unlicense?
      • mikekchar 1700 days ago
        I honestly don't know. That's why the question :-) Is CC0 good for software? It seems to be a bit more complete from a non-US view point, but I don't know if there are lurking situations. Possibly MIT is better -- it's pretty darn permissive. I'm really just soliciting opinions.
  • hendry 1701 days ago
    I imagined that https://www.w3.org/TR/widgets/ would be the open container format for saving a Web app to a single file.
  • cr0sh 1700 days ago
    This is interesting - I think any of us who save things off the internet have made something like this (I usually save entire sites or large chunks, though - so I have a different toolset - still, I also do single pages, so I might try out this tool).

    One thing I would propose to add - either a flag, or by default - have it parse the path to the page and create the file with the name - that way you can just "monolith {url}" and not have to worry about it.

    I am also curious as to how it handles advertisements and google tracking and such; some way to strip out just those scripts (and elements) could be handy.

  • makach 1701 days ago
    Ahh, to me it looks like it creates an amalgamation of the web page+contents.

    How does this work on neverending webpages/forever scroll? How will it behave if you need to authenticate before browsing the page?

    • flatroze 1701 days ago
      That's it in the nutshell!

      It seems to work for basic pages quite well, I think that lazy load will work for most pages as long as the JavaScript is embedded (no -j flag provided) and the Internet connection is on. It saves what's there when the page is loaded, the rest is a gamble since every website implements infinite scroll differently.

      Authentication is another tricky part -- it's different for every browser. I will try to convert it into a web extension of sorts, so that pages could be saved directly from the browser while the user is authenticated.

      • donatzsky 1701 days ago
        For authentication, you could add an option for passing http headers, as well as accept Netscape-style cookie files.

        Whenever I want to download a video, using YouTube-dl, from a site that requires authentication, I first login using my browser and then exports the cookies using an extension.

        • sah2ed 1700 days ago
          May I ask what extension you use for cookie exporting?
  • jplayer01 1701 days ago
    Ah, I've been thinking about making something like this. You beat me to it. I've been using the SingleFile add-on until now. I'll definitely give this a try.
  • lucasverra 1701 days ago
    super project ! i ve pretty baffled with the difficulty to save a webpage in proper format. I’ve tried with PDF converter, getPolaroid app and of course firefox screen shot feature for the entire scroll thing. Will try this for saving purposes.

    I am also interested in cloning/forking sites for modification purposes, I will feedback you on the results four my consulting gigs

    • flatroze 1701 days ago
      Thank you for the kind words.

      It will evolve into a reliable tool in a couple weeks and it should eventually work for embedding everything, including things like web fonts and @url()'s within CSS. If anything doesn't work, please open an issue, I have plenty of time to work on it.

  • sankalp210691 1698 days ago
    This is pretty useful. It would be great to have a functionality of converting the HTML page to a PDF as well.
  • sergioisidoro 1701 days ago
    This sounds great, but the first thing I thought was how this would be a perfect tool to make automated mass phishing scams.

    If the outcomes are realistic, take a massive list of sites, make a snapshot of each page, replace the POST login URLs with the phishers, deploy these individual HTML files, and spread the links through email.

    I wonder how does this project handle forms.

    • flatroze 1701 days ago
      Thank you for reminding me, I need to set action="" to be an absolute path when the page is saved.

      upd: Done, now forms get their action="/submit" converted into action="https://website.com/submit" when the page is saved.

    • danfang 1701 days ago
      I'd imagine you can already do that with some basic web scraping tools. This definitely makes it easier.
  • personjerry 1701 days ago
    • flatroze 1701 days ago
      Thank you for the heads up, I'll test it and enhance to preserve styles better.
  • fouc 1701 days ago
    Sweet idea! I would especially like to be able to capture videos and pictures too.

    I suspect for saving videos, a good approach would be some sort of proxy + headless browser combination, where the proxy is responsible for saving a copy of all data the browser requests for.

    Thoughts?

    • flatroze 1701 days ago
      Thanks! Pictures should work, I'll check more tags first thing tomorrow when I start working on improving it.

      I use youtube-dl for youtube and other popular web services myself. Embedding a video source as a data URL could in theory work, but it'd be quite a long base64 line. Also, editing .html files with tens or hundreds of megabytes of base64 in them would perhaps be less than convenient.

  • personjerry 1701 days ago
    `cargo install` install 237 packages for this?! I don't think that's acceptable.
    • Deukhoofd 1701 days ago
      Probably the reqwest crate. That thing alone uses like 30 crates, not including the dependencies of those crates.
      • flatroze 1701 days ago
        The compile time is rather long as well, I'm looking into ways of reducing the amount of dependencies.
    • codegladiator 1700 days ago
      Then don't accept it. What's with the number of packages or size ?
  • ajxs 1701 days ago
    Very cool. Have you considered incorporating an option for following links within the same domain to a certain depth? I remember using tools such as this in the past to save all the content from certain websites.
    • flatroze 1701 days ago
      Thank you! I'll add it as an issue, since it could definitely be useful for "archiving" certain resources more than 1 level deep. Do you remember the name of that tool by any chance?
      • darrenf 1701 days ago
        • rmetzler 1701 days ago
          IIRC this was the mode Snowden used to bulk-download the NSA data.
      • xtreme 1701 days ago
        I have used HTTrack in the past for this.
      • ajxs 1701 days ago
        I wish I could remember the exact tool, this was over a decade ago. If you just do a quick internet search for such a tool you'll likely find whatever I used, it certainly wasn't anything sophisticated. It was a Windows GUI tool designed specifically for the task. Something makes me think that 'GetRight' tool might have been able to do the same thing, but I can't seem to see the feature on their website.
        • flatroze 1701 days ago
          Ah, I remember using something like that. I thought that tool was saving it into one .html file, but data URLs didn't exist back then, so creating directories alongside with HTML files was the only option to "replicate" a web resource, now I understand exactly what you were talking about. I'll do some more digging around and implement that in the nearest future. I may need to make all the requests async first to make sure that saving one resource with decent depth won't take too long.
        • olakeasseillo 1701 days ago
          yes, "teleport pro" for win98, you could scrape a site and duplicate locally or scrape only for specific file type or size, had recursive link follow depth option and created several threads for the requests(sniff)
    • jp_sc 1701 days ago
      SiteSucker for macOS does it.
      • flatroze 1701 days ago
        Well, at least it's not called iSuck.
  • tenken 1701 days ago
    • masklinn 1701 days ago
      It looks like it creates a normal HTML file (embedding assets as data URI) so it should require no special client / support.

      HTMLD, WARC, MTHML, MAFF and webarchive are all "container" formats which bundle assets next to the HTML using various methods (resp. bundle, custom, multipart MIME, zip and binary plist).

      • emerongi 1701 days ago
        The issue with this is that if the website requires some external API for content, it might not work properly.

        https://webrecorder.io/ solves that problem by recording all interactions and then replaying them as needed.

        > Webrecorder takes a new approach to web archiving by “recording” network traffic and processes within the browser while the user interacts with a web page. Unlike conventional crawl-based web archiving methods, this allows even intricate websites, such as those with embedded media, complex Javascript, user-specific content and interactions, and other dynamic elements, to be captured and faithfully restaged.

  • tannhaeuser 1701 days ago
    Well you could do that for a long time with MHTML, WARC, etc. downloaders, including those available in browsers via "Save Page as", though CSS imports aren't covered by older tools (are they by yours?). Anyway, congrats for completing this as a Rust first-timer project, which certainly speaks to the quality of the Rust ecosystem. For using this approach as offline browser, of course, the problem is that Ajax-heavy pages using Javascript for loading content won't work, including every React and Vue sites created in the last five years (but you could make the point those aren't worth your attention as a reader anyway).
    • flatroze 1701 days ago
      CSS imports are covered by converting .css files into data URLs, later I will parse those and embed resources found within stylesheets as well.
  • dtjohnnymonkey 1700 days ago
    Thank you for this. I’ve been looking for something that does this exact thing. I don’t like any of the other HTML archiving formats .
  • dfee 1701 days ago
    If the output were a tar file, couldn’t we also say it was saving web pages as a single file? Wouldn’t that also be easier?
    • flatroze 1701 days ago
      I think there's an issue with opening a tar file, e.g. if sent to someone who needs to view the document but isn't techy.

      It seems to me that having one file that any browser can easily open (and not require Internet connection to view) is a big advantage over having a directory with assets alongside the .html file. It may be one of those things that make things easier yet nobody really complains about how things are usually done when the page gets saved. I hope more browsers add support for saving pages as MHTML in the nearest future so that we wouldn't need tools like this one.

  • ahub 1701 days ago
    I noticed there is a `-j` argument to remove javascript. A `-i` argument for removing images would be great too.
    • flatroze 1701 days ago
      It is done, option -i in the latest version (2.0.3) now replaces all src="..." attributes with src="<data URL for a transparent PNG pixel>" within IMG tags.
  • ur-whale 1700 days ago
    Does not compile with some byzantine message about let in const funcs being unstable.
    • flatroze 1700 days ago
      Could you please open issue on github providing the output that you get in the terminal?
      • ur-whale 1700 days ago
        I have closed my github account since the takeover occured.
        • dspillett 1694 days ago
          The important part of what he said was "providing the output that you get in the terminal". Simply stating "I got an error" and expecting the developer(s) to use clairvoyance to glean further detail is far from a helpful way to report a problem. Perhaps dropping the details in a pastebin site and linking to that would be a possible alternative? Or just including the error message here if it is short enough, though HN shouldn't really be used as a tech support channel.
  • sbmthakur 1701 days ago
    Nice work! I am wondering if Puppeteer can also be used to accomplish the same thing.
    • flatroze 1701 days ago
      It for sure would help with those SPA websites that get their DOM fully generated by JS. A web extension that saves the current DOM tree as HTML would perhaps do a better job, especially when it comes to resources which require some web-based authentication.
  • dvcrn 1701 days ago
    I'm not so experienced but how does this compare to .webarchive?
    • flatroze 1701 days ago
      The idea is almost identical, yet saving as .webarchive is only supported by Safari, and it's also not a plaintext format, hence can't be edited as easily.
  • Exuma 1701 days ago
    Saving this for later
    • dredmorbius 1701 days ago
      FYI: "favorite" is one way of doing that through HN.

      Bookmarks, or downloads, externally.

      • skinnymuch 1701 days ago
        Favorites is limited to a certain amount on HN before you start losing the oldest favorite.
        • dredmorbius 1701 days ago
          How many specifically?

          I'm at 252 posts, presently, just checked. That seems to be a complete log.

        • sah2ed 1700 days ago
          Is this limitation documented somewhere?

          After how many entries before the HN software started tripping on your favorites?

        • dang 1699 days ago
          That's not true. One user has 46,000. What did you see that made you think this?
  • nessunodoro 1701 days ago
    call me old fashioned, but I still use Ctrl+S
  • VvR-Ox 1701 days ago
    Very cool idea - thank you for this!

    On question: How does it handle those cookie pop-ups, gdpr-warnings etc?

    • flatroze 1701 days ago
      Oh, thank you kindly.

      That's an interesting question. I think it depends on how the given modal is implemented, but closing them should technically work (unless the page is saved with JavaScript code removed [-j flag]). Those notifications can easily be removed from the saved file using any text editor, should be pretty easy if you know how to edit HTML code. I don't think removing it would violate anything since "this website" will no longer really be a website but rather a local document at that point.