Show HN: Paper to HTML Converter

(papertohtml.org)

153 points | by codeviking 9 days ago

15 comments

  • codeviking 9 days ago
    Hi all,

    I’m one of the engineers at AI2 that helped make this happen. We’re excited about this for several reasons, which I’ll explain below.

    Most academic papers are currently inaccessible. This means, for instance, that researchers who are vision impaired can’t access that research. Not only is this unfair, but it probably prevents breakthroughs from happening by limiting opportunities for collaboration.

    We think this is partly due to the fact that the PDF format isn’t easy to work with, and thereby make accessible. HTML, on the other hand, has benefited from years of open contributions. There’s a lot of accessibility affordances, and they’re well documented and easy to add. In fact, our hope long-term is to use ML to make papers more accessible without (much) effort on the author’s part.

    We’re also excited about distributing papers in their HTML form as we think it’ll allow us to greatly improve the UX of reading papers. We think papers should be easy to read regardless of the device you’re on, and want to provide interactive, ML provided enhancements to the reading experience like those provided via the Semantic Reader.

    We’re eager to hear what you think, and happy to answer questions.

    • kahon65 9 days ago
      Do you remove the pdf files we send to your servers?

      Edit https://allenai.org/terms point 5, you own all the uploads! So if by mistake we send a medical PDF for example or something else that is under gdpr, we can't ask you to delete it???? ? Wtfffff

      • codeviking 9 days ago
        We don't retain the uploaded document. We cache the extracted content, as to make things more efficient.

        See https://papertohtml.org/about:

        > What data do we keep? We cache a copy of the extracted content as well as the extracted images. This allows us to serve the results more quickly when a user uploads the same file again. We do not retain the uploaded files themselves. Cached content is never served to a user who has not provided the exact same document.

        Also, we can delete the extracted data on request. Just send a note to accessibility@semanticscholar.org.

        Sorry for the confusion!

        • kahon65 9 days ago
          Ah okay, thank you.

          >Also, we can delete the extracted data on request.

          Just to be 100% clear, you are referring to the cached extracted data, right?

    • Telemakhos 9 days ago
      Is there any thought about presenting the papers as TEI XML with XSLT to display the paper in a browser or screenreader? TEI provides pagination support (needed for citing page numbers, because most of academia still needs that) and extensive semantic markup for things like bibliographic information. It also serves as one data model that can be converted easily with existing tools (XSLT) to provide many representations for humans, while also serving as a machine-parsable text for datamining. Digital humanities has made heavy use of TEI for years, and this project seems like it could benefit from it.
      • 1vuio0pswjnm7 9 days ago
        "We're eager to hear what you think, ..."

        I think I will stick with pdftohtml, pdftotext, and pdfimages https://en.wikipedia.org/wiki/Poppler_(software). These take seconds not minutes.

        From user perspective I dont understand why not release the source code and let people compile a native application. (Did I miss the link to the source code.) Instead it looks like this is just a means of collecting free data (metadata, more training data, data from submitted papers by default) everytime someone submits a paper.

        • politelemon 9 days ago
          I've never actually questioned the why, so maybe you could shine some light... why are they usually published as PDFs?
          • ephbit 9 days ago
            I always assumed the main reason for using PDFs is, that an author/distributor can be pretty sure, that they're rendered almost exactly the same (fonts, layout) no matter with which viewer they're viewed.

            This probably evokes some kind of sense of authenticity. Like some physical paper document it has exactly one appearance.

            • ephbit 9 days ago
              There's also the annotation features in PDFs which allow me to highlight text and add a comment.

              I don't know of any more convenient way to directly attach my thoughts to a specific portion of text. (If there is, I'd genuinely like to know).

              And it even works well across multiple devices: I have a folder on my PC that's synced with my phone via syncthing containing mostly PDFs (savend web pages, papers, books, ..) and the annotations I make in those PDFs on my phone are directly available on my PC ... all without using some cloud bullsh*.

            • kartoshechka 9 days ago
              Unfortunately for my mental health my thesis was exactly about converting arxiv papers to modern looking html, and there's so much more broken, unjust and ugly things in academia then using pdfs...

              Regarding your question, I'd say that it is a natural continuation of centuries long tradition of writing on the actual paper. The invention of TeX actually made it easier to produce more papers, then came PDF, and you could produce virtual papers. Also science journals pretty much have monopoly on scientific knowledge distribution, and they are mostly paper too

              • codeviking 9 days ago
                Y'know, that's a good question. I'm not sure I know the answer.

                My guess is it's largely for historical reasons. At the time most venues were organized PDF was probably the best (or only) mechanism for sharing documents for print distribution.

                But we think it's time to change that :).

                • DoreenMichele 9 days ago
                  I have no idea at all but as a wild guess, I would assume it's because you can't edit PDFs. So you know it says the same thing forever and no one went and changed it in response to reading criticism of their paper or something.
                  • temp8964 9 days ago
                    What alternative do you have? Word file?

                    PDF is the only widely supported format can guarantee accurate reprint.

                    • nailer 9 days ago
                      HTML. As long as the information is preserved, the layout is not significant and actively harms viewing the content on many devices.

                      Here's the output of this tool on this PDF - https://arxiv.org/pdf/1909.00031.pdf - content is preserved, but text is readable: https://imgur.com/a/EPCaWxP

                      • miohtama 9 days ago
                        Are papers printed anymore?

                        HTML for text.

                        SVGs for diagrams.

                        Equations can be exported as images if needed.

                        • radarsat1 9 days ago
                          > Are papers printed anymore?

                          You know what, they are.

                          I like print format for reading purposes, even if it's on my epaper tablet. The other day when I took a train for 8 hours, I printed out several papers to read on my b&w laserjet. And it's more difficult to read diagrams these days because people make them all in colour, sometimes in ways that are very difficult to read when it's converted to b&w.

                          I find it a real tragedy that all these efforts to turn papers into dynamic content, which I wholeheartedly applaud, ignore the still very relevant use case of printing. Every preview mechanism for camera-ready papers should include a b&w print-preview mode.

                          The other advantage of PDF is that "page count" still means something. There's a reason journals limit page count, and it's not because it adds a few kbs to the download. It's because long-winded papers that don't get to the point need editing.

                          • codeviking 9 days ago
                            That's the idea!

                            If all goes well we won't need this software anymore. In a best case scenario the publishers start accepting HTML, and gone are the days of having to convert PDFs to something better...!

                            • temp8964 9 days ago
                              How do you define pages in HTML?
                              • codeviking 9 days ago
                                We don't. We extract the content and present it as a single document.

                                Page anchors can be used for navigating between sections. We present a table of contents that makes this easy. For instance:

                                https://papertohtml.org/paper?id=6f9fc51102cf49bff4f4e2b3367...

                                • john-doe 9 days ago
                                  Great initiative, HTML is the way to go!

                                  It would be great if you could add some basic CSS rules for print? Right now navigation elements are needlessly repeated on each page, obscuring the content.

                                  Also, you forgot to include bold and italics webfonts, so you have faux-styles for all headings and emphasis.

                      • znpy 9 days ago
                        I'd love to see a way to re-export a paper into a digital-friendly format, say epub/mobi to use on my e-reader.

                        Any plans on that?

                        • kwhitefoot 9 days ago
                          You could give Calibre a try. The result will probably be a long way from perfect for complicated documents but it does work reasonably well for most things. Formulas don't translate well unfortunately.
                        • isaacimagine 9 days ago
                          Looks great! Have you considered linking this up to something like arxiv or other preprint sites?
                          • _delirium 9 days ago
                            There's already this for arXiv: https://www.arxiv-vanity.com/

                            Their job is a little bit easier because arXiv papers have the .tex source available, so you can use one of the various tex2html variants, instead of having to extract the paper's contents from a rendered PDF.

                            • codeviking 9 days ago
                              Yup, we're definitely thinking about this.

                              Our focus right now is on providing a tool folks can run it on whatever papers they have access to. For instance, some researchers might have access to documents that aren't available to the public. We want them to be able to run this against those.

                              That said as we expand the effort I imagine we'll eventually pre-convert things that are publicly available, like those on ArXiv, etc.

                          • nanis 9 days ago
                            This seems pdf2tohtml combined with GROBID[1].

                            It seems to me the masheen learningz technikz boil down to a generalization of my lightbulb moment here[2].

                            [1]: https://grobid.readthedocs.io/en/latest/

                            [2]: https://www.nu42.com/2014/09/scraping-pdf-documents-without-...

                            • codeviking 9 days ago
                              Yup, right now we use GROBID, do some post processing and combine the output with other extraction techniques. For instance, we use a model to extract document figures[1], so that we can render them in the resulting HTML document.

                              Also, we're working hard on a new extraction mechanism that should allow us to replace GROBID [2].

                              There's a lot of really smart people at AI2 working on this, I'm excited to see the resulting improvements and the cool things (like this) that we build with the results!

                              [1]: https://api.semanticscholar.org/CorpusID:4698432

                              [2]: https://api.semanticscholar.org/CorpusID:235265639

                              • tailspin2019 8 days ago
                                > It seems to me the masheen learningz technikz...

                                Off-topic low-value comment, but I'm now going to be getting a T-shirt made with the caption "i can haz masheen learningz?"

                              • oolonthegreat 9 days ago
                                cool project, though the name was confusing for me: I believe to most people "paper" first means actual paper, so I thought this was some kind of OCR system converting printed material to html?
                                • codeviking 9 days ago
                                  Thanks for the feedback. There's two hard problems n' all that... :)
                                • gregsadetsky 9 days ago
                                  Great site, congrats!

                                  One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

                                  I see 421 requests and 150 Mb loaded. As it seems to be mostly thumbnails, have you considered using jpegs instead of pngs, potentially use lazy loading (i.e. not load images outside of the viewport) and potentially use GCP's (or another provider) CDN offering?

                                  Once I clicked a thumbnail, loading the article itself (for example [1]) was quite breezy.

                                  The gallery is a great showcase of what your site does -- I think that it'd be worth making it snappier :-)

                                  Cheers and congrats again

                                  P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

                                  [0] https://papertohtml.org/gallery

                                  [1] https://papertohtml.org/paper?id=02f033482b8045c687316ef81ba...

                                  • codeviking 9 days ago
                                    > One comment is that the slowest page to load was the Gallery [0] as it loads an ungodly amount of PNG files from what appears to be a single IP (a GCP Compute instance?)

                                    Yup. There's no CDN or anything like that right now. We kept things simple to get this out the door. But we definitely intend to make improvements like this as we improve the tool.

                                    The more adoption we see, the more it motivates these types of fixes!

                                    > P.S. Also, the paper linked below [1] seems to have a few conversion problems -- I see "EQUATION (1): Not extracted; please refer to original document", and also some (formula? Greek?) characters that seem out of place after the words "and the next token is generated by sampling"

                                    Thanks for the catch. As you noted there's still a fair number of extraction errors for us to correct!

                                  • Terretta 9 days ago
                                    > have you considered using jpegs instead of pngs

                                    For thumbs of text papers, perhaps a GIF or PNG would be smaller than a JPEG while retaining pixel accurate crispness?

                                  • chrisMyzel 9 days ago
                                    This is amazing! Will make my (offline-only) Kindle finally display scientific papers. Took a random link of arxiv and it worked like a charm, including TOC. will this be OS'ed?
                                    • kartoshechka 9 days ago
                                      You may check out https://arxiv-vanity.com as well. OS, convertation rates are close to 70% on random arxiv paper if I'm not mistaken, but hardly can be called stable
                                      • There is a offline solution if you are looking for, the app is Calibre. It is basically ebook manager & extra. It can convert the PDF into mobi and customizable based on your preference. They have a preset for Kindles. Also it can works with DRM'ed files via DeDRM plugins. And Calibre can export it directly to your Kindle. A fair warning, don't use Calibre if you structured your ebook folder. The app will import everything and keep it within their own database folder thus doubling the space size.
                                        • codeviking 9 days ago
                                          Yay, glad to hear it! If you end up viewing one of these on your Kindle, let us know how well (or not) things work.

                                          We're not sure if it's something that we can distribute as OSS just yet. It relies on a few internal libraries that would also need be publicly released, so it's not as simple as adjusting a single repository's visibility.

                                          • mintplant 9 days ago
                                            See also KOReader [0], if jailbreaking is an option for you. The built-in column splitter works pretty well for the papers I've used it to read.

                                            [0] https://github.com/koreader/koreader

                                            • Anunayj 8 days ago
                                              I've used KOReader in the past, and it's awesome! Keeping the jailbreak when my kindle randomly decides to updates itself, not so much. (yes I followed instructions to disable updates, but it still somehow managed to update) At some point it becomes too much of a hassle.

                                              Though OP has his kindle offline all the time, so not a issue for them.

                                              • mintplant 8 days ago
                                                It's gotten a lot better since we entered the KindleBreak era. The community went nuclear, and now instead of applying various hacks to try and prevent updates from being downloaded, the jailbreak package includes a little service that (as I understand it) watches the disk and immediately deletes anything that looks like an update package. The MobileRead "Open Sesame!" thread [0] has all the modern tooling in one place, if you're interested.

                                                [0] https://www.mobileread.com/forums/showthread.php?t=320564

                                            • chrisMyzel 9 days ago
                                              (HTML->Mobi is totally possible)
                                            • p4bl0 9 days ago
                                              I tried that a few days ago with one of my papers (a PDF generated using pdflatex) and it didn't work that well: the text was fine but some section titles were off, and all of the math and code parts were broken.

                                              But clearly it is a nice idea and I can't wait that such tools work better!

                                              • codeviking 9 days ago
                                                > all of the math and code parts were broken.

                                                Yup, this is a known issue that we're working towards fixing.

                                                > But clearly it is a nice idea and I can't wait that such tools work better!

                                                Glad to hear it!

                                              • Klasiaster 9 days ago
                                                For non-reflow conversion there is pdf2htmlEX: https://github.com/coolwanglu/pdf2htmlEX is discontinued but there is development under https://github.com/pdf2htmlEX/pdf2htmlEX

                                                Demo: https://pdf2htmlex.github.io/pdf2htmlEX/doc/tb108wang.html

                                                • kartoshechka 9 days ago
                                                  Looks exactly like what type of crunch work ML would do, but have you considered using brute force converters like latexml or pandoc where appropriate?
                                                  • NmAmDa 9 days ago
                                                    I tried several physics papers and none of them had any equation extracted. Is it by design have problems with LaTeX equations?
                                                    • codeviking 9 days ago
                                                      Yup, this is a known limitation:

                                                      > What are the limitations? There are several known limitations. Tables are currently extracted from PDFs as images, which are not accessible. Mathematical content is either extracted with low fidelity or not being extracted at all from PDFs. Processing of LaTeX source and PubMed Central XML may lack some of the features implemented for PDF processing. We are working to improve these components, but please let us know if you would like some of these features prioritized over others.

                                                      But we intend to fix this!

                                                    • jimmySixDOF 9 days ago
                                                      I am so amazed at the work you guys are doing at AI2 & the Semantic Scholar project. You guys are really fixing a broken system of research and discovery which suffers from organization design principles based on university library index card filing cabinets as magnified by the exponential content growth.

                                                      Cant wait to see what people do with this . . . .

                                                      • codeviking 9 days ago
                                                        Thanks!

                                                        There's a lot of amazing people here, doing really great work. It's a really inspiring place to be. I feel really lucky to work with such great people on interesting, important problems.

                                                        Also, I should mention...we're hiring!

                                                        https://allenai.org/careers#current-openings

                                                      • weystrom 7 days ago
                                                        When are, as people, are going to ditch PDF? It's an awful format.

                                                        My friend wrote his PHD in Latex, but it all ends up being PDFed anyway for what, eye candy?

                                                        It's time to move on. #ditchpdf

                                                        • tailspin2019 8 days ago
                                                          Haven't tried it yet, but a very cool concept.

                                                          As per other recent discussions on HN I think the general accessibility of academic papers is ripe for improvement.

                                                          • Orionos 9 days ago
                                                            Please make it popular in the research field so you can spin up your own Sci-Hub!
                                                            • johnhenry 9 days ago
                                                              Retro mode should be default.
                                                              • codeviking 9 days ago
                                                                I agree!

                                                                Maybe we'll work on vi bindings next...