Show HN: I made a simple PDF text editor

(simpdf.com)

327 points | by shashanoid 1418 days ago

20 comments

  • jfk13 1418 days ago
    So it claims to

    > Edit pdf like a word doc while preserving structure and format.

    While I'm sure there are cases where this works pretty well and can be very useful, it may be worth noting that -- just like every other tool in this space -- there will be many PDFs where it simply can't work. It'll be hugely dependent on exactly how the PDF-generating application/tool went about things.

    One simple example: suppose your PDF uses a specific font (not one of the standards like Times or Helvetica), so the PDF-generating tool embedded the font. (This is common.) Further, suppose the generating tool embedded a re-encoded subset of the font, including only the glyphs that were actually required. (This is also common.)

    Now, suppose the edit you wish to make involves adding a character that was not present in the original document -- let's say you want to change the date from "May" to "June". But the original document contained no occurrences of capital J (in this particular font/style), and so the capital J glyph is not present in the embedded font. No "PDF text editor" can get around this; the best you can hope for is a "J" in some fallback (such as Times) that may look terrible alongside the intended custom font.

    And as for edits that would require reflowing multiple lines of text, maybe inserting a new paragraph in the middle of a page, etc.... not much chance of this working out well.

    Yes, a tool like this can (in many cases) make it possible to make minor changes (perhaps fixing a typo or updating a word here and there). To suggest that it can "edit pdf like a word doc" seems patently false to me.

    • saint-loup 1418 days ago
      I learned a lot from this article (discovered on HN I think):

      What's so hard about PDF text extraction? https://www.filingdb.com/pdf-text-extraction

      In a nutshell, PDF is fundamentally different than MS Word: it's a standard for visual layout, without notions of paragraphs or even words.

      As OP said, it doesn't mean the tool is useless. It could quite often come handy for myself.

      • mehrdadn 1418 days ago
        One of the weirdest things I've seen is a PDF where the text is complete gibberish if you copy paste, but perfect if you export it to HTML or Word in Acrobat. Never figured out how or why that might happen.
        • Tsiklon 1417 days ago
          I read somewhere (likely here) that this oddness comes from the idea that PDF is a way of structuring documents for print first, and presentation in a user interface is secondary.

          That the rendering of the document on screen is paramount as opposed to ability to manipulate the text itself. "These characters should be displayed at this position in the document precisely"

          It would make sense that exporting the document as HTML or Word would make this easier - as these document formats have different goals.

        • _chompsky 1418 days ago
          seen that happen quite often with LaTeX documents!
    • bla3 1418 days ago
      Finding a font that matches the look of letters in the vicinity of the edit on Google Fonts or similar and embedding the missing characters from that font seems doable and should work pretty well.

      Figuring out more global properties like multi-column layouts and reflowing text is a much harder problem.

    • blitmap 1418 days ago
      Would be neat to do font recognition against available web fonts to try go find a match, then convert the web font to truetype or whatever format the pdf used, and re-embed the front.
  • x32n23nr 1418 days ago
    Congratulations for the launch. Looks nice. For those that are cautious uploading sensitive PDFs, you can always just open them with Inkscape, and start editing pretty much anything in a document.

    PS: I was once forwarded an application for someone who was supposed to replace me, and I had to interview them. The expected salary was hidden by placing a gray rectangle over it. I removed it using inkscape and saw the expected salary was 30% higher than what I made.

    Inkscape Link: https://inkscape.org/

    • lowwave 1418 days ago
      Inkscape to Illustrator is what GIMP is to Photoshop.

      Glade someone brought it. For most common tasks is there really a point even using Photoshop and Illustrator now days? Especially with the cloud direction they are moving towards.

      • citizenkeen 1418 days ago
        I've been incredibly happy with the Affinity Suite. Smoother and cleaner than GIMP/Inkscape/Scribus, and a desktop app for $50 ($25 during Covid), which makes it more palatable to me than the Adobe options.
        • alok-g 1416 days ago
          Is this referring to Affinity Designer, to Affinity Publisher, or a bundle of the three apps (Photos, Designer, Publisher)?
      • maaarghk 1418 days ago
        because GIMP sucks, maybe, and also because most people are required to interoperate with the rest of the profession. i am very disappointed that photoshop et al cannot be run under WINE :(
        • emayljames 1418 days ago
          I have done/still do complex composition and editing in both GIMP and Photoshop, and there is really no missing features in GIMP. If you go into using GIMP expecting it to be PShop, you will always be disappointed, but in reality that is only a good thing.
        • AbuAssar 1418 days ago
          I use GIMP regularly and find it more than enough for all my needs.

          I don't agree with you that it sucks.

        • lowwave 1417 days ago
          For sure. If one (and myself included) use the to nicety of Photoshop Illustrator GIMP's user interface is left much to be desired. However, from a open platform perspective, do I really want many hours of editing work to be tight to a cloud platform to a proprietary format?

          Think not just about now or one year from now, but 10 years, 20, 50 years from now, 100 years? Open format and open source software is much desired for creation based software. There is the paradox of the making money for the open source software, and I don't have an answer for that. Seems like some kind of sliding scale payment will be needed for these software. May be also some kind of sliding scale payment for eventual money earned form the software up to a limit.

        • pacamara619 1417 days ago
          It doesn't suck. It has nearly all features you could possibly need, and then some. People who expect GIMP to be identical to Photoshop suck.
    • Flashtoo 1418 days ago
      I'm curious how that interviewing situation worked out for you. Did you use it to negotiate a higher salary there or at your next employer? No pressure if you don't feel like sharing that, of course.
      • x32n23nr 1418 days ago
        Once you put an employee in a position where the only way to get a salary hike is to change employers, there is no way back - you change employers. It worked out well for me, and AFAIK the role is not filled yet in the previous company.
    • flak48 1418 days ago
      But the candidate's expected salary can be anything, right?

      Or is the fact that the company is going ahead with interviewing them, a sign that they may be willing to pay that figure?

    • agumonkey 1418 days ago
      qpdf also helps peeking safely
  • throwawat573635 1418 days ago
    You should provide a sample pdf file so we don't have to hunt around for a (small) pdf file just to see how it works.
  • jessmay 1418 days ago
    Wait wrapping a lib you didn’t write and that hasn’t been updated in 5 years is now called “I made a text extractor”?
  • gnicholas 1418 days ago
    I tried using this with an invoice that I'd created using invoice-generator.com, in the hopes that it would be an easier way to make new invoices. When I tried to replace the To party's name, the text came back partly bold and partly not. There was also a weird overlay on an email address on the bottom that said something about email address protected.

    Would love to have a tool like this that worked for making new invoices, among other things!

    • jfk13 1418 days ago
      > When I tried to replace the To party's name, the text came back partly bold and partly not.

      Most likely, the PDF used a subsetted embedded bold font, so it only worked for letters that happened to be present in the original text; any new letters were missing from the font and you got a fallback.

      Just one of many reasons why a tool like this is the wrong way to approach pretty much any document creation/editing task, because PDF is the wrong document format to use for these purposes.

      • gnicholas 1418 days ago
        > Most likely, the PDF used a subsetted embedded bold font, so it only worked for letters that happened to be present in the original text; any new letters were missing from the font and you got a fallback.

        Wow, that seems to be exactly right, based on which letters were bold and which weren't. Very interesting!

        To the folks downthread who asked about the legality of modifying invoices, I was trying to use one invoice as a template for a second invoice. Most things are the same, so it would have been great if I could have made a simple change!

    • punnerud 1418 days ago
      In EU it is illegal to edit an invoice outside a program that keep track of all changes, bill number etc.

      Not the same in US?

      • londons_explore 1418 days ago
        Citation needed...

        I don't think laws are ever that prescriptive about what you have to do. As far as I know, the only requirement is when the tax inspector arrives, you need to be able to produce a complete and accurate list of all invoices you have ever sent to customers.

      • killerpopiller 1418 days ago
        at least for Germany: digital invoices shall be processed in a way that manipulation can be ruled out (GoBD). Hence companies use special scanner and document management systems which document the Revisionssicherheit.
        • corty 1418 days ago
          And that is the new, more relaxed situation. At first, digital invoices were required to be signed with a qualified signature, i.e. spend a few hundred quid a year on certificates by a few select CAs (only the usual german "suspects" could qualify due to intentionally onerous requirements).
      • DanBC 1418 days ago
        They aren't editing an invoice, they're using a template to create a new invoice.
      • AnssiH 1418 days ago
        Even before issuance? No such requirement in Finland at least, so clearly not an EU level requirement.

        The Finnish VAT Act does, however, require that companies sending and receiving invoices ensure that the mandatory invoice data is not modified after issuance (209 g §).

      • J_cst 1418 days ago
        Source?
    • BossingAround 1418 days ago
      Try xournal. I had a really positive experience with it.
  • michaelmrose 1418 days ago
    A few notes. It doesn't seem to work with documents that have multiple columns. Pushing text over just overwrites the other column. It doesn't seem to reflow text where the source document obviously had margins possibly because that information is looooong gone. Hitting enter to move text to the next line didn't move other text it just again seemed to overwrite it.
  • shashanoid 1418 days ago
    Hi, I bootstrapped this simple website. Let me know what you think :)
    • mhasbini 1418 days ago
      I tried it with a bit complex pdf structure and it worked like charm +1. Would love to learn more about the underlying techniques/tech.
      • hombre_fatal 1418 days ago
        https://github.com/shashanoid/Simpdf/blob/1557bf838a8debeee1...

        Btw, arbitrary code execution vuln here, OP.

        • parhamn 1418 days ago
          Yeah. Switch array & args disable the shell. I hope they’re not running that locally as the download script suggests. But then you still have a bunch of other security issues. Shrug.
        • miccah 1418 days ago
          I'm investigating the same. The upload endpoint uses secure_filename to get the filename used in that func. I'm not familiar with it, but the docs say it could return an empty string.
      • dellinspiron 1418 days ago
        It just calls https://github.com/pdf2htmlEX/pdf2htmlEX on the server.
        • chipperyman573 1418 days ago
          You could create a file named ; echo 'hi'; #

          and it appears as if it would probably run anything you put between ; and # (in this case it will echo hi). Unless the filename is sanitized, which it appears to not be.

        • echan00 1418 days ago
          What calls what?
      • shashanoid 1418 days ago
        Hey, there's nothing complex going on. I'm using a tool to convert pdf to html, and it does a phenomenal job.
    • koulvi 1418 days ago
      I used this to delete a few pages from a pdf file. However, I can't download the edited version. The "save and download" option is just resulting in opening another blank webpage.
    • saradhi 1418 days ago
      What's up with requirements. That's a lot and I don't see all being used. did you line them for new features?
  • longtom 1418 days ago
  • smhmd 1418 days ago
    I tried it with a relatively complex pdf and it blew my mind.
  • ray991 1418 days ago
    I really wish the PDF layout was easier to parse. No matter which library you use, you always run into edge cases which make text selection and extraction an issue on certain files. I was recently extracting financial data from a bank which provides only PDFs and every time they changed the format just a little bit I had to change large parts of my code to extract the transactions I wanted.
    • jfk13 1418 days ago
      PDF is designed to present a human-readable document, not to serve as a data interchange format.
    • saradhi 1418 days ago
      I agree to this, it's the same with insurance companies too when resolving claims. Feels like they certainly want to make the extraction look complicated for an unknown reason. Not often and not all companies but edge cases
    • dmoo 1418 days ago
      I’m sure you’ve looked at it but I have a lot of success with pdftotext -layout
  • illender 1418 days ago
    @op line 105 on https://github.com/shashanoid/shashwatsingh.github.io/blob/m... is missing an 'l' and will cause you to not get your email from your personal web link
  • dheera 1418 days ago
    Now I just wish there were a version that could run locally. All the Linux PDF viewers suck -- they can't even save a fill-in PDF form or insert a signature.
    • mrb 1418 days ago
      I spent an afternoon trying a bunch of them some years ago. I settled on the freeware Master PDF Editor version 4 (version 5 inserts a watermark unless you buy a license). https://code-industry.net/masterpdfeditor/

      It is super lightweight and opens any complex PDF. Can insert signatures, can edit anything in the PDF (without changing the font even if the font is embedded in the PDF—as long as all the glyphs you are typing are present in the font), etc. My only complaint is that it won't edit an encrypted PDF, but a one-liner Ghostscript command can remove the encryption automatically: https://gist.github.com/compleatang/6046249

    • tjbiddle 1418 days ago
      Github is in the footer on the website if you want to run it locally: https://github.com/shashanoid/Simpdf
    • bmn__ 1418 days ago
      http://enwp.org/Okular can.

      Please don't bait for answers with generalisations.

    • michaelmrose 1418 days ago
      There are actually a plethora of choices that are great for what most people are doing namely viewing documents. Forms and annotations are also supported by a number of choices.

      Have you tried Okular for example? It supports filling and saving forms and annotations. You can draw a signature with the freehand annotation. This works great with a touchscreen not so great with a mouse.

      It also supports a plethora of useful features and looks and works great out of the box.

      It actually CAN insert a signature via its Stamp annotation but unfortunately this doesn't produce an annotation that works outside of okular due to a current limitation in poppler which is why all pdf readers based on it won't have that feature.

      Here is the 10 year old bug that nobody is working on that would presumably make it possible for any reader to have this feature.

      https://gitlab.freedesktop.org/poppler/poppler/-/issues/522

      I think it seems like a major missing feature for you but perhaps hasn't received much attention because if you use it to sign business documents you yourself may use this future a lot whereas 99% of users who are consuming documents or exporting them from word processing documents don't know this feature exists nor need it.

      Maybe people ought to put money towards a bounty for someone to implement it?

    • Scarblac 1418 days ago
      The old advice is still true - the best way to get help with Linux problems online is to claim it can't do something :-)
    • distances 1418 days ago
      Which ones can't? At least the default viewer KDE ships definitely can do this, I use it all the time.
    • jack20 1418 days ago
      The best editor I’ve found is Qoppa PDF Studio. Although not free it is cross-platform (written in Java) and their licenses are perpetual. They do have a free viewer app which can fill interactive forms and this works on Windows, Mac and Linux: Qoppa PDF Studio Viewer.
    • BossingAround 1418 days ago
      I use xournal for exactly that--filling out a PDF form and inserting my signatures.
  • yodaarjun 1418 days ago
    hackernews traffic crashed the host :( Cant wait to try tho!
  • kyawzazaw 1418 days ago
    I love it. How do I save it?
    • jnlar 1418 days ago
      ^ this
      • dheera 1418 days ago
        there's a "Save and Download" option if you mouseover to the left.
        • zoid_ 1418 days ago
          I found this too, didn't seem too obvious at first though.
  • top_kekeroni_m8 1417 days ago
    I'll be honest, I thought the website was called simp df.. lol
  • mamurphy 1418 days ago
    The website isn't loading anything for me right now.
  • schoolornot 1418 days ago
    A little glitchy with complex PDFs but WOW, amazing work!
  • Hoasi 1418 days ago
    Neat, would be useful for students' homework.
  • oxbridge 1418 days ago
    Works, it changes the font type after editing