Show HN: Headless Chrome Crawler

(github.com)

172 points | by yujiosaka 2252 days ago

12 comments

  • ptasker 2252 days ago
    Pretty cool, but I recommend anyone wanting to do this kind of thing to check out the source Puppeteer library. You can do some really powerful stuff and make a custom crawler fairly easily.

    https://github.com/GoogleChrome/puppeteer

    • itsjustme2 2252 days ago
      Looks like this is actually built on top of puppeteer. See the "Note" under "Installation": https://github.com/yujiosaka/headless-chrome-crawler/blob/ma...
    • chatmasta 2252 days ago
      Puppeteer has some limitations. You can’t install extensions, for example.

      I haven’t looked into it, but I imagine it has a pretty clear fingerprint as well. So it would be easier to block than stock chrome in headless mode.

    • robk 2252 days ago
      Puppeteer seems needlessly difficult to use on a VPS. I'd prefer an easily dockerized version but there seems to be nothing robust and they make it VERY hard to connect to a docker instance just running Chrome for the websocket/9222 interface sadly.
      • asadlionpk 2252 days ago
        I recently did this in a Docker.

        Let me quickly add instructions here, first you need to install some dependancies, add the following to dockerfile:

          RUN apt-get install -y gconf-service libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release xdg-utils wget
        
        Secondly, launch puppeteer with --no-sandbox option:

          browser = await puppeteer.launch({
              args: ['--no-sandbox'] /*, headless: false*/
            })
        
        That should do it.
      • RossM 2252 days ago
        I've done this recently actually. Take a look at the yukinying/chrome-headless-browser[0] image. You'll need to run with the SYS_ADMIN capability and up the shm_size to 1024M (you can workaround the SYS_ADMIN cap with a seccomp file but I didn't have much luck with that). Other than that oddness it works pretty well (and with Puppeteer 1.0, with far fewer crashes).

        [0]: https://github.com/yukinying/chrome-headless-browser-docker

    • isuckatcoding 2252 days ago
      Yeah I’d really rather that people made extensions to Pupeteer rather than a whole new library.
  • codedokode 2252 days ago
    This has been possible for a long time with any browser using Selenium for example. It has APIs and client libraries for many languages.

    Also using a real browser brings a lot of problems: high resource consumption, hangs, it is unclear when the page has finished loading etc. You have to supervise all browser processes. And if you use promises, there is high chance that you will miss error messages because promises hide them by default.

  • tesin 2252 days ago
    While as a developer I find this super interesting, as a system administrator this makes me cringe. We don't have a lot of resources for servers, and I end up spending a disproportionate amount of time banning IPs from bots running poorly configured tools like this, which aren't rate limited and crush websites.

    I'm grateful that "Obey robots.txt" is listed as part of it's standard behavior. If only scrapers cared enough to use it as well.

    • superasn 2252 days ago
      I've found that mod_evasive[1] works particularly well in these situation and helped us a lot dealing with it (though I'm not a sysadmin and I'm sure there are better tools to deal with it). But for someone who is just a webmaster, I'd recommend using it for a quick dirty fix for such hassles.

      [1] https://www.digitalocean.com/community/tutorials/how-to-prot...

    • codedokode 2252 days ago
      Such crawler should not be difficult to ban by looking at stats - if there are many requests per IP per unit of time, or many requests from data center IPs, or many requests from Linux browsers, it is likely bots and you can ban them (you can ban whole data center to be sure).
  • tegansnyder 2252 days ago
    There are a lot of folks reevaluating their crawling engines lately now that Chrome headless is maturing. To me there are some important considerations in terms of CPU/memory footprint that go into distributing a large headless crawling architecture.

    The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.

  • hartator 2252 days ago
    I was stuck last time I was using headless chrome when I needed to use a proxy with an username and a paasword. Headless chrome just doesn't support it. Any changes on that?
  • princehonest 2252 days ago
    I've been considering writing my own puppeteer docker image such that one could freeze the image at crawl time after a page has loaded. This would allow me to re-write the page-parsing logic after the page layout changes. Has anyone done this already or know of any other efforts to serialize the puppeteer page object to handle parsing bugs?
    • radioo75555 2252 days ago
      In large scale scraping they always separate loading pages from processing the data. Easiest thing would be to wait for the page to load and then save the html
  • nikisweeting 2252 days ago
    I'm thinking about adding a crawler to Bookmark Archiver, to augment the headless chrome screenshotting and PDFing that it already does.

    Wget is also a pretty robust crawler, but people have requested a proxy that archives every site they visit in real-time more than a crawler.

  • bryanrasmussen 2252 days ago
    Can't see from examples, how do I get back individual elements from the body?
  • artur_makly 2252 days ago
  • agotterer 2252 days ago
    Nice job! Can this be scaled and distributed to multiple machines?
  • bryanrasmussen 2252 days ago
    also how does this handle pages that load with a small number of links and then uses JS to write in a bunch of DOM nodes and links?
    • trevyn 2252 days ago
      I don't know about this project specifically, but typically with headless Chrome, you let it run the JS and then read the DOM.
      • bryanrasmussen 2252 days ago
        most naive code I see is like the following

            const page = await crawler.browser.newPage();
            await page.goto(url);
            await page.waitForSelector("a[href]");
            const hrefs = await page.evaluate(
            () => Array.from(document.body.querySelectorAll('a[href]'), ({ href }) => href)
            );
        
        and then you do something with hrefs.

        However if you have a page that loads with 4 links defined, does its script and then ends up 100+ links you miss the 100+ links. I notice people often failing to account for this in their crawlers, so I wondered if this one was.

  • londt8 2252 days ago
    is it possible to scrape songs from spotify web app with this?