ScrapeGraphAI: Web scraping using LLM and direct graph logic

(scrapegraph-doc.onrender.com)

194 points | by ulrischa 12 days ago

13 comments

  • nodoodles 11 days ago
    What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..
    • jumploops 11 days ago
      Agreed!

      Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).

      We currently use this at Magic Loops[2] and it works _most_ of the time.

      The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).

      Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.

      [0] https://apify.com/apify/website-content-crawler

      [1] https://github.com/extractus/article-extractor

      [2] https://magicloops.dev/

      [3] https://reworkd.ai/

    • KhoomeiK 11 days ago
      This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.
      • nodoodles 11 days ago
        Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)
      • spxneo 11 days ago
        all around automation sucks with LLM thrown on top of it

        the statistics are not in its favour

        • visarga 11 days ago
          Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.
        • KhoomeiK 11 days ago
          Yep, until you generate code—it's harder from a technical POV but you can get way higher performance & reliability.
    • longgui0318 11 days ago
      Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.

      https://github.com/EZ-hwh/AutoCrawler

      • nodoodles 11 days ago
        Thanks, will look into it, looks promising
    • nikcub 11 days ago
      Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.

      The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.

      • nodoodles 11 days ago
        Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..
      • selimthegrim 11 days ago
        There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.
    • greggsy 11 days ago
      It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.

      Ad blockers have had something very close to this for some time, without any sparkly AI buttons.

      I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.

      • uptown 11 days ago
        Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.
    • geuis 11 days ago
      That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.
      • nodoodles 11 days ago
        One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..
    • cpobuda 11 days ago
      I have been working on this. Feel free to DM me.
    • wraptile 11 days ago
      Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.

      More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.

      LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.

      • illegally 7 days ago
        Exactly, are you aware of any current efforts of people trying to do that?
        • wraptile 6 days ago
          Not anything in open source yet.
  • mariopt 11 days ago
    What is the point of using LLMs for the scrapping itself instead of using them to generate the boring code for mimicking HTTP requests, css/xpath selectors, etc?

    I get it may be interesting for small tasks combined with a browser extension but for real scrapping just seems to be overkill and expensive.

    • geuis 11 days ago
      It is potentially expensive, but here's a different take.

      Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?

      It abstracts away a lot of manual fragile work.

      • mariopt 11 days ago
        I get that and LLMs are expected to get better.

        Today, would you build a scraper with current LLMs that randomly hallucinate? I wouldn't.

        The idea of a LLM powered scraper adapting the selectors every time the website owner updates it, it's pretty cool.

        • ewild 11 days ago
          At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.
          • what 11 days ago
            Bold claim, did you review all 1.5 million requests?
          • bryanrasmussen 11 days ago
            I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?

            Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.

            • ewild 11 days ago
              in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.
          • krainboltgreene 11 days ago
            How do you know that's true?
            • ewild 11 days ago
              the 1.5 million was our test set. we had 1.5 million ground truths, and it didnt make up fake data for a single one
              • krainboltgreene 9 days ago
                That's not what I asked. I asked "How did you determine that it didn't make up/get information wrong for all 1.5m?"
      • is_true 11 days ago
        I've written thousands of scrapers and trust me, they don't break often.
        • infecto 11 days ago
          Me too but for adversaries that obfuscate and change their site often to prevent scrapping. It can happen depending on what you are looking at.
          • is_true 10 days ago
            Scrapers well written should be able to cope with site changes.
      • suchintan 11 days ago
        https://github.com/Skyvern-AI/skyvern

        This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year

  • nextworddev 11 days ago
    Would be nice if docs had a comparison between traditional scraping (e.g. using headless browsers, beautifulsoup, etc) versus this approach. Exactly how is AI used?
    • geuis 11 days ago
      A lot of larger LLM's have been trained on millions of pages of html. They have the ability to understand raw html structure and extract content from them. I've been having some success with this using Mixtral 8x7B.
  • nurettin 11 days ago
    A lot of websites (like online shopping ) won't let you scrape for long unless you are logged in. Some (like real estate sites) won't tolerate you for long even if you are logged in. Some (like newspapers) won't accept a simple request, they will try to detect browser and user behavior. Some will even detect data center ip blocks to get rid of you.

    I don't believe scraping is such a solved problem that you can slap AI and some cute vector spiders on it and claim that everything works.

  • simonw 12 days ago
    Typo on your homepage: "You just have to implment just some lines of code and the work is done"
  • ushakov 11 days ago
    There’s also llm-scraper in TypeScript

    https://github.com/mishushakov/llm-scraper

  • holoduke 11 days ago
    Writing a few selector queries is probably easier than letting a llm output the selector queries by instructing the llm with the webpage and output request. I do scraping for a living. Casperjs/phantomjs is still my best friend for scraping dynamic websites. Selector Queries are the least of a problem.
  • RamblingCTO 11 days ago
    I don't see the benefit. I don't wanna say what I built 2markdown.com with (converts websites to markdown for LLMs), but it has a pretty decent performance without any high cost (and also sometimes erroneous) LLMs thrown on top of the scraping.
    • dev213 11 days ago
      The "Integrates Easily" links to Langchain and Pipedream are flipped on your website.
  • sethx 12 days ago
    At jobstash.xyz we have similar tech as part of our generalized scraping infra, and it’s been live for half a year performing optimally.
  • maxrmk 11 days ago
    I'd love to try the demo but there's no way I'm putting my openai key into that site.
  • spaniard89277 11 days ago
    This is completely unrealistic unless you want to burn money.
    • msp26 11 days ago
      In practice it's actually very good for lower volume tasks with non-fixed sources.

      I haven't tried this library but I do use an LLM based scraper in addition to more traditional ones.

    • infecto 11 days ago
      I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.
      • spaniard89277 11 days ago
        Which LLM do you use? Because I can't see an scraper running daily without being very expensive.
        • a_wild_dandan 11 days ago
          Llama-3 70B on my local MacBook works wonderfully for these tasks.
          • spaniard89277 11 days ago
            How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?
            • a_wild_dandan 11 days ago
              There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.
              • LunaSea 11 days ago
                At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?
        • mrbungie 11 days ago
          GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.
          • spaniard89277 11 days ago
            How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?
            • aleksiy123 11 days ago
              My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.

              Something like https://github.com/Alir3z4/html2text.

              I'm sure there are other (better?) options as well.

            • msp26 11 days ago
              Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.
        • infecto 11 days ago
          Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.
  • syntaxgoonoo 12 days ago
    [dead]