ScrapeGraphAI: Web scraping using LLM and direct graph logic

(scrapegraph-doc.onrender.com)

194 points | by ulrischa 12 days ago

13 comments

nodoodles 11 days ago
What I'd love to see is scraper builder that uses LLMs/'magic' to generate optimised scraping rules for any page, ie css selectors and processing rules mapped to output keys. So you can run scraping itself at low cost and high performance..
[-]
- jumploops 11 days ago
  Agreed!
  Apify's Website Content Crawler[0] does a decent job of this for most websites in my experience. It allows you to "extract" content via different built-in methods (e.g. Extractus [1]).
  We currently use this at Magic Loops[2] and it works _most_ of the time.
  The long-tail is difficult though, and it's not uncommon for users to back out to raw HTML, and then have our tool write some custom logic to parse the content they want from the scraped results (fun fact: before GPT-4 Turbo, the HTML page was often too large for the context window... and sometimes it still is!).
  Would love a dedicated tool for this. I know the folks at Reworkd[3] are working on something similar, but not sure how much is public yet.
  [0] https://apify.com/apify/website-content-crawler
  [1] https://github.com/extractus/article-extractor
  [2] https://magicloops.dev/
  [3] https://reworkd.ai/
- KhoomeiK 11 days ago
  This is essentially what we're building at https://reworkd.ai (YC S23). We had thousands of users try using AgentGPT (our previous product) for scraping and we learned that using LLMs for web data extraction fundamentally does not work unless you generate code.
  [-]
  - nodoodles 11 days ago
    Awesome to hear! Looking forward to a launch -- the Waitlist form was too long to complete, need to take another LLM to fill that :)
    [-]
    - KhoomeiK 10 days ago
      1 month away ;)
  - spxneo 11 days ago
    all around automation sucks with LLM thrown on top of it
    the statistics are not in its favour
    [-]
    - visarga 11 days ago
      Code is also hard. You got to generate code that accounts for all possible exceptions or errors. If you want to automate an UI for example, pushing a button can cause all sorts of feedback, errors, consequences that need to be known to write the code.
    - KhoomeiK 11 days ago
      Yep, until you generate code—it's harder from a technical POV but you can get way higher performance & reliability.
- longgui0318 11 days ago
  Here's a project that describes the use of llm to generate crawling rules and then capture them, but it looks like it's still in the early stages of research.
  https://github.com/EZ-hwh/AutoCrawler
  [-]
  - nodoodles 11 days ago
    Thanks, will look into it, looks promising
- nikcub 11 days ago
  Most of the top LLM already do this very well. It's because they've been trained on web data, and also because they're being used for precisely this task internally to grab data.
  The complicated ops of scraping is running headless browsers, IP ranges, bot bypass, filling captchas, observability and updating selectors, etc. There are a ton of SaaS services that do that part for you.
  [-]
  - nodoodles 11 days ago
    Agreed there are several complexities but not sure which ‘this’ you mean - specifically updating selectors is one of the areas I had in mind earlier..
  - selimthegrim 11 days ago
    There was one I remember out of UF/FSU called Intoli that seems to have pivoted into consulting.
- greggsy 11 days ago
  It seems also obvious that one would want to simply drag a box around the content you want, and the tool would just provide some examples to help you refine the rule set.
  Ad blockers have had something very close to this for some time, without any sparkly AI buttons.
  I’m sure someone would be working on a subscription based model using corporate models in the backend, but it’s something that could easily be implemented with a very small model.
  [-]
  - uptown 11 days ago
    Mozenda does something like that. I haven't used it in many years, so I'm not up to date on what it currently offers.
- geuis 11 days ago
  That's an interesting take. I've been experimenting with reducing the overall rendered html size to just structure and content and using the LLM to extract content from that. It works quite well. But I think your approach might be more efficient and faster.
  [-]
  - nodoodles 11 days ago
    One fun mechanism I've been using for reducing html size is diffing (with some leniency) pages from same domain to exclude common parts (ie headers/footers). That preprocessing can be useful for any parsing mechanism..
- cpobuda 11 days ago
  I have been working on this. Feel free to DM me.
- wraptile 11 days ago
  Parsing html is a solved and frankly not a very interesting problem. Writing up xpath/css selectors or JSON parsers (for when data is in script variables) is not much of a challenge for anyone.
  More interesting issue is being able to parse data from the whole page content stack which includes XHRs and their triggers. In this case LLM driver would control an indistinguishable web browser to perform all steps to retrieve the data as a full package. Though this is still a low value proposition as the models would get fumbled by harder tasks and easier tasks can be performed by a human being in couple of hours.
  LLM use in web scraping is still purely educational and assistive as the biggest problem in scraping is not scraping itself but scraper scaling and blocking which is becoming extremely common.
  [-]
  - illegally 7 days ago
    Exactly, are you aware of any current efforts of people trying to do that?
    [-]
    - wraptile 6 days ago
      Not anything in open source yet.
mariopt 11 days ago
What is the point of using LLMs for the scrapping itself instead of using them to generate the boring code for mimicking HTTP requests, css/xpath selectors, etc?
I get it may be interesting for small tasks combined with a browser extension but for real scrapping just seems to be overkill and expensive.
[-]
- geuis 11 days ago
  It is potentially expensive, but here's a different take.
  Instead of writing a bunch of selectors that break often, imagine just being able to write a paragraph telling the LLM to fetch the top 10 headlines and their links on a news site. Or to fetch the images, titles, and prices off a store front?
  It abstracts away a lot of manual fragile work.
  [-]
  - mariopt 11 days ago
    I get that and LLMs are expected to get better.
    Today, would you build a scraper with current LLMs that randomly hallucinate? I wouldn't.
    The idea of a LLM powered scraper adapting the selectors every time the website owner updates it, it's pretty cool.
    [-]
    - ewild 11 days ago
      At my job we are scraping using LLMs. For a 10M sector of the company. GPT4 turbo has never not once out of 1.5 million API requests hallucinated. We however use it to parse data and interpret it from webpages, this is something you wouldn't be able to do with a regular scraper. Not well atleast.
      [-]
      - what 11 days ago
        Bold claim, did you review all 1.5 million requests?
      - bryanrasmussen 11 days ago
        I guess the claim is based on statistical sampling at reasonably high level to be sure that if there were hallucinations you would catch them? Or is there something else you're doing?
        Do you have any workflow tools etc. to find hallucinations, I've got a project in backlog to build that kind of thing and would be interested in how you sort through bad and good results.
        [-]
        ewild 11 days ago
        in this case we had 1.5 millioon ground truths for our testing purposes. we now have run it over 10 million, but i didnt want to claim it had 0 hallucinations on those as technically we cant say for sure, but considering the hallucination rate was 0% for 1.5 million when compared to ground truths im fairly confident.
      - krainboltgreene 11 days ago
        How do you know that's true?
        [-]
        ewild 11 days ago
        the 1.5 million was our test set. we had 1.5 million ground truths, and it didnt make up fake data for a single one
        [-]
        krainboltgreene 9 days ago
        That's not what I asked. I asked "How did you determine that it didn't make up/get information wrong for all 1.5m?"
  - is_true 11 days ago
    I've written thousands of scrapers and trust me, they don't break often.
    [-]
    - infecto 11 days ago
      Me too but for adversaries that obfuscate and change their site often to prevent scrapping. It can happen depending on what you are looking at.
      [-]
      - is_true 10 days ago
        Scrapers well written should be able to cope with site changes.
  - suchintan 11 days ago
    https://github.com/Skyvern-AI/skyvern
    This is pretty much what we're building at Skyvern. The only problem is that inference cost is still a little bit too high for scraping, but we expect that to change in the next year
nextworddev 11 days ago
Would be nice if docs had a comparison between traditional scraping (e.g. using headless browsers, beautifulsoup, etc) versus this approach. Exactly how is AI used?
[-]
- geuis 11 days ago
  A lot of larger LLM's have been trained on millions of pages of html. They have the ability to understand raw html structure and extract content from them. I've been having some success with this using Mixtral 8x7B.
nurettin 11 days ago
A lot of websites (like online shopping ) won't let you scrape for long unless you are logged in. Some (like real estate sites) won't tolerate you for long even if you are logged in. Some (like newspapers) won't accept a simple request, they will try to detect browser and user behavior. Some will even detect data center ip blocks to get rid of you.
I don't believe scraping is such a solved problem that you can slap AI and some cute vector spiders on it and claim that everything works.
simonw 12 days ago
Typo on your homepage: "You just have to implment just some lines of code and the work is done"
ushakov 11 days ago
There’s also llm-scraper in TypeScript
https://github.com/mishushakov/llm-scraper
[-]
- lucgagan 11 days ago
  Something similar I worked on in the past https://github.com/lucgagan/auto-playwright/
  [-]
  - worldsayshi 11 days ago
    Does it use ChatGPT every time you run the test or only when a test fails (to check if the selector has changed)?
holoduke 11 days ago
Writing a few selector queries is probably easier than letting a llm output the selector queries by instructing the llm with the webpage and output request. I do scraping for a living. Casperjs/phantomjs is still my best friend for scraping dynamic websites. Selector Queries are the least of a problem.
RamblingCTO 11 days ago
I don't see the benefit. I don't wanna say what I built 2markdown.com with (converts websites to markdown for LLMs), but it has a pretty decent performance without any high cost (and also sometimes erroneous) LLMs thrown on top of the scraping.
[-]
- dev213 11 days ago
  The "Integrates Easily" links to Langchain and Pipedream are flipped on your website.
  [-]
  - RamblingCTO 10 days ago
    Ouch, thanks! Will fix asap
sethx 12 days ago
At jobstash.xyz we have similar tech as part of our generalized scraping infra, and it’s been live for half a year performing optimally.
[-]
- senordevnyc 12 days ago
  [dead]
maxrmk 11 days ago
I'd love to try the demo but there's no way I'm putting my openai key into that site.
spaniard89277 11 days ago
This is completely unrealistic unless you want to burn money.
[-]
- msp26 11 days ago
  In practice it's actually very good for lower volume tasks with non-fixed sources.
  I haven't tried this library but I do use an LLM based scraper in addition to more traditional ones.
- infecto 11 days ago
  I have not used this specific library but its far from unrealistic and hardly a money pit. A LLM can fit in nicely with scraping libraries. Sure if you are crawling the web like google, it makes no sense, but if you have a hit list, this can be a cost effective way to not have engineering hours spent maintaining the crawler.
  [-]
  - spaniard89277 11 days ago
    Which LLM do you use? Because I can't see an scraper running daily without being very expensive.
    [-]
    - a_wild_dandan 11 days ago
      Llama-3 70B on my local MacBook works wonderfully for these tasks.
      [-]
      - spaniard89277 11 days ago
        How's the Pipeline? Do you pass all the html to the LLM? Isn't the context window a problem?
        [-]
        a_wild_dandan 11 days ago
        There are phenomenal web scraping tools to crudely "preprocess" the document a bit, slashing outer HTML fluff while preserving the small subset of actual data. From there, 8k tokens (or whatever) goes really far.
        [-]
        LunaSea 11 days ago
        At a very generous 50 tokens per second doesn't that still leave you with more than two and a half minutes (160s) processing time per document?
    - mrbungie 11 days ago
      GPT-3.5/GPT-4 ain't the only LLMs available. A Flan-T5/T5 or Llama2/3 8B models may be finetuning for this use case and used for much cheaper.
      [-]
      - spaniard89277 11 days ago
        How do you handle the context window limit? If you push the entire Dom to the LLM it will exceed the context window by far in most cases, isn't it?
        [-]
        aleksiy123 11 days ago
        My guess is you do some preprocessing on the DOM to get it down to text but still retains some structure.
        Something like https://github.com/Alir3z4/html2text.
        I'm sure there are other (better?) options as well.
        [-]
        tarasglek 11 days ago
        I wrote https://markdown.download as a general helper for this
        msp26 11 days ago
        Trim unwanted html elements + convert to markdown. Significantly reduces token counts while retaining structure.
    - infecto 11 days ago
      Again, depends on the volume of the scraping and the value of the data within it. Even 3.5 can be cost effective for certain workflows and data value.
syntaxgoonoo 12 days ago
[dead]