Show HN: Apify SDK – A scalable web crawling and scraping library for JavaScript

(github.com)

78 points | by jancurn 2033 days ago

6 comments

jancurn 2033 days ago
Hey guys, today we’re showing HN a new open-source library that we have been working on for almost a year. It incorporates lessons learned from scraping of thousands of websites over the last 4 years. We figured there was no such universal library for JavaScript, while for example Python has one (https://scrapy.org/). That wasn’t fair, because JavaScript is THE language of the web :)
Anyway, we hope you’ll give it a shot and we’re really looking forward to hear what you think about it. All feedback welcome!
rajangdavis 2032 days ago
I wish I could upvote this more. This solves a huge problem for me and will definitely be taking a peek at this over the weekend.
Thank you so much for making and sharing this!
darekkay 2033 days ago
Thanks, this looks solid, with a really extensive documentation. I will give it a try for my next crawling/bot project :)
[-]
- jancurn 2033 days ago
  Awesome, looking forward to hear what you think :)
raitom 2032 days ago
This comes just in time when I needed to replace an old scraper!
Does it have to run on an instance or can we also use a serverless environnement?
[-]
- jancurn 2032 days ago
  The SDK runs anywhere where you have Node running. And if you can run headless Chrome with Puppeteer there too, than you can use it in the SDK too. This might require several libraries and configuration settings. If I’m not mistaken, Google Cloud Functions support Puppeteer by default, AWS Lambda does not. With any Docker-based serverless platform such as Zeit Now or Apify Cloud you just need to use the right Docker image.
pdxandi 2032 days ago
I'm a huge fan of Apify and look forward to exploring this new SDK. Thanks y'all.
[-]
- jancurn 2031 days ago
  Thank you so much!