Show HN: Transistor, a Python web scraping framework for intelligent use cases

(github.com)

12 points | by bobjordan 1858 days ago

1 comments

lapnitnelav 1857 days ago
Looks interesting but I'm struggling to see (at quick glance) what makes it unique / better than the alternatives out there.
[-]
- bobjordan 1856 days ago
  As compared to a mature framework like Scrapy. Transistor is a lot lighter than Scrapy and easier to grok the entire codebase, while having less magic. I wanted a scraping framework with useful classes/abstractions which I could subclass/override, customize to my specific needs, and then run tightly integrated with our gevent-based Flask web app.
  Bottom line is, Scrapy's codebase is so big and also running Twisted, which I'm not familiar with. So I kind of threw my hands in the air on that integration and instead decided to take a few weeks to write my own framework only with what I needed, while also using gevent. Learned a lot and overall it was a great exercise and will serve us well a long time.
  Transistor module itself has about ~6,000 LOC including full support for Splash headless browser/javascript rendering service and Crawlera service. While the base Scrapy framework repo alone has ~30,000 LOC, with further middleware repos required to integrate Splash/Crawlera.
  That said, the current Transistor implementation doesn't really compare with Scrapy as a crawler, in that Transistor is like a surgical knife to get the specific data you are after, while Scrapy can be more suited to cataloging, following-all-the-links.
  Where Transistor shines right now is spinning up a few hundred workers, each with a scrape task, with each task being a term (like a part-number) which is searched on a website. Transistor get's the job done well in this case.
  [-]
  - lapnitnelav 1855 days ago
    Hey thanks for the reply.
    I've done a few things with Scrapy but never really poked under the hood, so I'll take your word for it.
    If I get you right, it's more targeted towards precise extraction than website crawling?
    Last point : why the tight integration with an app? Monolith approach?