As compared to a mature framework like Scrapy. Transistor is a lot lighter than Scrapy and easier to grok the entire codebase, while having less magic. I wanted a scraping framework with useful classes/abstractions which I could subclass/override, customize to my specific needs, and then run tightly integrated with our gevent-based Flask web app.
Bottom line is, Scrapy's codebase is so big and also running Twisted, which I'm not familiar with. So I kind of threw my hands in the air on that integration and instead decided to take a few weeks to write my own framework only with what I needed, while also using gevent. Learned a lot and overall it was a great exercise and will serve us well a long time.
Transistor module itself has about ~6,000 LOC including full support for Splash headless browser/javascript rendering service and Crawlera service. While the base Scrapy framework repo alone has ~30,000 LOC, with further middleware repos required to integrate Splash/Crawlera.
That said, the current Transistor implementation doesn't really compare with Scrapy as a crawler, in that Transistor is like a surgical knife to get the specific data you are after, while Scrapy can be more suited to cataloging, following-all-the-links.
Where Transistor shines right now is spinning up a few hundred workers, each with a scrape task, with each task being a term (like a part-number) which is searched on a website. Transistor get's the job done well in this case.
Bottom line is, Scrapy's codebase is so big and also running Twisted, which I'm not familiar with. So I kind of threw my hands in the air on that integration and instead decided to take a few weeks to write my own framework only with what I needed, while also using gevent. Learned a lot and overall it was a great exercise and will serve us well a long time.
Transistor module itself has about ~6,000 LOC including full support for Splash headless browser/javascript rendering service and Crawlera service. While the base Scrapy framework repo alone has ~30,000 LOC, with further middleware repos required to integrate Splash/Crawlera.
That said, the current Transistor implementation doesn't really compare with Scrapy as a crawler, in that Transistor is like a surgical knife to get the specific data you are after, while Scrapy can be more suited to cataloging, following-all-the-links.
Where Transistor shines right now is spinning up a few hundred workers, each with a scrape task, with each task being a term (like a part-number) which is searched on a website. Transistor get's the job done well in this case.
I've done a few things with Scrapy but never really poked under the hood, so I'll take your word for it.
If I get you right, it's more targeted towards precise extraction than website crawling?
Last point : why the tight integration with an app? Monolith approach?