1 comments

  • lapnitnelav 1857 days ago
    Looks interesting but I'm struggling to see (at quick glance) what makes it unique / better than the alternatives out there.
    • bobjordan 1856 days ago
      As compared to a mature framework like Scrapy. Transistor is a lot lighter than Scrapy and easier to grok the entire codebase, while having less magic. I wanted a scraping framework with useful classes/abstractions which I could subclass/override, customize to my specific needs, and then run tightly integrated with our gevent-based Flask web app.

      Bottom line is, Scrapy's codebase is so big and also running Twisted, which I'm not familiar with. So I kind of threw my hands in the air on that integration and instead decided to take a few weeks to write my own framework only with what I needed, while also using gevent. Learned a lot and overall it was a great exercise and will serve us well a long time.

      Transistor module itself has about ~6,000 LOC including full support for Splash headless browser/javascript rendering service and Crawlera service. While the base Scrapy framework repo alone has ~30,000 LOC, with further middleware repos required to integrate Splash/Crawlera.

      That said, the current Transistor implementation doesn't really compare with Scrapy as a crawler, in that Transistor is like a surgical knife to get the specific data you are after, while Scrapy can be more suited to cataloging, following-all-the-links.

      Where Transistor shines right now is spinning up a few hundred workers, each with a scrape task, with each task being a term (like a part-number) which is searched on a website. Transistor get's the job done well in this case.

      • lapnitnelav 1855 days ago
        Hey thanks for the reply.

        I've done a few things with Scrapy but never really poked under the hood, so I'll take your word for it.

        If I get you right, it's more targeted towards precise extraction than website crawling?

        Last point : why the tight integration with an app? Monolith approach?