Show HN: Web scraping page analyzer

(apify.com)

166 points | by jardah 185 days ago

15 comments

  • tcmb 185 days ago

    I entered a URL and pressed enter, wondering why nothing happened. Only then did I scroll down to find the 'Analyze' button. I wasn't looking for specific attributes, and the strong color contrast of that section made it look like nothing else of interest would come below.

    • jardah 185 days ago

      Oh... Clearly I need to work on my UX skills, I will improve that in next iteration.

    • jardah 185 days ago

      Just a quick update: Thank you for using it and playing around with it. Looking at the usage and results I found a quite a lot of things to improve. Which is great, since it's hard to develop something like this without real usage data.

      • at_smith 185 days ago

        Awesome tool! How do you handle scraping data that's hiding behind layers of ~fancy~ JS libraries? Is it as simple as triggering click events, pausing for loading, and then grabbing the information?

        • jardah 185 days ago

          This tool basicaly performs the simplest data loading, it opens the webpage, then waits till most xhr requests are done, wait's a second (tio give JS time to manipulate DOM) and then loads data from the page. This way, it has what user sees when he opens the page in browser. So if the data is visible, or loaded through XHR or hidden in global JS variable it will see it.

          For more advanced usage (like clicking, or submiting a search request) it would need to have some kind of scenario like: "Click on this" -> "wait till this loads" -> "type something here" -> "scroll to this" -> load data.

          Which is possible with headless chrome, so the trick is to make it general and easy to use (something like recording what user does through chrome plugin). Maybe in future versions :)

          • cseelus 185 days ago

            Could be an interesting enhancement. Sounds a little bit like what Capybara, a test framework for Ruby apps can do[1], things like

              click_link('Link Text')
              fill_in('Password', with: 'Seekrit')
              choose('A Radio Button')
              check('A Checkbox')
              uncheck('Another Checkbox')
              select('Option', from: 'Select Box')
            
            1) https://github.com/teamcapybara/capybara#navigating
            • 185 days ago
              [deleted]
            • razki 185 days ago

              I'd have money on using Horseman with phantomjs in node.

              • bdcravens 185 days ago

                No, it says on the page it's using headless Chrome.

                • oodavid 184 days ago

                  PhantomJS is dead. The arrival of Puppeteer rung the death knell.

              • nreece 185 days ago

                Cool tool!

                * shameless plug * Our little startup, Feedity - https://feedity.com, helps create custom RSS feeds for any webpage, utilizing Chrome for full-rendering and many other tweaks & techniques under the hood for seamless & scalable indexing.

                • cstrat 185 days ago

                  Looks awesome! Does the tool work when trying to access websites behind web application firewalls? eg. F5 WAF [1]

                  https://f5.com/glossary/web-application-firewall

                  • jardah 184 days ago

                    Depends on whether we access the website from a proxy that is known by the WAF. But for most websites it's just a single normal request. If it's an issue in the future we could make browser extension, that will do the analytic on page loaded by the user, so that we don't have to use proxy to connect to it. If you are talking about actually scraping the websites, then that is usually on case by case scenario. Mostly it works, but sometimes it's a bit harder to get around.

                  • guilamu 185 days ago

                    Not giving me anything useful on this pretty straightforward table:

                    http://www.dsden93.ac-creteil.fr/spip/spip.php?page=ecoles

                    • jardah 185 days ago

                      Yes, that is probably the problem, when I looked for the text it returned:

                      [ 0:{ "selector":".bloc-blanc > p:nth-child(1)" "text":" 0 école(s) correspondent à votre recherche " } ]

                      • jardah 185 days ago

                        Aha! I see, it shows data based on POST request from FORM on this page http://www.dsden93.ac-creteil.fr/spip/spip.php?page=annu1d so if you provide just a link to the results page without the POST data then it will show you nothing. Sadly the tool currently does not allow for sending POST requests to the websites.

                        • guilamu 185 days ago

                          Thanks for your replies, I've successfully been parsing this page with others parsers though.

                          Edit: the page changed and it's not working anymore. Sorry for the false alarm, my bad.

                      • jardah 185 days ago

                        When I open the link in my browser it shows "0 école(s) correspondent à votre recherche" and no table, probably what happens to the analyzer too.

                      • Kikobeats 181 days ago

                        Similar but just for getting normalized metadata: https://microlink.io

                        • jardah 185 days ago

                          I'm still testing it and improving it (there are so many different websites with different responses...), so If you have any comments I'm looking forward to what you think about it.

                          • JustARandomGuy 185 days ago

                            Suppose I wanted to extract an image that gets loaded async via Javascript (For example, a Pinterest page). How would that work? Looking at your documentation, it looks like I could parse the XHR array you supply. Could you suggest any other ways? I'm calling out Pinterest as an example here because they try to block their images from being easily downloaded, but if you have any other examples I'd like to hear them.

                            It would be great if the page analyzer could supply a list of all the assets loaded with the web page; for example, any asset with a media type of image/* is listed in an images array, and so forth.

                            • jardah 185 days ago

                              Actually the list of assets shouldn't be that hard. Looking at pinterest the xhr requests for images are loaded immediately when page is open, so potentialy it then it's catched in onRequest function (only now I'm aborting the requests to save network trafic). I will try it our tomorrow and let you know in comment.

                              Also, looking at pinterest, it's server rendered through ReactJS, so there is #initial-state script tag with first few images preloaded as urls, so if you cared only about the images on top without scrolling then this is the safest bet.

                            • _Chief 185 days ago

                              how about caching the default entry (static url instead) + attribs, to ease demoing. at the moment it's been analyzing for more than >5mins

                              • jardah 185 days ago

                                Good idea and I would implement that if I used an API from server to get the response. But currently I'm at the same time testing stability of Apify "Actor" solution and proxies, so for my case it's good that there are real requests with real responses, even if it's just from demo.

                                Btw the fact that it's running for 5 minutes is a bug, that I will look at, since there is a timeout of 2 minutes and there are no hanging runs or runs that ended with timeout.

                                • ComputerGuru 185 days ago

                                  You also don’t want to get your server blocked by yelp if they do rate limiting.

                                  • jardah 185 days ago

                                    It's why I'm using proxies, every request is routed through different proxy address and the application as whole is rate limited. So hopefully I'm not making too much traffic on yelp. They are just a perfect example because they are using all types of data I'm looking for. When I find more good examples I will add them and rotate them for every page load.

                                    Btw when it comes to ToS and scraping, this is not much different from accessing their website through normal browser only instead of rendered content we should you analyzed data. The page is only loaded once same as in browser.

                                    • bpicolo 185 days ago

                                      They have fairly aggressive scraper detection (and this is also against their ToS)

                                • GSGSGS 185 days ago

                                  Are you Jaroslov ? :)

                                  • jardah 185 days ago

                                    Jaroslav, yes, I'm the author. Did you notice any problems or ways how I can improve it?

                                    • GSGSGS 185 days ago

                                      Not that i can see from a surface view, i think documentation can be improved :). Personally like the idea of APIFY, saw it a few months ago. Are you guys hiring ? :D

                                      • jardah 185 days ago

                                        :D yep the documentation needs a lot of work. It started as a test of an idea, then slowly became a usable tool and the code was getting incrementaly more complex without me event noticing. I only added the readme on github yesterday and there are basicaly no tests... :(

                                        • jancurn 185 days ago

                                          Yes we are! Please see https://www.apify.com/jobs

                                          • johnnyfived 185 days ago

                                            It's great that you're communicating openly on HN.

                                            I just sent an application for the Junior Web Developer position.

                                            Looking forward to hearing back!

                                    • rmateus 185 days ago

                                      Is it able to deal with digital certificates?

                                  • chadlavi 185 days ago

                                    We use this for some stuff at my office. It's handy.

                                    • dmarlow 185 days ago

                                      Is data shared between accounts if two accounts both want to retrieve information from the same exact URL?

                                      • jardah 185 days ago

                                        Nope, there is no caching now, every run of the tool has a single instance and writes the output into separate file. I'm using it to test stability of cloud when multiple users are using it and to test proxies. It would not be much of a test if one user opened the demo page and then every other use would just get the results from a file. But when I'm happy with how it works I will add caching.

                                      • oevi 185 days ago

                                        Nice work! It seems that it only supports microdata and not RDFa at the moment?

                                        • jardah 185 days ago

                                          Yep only microdata. I completely forgot about RDFa. I'm immediately writing RDFa to my todo list. It would be a great addition.

                                          • anomie31 185 days ago

                                            Speaking of which, do you think you could support more ontologies than schema.org? It's easy to use schema.org without understanding the rest of the RDF ecosystem, so I'll elaborate in a minute, but I'm on my phone right now so it's difficult.

                                        • BrandoElFollito 185 days ago

                                          Is there a way to access an authenticated web site?

                                          • jardah 185 days ago

                                            Sadly not for now. Our company has a solution for that (for some websites), but currently this tool does not have this functionality, since I wanted it be as simple as possible. Maybe in the future.

                                            • lanewinfield 185 days ago

                                              Authentication support on this would make it an instant purchase for me.

                                              • jardah 185 days ago

                                                Some general authentication (like separate input fields for your login credentials on the website) could be potentialy done (but very unsafe for user of the tool, since you would be sending us your credentials as plaintext). But authentication as whole is sadly not as general as semantic data on the web. Not every website has the same login form(different fields), some use captchas, some use authenticators, some do robot checking for too fast logins.

                                                • lozzo 185 days ago

                                                  why ? and how would you envisage authentication support to work ? (p.s. awesome tool guy...the clever part is to allow user the ability to describe their process via JavaScript. Hence a perfect DSL (domain specific language))

                                            • nickthemagicman 185 days ago

                                              Just wanted to say great tool!

                                              • jardah 185 days ago

                                                Thank you, still lot's of things to improve (for example 404 handling) but it's great to see positive feedback.

                                              • petagonoral 184 days ago

                                                hmm, no joy getting rating information from an amazon.com product page

                                                • jardah 184 days ago

                                                  Amazon is unfortunately not using any metadata information for reviews (probably to prevent easy scraping for competing companies). You can only get it from from html (At least from what I can see).

                                                • alexroan 185 days ago

                                                  Love this.