As part of my learning in data science, I need/want to gather data. One relatively easy way to do that is web scraping.
However I'd like to do that in a respectful way. Here are three things I can think of:
1. Identify my bot with a user agent/info URL, and provide a way to contact me 2. Don't DoS websites with tons of request. 3. Respect the robots.txt
What else would be considered good practice when it comes to web scraping?
Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.
Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.
The reality of scraping industry related to your question is this
1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.
2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.
3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.
Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.
I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.
Oh, and for 3.: if you can, apply some heuristics to your reading of the robots.txt. If it's just "deny everything", then ignore it, but you really don't want to be responsible for crawling all of the GET /delete/:id pages of a badly-designed site… (those should definitely be POST, and authenticated, by the way).
Anonymity is part of the right to privacy; IMO, such a right should extend to bots as well. There should be no shame in anonymously accessing a website, whether via automated means or otherwise.
No, it very much shouldn't, but (as you probably meant) it should extend to the person (not, eg, company) using a bot, which amounts to the same thing in this case.
If sensitive data can be scraped, it is not really stored sensitive. So I would not care too much about it and just notify the owner if I notice it.
Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.
Thanks for your feedback!
If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.
https://commoncrawl.org/the-data/
Also not web scraping, but a few other public data set sources to check.
https://registry.opendata.aws
https://github.com/awesomedata/awesome-public-datasets
Common Crawl is the data set to master if someone wants to use the fruits of web scraping without actually doing the web scraping.
Some other thoughts:
- Find the most minimal, least expensive (for you and them both) way to get the data you're looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.
- Even if they don't have an official/documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.
- Pay attention to response times. If you get your results back in 50ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster/cheaper cached results if you stick to the same sets of parameters as the site is using on its own (e.g., when the site's front end makes AJAX calls)
Look into If-Modified-Since and If-None-Match/Etag headers as well if you are querying resources that support those headers (RSS feeds, for example, commonly support these, and static resources). They prevent the target site from having to send anything other than a 304, saving bandwidth and possibly compute.
And costing them some CPU :) It’s probably a good idea in most cases, agreed, but there are exceptions such as if you are requesting resources in already-compressed formats, like most image/video codecs.
1. You're never causing their server to do anything they didn't configure their server to do. Accept headers are merely information for the server telling them what you can accept: what they return to you is their choice, and they can weigh the tradeoffs themselves.
2. The tradeoff you think is happening isn't even happening in a lot of cases. In a lot of cases they'll be serving that up from a cache of some sort so the CPU work has already been done when someone else requested the page. CPU versus bandwidth isn't an inherent tradeoff.
On the servers that have no purpose in life but to handle caching. I’d much rather browsers and scrapers alike hit my Apache Trafficserver instances with requests needing to return a Not Modified than wasting time of the app servers.
Knowing what rights you have to use material you're scraping early on could guide you towards seeking out alternative sources in some cases, sparing you trouble down the line.
You have a lot of rights and you can do a lot. Understanding those rights and where they end lets you do more, and with confidence.
Tell that to Aaron Swartz.
Sure, if you think of factual information as an abstract concept. But as soon as you put that abstract concept into a concrete representation, that representation is absolutely copyrightable. And when you scrape data you're not scraping abstract information, you're scraping the representation of that information.
Try publishing PDFs of college textbooks online and see how well your "I'm just publishing factual information" argument works.
I'm not saying I agree with the law on this, and I'm also not saying that the way the law was intended should apply to the situation of scraping.
He wasn't downloading (purely) factual information, as I understood it.
> college textbooks
Not even remotely raw factual information. Heck, a table of numbers with a descriptive label probably is copyrightable, but you can scrape the table itself, yes?
I think the issue here is that I assumed a very narrow idea of what people would scrape; it hadn't crossed my mind to download prose or such, which I think is why we're arriving at different conclusions.
respect robots.txt
find your data from sitemaps, ensure you query at a slow rate. robots.txt has a cool off period. See https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...
example: https://www.google.com/robots.txt
For example, the recent case between LinkedIn and hiQ centered on the latter not respecting the former’s terms of service. But even if they had followed that to the T, what hiQ is doing — scraping people’s profiles and snitching to their employer when it looked like they were job hunting — is incredibly unethical.
Invert power structures. Think about how the information you scrape could be misused. Allow people to opt out.
Do you have a link to an article or something?
> HiQ Labs’ business model involves scraping publicly available LinkedIn data to create corporate analytics tools that could determine when employees might leave for another company, or what trainings companies should invest in for their employees.
> Keeper is the first HCM tool to offer predictive attrition insights about an organization's employees based on publicly available data.
Here is why: while it didn't teach me anything new (in a sense), it did give me a vocabulary to better articulate myself. Having new words to describe certain ideas means you have more analytical tools at your disposal. So you'll be able to examine your own ethical stance better.
It takes some time, but instead of watching Netflix (if that's a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.
[1] https://www.youtube.com/watch?v=kBdfcR-8hEY
The cost benefit analysis part reminds me a lot of some of the comments you see here (and elsewhere) with regards to Covid-19 and the economic shutdown of societies. Quite timely.
Obviously, there may be legal repercussions for scraping, and you should follow such laws, but those laws seem absurd to me.
If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.
Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.
So:
* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.
* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.
* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.
* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.
Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton
Maybe that can help.
If you just want meta data from the homepage of all domains we scrape that every month at https://host.io and make the data available over our API: https://host.io/docs
I don't know what kind of data you're looking for, but please verify that there isn't a quicker/easier way of getting the data than scraping first.
In the first event the content wasn't clearly licensed and the site as somewhat small, so I didn't want to break them. I emailed them and they gave us permission but only if we only crawled one page per ten seconds. Took us a weekend, but we got all the data and did so in a way that respected their site.
The second one was this last week and was part of a personal project. All of the content was over an open license (creative commons), and the site was hosted on a platform that can take a ton of traffic. For this one I made sure we weren't hitting it too hard (scrapy has some great autothrottle options), but otherwise didn't worry about it too much.
Since the second project is personal I open sourced the crawler if you're curious- https://github.com/tedivm/scp_crawler
Some sites will have rules or guidelines for attribution already in place. For example, the DMOZ had a Required Attribution page to explain how to credit them: https://dmoz-odp.org/docs/en/license.html. Discogs mentions that use of their data also falls under CC0: https://data.discogs.com/. Other sites may have these details in their Terms of Service, About page, or similar.
The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don't even bother replying. If they don't reply, apply the rules you set and follow the obvious ones like not overloading their service etc.
Totally agree with the point on accidental personal data, thanks for pointing that out!
PS: they never released their app...
Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt directive, and do you ensure that works properly across your fleet of crawlers?
However I do parse and respect the "crawl-delay" now, thanks for pointing it out!
https://tools.ietf.org/html/rfc6585#page-3
That's often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.
If the file type isn't clear, the response headers would still include the Content-Length for non-chunked downloads, and the Content-Disposition header may contain the file name with extension for assets meant to be downloaded rather than displayed on a page. Response headers can be parsed prior to downloading the entire body.
For example, do you just mean "legal"? Or perhaps, consistent with current industry norms (which probably includes things you'd consider sleazy)? Or not doing anything that would cause offense to site owners (regardless of how unreasonable they may seem)?
I do think it's laudible that you want to do good. Just pointing out that it's not a simple thing.
I work for a company that does a lot of web scraping, but we have a business contract with every company we scrape from.
Certainly, if your goal is “learning in data science”, and thus not tied to a specific subject, there are enough open datasets to work with, for example from https://data.europa.eu/euodp/en/home or https://www.data.gov/
Why are there entities which are allowed to scrape web however they want (who got into their position because of scraping the web) and when it comes to regular Joe then he is discouraged from doing so?
As I said, in this case “learning data science” likely doesn’t require web scraping; it just requires some suitable data set.
The OP claimed in another comment that that doesn’t exist, but (s)he doesn’t say what dats (s)he’s looking for, so that impossible to check.
Just because something is legal, by absence of law, doesn't mean it's right or fair for all cases. Just because something is illegal (copyright) doesn't mean it's not right or fair for all cases. What if the information saved a million lives? Would it still be ethical to claim "ownership" of that information?
What if the information caused a target audience to visualize that thing over and over again? Is it right to allow that information out into the public at all?
g'disable javascript in your browser'