Ask HN: Best practices for ethical web scraping?

Hello HN!

As part of my learning in data science, I need/want to gather data. One relatively easy way to do that is web scraping.

However I'd like to do that in a respectful way. Here are three things I can think of:

1. Identify my bot with a user agent/info URL, and provide a way to contact me 2. Don't DoS websites with tons of request. 3. Respect the robots.txt

What else would be considered good practice when it comes to web scraping?

270 points | by aspyct 1476 days ago

37 comments

snidane 1476 days ago
When scraping just behave as to not piss off the site owner - whatever that means. Eg. not cause excessive load or making sure you don't leak out sensitive data.
Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.
Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.
The reality of scraping industry related to your question is this
1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.
2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.
3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.
Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.
I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.
[-]
- codingdave 1476 days ago
  You are correct that I don't like this advice... not because I find it to be wrong, but because you are approaching it solely from a competitive perspective -- "Your competitors don't have ethics, so you shouldn't either." That doesn't help someone who is engaging in research and trying to hold themselves to a higher standard.
  [-]
  - jdc 1475 days ago
    It seems to me that you are conflating ethics with compliance.
    [-]
    - codingdave 1474 days ago
      When the compliance we are talking about is whether or not to communicate clearly to others who you are, and whether or not to respect their directions, are you so sure they are not correlated?
- lordgrenville 1476 days ago
  I'm neither a web dev nor a site owner, but OP literally asked for tips on ethical web scraping, not "what's the most I can get away with".
  [-]
  - trickstra 1476 days ago
    That's because OP is operating under the wrong assumption that sites won't ban an ethical scraper. The reality is that they will, and much faster than an unethical one. They don't care about your science project, they want ad revenue, conversions...
- wizzwizz4 1476 days ago
  1. is the only one I don't like. I think you should use your real user agent first on any given site, as a courtesy; whether you give up or change to a more "normal" user agent if you get banned is up to you.
  Oh, and for 3.: if you can, apply some heuristics to your reading of the robots.txt. If it's just "deny everything", then ignore it, but you really don't want to be responsible for crawling all of the GET /delete/:id pages of a badly-designed site… (those should definitely be POST, and authenticated, by the way).
  [-]
  - chatmasta 1476 days ago
    I disagree. The risks are similar to those of disclosing a security vulnerability to a company without a bug bounty. You cannot know how litigious or technically illiterate the company will be. What if they decide you're "hacking" them and call the FBI with the helpful information you included in your user agent? Crazier things have happened.
    Anonymity is part of the right to privacy; IMO, such a right should extend to bots as well. There should be no shame in anonymously accessing a website, whether via automated means or otherwise.
    [-]
    - a1369209993 1476 days ago
      > such a right should extend to bots as well
      No, it very much shouldn't, but (as you probably meant) it should extend to the person (not, eg, company) using a bot, which amounts to the same thing in this case.
  - gilad 1476 days ago
    As for delete, use authenticated DELETE, not POST, it’s why its there in the first place
  - mpclark 1476 days ago
    Also, if a target site is behind Cloudflare then you probably won’t be able to masquerade as any of the popular bots - they block fake google/yandex bots.
- hutzlibu 1476 days ago
  "or making sure you don't leak out sensitive data"
  If sensitive data can be scraped, it is not really stored sensitive. So I would not care too much about it and just notify the owner if I notice it.
  [-]
  - HenryBemis 1476 days ago
    Keep in mind that if you end up with data that are protected under GDPR, merely having them puts you in a damning position. The intended owner will be fried for not protecting it adequately, but you violate GDPR since "I never agreed to you collecting, processing, etc" the data. And imagine the world of pain if you are caught with children's data.
    [-]
    - wolco 1476 days ago
      The site might be if located in Europe but if it's public you are not bound by those terms as long as you don't share with Europeans.
      [-]
      - fiddlerwoaroof 1476 days ago
        In the US I’d be more concerned about the CCPA.
- aspyct 1476 days ago
  Well, having a few websites of my own, I really do think that point 1 is the worst. I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).
  Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.
  Thanks for your feedback!
  [-]
  - the8472 1476 days ago
    > I can't filter bots that disguise as users from my access logs, and they actually hurt my work (i.e. figuring out what people read).
    If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.
    [-]
    - aspyct 1476 days ago
      Ha, that's a good idea! Is there a list somewhere of the cidr blocks that are assigned to residential vs server farms? I mean, how can I tell an IP is residential?
      [-]
      - the8472 1476 days ago
        The other way around may be easier, i.e. excluding known datacenter ranges. There are some commercial databases for that, i'm not sure if there are any free ones. But you can also do this manually by running a whois on an IP and then extracting the ranges from the whois response and caching then. Then you can look at the orgname or something like that. You can also download the whois databases from the RIRs, but they don't contain the information what kind of entities they are.
        $ dig +short reddit.com 151.101.1.140 $ whois 151.101.1.140 NetRange: 151.101.0.0 - 151.101.255.255 CIDR: 151.101.0.0/16 OrgName: Fastly [...]
        So if you see a known hoster here then you can exclude it from your statistics.
      - capableweb 1476 days ago
        What I've done in the past is to pull down all the IPs of request I see, filter by unique, do whois for each one of them (you're gonna need to have a backoff/rate limit here as whois services are usually rate limited) and save the organization name, ASN and CIDR blocks, again filter by uniqueness, then create a new list with the organizations of interest and match with the CIDR blocks. Now you have an allow/blocklist you can use.
      - kitteh 1476 days ago
        There are some geoip databases that will denote if it's end user networks and if it's fixed (DSL/cable) or mobile.
- amelius 1476 days ago
  You forgot to mention to use different IP addresses.
- erdos4d 1476 days ago
  Perhaps I am behind the curve here, but why would sneaker shoe sites get scraped hard?
  [-]
  - abannin 1476 days ago
    There is a very active secondary market for sneakers. If you can buy before supply is exhausted, you can make some decent money.
pfarrell 1476 days ago
It won’t help you learn to write a scraper, but using the common crawl dataset will get you access to a crazy amount of data without paying to acquire it yourself.
https://commoncrawl.org/the-data/
[-]
- aspyct 1476 days ago
  Cool, didn't know about this. Thanks!
  [-]
  - Reelin 1476 days ago
    > As part of my learning in data science, I need/want to gather data.
    Also not web scraping, but a few other public data set sources to check.
    https://registry.opendata.aws
    https://github.com/awesomedata/awesome-public-datasets
    [-]
    - aspyct 1476 days ago
      Thanks!
      [-]
      - smcnally 1476 days ago
        also https://www.reddit.com/r/datasets
        [-]
        aspyct 1475 days ago
        The comments that keep on giving :D
        [-]
        analyticascent 1474 days ago
        I appreciate you starting this thread, these are great resources people are posting.
        Common Crawl is the data set to master if someone wants to use the fruits of web scraping without actually doing the web scraping.
- my_green_book 1476 days ago
  This is quite useful. Thanks a lot
montroser 1476 days ago
Nice you to ask this question and think about how to be as considerate as you can.
Some other thoughts:
- Find the most minimal, least expensive (for you and them both) way to get the data you're looking for. Sometimes you can iterate through search results pages and get all you need from there in bulk, rather than iterating through detail pages one at at a time.
- Even if they don't have an official/documented API, they may very likely have internal JSON routes, or RSS feeds that you can consume directly, which may be easier for them to accommodate.
- Pay attention to response times. If you get your results back in 50ms, it probably was trivially easy for them and you can request a bunch without troubling them too much. On the other hand, if responses are taking 5s to come back, then be gentle. If you are using internal undocumented APIs you may find that you get faster/cheaper cached results if you stick to the same sets of parameters as the site is using on its own (e.g., when the site's front end makes AJAX calls)
[-]
- aspyct 1476 days ago
  That's great advice! Especially the one about response times. I didn't think of that, and will integrate it in my sleep timer :)
mapgrep 1476 days ago
I always add an “Accept-Encoding” header to my request to indicate I will accept a gzip response (or deflate if available). Your http library (in whatever language your bot is in) probably supports this with a near trivial amount of additional code, if any. Meanwhile you are saving the target site some bandwidth.
Look into If-Modified-Since and If-None-Match/Etag headers as well if you are querying resources that support those headers (RSS feeds, for example, commonly support these, and static resources). They prevent the target site from having to send anything other than a 304, saving bandwidth and possibly compute.
[-]
- Lammy 1476 days ago
  > Meanwhile you are saving the target site some bandwidth.
  And costing them some CPU :) It’s probably a good idea in most cases, agreed, but there are exceptions such as if you are requesting resources in already-compressed formats, like most image/video codecs.
  [-]
  - kerkeslager 1476 days ago
    Frankly, it would be difficult to find a part of your post that is correct.
    1. You're never causing their server to do anything they didn't configure their server to do. Accept headers are merely information for the server telling them what you can accept: what they return to you is their choice, and they can weigh the tradeoffs themselves.
    2. The tradeoff you think is happening isn't even happening in a lot of cases. In a lot of cases they'll be serving that up from a cache of some sort so the CPU work has already been done when someone else requested the page. CPU versus bandwidth isn't an inherent tradeoff.
  - newscracker 1476 days ago
    This is baffling to me, since I’ve always thought of gzip (or other) compression being applied by the web server (or configured to do so) only to text formats like HTML, JS, CSS, etc. I’m curious to know which badly written server or sites compress already compressed content like images and videos just because a user agent says it’ll accept compressed content?
  - edaemon 1476 days ago
    The server ultimately decides what encoding gets used, so if the CPU cost is too high they can ignore the compression.
  - snuxoll 1476 days ago
    > And costing them some CPU
    On the servers that have no purpose in life but to handle caching. I’d much rather browsers and scrapers alike hit my Apache Trafficserver instances with requests needing to return a Not Modified than wasting time of the app servers.
rectang 1476 days ago
In addition to the steps you're already taking, and the ethical suggestions from other commenters, I suggest that you aquaint yourself thoroughly with intellectual property (IP) law. If you eventually decide to publish anything based on what you learn, copyright and possibly trademark law will come into play.
Knowing what rights you have to use material you're scraping early on could guide you towards seeking out alternative sources in some cases, sparing you trouble down the line.
[-]
- aspyct 1476 days ago
  That's a good point! So far I'm not planning on publicly disclosing any of my results, but that may come, I guess.
- yjftsjthsd-h 1476 days ago
  I'm curious how this would be an issue; factual information isn't copyrightable, and most of the obvious things that I can think to do with a scraper amount to pulling factual information in bulk. Even if it's information like, "this is the average price for this item across 13 different stores". (Although I'm not a lawyer and only pay attention to American law, so take all of this with the appropriate amount of salt)
  [-]
  - rectang 1476 days ago
    How much can you quote from a crawled document? Can you republish the entire crawl? What can you do under "fair use" of copyrighted material and what can't you do? Can you articulate a solid defense of your publication that it truly contains only pure factual information? Will BigCo dislike having its name associated with the study but can you protect yourself by limiting your publication to "nominative use" of its trademarks? What is the practical risk of someone raising a stink if the legality of your usage is ambiguous? Who actually holds copyright on the crawled documents?
    You have a lot of rights and you can do a lot. Understanding those rights and where they end lets you do more, and with confidence.
    [-]
    - yjftsjthsd-h 1476 days ago
      So I think I just was being unimaginative on "scraping"; I wouldn't have thought to save quotes/prose, just things like word counts, processed results (sentiment analysis), pricing, etc. In which case most of that shouldn't come up, but yes I can see where other options are less simple.
  - kerkeslager 1476 days ago
    > factual information isn't copyrightable
    Tell that to Aaron Swartz.
    Sure, if you think of factual information as an abstract concept. But as soon as you put that abstract concept into a concrete representation, that representation is absolutely copyrightable. And when you scrape data you're not scraping abstract information, you're scraping the representation of that information.
    Try publishing PDFs of college textbooks online and see how well your "I'm just publishing factual information" argument works.
    I'm not saying I agree with the law on this, and I'm also not saying that the way the law was intended should apply to the situation of scraping.
    [-]
    - yjftsjthsd-h 1476 days ago
      > Tell that to Aaron Swartz.
      He wasn't downloading (purely) factual information, as I understood it.
      > college textbooks
      Not even remotely raw factual information. Heck, a table of numbers with a descriptive label probably is copyrightable, but you can scrape the table itself, yes?
      I think the issue here is that I assumed a very narrow idea of what people would scrape; it hadn't crossed my mind to download prose or such, which I think is why we're arriving at different conclusions.
sairamkunala 1476 days ago
Simple,
respect robots.txt
find your data from sitemaps, ensure you query at a slow rate. robots.txt has a cool off period. See https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...
example: https://www.google.com/robots.txt
[-]
- aspyct 1476 days ago
  Yeah that's a must do, but I think most websites don't even bother making a robots.txt beyond "please index us, google". However that wouldn't necessarily mean they're happy about someone vacuuming their whole website in a few days.
jakelazaroff 1476 days ago
I think your main obligation is not to the entity from which you’re scraping the data, but the people whom the data is about.
For example, the recent case between LinkedIn and hiQ centered on the latter not respecting the former’s terms of service. But even if they had followed that to the T, what hiQ is doing — scraping people’s profiles and snitching to their employer when it looked like they were job hunting — is incredibly unethical.
Invert power structures. Think about how the information you scrape could be misused. Allow people to opt out.
[-]
- aspyct 1476 days ago
  That's a fair point indeed. I don't think I will ever expose non-anonymized data, because that's just too sensitive. But if I ever do, I'll make sure people are made aware they are listed, and that they can opt out easily.
- monkpit 1476 days ago
  I tried to find a source to back up what you’re saying about hiQ “snitching” to employers about employees searching for jobs, but all I can find is vague documentation about the legal suit hiQ v. LinkedIn.
  Do you have a link to an article or something?
  [-]
  - jakelazaroff 1476 days ago
    Sure, it’s mentioned in the EFF article about the lawsuit: https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...
    > HiQ Labs’ business model involves scraping publicly available LinkedIn data to create corporate analytics tools that could determine when employees might leave for another company, or what trainings companies should invest in for their employees.
  - lkjdsklf 1476 days ago
    It's their actual product. Keeper.
    > Keeper is the first HCM tool to offer predictive attrition insights about an organization's employees based on publicly available data.
mettamage 1476 days ago
Indirectly related, if you have some time to spare follow Harvard's course in ethics! [1]
Here is why: while it didn't teach me anything new (in a sense), it did give me a vocabulary to better articulate myself. Having new words to describe certain ideas means you have more analytical tools at your disposal. So you'll be able to examine your own ethical stance better.
It takes some time, but instead of watching Netflix (if that's a thing you do), watch this instead! Although, The Good Place is a pretty good Netflix show sprinkling some basic ethics in there.
[1] https://www.youtube.com/watch?v=kBdfcR-8hEY
[-]
- lapnitnelav 1476 days ago
  Thanks for sharing that Harvard's course.
  The cost benefit analysis part reminds me a lot of some of the comments you see here (and elsewhere) with regards to Covid-19 and the economic shutdown of societies. Quite timely.
- aspyct 1476 days ago
  Great recommendations, thanks!
  [-]
  - aspyct 1476 days ago
    I must insist. This course is great! Thanks :)
    [-]
    - mettamage 1476 days ago
      Glad you like it! When I studied CS, I was really happy I found this course as well. Getting some ideas about what ethics is made me a bit better in reflecting on what the implications are with whatever I'm creating.
fiddlerwoaroof 1476 days ago
My general attitude towards web scraping is that if I, as a user, have access to a piece of data through a web browser, the site owners have no grounds to object to me using a different program to access the data, as long as I’m not putting more load on their servers than a user clicking all the links would.
Obviously, there may be legal repercussions for scraping, and you should follow such laws, but those laws seem absurd to me.
RuedigerVoigt 1476 days ago
Common CMS are fairly good at caching and can handle a high load, but quite often someone deems a badly programmed extension "mission critical". In that case one of your requests might trigger dozens of database calls. If multiple sites share a database backend, an accidental DOS might bring down a whole organization.
If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.
Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.
So:
* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.
* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.
* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.
* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.
Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton
[-]
- volkansen 1475 days ago
  If you use Selenium & Chrome WebDriver you can disable loading images by : AddUserProfilePreference("profile.default_content_setting_values.images", 2)
haddr 1476 days ago
Some time ago I wrote an answer on stackoverflow: https://stackoverflow.com/questions/38947884/changing-proxy-...
Maybe that can help.
[-]
- johnnylambada 1476 days ago
  You should probably just paste your answer here if it's that good.
tingletech 1476 days ago
as sort of a poor man's rate limiting, I have written spiders that will sleep after every request, for the length of the previous request (sometimes length of the request times a sleep factor that defaults to 1). My thinking is that if the site is under load, it will respond slower, and my spider will slow down as well.
coderholic 1476 days ago
Another option is to not scrape at all, and use an existing data set. Common crawl is one good example, and http archive is another.
If you just want meta data from the homepage of all domains we scrape that every month at https://host.io and make the data available over our API: https://host.io/docs
xzel 1476 days ago
This might be overboard for most projects but here is what I recently did. There is a website I use heavily that provides sales data for a specific type of products. I actually e-mailed to make sure this was allowed because they took down their public API a few years ago. They said yes everything that is on the website is fair game and you can even do it on your main account. It was actually a surprisingly nice response.
ok_coo 1476 days ago
I work with a scientific institution and it's still amazing to me that people don't check or ask if there are downloadable full datasets that anyone can have for free. They just jump right in to scraping websites.
I don't know what kind of data you're looking for, but please verify that there isn't a quicker/easier way of getting the data than scraping first.
tedivm 1476 days ago
I've gone through this process twice- one about six months ago, and once just this week.
In the first event the content wasn't clearly licensed and the site as somewhat small, so I didn't want to break them. I emailed them and they gave us permission but only if we only crawled one page per ten seconds. Took us a weekend, but we got all the data and did so in a way that respected their site.
The second one was this last week and was part of a personal project. All of the content was over an open license (creative commons), and the site was hosted on a platform that can take a ton of traffic. For this one I made sure we weren't hitting it too hard (scrapy has some great autothrottle options), but otherwise didn't worry about it too much.
Since the second project is personal I open sourced the crawler if you're curious- https://github.com/tedivm/scp_crawler
elorant 1476 days ago
My policy on scraping is to never use asynchronous methods. I've seen a lot of small e-commerce sites that can't really handle the load, even if it's a few hundred requests per second, and the server crashes. So even if it takes me longer to scrape a site I prefer to not cause any real harm on them as long as I can avoid it.
throwaway777555 1476 days ago
The suggestions in the comments are excellent. One thing I would add is this: contact the site owner in advance and ask for their permission. If they are okay with it or if you don't hear back, credit the site in your work. Then send the owner a message with where they can see the information being used.
Some sites will have rules or guidelines for attribution already in place. For example, the DMOZ had a Required Attribution page to explain how to credit them: https://dmoz-odp.org/docs/en/license.html. Discogs mentions that use of their data also falls under CC0: https://data.discogs.com/. Other sites may have these details in their Terms of Service, About page, or similar.
moooo99 1476 days ago
The rules you named are some I personally followed. One other extremely important thing is privacy when you want to crawl personal data like social networks. I personally avoid crawling data that inexperienced users might accidentally expose, like email adresses, phone numbers or their friends list. A good rule of thumb for social networks for me always was, that I only scrape the data that is visible when my bot is not logged in (also helps to not break the providers ToS).
The most elegant way would be to ask the site provider if they allow scraping their website and which rules you should obey. I was surprised how open some providers were, but some don't even bother replying. If they don't reply, apply the rules you set and follow the obvious ones like not overloading their service etc.
[-]
- aspyct 1476 days ago
  I tried the elegant way before, after creating a mobile application to find fuel pumps around the country for a specific brand. My request was greeted with a "don't publish; we're busy making one; we'll sue you anyway". I guess where I'm from, people don't share their data yet...
  Totally agree with the point on accidental personal data, thanks for pointing that out!
  PS: they never released their app...
mfontani 1476 days ago
If all scrapers did what you did, I'd curse a lot less at $work. Kudos for that.
Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt directive, and do you ensure that works properly across your fleet of crawlers?
[-]
- aspyct 1476 days ago
  Hehe, my "fleet of crawlers" is a single machine in a closet so far :) I'll think about that kind of synchronization later.
  However I do parse and respect the "crawl-delay" now, thanks for pointing it out!
- greglindahl 1476 days ago
  A large fraction of websites with Crawl-Delay set it a decade ago and promptly forgot about it. No modern crawler uses it for anything other than a hint. The primary factors for crawl rate are usually site page count and response time.
- the8472 1476 days ago
  In addition to crawl-delay there's also HTTP 429 and the retry-after header.
  https://tools.ietf.org/html/rfc6585#page-3
  [-]
  - greglindahl 1476 days ago
    Sites also use 403 and 503 to send rate-limit signals, despite what the RFCs say.
tyingq 1476 days ago
Be careful about making the data you've scraped visible to Google's search engine scrapers.
That's often how site owners get riled up. They search for some unique phrase on Google, and your site shows up in the search results.
[-]
- lazyjones 1476 days ago
  It's incredibly ironic that one has to avoid doing what Google does in order to be kept in their index.
- MarcellusDrum 1476 days ago
  This isn't really an "ethical" practice, more like how to hide that you are scraping data practice. If you have to hide the fact that you are scraping their data, maybe you shouldn't be doing it in the first place.
  [-]
  - tyingq 1476 days ago
    Depends. Maybe, for example, you're doing some competitive price analysis and never plan on exposing scraped things like product descriptions...you only plan to use those internally to confirm you're comparing like products. But you expose it accidentally. Avoid that.
narsil 1476 days ago
It's helpful to filter out links to large content and downloadable assets from being traversed. For example, I assume you wouldn't care about downloading videos, images, and other assets that would otherwise use a large amount of data transfer and increase costs.
If the file type isn't clear, the response headers would still include the Content-Length for non-chunked downloads, and the Content-Disposition header may contain the file name with extension for assets meant to be downloaded rather than displayed on a page. Response headers can be parsed prior to downloading the entire body.
JackC 1476 days ago
In some cases, especially during development, local caching of responses can help reduce load. You can write a little wrapper that tries to return url contents from a local cache and then falls back to a live request.
philippz 1476 days ago
As many pages are at least half-way SPAs, make sure to really understand the website's communication with their backend. Identify API calls and try to make API calls directly instead of downloading the full pages and extracting the required information from HTML afterwards. If you have certain data sets from specific API calls that almost never change, try to crawl them less regularly and instead cache the results.
DoofusOfDeath 1476 days ago
You may need to get more specific about your definition of "ethical".
For example, do you just mean "legal"? Or perhaps, consistent with current industry norms (which probably includes things you'd consider sleazy)? Or not doing anything that would cause offense to site owners (regardless of how unreasonable they may seem)?
I do think it's laudible that you want to do good. Just pointing out that it's not a simple thing.
danpalmer 1475 days ago
Haven’t seen anyone mention this, but asking permission first is about the most ethical approach. If you think sites are unlikely to give you permission, that might be an indication that what you’re doing has limited value. Offering to share your results with them could be a good plan.
I work for a company that does a lot of web scraping, but we have a business contract with every company we scrape from.
tdy721 1476 days ago
Schema.org is a nice resource. If you can find that meta-data on a site, you can be just a little more sure they don’t mind getting that data scraped. It’s the instruction book for teaching google and other crawlers extra information and context. Your scraper would be wise to parse this extra meta information.
jll29 1476 days ago
The only sound advice one can give is: there are two elements to consider: 1) ethics is different from law 1.1) the ethical way: respect robots.txt protocol 2) consult a lawyer 2.1) prior written consent, they will say, prevents you from being sued, and not much else.
sudoaza 1476 days ago
Those 3 are the main, sharing the data in the end could be also a way to avoid future scrapings.
[-]
- mrkramer 1476 days ago
  That's an interesting proposition. For example there is Google Dataset Search where you can "locate online data that is freely available for use".
  [-]
  - aspyct 1476 days ago
    Didn't know about that search engine. Thanks a lot! Actually found a few fun datasets, made my day :)
imduffy15 1476 days ago
https://scrapinghub.com/guides/web-scraping-best-practices/ may be of interest to you.
Someone 1476 days ago
IMO, the best practice is “don’t”. If you think the data you’re trying to scrape is freely available, contact the site owner, and ask them whether dumps are available.
Certainly, if your goal is “learning in data science”, and thus not tied to a specific subject, there are enough open datasets to work with, for example from https://data.europa.eu/euodp/en/home or https://www.data.gov/
[-]
- pxtail 1476 days ago
  Where this 'best practice is “don’t”' idea comes from? I saw it couple of times when scraping topic surfaces. I think that it is kind of hypocrisy and actually acting against own good and even good of the internet as whole because it artificially limits who can do what.
  Why are there entities which are allowed to scrape web however they want (who got into their position because of scraping the web) and when it comes to regular Joe then he is discouraged from doing so?
  [-]
  - Someone 1475 days ago
    In my book, “Not best practice” doesn’t imply “never do”, but web scraping should be your option of last resort. Doing it well takes ages, and time spent doing it will often detract you from your goal.
    As I said, in this case “learning data science” likely doesn’t require web scraping; it just requires some suitable data set.
    The OP claimed in another comment that that doesn’t exist, but (s)he doesn’t say what dats (s)he’s looking for, so that impossible to check.
- aspyct 1476 days ago
  I'm a lot more motivated to do data science on topics I actually care about :) Unfortunately those topics (or websites, in this case) don't expose ready-made databases or csv files.
adrianhel 1476 days ago
I like this approach. Personally I wait an hour if I get an invalid response and use timeouts of a few seconds between other requests.
abannin 1476 days ago
Don't fake identity. If the site requires a login, don't fake that login. This has legal implications.
avip 1476 days ago
Contact site owner, tell them who you are and what you're doing, ask about data dump or api.
brainzap 1476 days ago
Ask for permissions and have nice timeout/retries.
sys_64738 1476 days ago
Ethical web scraping? Is that even a thing?
[-]
- kordlessagain 1476 days ago
  No, it's not and discussing it like it is a thing is irrational. Ethics are based on morals and morals are based on determining a "right" course of action for a given act.
  Just because something is legal, by absence of law, doesn't mean it's right or fair for all cases. Just because something is illegal (copyright) doesn't mean it's not right or fair for all cases. What if the information saved a million lives? Would it still be ethical to claim "ownership" of that information?
  What if the information caused a target audience to visualize that thing over and over again? Is it right to allow that information out into the public at all?
  g'disable javascript in your browser'
  [-]
  - matz1 1476 days ago
    And moral is subjective. There is no one "right" course of action.
- RhodesianHunter 1476 days ago
  How do you think Google provides search results?
  [-]
  - sys_64738 1476 days ago
    You’re claiming google is ethical? Bit of a stretch.
    [-]
    - mirimir 1476 days ago
      Maybe not. But re scraping and indexing, there's a quid pro quo.