Can anyone who knows the law please guide me on this issue? Note that the concern is less about what’s ethical and more about what’s legal. This will also help me in my research because these days some reviewers are raising this concern when they see authors used web scraped data. Online there are a ton of opinion pieces but nobody is clear on the legal side of it. Mostly people oppose scraping because they think it’s unethical.
https://www.eff.org/cases/hiq-v-linkedin
Basically: if it's publicly visible, you can scrape it.
Caveat: the case is still making its way to the Supreme Court.
Edit: There's also Sandvig v. Sessions, which establishes that scraping publicly available data isn't a computer crime:
https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...
Edit2: Two extra common sense caveats:
- Don't hammer the site you're scraping, which is to say don't make it look like you're doing a denial of service attack.
- Don't sell or publish the data wholesale, as is -- that's basically guaranteed to attract copyright infringement lawsuits. Consume it, transform it, use it as training data, etc. instead.
- Make sure your scrapper has both a reasonable delay (one request per second or slower) and a proper backoff. It you start getting errors, back off. We never cared about scrapers, until we noticed them, and we only noticed them if they hit us too hard, we told them to back off, and then they didn't.
- Look deep for an API. A ton of people would scrape reddit without realizing we had (an admittedly poorly marketed) API for doing just that.
- Respect robots.txt. That was another way to get noticed quickly -- hitting forbidden URLs. If you hit a forbidden URL too often, you'd start getting 500 errors, and if you didn't back off, you'd get banned from using the site. It was an easy way to tell if someone was not a well behaved scraper.
Response code 429 is your friend: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
And if you want to get really pedantic, 429 didn't exist when we did this. It wasn't approved until April 2012 and the first patches for it didn't show up until around 2014. We could have monkey patched if we really wanted to, but we didn't really want to.
[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...
[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
…in the USA (possibly even “in DC”). Also, ”isn't a computer crime” doesn’t imply ”isn't a crime”. Copyright law likely still applies, for example.
I'm not a lawyer but I did receive a C&D from a Fortune 100 that ultimately shut my project down. I was not selling or exposing any data directly -- it was purely consumed on the back end.
I was not hammering their site, but aggregating and caching requests such that people who used my project ultimately had orders-of-magnitude lower impact than they would've had otherwise.
The data we were sampling was fundamentally non-copyrightable in the US per Feist v. Rural Telecom; just a compendium of places, dates, and times (in the EU, raw data without substantial creative components is copyrightable), but because it was on their servers, and because we had to extract it from a HTML page that constituted a creative work, the CFAA and the Copyright Act were against us.
I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless. The law and the legal precedent is 100% in favor of the site being scraped. Essentially, it may not be illegal until they tell you to stop, but after that, it's unquestionably illegal. There is no public right-of-way on the internet.
My case is by no means unusual; it happens to several small companies on a daily basis, and it's a critical component in the ability of BigTechCos to maintain their walled gardens and effectively use legal mechanisms to route around the web's inherent distributed properties. All this "decentralized internet" stuff misses the point that the decentralization is not a technical problem, but a legal and social one.
Eric Goldman's blog [0] is a great resource that has consistently followed law related to scrapers for several years. He discusses hiQ v. LinkedIn at [1].
----
The applicable federal statutes, which are primarily the CFAA and the Copyright Act, don't leave much wiggle room at all on this topic, and neither does the overwhelming majority of case law. Precedents established in the 80s like MAI v. Peak have been consistently misapplied to screen scraping.
There are two particular onerous prongs of the law here: first, the CFAA's "authorized access" stipulations, and second, interpretations of the Copyright Act that hold RAM copies of data are sufficiently tangible to be potentially-infringing.
The CFAA makes it both a crime and a tort to ever access a server in a manner that "exceeds authorized access" -- essentially, as soon as the company indicates that they don't want you to talk to them, if you talk to them again, you're dead meat (craigslist v. 3taps among others).
Most companies include boilerplate in their Terms of Service that says the site cannot be accessed by any automated means and generally successfully argue that you were thereby on notice regarding the extent of your authorized access as soon as you did anything that constitutes enactment of that contract, which generally means accessing anything beyond the front page of the site ("clickwrap" or "linkwrap"), and almost certainly means anything that involves logging in, submitting forms, etc.
Re: the Copyright Act -- until it's modified to clarify that RAM copies are not independent infringements and to enshrine the rights of users to extract their own copyrighted content from another's copyrighted wrapper, it's going to be a potential infringement every time your software downloads someone's page. The real-world analog of the "RAM Copy doctrine", as it's called, would be that every time your eye reflects the image of a copyrighted work into your brain, you've made a new infringing copy. When it gets to court, that's what scrapers deal with -- and they almost always lose.
On the API front you may be able to argue that a simple JSON structure isn't sufficiently creative to qualify for copyright protection, but that would be blazing a new trail (and still leaves the CFAA to worry about). In almost all cases, something as complex as the JavaScript and the HTML that you get from $ANYWEBSITE.com, just loading it on an unapproved device is probably an infringement. That each digital load/transform is a potential infringement is how you hear about millions of infringements in file sharing cases, etc., because they're claiming each time you copied that data from your hard drive into your RAM, it was a new independent infringing copy.
Seriously, sit down and read the law, and then read the dozens of cases where this has been litigated previously. HiQ v. LinkedIn is a very limited anomaly in this pantheon, still very early in the cycle, and NO ONE should be taking it as a guiding star, at least not until it hits the Supreme Court and they come down reversing all the old precedent around this.
If you are going to build a business that depends on scraping, ONLY do so with the backing of mega-well-funded VCs, etc., who are able and willing to take on the powerful lobbies, and who are funding your company at least as much for its potential to break legal precedent as for its commercial viability.
Final note: expect no help from FAANG et al on this. Without the CFAA, their walled gardens are dead in the water. It is a critical tool used by MegaCos to retain their digital monopolies. "Network effect" means something, but it's only strangling the web to death because there are $1000/hr law firms enforcing it behind the scenes. Without that, we'd have automatic multiplexed Twitter/G+/FB streams a long time ago. They shut down aggregators because they need to control the direct interface to the user -- if they're relegated to a backend data provider by someone with a better user experience, they're very vulnerable. This realization is what motivated Craigslist's rapid reversal on scraper-friendliness and sunk 3taps, and been the death of many potentially innovative early-stage companies.
-----
tl;dr The long and short of it is that until Congress passes revisions to the CFAA and the Copyright Act and/or until the Supreme Court comes down with a wide-ranging ironclad reversal of the last 30 years of case law on this topic, it's going to be perilous for anyone whose business depends on scraping.
And all this is at the federal level -- many states have enacted similar statutes so they can get in on the "put hackers in jail" action, and these battles will have to be fought at the state level too.
[0] https://blog.ericgoldman.org/ [1] https://blog.ericgoldman.org/archives/2017/08/linkedin-enjoi...
How long ago was this? It seems like the courts have shifted their position on this over time and only very recently (as in the last year) have they started to take a more permissive stance on scraping.
The paper linked elsewhere in this thread does a great job of summarizing the trend: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625
We're in a good spot socially right now, as the tech behemoths are no longer perceived as plucky upstarts and quirky computer whizzes, but instead as creepy 1984-ish overlords. So I think the stage is set for upheaval -- maybe even some Congressional action if someone can tie this to the "deplatforming" thing that has Republicans fired up -- but we're a ways out yet, especially if we're just going to be crossing our fingers for a favorable SCOTUS ruling.
Compare the Aereo case at [0] for what is perhaps a counter-intuitive philosophical divide: the conservative side of the Court dissented from the majority in holding that Aereo should've been in the clear.
[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....
Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?
Regardless, the state of copyright and IP law in the US is abysmal. We can't trust these companies (FAANG) to keep their own press releases online for a decade, how can we let them monopolize ideas (which they fail to fully flesh out) and content? They have been shown to be inept stewards to their own content :c
Unfortunately, this is where the RAM Copy Doctrine gets us into trouble. It is not only illegal to "exceed authorized access" to a networked computer, the precedent currently considers loading any copyrighted work into RAM potentially infringing, e.g., if the rightsholder says you're not allowed to use their copyrighted work in that way, you have to present a viable fair use defense.
afaik, no one has brought suit against things like client-side adblockers and browser extensions that modify a page, but if they did, they'd be likely to prevail under current precedent.
We really need true legal protection for users to select their own user agents and to be free to access information willfully transmitted to them in the way they like, especially in the case of something like Facebook/Twitter, where the site itself is just a wrapper around other peoples' copyrighted content.
That will only happen if someone can convince enough Congresscritters to carve out an exception in the actual law, rather than relying on long-outmoded pre-internet judicial interpretations.
Power Ventures scoped down to extract only your own data out of Facebook and they still ended up owing $3M in damages.
See Ticketmaster v RMG at https://en.wikipedia.org/wiki/Ticketmaster,_LLC_v._RMG_Techn.... , where the argument that alternative user agents should be allowed was shot down. I discussed at some length here: https://news.ycombinator.com/item?id=12352450
> It is generally impermissible to enter into a private home without permission in any circumstances. By contrast, it is presumptively not trespassing to open the unlocked door of a business during daytime hours because "the shared understanding is that shop owners are normally open to potential customers." These norms, moreover govern not only the time of entry but the manner; entering a business through the back window might be a trespass even when entering through the door is not.
[0] https://arstechnica.com/tech-policy/2017/08/court-rejects-li...
[1] https://www.documentcloud.org/documents/3932131-2017-0814-Hi...
I'm not a lawyer either, but making such a frivolous distinction has always bothered me --- HTTP(S) and HTML is an API, and it's the one the web browser uses. Maybe the "official" API offers some better formatting and such, but ultimately you're just getting the same information from the same source. As long as you don't hammer the server to the point that it becomes disruptive to other users, as far as they're concerned you're just another user visiting the site.
IMHO making such a distinction is harmful because it places an artificial barrier to understanding how things actually work. I've had a coworker think that it was impossible to automate retrieving information from a (company internal) site "because it doesn't have an API". It usually takes asking them "then how did you get that information?" and a bit more discussion before they finally realise.
"If you asked a hundred people to go to different pages on a site and tell you what they found, is that legal?"
Offering an HTML interface may be an indication that you also consent to allowing machines to read the data through the HTML - that's the idea behind search engines. But that's where it gets complicated, and that's why there's all sorts of other considerations to the legal question. Things like did you include the pages in question in robots.txt, did you say anything explicitly about scrapers in the ToS, does the scraper offer a way to contact its owner about abuse, has the website actually contacted them, has an IP ban been issued, is the scraping for commercial purposes, does it compete directly with the site, does it interfere with legitimate human use, etc.
Why would bytes on the wire be any different from printed words on the page here?
You do not own the copyright on the words of the book, and in many of the cases you list, the publisher does have a say in that. If you want to put on a school play based on the book, you need to get permission from the author. (My high school put on an in-house adaptation of Out of the Dust, and we had to write Karen Hesse and get her okay to do so.) If you put the entirety of the book on your website so that readers of your negative review can refer back to it, the publisher can come after you with a cease & desist or, if you ignore it, a lawsuit. If you write fanfiction based on the characters in the book, the publisher can come after you with a C&D. If you want to make a movie based on that book, you need to buy the film rights. (There's currently an interesting situation with Game of Thrones where HBO owns the film rights to the world of Westeros, but the film rights to the characters & story of Dunk & Egg are still owned by GRRM, so if the film rights to the earlier Dunk & Egg stories were ever bought by a studio other than HBO, they would have to be scrubbed of mentions of Targaryens, the Iron Throne, King's Landing, etc.)
In the pre-Internet days, the chance of enforcement was next to nil for many of these cases, because the big studios and publishers all got licenses for any IP, while class discussions, high school plays, and hobbyists never got a wide audience for their work and so the original publisher would probably never know (unless you did something really stupid like send it to them). The Internet's blurred a lot of these boundaries.
But for a concrete example - one of the exclusive rights bestowed by copyright is the right of reproduction. (It's not the only one, BTW: performance is another one specific enumerated, as is distribution, as is creating derivative works.) What does that mean? Well, courts have ruled that if you take an exact digital copy of a work, as sold to the public, and publish it for free on a torrent site, that's infringement. They've also ruled that there are various "fair use" exceptions that give implicit rights to the general public even when a work is under copyright. If you quote a sentence from a 300-page book to support a point in an academic paper, that's not infringing.
Where's the boundary? Consult a lawyer, because there's lots of case law. I remember that when I was at Google, there was a big debate over how big the snippets (the little summaries of text on the results page could be). 2 sentences was fine. A paragraph was dodgy. Showing the entire page was a big no-no. Showing the entire page when the user clicks on "cached" was okay when I was there (I don't remember what the justification was for that), but that option has since disappeared, so I wonder if they ran into problems. They got around it with AMP, which requires explicit opt-in from publishers and so has an explicit consent.
It's not all that different from regular property rights in that regard. You own land. What does that mean? Well, normally it means that you can build a house on it - but not if you have a conservation easement on the land, or if local zoning codes forbid the type of dwelling you want. It normally means you have the right to keep other people off your land - except that if your property completely surrounds somebody else's property and cuts them off from a public street, you're required to grant them an easement so that they can cross your land to get to their dwelling. There are other sorts of easements you can grant, too, which are all ways of either granting other people some of the rights associated with your property (but not all of them) or restricting yourself from having some of those rights.
Dismissing various legal and social conventions as 'frivolous distinctions' is, in the end, probably a more harmful viewpoint than the inconveniences the 'distinctions' introduce. It's also too easy to apply it in arbitrary and self-serving ways. Scraping data off some website? Frivolous distinction. Someone hoards your personal data? Venal violation of your privacy rights.
Yes! Similar pet peeve about e.g. "You can't use encryption in Gmail." No, nothing stops you from encrypting the message outside of Gmail and pasting the ciphertext in your message's body. It's that e.g. there might not be native support in the web client.
- If the website offers the data publicly (without authentication), it's free to scrape.
- If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse.
- If you use the data to compete with a big company, they will sue you regardless.
Court resolutions will vary on the court and judge. https://en.wikipedia.org/wiki/Web_scraping#Legal_issues
At the risk of stating the obvious, most data you'll find is protected by copyright. Eg this comment is written by me so according to nearly all jurisdictions in the world, I own the copyright (unless HN has a clause that I agreed to when I signed up that I sign it away, like stack overflow has).
Most forums, blogs, essays, articles, news sites, recipes and song lyrics are covered by copyright. I'm pretty sure that a webshop's blarb about why product x is good is covered by copyright.
If you're scraping for more factual information, in some juristdictions, such as the US, there's a good chance those aren't subject to copyright. Things like addresses, opening hours, prices, inventory (but not a description of the inventory), etc can be very useful to scrape and present in different ways.
Careful with this one. It's possible that it could be copyrighted in both the US and Europe (and also have some beyond copyright protection in Europe--more on the European situation later). In the US, a collection of data might count as a "compilation", defined in 17 USC 101:
> A “compilation” is a work formed by the collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes an original work of authorship. The term “compilation” includes collective works
In the case of a compilation like a collection of house addresses, the important thing is whether the selection and arrangement of the data was sufficiently creative. The big case on this was decided by the Supreme Court in 1991. The cite is Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).
Briefly, the compilation in that case was a book of telephone listing. There was no doubt that it had taken a lot of work to produce, and up to this point copyright law followed the "sweat of the brow" doctrine, which basically means that if you put a lot of time and effort into making something in a category that can be covered by category, you could get copyright.
In Feist, the Court said that it is a Constitutional requirement for copyright that the work must actually be creative. It didn't take much creativity to qualify, but there had to be a spark of creativity in there. In the Feist case, they found that the telephone book in question was just an alphabetical lists of all phone users in a region, which the telephone company was required by law to make. There was no creativity in either the selection or arrangement of the data, so no compilation copyright.
Based on Feist, then, a list of all house addresses in a region, sorted by address, or owner, or something like that, would probably be up for grabs. If it is a subset of the houses, then it is possible that selection was sufficiently creative to allow copyright. Same goes for a clever arrangement or presentation of the data, although if what you are using it for doesn't copy the arrangement or presentation they compilation copyright might not cover your copying.
BTW, in the particular case of address data, if your application doesn't actually need specific house addresses but instead just needs to know all the valid streets in a US state, and the address ranges on those streets, look at how that state handles sales tax. Sales taxes are usually based on street address, and the states make available databases that list all streets and the tax rates for each address range within the street.
If the state is one of the states that have joined the Streamlines Sales Tax arrangement, you can get their data here [1]. All the states part of the SST group (around half of the states) agreed to a common format for the data. I think most non-SST states also make the data available in a reasonable form, so the approach of using tax data to get address information works in them, too, just not as conveniently.
Most of the rest of the world also recognizes some kind of copyright on data collections, similar to the compilation copyright in the US, for data collections that are selected or arranged with sufficient creativity. This is part of the TRIPS trade agreement.
In the case of scraping for academic purposes, it might be OK even if it would otherwise be a copyright violation due to fair use. If it is a state owned school, it might not matter because of sovereign immunity which greatly limits the ability of citizens to sue a state government for violations of Federal laws.
Some places, including most of Europe, also have a sui generis database right that creates a property right separate from copyright in databases, based on the effort to put together the database (i.e., the old "sweat of the brow" theory). I'll just point to Wikipedia for those who want more on the sui generis database right [2].
Oh, I suppose if the house addresses were for houses in Europe, then besides copyright and the sui generis database right, you might also want to consider whether or not scraping and using the data might have GDPR implications for you.
[1] https://www.streamlinedsalestax.org/Shared-Pages/rate-and-bo...
[2] https://en.wikipedia.org/wiki/Sui_generis_database_right
I suspect you'll actually get different answers depending on which lawyer you ask. If you've got deep enough pockets you can probably ensure you get the answer you want, and if you have really deep pockets you can probably ensure the court gets the answer you want. But if you're just a student who doesn't want to end up in court, there are potential minefields there.
[0] https://commoncrawl.org/
[1] https://registry.opendata.aws/
My tips:
- Keep careful control of the rate you scrape. Every time I have ever heard of someone getting negative feedback it is because they have scraped pages at a rate that caused an impact on the website they were scraping. If you don't cause a noticeable increase in traffic/load nobody will check to see what is going on, and generally nobody has a reason to care.
- Some sites are notoriously aggressive at going after people, such as craigslist. I wouldn't try to scrape them.
- Use some kind of proxy!
Many proxies, in random order, would be the best.
That brings up another curious question: What's the legality of posting a site to something like HN or Slashdot and effectively getting it DDoS'd...?
I imagine there's some reading of the CFAA that could theoretically land you in hot water for this, but this is silly.
Intent is very important. Can one sue or prosecute a popular food critic for writing something about a restaurant, causing lines so long that long-time regulars can't get a seat anymore?
On the other hand, you have things like booter services (essentially, DDoS as a service). Continuing the analogy, I imagine if you hired 100 people to physically block the entrance of a restaurant for some reason, you would be on the hook for damages in civil court and something along the lines of "disturbing the peace" in criminal court.
Andy Sellars [1] published a paper year ago on the topic titled "Twenty Years of Web Scraping and the Computer Fraud and Abuse Act" [2] which puts the topic in a great perspective. Many of the cases are not very clear cut and sway from one direction to another. We are currently in the up where courts side with the "crawlers" which may change in couple of years.
[0] https://pex.com
[1] https://twitter.com/andy_sellars
[2] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625
It might be illegal in some jurisdiction; IANAL but I think you can just get out of that jurisdiction and scrape away if that is the case. It might violate some ToS but ToS isn't law; the consequences of violating a ToS are usually on the order of getting your IP banned.
What you do with the stuff you scraped can be ethical or unethical.
Why should I be treated differently than search engine spiders?
If somebody doesn’t want their site scraped then they can let people know with robots.txt. Get off your high horse.
Google is massively scraping the web and is building products on top of the data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?
As others pointed out that one should take care about ToS.
The short answer is because they have the muscle and you don't.
This was litigated in Perfect 10 v. Amazon, and the only way out was for the panel deciding the case was to claim Google's use is "fair" due to its unprecedented and "transformative" nature, which basically should be read as "we don't want to face the public scorn of being the judges responsible for shutting down Google Images". Such advantage is unlikely to be a factor in less-prominent cases.
Even if you believe you can convince a panel of judges that your project specifically meets the four-prong test for fair use, it takes millions of dollars to litigate a case that far, which is well outside the realm of possibility for most independent projects.
Flight and hotel search is a good example. Why can't you find Southwest fares on any aggregator? People try to scrape them all the time, and as soon as they come to Southwest's attention, they get C&D'd and shut down. This is a common and well-established practice and hundreds of companies die by it every year.
The argumentation about "tranformative" nature of something making it exceptional and above the law sound not intuitive to me.
Building flight and hotel search by scraping is possible, but will make whoever you scrape very angry very quickly, because they're paying their technology vendors per search, so they watch their look to book ratios very closely.
Please reconsider this position. You're teaching the future generation of engineers and scientists. Even if it's not strictly the topic of your course, please don't teach your students that everything that's technically legal to do is fine. Show that being socially conscious matters as well. Everybody will be better off.
First, teachers shouldn’t be teaching morals. Specially in college and university. The slippery slope between morals to politics is a dangerous one. I rather them focusing on their actual course materials
Finally, there is nothing wrong with scrapping on ethical standpoint if you don’t DDOS the target services. It gave us search engines. And that’s probably one of most important breakthrough for humanity in the past few decades.
I don't see where you see politics in how they handle such questions. I'm not advocating they go on an extended lecture about their personal views on the political system that made the laws and what not.
I'm saying that there's a difference between handling these kind of questions with "if you're not sure, maybe you should kindly ask the publisher of the data if they would be ok with you scraping/using it that way" and "if your lawyer says you're in the clear, fuck them and scrape away."
I disagree 100%. "To teach about the human anatomy, we've kidnapped Paul here, and will now cut him apart."
It's great for teaching (how better to observe what happens when you cut open a living person than ... cutting open a living person), but it's unethical (and illegal), and that's an important lesson as well.
That said, I appreciate the distinction of morals and ethics. I understand ethics in the domain of 'what is good' and morals in the domain of 'what is good in society'. Fair Use suggests its ethical to use published materials for educational purposes. Whereas morals ask if what you're actually doing with the data is good for society.
And in some cases scraping is a violation of ToS. (Though who knows whether that’s ever been litigated as enforceable.)
It would be extremely surprising to learn otherwise, for example that there is a jurisdiction in which site users are bound by terms they can only find by actively looking for them on the site.
See Nguyen v. Barnes and Noble at https://en.wikipedia.org/wiki/Nguyen_v._Barnes_%26_Noble,_In.... for a recent example that represented a loosening of precedent by ruling that the ToS was not enforceable because the user did not receive adequate notice. If B&N had placed their disclaimer in a place where the user was more likely to see it, they would've been fine.
You're right that "consent" is the important legal issue, but it's usually implied based on what your site requires re: authentication/authorization, robots.txt, and the controls Google has provided to let you tell them not to index a site.
Scraping is fine if you ask the company and get permission!
This may seem obvious, but so many conversations about scraping seem to start from the position that it is in some fundamental way, not allowed. This is not true.
Conversations also seem to start from the assumption that you need to scrape the whole web, which again is not true.
If you're teaching a machine learning course, perhaps you have a project on classifying... cars. Do you need to scrape the whole web to get a bunch of data about cars? No. Could you get away with scraping just Autotrader or a similar site? Maybe! Why not ask them! If you clearly state that it's for learning, that credit will be given, etc, you may find them quite amenable to it.
I work at a company built significantly around web scraping, and we have contracts with all of our scrape targets that confirm we are allowed to scrape them.
Typically the client will ask you to fill out a questionnaire about how you create or generate the data. There are lots of questions about web scraping.
The general sense is that these firms are more and more sensitive to purchasing data that has been scraped... Especially if it relates to individuals or social media.
1) Copyright 2) Terms of service.
If doing hobby/education projects and not publishing what you create, copyright isn't really relevant.
As for violating terms of service (which is very likely), that's not "illegal", it just opens you up to being sued. Which is very unlikely, if no one is making money out of it, or hurting the service itself.
Legality highly depends on where you are.
In the US, scraping of public data is a fair use exception protected by the first. If you have to sign in to access the data, you then might be bound by the ToS.
In Europe, scraping of public data can be against several laws. Notably GDPR, the new copyright law, and you might be infringing copyrights on database as defined by the CNIL.
I've always really wanted to make a terminal app to keep track of how busy my local places are. Not saying I'd become a customer who would keep the lights on or anything like that, but at the very least it would make a cool demo.
https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...
https://www.eff.org/deeplinks/2018/04/dc-court-accessing-pub...
does not make it a crime to access information in a manner that the website doesn’t like if you are otherwise entitled to access that same information.
I know this isn’t the answer you are seeking, but it might help you find more examples— the area of copyright and fair use has a longer history with digital images. Here’s an example legal court case showing, as others have noted, the ruling judge has great impact on the outcome: “Court Rules Images That Are Found and Used From the Internet Are 'Fair Use'” By Jack Alexander, 2018-07-02 [1]
Maybe your educational institution has already done some legal work related to issues of copyright and educational use?
Here is an example from a university where they have done the legal work and constructed further guidelines to determine safe harbor guidelines.
“The use of copyright protected images in student assignments and presentations for university courses is covered by Copyright Act exceptions for fair dealing and educational institution users. [...] In certain circumstances you may be able to use more than a "short excerpt" (e.g. 10%) of a work under fair dealing. SFU's Fair Dealing Policy sets out "safe harbour" limits for working under fair dealing at SFU, but the Copyright Act does not impose specific limits.”
[1]: https://fstoppers.com/business/court-rules-images-are-found-...
[2]: I want to use another person's images and materials in my assignment or class presentation. What am I able to do under copyright](https://www.lib.sfu.ca/help/academic-integrity/copyright/stu...)
Scraping is the entire business model of every search engine.
Same applies to radio streams or Netflix videos: once they're streamed, you can register the stream legally for yourself.
As journalists, we scrape things to collect information used toward transformative analysis. Not straight-up mirroring. Facts, as stated by an entity. So we've never run into a legal issue doing this as long as we used the scrapes to synthesize results into data. For example, map of restaurant closures by the health department, with statistics and graphs of violation frequencies. Or analysis of lawyer performance by cross-referencing a state judiciary database search with their team member lists for success rates and other stuff.
Most of the sites we scraped were county, state or federal government sites and they contained information available in the interest of the general public. However, we crawled tons of private sites as well and as long as we wore white hats we considered it fair game.
We typically tried to scrape things fairly without causing technical issues but to be honest, we ignored robots.txt directives all the time but timed it do happen during off hours, with backoff mechanisms in case we contributed negatively to computational loads. The typical issues we ran into were overeager system administrators who squashed or interfered our scraping attempts under their personal interpretation of appropriateness. Sometimes they sicced misguided lawyers after us. Most of the lawyers couldn't tell you what you were doing wrong, let alone how a site is registered, what a glue record is, how DNS differs from IP registry ownership, how collocated servers work and who owns them. They couldn't prove what we did with any of the data to even imply we violated any copyrights.
So we relied on our legal departments to clear the way in case of issues, but in 15 years of doing this, I've never once had a legal issue come up and put a stop to what we were doing under that operating premise of transforming the information into data. Our legal team never got involved for that sort of thing. There were issues, but they got resolved through communication or by reconfiguring our scrapers. Even when we've also made the raw data available to the public or other researchers, it hasn't come up as a problem.
In one case, a police department figure blocked us because they disliked our coverage. Their pretense was that our geocoding wasn't accurate enough from the information they provided, and rather than circumvent their blocking we had face to face meetings to address those concerns and mollify their concerns, on the record. They ended up providing us with additional information to meet that accuracy. In another case, the CEO of a large private company personally threatened us legally claiming we violated their terms of services for their API endpoint. However, their terms of services mentioned nothing about data retention once something became a data point and we felt we were in the clear so we kept doing it for years and nothing came of it.