Wow, they've built a distributed Amazon listing scraping system – essentially a botnet.
As someone who has done a lot of web scraping and had to route around a lot of blocking (we have business contracts to allow scraping, but they don't stop over-eager sysadmins), this feels like a dream come true.
But I'd never actually want to use this for scraping and I'm not sure any informed user would agree to use this.
Some companies want you to list their products on your page (usually with some kind of affiliate deal attached) but don't have a tech team to implement a feed or an API. In that case you end up in a situation where you have to scrape the data yourself with permission.
Pretty much exactly this. We're no affiliates as we do our own fulfilment, but essentially we partner with companies to offer their products on our site, and we take a margin from them.
We sometimes do this with feeds, but feeds aren't great for stock change latency which is important for us (thin stock levels on wide range of products, more out of stock issues). Scraping ensures we at least have the same stock latency that direct customers see, and we manage the risk on how frequently to scrape.
Most of these companies don't have in-house engineering capabilities. For Shopify based merchants we don't have to scrape, we use the Shopify API, but otherwise scraping is the only real solution.
> Unless of course you don’t consider the information collected here personal.
I don’t. The author even goes out of their way to point out that these requests aren’t generated by the user and so there’s no latent interest information there. I agree that they should cover this behavior in the privacy policy explicitly, but there’s a tone of moral outrage in this piece that seems unearned.
I’m really unsure how you would come to this conclusion. Even if you only read the summary at the beginning or only the conclusions section at the end, you should notice that Keepa is doing both. It will extract data from your Amazon visits (personal information) and do its own scraping (merely wasting your bandwidth if implemented correctly which I am unconvinced of).
Thanks for engaging here. Maybe my reading comprehension is poor, but here’s the full quote that I was objecting to. It comes after a long pull quote where Keepa promises to not log the requests that do contain latent interest behavior:
> This refers to some pieces of the Keepa functionality but it once again completely omits the data collection outlined here. It’s reassuring to know that they don’t log product identifiers when showing product history, but they don’t need to if on another channel their extension sends far more detailed data to the server. This makes the first sentence, formatted as bold text, a clear lie. Unless of course you don’t consider the information collected here personal. I’m not a lawyer, maybe in the legal sense it isn’t.
When I was reading, I thought that “data collection outlined here” referred to the scraping behavior you reverse engineered, since the pull quote covered the user-generated request. I agree that they should include the additional scraping behavior here for clarity (we’re arguing about it after all). I disagree that it constitutes as a “clear lie”, since I don’t think that data is personal.
“Data collection outlined here” refers to both mechanisms covered by the article. The first one collects information about the products you look at which clearly is personal information. The automated scraping in the background is less problematic from the privacy protection point of view, at least when it is used in the intended way.
I use this extension (and the app) regularly, which activates as soon as I visit Amazon in a container tab. In addition to providing in-depth statistics, features like alerts via Telegram have helped me hunt down bargains. I have noticed the increase in network requests and bandwidth when the tab is active, using basic tracking via Resource Monitor (W10). However, I can easily block it via uMatrix/uBO, if required. In this case, it is a trade-off, which can be justified.
Also, Tracker Control (Android) for Keepa app reports blocking just two trackers Google Crashlytics and Google Firebase Analytics -- so it is not as bad other apps.
I have used CamelCamelCamel in the past, which was more egregious and aggressive in tracking users, but don't know how it fares today.
Unfortunately, it isn’t that easy. You cannot use other extensions to block requests happening on the extension’s background page. Whatever tracking and scraping is going on, you can probably disable part of it via extension’s settings but otherwise there is nothing you can do.
I use Keepa basic and it has saved me a ton of money. I always just assumed it was scraping the prices from pages I visit, but I didn't know it would automatically fetch Amazon pages in the background. Might just sign out of Amazon, and use a separate browser to purchase from it.
Either way, I have some thinking to do on if I should "keepa" it or not (sorry really bad joke). Maybe I should purposely turn a blind eye and just trust they aren't going to do anything evil nor have some privacy risk due to how useful it is.
Isn't this always the trade-off? While I do appreciate useful software, it gets tiring that it's almost always at the expense of a little bit of privacy or tracking. Seems like the death of a thousand cuts of our anonymity online. Although, I don't really harbor illusions that we (at least Americans) haven't been tracked since the invention of the credit card. I guess I'm a little jaded at this point as there doesn't seem to be anything I, personally, can do about it and I get a touch of FOMO when I hear about the capabilities of the latest and greatest apps.
I understand that data collection is inherently necessary for AI, I just don't like who's in charge of it and making the innovations.
I use the Keepa website and never realized before this article that they even have browser plugins. On the website you can set up price alerts that go out via email or Telegram. That works well enough for me.
I can't quite understand this article and its conclusion.
The article says: "[The extension] will collect information about the products you look at and the ones you search for".
Yet, two sentences later it says "The company behind the extension fails to comply with its legal obligations. The privacy policy is misleading in claiming that no personal data is being collected."
So which personal information is exactly included in the data submitted to their servers about the products? Because in that json example I don't see anything that would be even close to personal information.
The remote scraping/execution abilities are not great, I'll give it that. But the rest of it seems like overblown conclusion and interpretation of how it works.
I’d assume that "products you searched for", even if only implicitly thanks to the results, is personal information. It also is not mentioned in their privacy policy, which only mentions sending on product pages.
The history of all Amazon products you looked at or searched for is personal data, and it can tell a lot about you. Whether it is also personal data in the legal sense is not something I can say for sure. But it definitely has to be properly covered in the privacy policy, for GDPR compliance at the very least.
But it is not personal data that would identify you (PII). If someone was able to determine who I was based solely on my browsing activity on Amazon, then they've already obtained my personal information.
PII is not a term that is used in the GDPR. The person you're replying to is correct that your browsing data is likely to count as personal data given that it's linked to an individual.
No, it isn’t PII in the legal sense, it doesn’t allow identifying you directly. Which doesn’t mean that it cannot be tied to your identity. Just one example: if you regularly post to social media what you bought online, this information could be correlated with the Keepa data to find out which profile is likely yours and what else you looked at.
But GDPR doesn’t merely require you to disclosure collection of PII, but rather all data collected. There is a good reason for that.
And not sure if Amazon would agree to this as it essentially threatens the privacy and integrity of their users. Interestingly, Keepa is also an Amazon Affiliate, so they are in a direct business relationship with Amazon.
If the additional Amazon pages are loaded on days when the user hasn’t browsed Amazon, or done once a day, that could be cookie stuffing, explicitly prohibited by Amazon Affiliate terms. The Amazon affiliate cookies last 24 hours, so triggering a session when a user doesn’t do it, might extent their affiliate window and is not right at all.
There is a difference between gathering prices and loading extra URLs to gather those prices. From that text, I would not assume they are using my computer as a part of a botnet.
I knew it was doing that as well (the distributed scraping the article talks about). But I cannot figure out where I read it. Maybe they used to have it somewhere on their site, and now it's gone?
What is strange that people asked for e.g. Amazon.nl support. This isn't implemented as Keepa relies on Amazon (this is their answer in the forums). But if they scrape, why do they still need Amazon?
Nice, I didn’t find this setting and I explicitly went looking for it. So the settings in the “price history” graph don’t merely apply to the way this graph is shown. Now I need to figure out what this setting is doing. Because I didn’t see any conditions in the code which were tied to this setting.
Found it. This is the optOut_crawl setting and its handling is entirely on the server side. So presumably if this setting is set, the server will no longer send the extension any instructions to scrape Amazon pages in background. Mind you, it still could but it probably won’t.
Scraping data from pages you visit shouldn’t be affected by this.
Do you remember the time when this weird German startup that publishes an Adblocker tried to start an "Acceptable Ads" program and extort money from Google? Guess what their CTO is up to now.
Exactly. Showing the world the shady business of browser plugins.
As someone who has done a lot of web scraping and had to route around a lot of blocking (we have business contracts to allow scraping, but they don't stop over-eager sysadmins), this feels like a dream come true.
But I'd never actually want to use this for scraping and I'm not sure any informed user would agree to use this.
We sometimes do this with feeds, but feeds aren't great for stock change latency which is important for us (thin stock levels on wide range of products, more out of stock issues). Scraping ensures we at least have the same stock latency that direct customers see, and we manage the risk on how frequently to scrape.
Most of these companies don't have in-house engineering capabilities. For Shopify based merchants we don't have to scrape, we use the Shopify API, but otherwise scraping is the only real solution.
I don’t. The author even goes out of their way to point out that these requests aren’t generated by the user and so there’s no latent interest information there. I agree that they should cover this behavior in the privacy policy explicitly, but there’s a tone of moral outrage in this piece that seems unearned.
I’m really unsure how you would come to this conclusion. Even if you only read the summary at the beginning or only the conclusions section at the end, you should notice that Keepa is doing both. It will extract data from your Amazon visits (personal information) and do its own scraping (merely wasting your bandwidth if implemented correctly which I am unconvinced of).
> This refers to some pieces of the Keepa functionality but it once again completely omits the data collection outlined here. It’s reassuring to know that they don’t log product identifiers when showing product history, but they don’t need to if on another channel their extension sends far more detailed data to the server. This makes the first sentence, formatted as bold text, a clear lie. Unless of course you don’t consider the information collected here personal. I’m not a lawyer, maybe in the legal sense it isn’t.
When I was reading, I thought that “data collection outlined here” referred to the scraping behavior you reverse engineered, since the pull quote covered the user-generated request. I agree that they should include the additional scraping behavior here for clarity (we’re arguing about it after all). I disagree that it constitutes as a “clear lie”, since I don’t think that data is personal.
I use this extension (and the app) regularly, which activates as soon as I visit Amazon in a container tab. In addition to providing in-depth statistics, features like alerts via Telegram have helped me hunt down bargains. I have noticed the increase in network requests and bandwidth when the tab is active, using basic tracking via Resource Monitor (W10). However, I can easily block it via uMatrix/uBO, if required. In this case, it is a trade-off, which can be justified.
Also, Tracker Control (Android) for Keepa app reports blocking just two trackers Google Crashlytics and Google Firebase Analytics -- so it is not as bad other apps.
I have used CamelCamelCamel in the past, which was more egregious and aggressive in tracking users, but don't know how it fares today.
https://camelcamelcamel.com/
Either way, I have some thinking to do on if I should "keepa" it or not (sorry really bad joke). Maybe I should purposely turn a blind eye and just trust they aren't going to do anything evil nor have some privacy risk due to how useful it is.
The article says: "[The extension] will collect information about the products you look at and the ones you search for".
Yet, two sentences later it says "The company behind the extension fails to comply with its legal obligations. The privacy policy is misleading in claiming that no personal data is being collected."
So which personal information is exactly included in the data submitted to their servers about the products? Because in that json example I don't see anything that would be even close to personal information.
The remote scraping/execution abilities are not great, I'll give it that. But the rest of it seems like overblown conclusion and interpretation of how it works.
The history of all Amazon products you looked at or searched for is personal data, and it can tell a lot about you. Whether it is also personal data in the legal sense is not something I can say for sure. But it definitely has to be properly covered in the privacy policy, for GDPR compliance at the very least.
But GDPR doesn’t merely require you to disclosure collection of PII, but rather all data collected. There is a good reason for that.
They moved to the current model of providing an API for Amazon data (which seems to use the extensions users to scrape data).
> Allow the add-on to gather Amazon prices to improve our price data
I thought it was common knowledge that Keepa uses the addon to gather prices. Though with GDPR it probably needs to be more explicitly said.
What is strange that people asked for e.g. Amazon.nl support. This isn't implemented as Keepa relies on Amazon (this is their answer in the forums). But if they scrape, why do they still need Amazon?
Nice, I didn’t find this setting and I explicitly went looking for it. So the settings in the “price history” graph don’t merely apply to the way this graph is shown. Now I need to figure out what this setting is doing. Because I didn’t see any conditions in the code which were tied to this setting.
Scraping data from pages you visit shouldn’t be affected by this.
As well as Honey and Keepa
Exactly. Showing the world the shady business of browser plugins.