Data exfiltration in Keepa Price Tracker

(palant.info)

62 points | by taxyovio 990 days ago

11 comments

  • danpalmer 990 days ago
    Wow, they've built a distributed Amazon listing scraping system – essentially a botnet.

    As someone who has done a lot of web scraping and had to route around a lot of blocking (we have business contracts to allow scraping, but they don't stop over-eager sysadmins), this feels like a dream come true.

    But I'd never actually want to use this for scraping and I'm not sure any informed user would agree to use this.

    • voltagex_ 990 days ago
      How do you get contracts to allow scraping? What kind of cost are we talking about?
      • dewey 990 days ago
        Some companies want you to list their products on your page (usually with some kind of affiliate deal attached) but don't have a tech team to implement a feed or an API. In that case you end up in a situation where you have to scrape the data yourself with permission.
        • danpalmer 990 days ago
          Pretty much exactly this. We're no affiliates as we do our own fulfilment, but essentially we partner with companies to offer their products on our site, and we take a margin from them.

          We sometimes do this with feeds, but feeds aren't great for stock change latency which is important for us (thin stock levels on wide range of products, more out of stock issues). Scraping ensures we at least have the same stock latency that direct customers see, and we manage the risk on how frequently to scrape.

          Most of these companies don't have in-house engineering capabilities. For Shopify based merchants we don't have to scrape, we use the Shopify API, but otherwise scraping is the only real solution.

  • wilde 990 days ago
    > Unless of course you don’t consider the information collected here personal.

    I don’t. The author even goes out of their way to point out that these requests aren’t generated by the user and so there’s no latent interest information there. I agree that they should cover this behavior in the privacy policy explicitly, but there’s a tone of moral outrage in this piece that seems unearned.

    • palant 990 days ago
      Note: I am the author of this article.

      I’m really unsure how you would come to this conclusion. Even if you only read the summary at the beginning or only the conclusions section at the end, you should notice that Keepa is doing both. It will extract data from your Amazon visits (personal information) and do its own scraping (merely wasting your bandwidth if implemented correctly which I am unconvinced of).

      • wilde 990 days ago
        Thanks for engaging here. Maybe my reading comprehension is poor, but here’s the full quote that I was objecting to. It comes after a long pull quote where Keepa promises to not log the requests that do contain latent interest behavior:

        > This refers to some pieces of the Keepa functionality but it once again completely omits the data collection outlined here. It’s reassuring to know that they don’t log product identifiers when showing product history, but they don’t need to if on another channel their extension sends far more detailed data to the server. This makes the first sentence, formatted as bold text, a clear lie. Unless of course you don’t consider the information collected here personal. I’m not a lawyer, maybe in the legal sense it isn’t.

        When I was reading, I thought that “data collection outlined here” referred to the scraping behavior you reverse engineered, since the pull quote covered the user-generated request. I agree that they should include the additional scraping behavior here for clarity (we’re arguing about it after all). I disagree that it constitutes as a “clear lie”, since I don’t think that data is personal.

        • palant 990 days ago
          “Data collection outlined here” refers to both mechanisms covered by the article. The first one collects information about the products you look at which clearly is personal information. The automated scraping in the background is less problematic from the privacy protection point of view, at least when it is used in the intended way.
      • 45ure 990 days ago
        Thanks for the article.

        I use this extension (and the app) regularly, which activates as soon as I visit Amazon in a container tab. In addition to providing in-depth statistics, features like alerts via Telegram have helped me hunt down bargains. I have noticed the increase in network requests and bandwidth when the tab is active, using basic tracking via Resource Monitor (W10). However, I can easily block it via uMatrix/uBO, if required. In this case, it is a trade-off, which can be justified.

        Also, Tracker Control (Android) for Keepa app reports blocking just two trackers Google Crashlytics and Google Firebase Analytics -- so it is not as bad other apps.

        I have used CamelCamelCamel in the past, which was more egregious and aggressive in tracking users, but don't know how it fares today.

        https://camelcamelcamel.com/

        • palant 990 days ago
          Unfortunately, it isn’t that easy. You cannot use other extensions to block requests happening on the extension’s background page. Whatever tracking and scraping is going on, you can probably disable part of it via extension’s settings but otherwise there is nothing you can do.
  • NazakiAid 990 days ago
    I use Keepa basic and it has saved me a ton of money. I always just assumed it was scraping the prices from pages I visit, but I didn't know it would automatically fetch Amazon pages in the background. Might just sign out of Amazon, and use a separate browser to purchase from it.

    Either way, I have some thinking to do on if I should "keepa" it or not (sorry really bad joke). Maybe I should purposely turn a blind eye and just trust they aren't going to do anything evil nor have some privacy risk due to how useful it is.

    • SCNP 990 days ago
      Isn't this always the trade-off? While I do appreciate useful software, it gets tiring that it's almost always at the expense of a little bit of privacy or tracking. Seems like the death of a thousand cuts of our anonymity online. Although, I don't really harbor illusions that we (at least Americans) haven't been tracked since the invention of the credit card. I guess I'm a little jaded at this point as there doesn't seem to be anything I, personally, can do about it and I get a touch of FOMO when I hear about the capabilities of the latest and greatest apps. I understand that data collection is inherently necessary for AI, I just don't like who's in charge of it and making the innovations.
    • wheels 990 days ago
      I use the Keepa website and never realized before this article that they even have browser plugins. On the website you can set up price alerts that go out via email or Telegram. That works well enough for me.
      • NazakiAid 990 days ago
        I would do that but it's very helpful to also see how often the price changes and goes on sale to know if I am getting "ripped off".
        • wheels 989 days ago
          There's a price graph on their website showing the price development over time.
      • rafaelm 989 days ago
        Same, I just set up alerts via Telegram and they popup on my phone and desktop client. Didn't know they had a browser extension.
  • a254613e 990 days ago
    I can't quite understand this article and its conclusion.

    The article says: "[The extension] will collect information about the products you look at and the ones you search for".

    Yet, two sentences later it says "The company behind the extension fails to comply with its legal obligations. The privacy policy is misleading in claiming that no personal data is being collected."

    So which personal information is exactly included in the data submitted to their servers about the products? Because in that json example I don't see anything that would be even close to personal information.

    The remote scraping/execution abilities are not great, I'll give it that. But the rest of it seems like overblown conclusion and interpretation of how it works.

    • Semaphor 990 days ago
      I’d assume that "products you searched for", even if only implicitly thanks to the results, is personal information. It also is not mentioned in their privacy policy, which only mentions sending on product pages.
    • palant 990 days ago
      Note: I am the author of the article above.

      The history of all Amazon products you looked at or searched for is personal data, and it can tell a lot about you. Whether it is also personal data in the legal sense is not something I can say for sure. But it definitely has to be properly covered in the privacy policy, for GDPR compliance at the very least.

      • timdorr 990 days ago
        But it is not personal data that would identify you (PII). If someone was able to determine who I was based solely on my browsing activity on Amazon, then they've already obtained my personal information.
        • iamacyborg 990 days ago
          PII is not a term that is used in the GDPR. The person you're replying to is correct that your browsing data is likely to count as personal data given that it's linked to an individual.
        • palant 990 days ago
          No, it isn’t PII in the legal sense, it doesn’t allow identifying you directly. Which doesn’t mean that it cannot be tied to your identity. Just one example: if you regularly post to social media what you bought online, this information could be correlated with the Keepa data to find out which profile is likely yours and what else you looked at.

          But GDPR doesn’t merely require you to disclosure collection of PII, but rather all data collected. There is a good reason for that.

  • mrsaint 990 days ago
    And not sure if Amazon would agree to this as it essentially threatens the privacy and integrity of their users. Interestingly, Keepa is also an Amazon Affiliate, so they are in a direct business relationship with Amazon.
    • patd 990 days ago
      As far as I know, Keepa is not an Amazon affiliate. They used to be and got kicked out like many similar tools around 5 years ago.

      They moved to the current model of providing an API for Amazon data (which seems to use the extensions users to scrape data).

    • avipars 990 days ago
      They actively warned about Honey Security Issues, but haven't mentioned Keepa at all.
  • dzink 990 days ago
    If the additional Amazon pages are loaded on days when the user hasn’t browsed Amazon, or done once a day, that could be cookie stuffing, explicitly prohibited by Amazon Affiliate terms. The Amazon affiliate cookies last 24 hours, so triggering a session when a user doesn’t do it, might extent their affiliate window and is not right at all.
    • liquorice 990 days ago
      Keepa is a data company though, not an Amazon Affiliate, so they shouldn't care about violating that policy
  • bkor 990 days ago
    From the Keepa addon settings:

    > Allow the add-on to gather Amazon prices to improve our price data

    I thought it was common knowledge that Keepa uses the addon to gather prices. Though with GDPR it probably needs to be more explicitly said.

    • Semaphor 990 days ago
      There is a difference between gathering prices and loading extra URLs to gather those prices. From that text, I would not assume they are using my computer as a part of a botnet.
      • bkor 990 days ago
        I knew it was doing that as well (the distributed scraping the article talks about). But I cannot figure out where I read it. Maybe they used to have it somewhere on their site, and now it's gone?

        What is strange that people asked for e.g. Amazon.nl support. This isn't implemented as Keepa relies on Amazon (this is their answer in the forums). But if they scrape, why do they still need Amazon?

    • palant 990 days ago
      Note: I am the author of the article above.

      Nice, I didn’t find this setting and I explicitly went looking for it. So the settings in the “price history” graph don’t merely apply to the way this graph is shown. Now I need to figure out what this setting is doing. Because I didn’t see any conditions in the code which were tied to this setting.

      • palant 990 days ago
        Found it. This is the optOut_crawl setting and its handling is entirely on the server side. So presumably if this setting is set, the server will no longer send the extension any instructions to scrape Amazon pages in background. Mind you, it still could but it probably won’t.

        Scraping data from pages you visit shouldn’t be affected by this.

  • robk 990 days ago
    i don't really care - i love the plugin too much to uninstall it. it's saved me a killing.
  • avipars 990 days ago
    thanks! Uninstalled today!

    As well as Honey and Keepa

  • dna_polymerase 990 days ago
    Do you remember the time when this weird German startup that publishes an Adblocker tried to start an "Acceptable Ads" program and extort money from Google? Guess what their CTO is up to now.

    Exactly. Showing the world the shady business of browser plugins.