Ask HN: How to get aggregated user behaviour without tracking an individual user

33 points | by harianus 194 days ago


  • Chrissvo 194 days ago

    That's a nice challenge!

    If you're super distrustful you could argue that you should never store a timestamp with a signup event, because it could potentially reveal a user's identity...

    Here's a crazy thought, what if you would do this:

    1. You fire off a default first event, say “init" On the server you generate a PGP key pair, store the private key with the init-event and return the public key

    2. Second event (first real event) is fired by the website owner and encrypted with the PGP public key from 1

    3. On the server you try decrypt event #2 with all available active private keys (stored with init-events)

    4. Once a solution is found you link the 2nd event to the 1st event, delete the private key of the 1st event, generate a new PGP key pair, store private key with 2nd event, and return the new public key

    5. Third event is encrypted with the public key of 4 and...

    No need to store timestamps and all traffic is encrypted, now how to make step 3 fast?

    • harianus 194 days ago

      Thanks! I like the way you think.

      2. I think encrypting PGP is pretty heavy and maybe not great for the performance of a script that loads on a lot of websites.

      3. I'm not sure how fast this will be. Especially on a very busy website with lots of page views per second.

      Basically you could also store a variable with the event and send that variable back. What would be the added value to use PGP encryption?

    • harianus 194 days ago

      I don't want to use a session cookie with an ID to link all events. I don't want any ID because I could potentially link those ID together in the back end based on IP (I don't, but I want people not to have to trust me). I want to make sure I don't get any data that my system could use wrong.

      • vokep 194 days ago

        I think maybe a good way to go is a compromise - Since you're already taking efforts to protect privacy without needing to trust you, thats already a good start. But maybe you need some kind of ID to tie behavior together, so you do record one temporarily, until you've processed it into an aggregate (anonymized individual behaviors)

        Basically, train a machine learning model on the data of invididuals. You don't want to overfit or that could be de-anonymizable, but a slightly underfit model could capture most of the important patterns, while throwing out most of any identifying aspects.

        The hard part then becomes finding a way to demonstrate this actually is happening so that you can be trusted. Unfortunately I can't think of a provable way, since you pretty much either can track users by IDs or not. And if you do..then trust has to be assumed

        • harianus 194 days ago

          But with my solution in the main Ask HN I don't need any ID. So why should I not do it that way?

        • JadeNB 194 days ago

          I think that the problem is, while you have near-total control over the information you collect, and can carefully consider its interactions, you have no control over the interaction of your information with other publicly available information. For example, the famous AOL de-anonymisation ( did not (I think it is accurate to say) rely on any metadata attached to the queries, only to the queries themselves.

          • harianus 194 days ago

            While I can understand this being true for AOL:

            > The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

            I don't think I have a similar issue with page views of one website of one session. I strip all query params and only save the hostname and path of the URL. I think it's nearly impossible to ever link that to a user. Maybe if you have very little amount of users, but then still you don't get personal info.

            • chatmasta 192 days ago

              Why do you consider the path of the URL to be any less sensitive than the query parameters? Many websites use dynamic paths that may as well be query parameters.

              • harianus 192 days ago

                We are going way off topic here, too bad there is not even one answer to my question:

                > Do you think this is acceptable from a privacy perspective?

                But back to your point. Query params contain usually tokens, search queries, and id's. This is not so much the case for paths. I think you agree with that. But indeed, paths can have sensitive information too.

                How would you prevent that data to be sent to my server?

                • chatmasta 192 days ago

                  Maybe you should allow the user to provide some sort of regex mask on URLs, or some sort of rule engine for which parts to keep or strip.

        • harianus 193 days ago

          And what would it be from a privacy perspective if I set a cookie for 90 days. I can't link this to any personal information and my customers will only see my tool where they can see the conversions (they don't get access to the "link" in the tables above).

          • nartz 194 days ago

            Differential privacy.

            • harianus 194 days ago

              Not really, that is more for when you have sensitive data and want to show that data publicly. I want to have only insensitive data and make sure I don't get sensitive data from the visitors.