Show HN: Export HN Favorites to a CSV File

I wrote a short JavaScript snippet to export HN Favorites to a CSV file.

It runs in your browser like a browser extension. It scrapes the HTML and navigates from page to page.

Setup and usage instructions are in the file.

Check out https://gabrielsroka.github.io/getHNFavorites.js or to view the source code, see getHNFavorites.js on https://github.com/gabrielsroka/gabrielsroka.github.io

240 points | by gabrielsroka 1453 days ago

9 comments

  • sbr464 1453 days ago
    • gabrielsroka 1453 days ago
      Ok, now that's really smart!

      Would a client using your API paginate to fetch, say, 50 pages? When I tried it using ?limit=50, I got a 504 error.

      Thanks!

      (Edit, never mind, I see you explain it in the readme.)

      • sbr464 1453 days ago
        I had made it pretty quickly, the limit would be too large, thats 50*30, you may need to provide an offset and make a few requests.
    • wildduck 1453 days ago
      Interesting that it is using x-ray, seems like x-ray is still using PhantomJS as the plugin, is PhantomJS deprecated? Would it be using Puppeteer instead?
  • jaytaylor 1453 days ago
    This is cool, I love HN metadata, too :)

    Plug for a related golang tool I wrote and use which exports favorites upvotes as structured JSON:

    https://github.com/jaytaylor/hn-utils

    Just

        go get github.com/jaytaylor/hn-utils/...
  • simonw 1453 days ago
    It's a shame favorites aren't exposed in the official HN API: https://github.com/HackerNews/API - this is a smart workaround.
    • dang 1453 days ago
      Our plan is for the next version of HN's API to simply serve a JSON version of every page. I'm hoping to get to that this year.
      • simonw 1453 days ago
        That would be amazing!

        I've been having some fun with the API recently building this tool: https://github.com/dogsheep/hacker-news-to-sqlite

      • gitgud 1453 days ago
        Wow, that would make it really easy to implement an alternative HN client.

        Related Question: Is this the source code for HN? https://github.com/wting/hackernews

        • dang 1453 days ago
          That page says the code is copied from http://arclanguage.org/install, which is indeed an old version of the HN source code. It has changed a lot over the years though.
          • saagarjha 1453 days ago
            Any chance of some newer code being released, or the addition of a place to submit patches?
            • dang 1453 days ago
              I'd love to do that someday. But it would be a lot of work.
      • bhl 1453 days ago
        Are there plans for an export tool, e.g. a user downloading all their comments and upvoted submissions? I tend to use the submission upvote button more than the favorite one, and an export tool wouldn't require a user API key for non-private info.
      • death-by-ppt 1453 days ago
        Hi Dan,

        That's great news! Is there a way to be notified (eg, via email) when this comes out?

        Thanks.

        • dang 1453 days ago
          If you (or anyone) want to be on an alpha-tester list, email hn@ycombinator.com and we can add you. Send a username and make sure it has the email address you want to be notified at.
      • amjd 1453 days ago
        That's great! Is there any plan of exposing authenticated content through the API too? Mainly talking about upvoted stories.
    • gabrielsroka 1453 days ago
      Thanks Simon. I'd originally written this script to export my DVD ratings from Netflix, since there's no API for that either. It was easy to adapt it to HN.

      I wanted to show people that it's possible (and easy) to get to your own data!

  • dvfjsdhgfv 1453 days ago
    This is smart. I'm adding this to my HN favorites.
    • gabrielsroka 1453 days ago
      Thanks dvfjsdhgfv. If there's sufficient interest, I can easily turn it into a Chrome extension.

      (Edit: haha, I see what you just did there. A little recursive humor.)

  • catchmeifyoucan 1453 days ago
    This is great! My biggest problem was I couldn’t search through my upvoted items to find the article I liked again. I used google custom search and cleaned the data as flat urls.

    https://www.heyraviteja.com/post/projects/deep-search-hn/

    • catchmeifyoucan 1453 days ago
      oops - I didn’t realize that favorites != upvoted.
  • zerop 1452 days ago
    Can it will be done with "Scrape Similar" Chrome plugin?
    • gabrielsroka 1452 days ago
      Thanks for the tip. I gave the "Scraper" extension a try, and 1) I got an error, 2) it only seems to scrape 1 page -- it doesn't paginate (or, did I miss something?).

      I used the jQuery selector `a.storylink`.

  • rtcoms 1451 days ago
    Is there any way to find most Favorited items on HN ?
  • app4soft 1453 days ago
    Could someone convert it to Python-script?
    • gabrielsroka 1453 days ago
      Part of the advantage of running JavaScript in your browser is that you might already be authenticated and it can use your session. But, fetching your HN favorites doesn't require authentication.

        #!/usr/bin/env python3
        import requests
        from bs4 import BeautifulSoup
      
        for p in range(1, 17):
            r = requests.get(f'https://news.ycombinator.com/favorites?id=app4soft&p={p}')
            s = BeautifulSoup(r.text, 'html.parser')
            print([{'title': a.text, 'url': a['href']} for a in s.select('a.storylink')])
      • app4soft 1452 days ago
        Thanks!

        One more question: what is the best way stop it when it will reaches last page?

        > for p in range(1, 17):

        Actually p=17[0] is empty (as p=16 is maximum as for now).

        Maybe, script should scrap pages from `1` to `infinity` UNTIL it detect next message on page[0]:

        > app4soft hasn't added any favorite submissions yet.

        [0] https://news.ycombinator.com/favorites?id=app4soft&p=17

        • gabrielsroka 1452 days ago
          In Python, `range(1, 17)` produces the numbers 1-16. I hard-coded it just for your favorites.

          A better way to solve it is to look at the `len()` of the results, and stop when it gets to 0:

            p = 1
            while True:
                r = requests.get(f'https://news.ycombinator.com/favorites?id=app4soft&p={p}')
                s = BeautifulSoup(r.text, 'html.parser')
                faves = [{'title': a.text, 'url': a['href']} for a in s.select('a.storylink')]
                print(faves)
                p += 1
                if len(faves) == 0:
                    break
          • app4soft 1452 days ago
            Great!
            • gabrielsroka 1452 days ago
              I think this one is a little cleaner. I used some of the ideas in sbr464's code.

                  path = 'favorites?id=app4soft'
                  while path:
                      r = requests.get('https://news.ycombinator.com/' + path)
                      s = BeautifulSoup(r.text, 'html.parser')
                      print([{'title': a.text, 'url': a['href']} for a in s.select('a.storylink')])
                      more = s.select_one('a.morelink')
                      path = more['href'] if more else None
      • death-by-ppt 1453 days ago
        Can someone convert it to Bel?
    • amjd 1453 days ago
      I had written a Python script to get saved (upvoted) links as JSON / CSV a few years ago: https://github.com/amjd/HN-Saved-Links-Export

      I'm not sure if it still works as it too relied on HTML scraping. Perhaps I should update it to support favorites too.

      Edit: Whoa, it's been 4 years already. I believe HN didn't have favorite feature at the time. That's why I used upvoting as my bookmarks system and created a script to export that data.

      • gabrielsroka 1453 days ago
        @amjd, thanks for sharing. I upgraded it from Python 2 to Python 3, but when I ran it, I got a 404 error on the `saved` endpoint. Does it work for you?

        Edit: I see from the other PR it's called `upvoted`.

        Edit 2: I changed it to `upvoted` and now I get a 200 OK, but the code crashed right afterwards on `tree.cssselect()`.

        • amjd 1452 days ago
          It's four years old so the page HTML structure must have likely changed. I'll take a look at it when I'm free and update the script. Thanks for your PR, I'll review and merge the two open PRs as well.

          Btw, good work on your JS solution. It's great that it just works without requiring any download or installation. :)

  • abdullahkhalids 1453 days ago
    What is HN's GDPR compliant way of requesting a copy of all stored data? Email dang?
    • tzs 1453 days ago
      Considering that there is no mention of GDPR in the HN FAQ or on the "legal" page, my guess is that their position is that GDPR does not apply.

      According to Article 3 of the GDPR, it applies to:

      1. Processing that takes place in the context of processors and controllers that are in the Union, regardless of whether or not the processing itself takes place in the Union.

      2. Processing the data of subjects who are in the Union by controllers or processors who are not in the Union if the processing is related to offering goods or services to such subjects in the Union or the processing is related to monitoring the behavior of such subjects that takes place in the Union.

      I don't know how HN is structured, but I've not seen any indication that they are in the Union, so #1 probably does not apply.

      #2 applies if they are doing processing related to "offering goods or services to such subjects in the Union" or "monitoring the behavior of such subjects that takes place in the Union".

      One of the recitals elaborates on the first branch of that:

      > In order to determine whether such a controller or processor is offering goods or services to data subjects who are in the Union, it should be ascertained whether it is apparent that the controller or processor envisages offering services to data subjects in one or more Member States in the Union. Whereas the mere accessibility of the controller’s, processor’s or an intermediary’s website in the Union, of an email address or of other contact details, or the use of a language generally used in the third country where the controller is established, is insufficient to ascertain such intention, factors such as the use of a language or a currency generally used in one or more Member States with the possibility of ordering goods and services in that other language, or the mentioning of customers or users who are in the Union, may make it apparent that the controller envisages offering goods or services to data subjects in the Union.

      Does HN "envisage" offering services to people in the Union? Or are they a site that is merely accessible from the Union without envisaging offering services there?

      There's a recital that elaborates on the second branch, too:

      > In order to determine whether a processing activity can be considered to monitor the behaviour of data subjects, it should be ascertained whether natural persons are tracked on the internet including potential subsequent use of personal data processing techniques which consist of profiling a natural person, particularly in order to take decisions concerning her or him or for analysing or predicting her or his personal preferences, behaviours and attitudes.

      Does the data HN stores about its users satisfy this? And if it does, is the behavior being monitored taking place in the Union?