Network visualization of 50k blogs and links

(graph.henryn.ca)

222 points | by ng-henry 9 days ago

24 comments

PaulHoule 9 days ago
People still upvote hairball graphs every time. Fortunately there is a cure:
https://cambridge-intelligence.com/how-to-fix-hairballs/
[-]
- ng-henry 9 days ago
  The hairball was much worse before. I used a lot of techniques from this paper [1] to make it look decent and a bunch of other heuristics based on other papers to make it look informative.
  [1] https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...
  [-]
  - mikk14 9 days ago
    I'd give it a shot to make node embeddings with Node2Vec [1] and then reduce them to 2D with UMAP [2]. I think it could help breaking apart the hairball, assuming you have a nice clustered structure.
    [1] https://pytorch-geometric.readthedocs.io/en/latest/generated... [2] https://umap-learn.readthedocs.io/en/latest/
- tauchunfall 9 days ago
  Also edge bundling can help. See the papers by Benjamin Bach et al.
  https://aviz.fr/~bbach/confluentgraphs/
- 3abiton 9 days ago
  Is graph data processing considered visualization style? It is changing the data, how can this be considered "visualization"?
  [-]
  - ng-henry 9 days ago
    We aren't changing the data, just changing how the graph is displayed.
  - omeid2 9 days ago
    The relationship of data is part of the data.
    Your visualisation tool may require it in a specific format, but it still about properties of your data.
ng-henry 9 days ago
I scraped my favorite blogs and made a graph from the domains that each blog links to.
You can see clusters forming of websites that talk about similar topics, like crypto, rationality, Canada, India, and even postgres!
The visualization was made entirely in webgl with some neat optimizations to render that many lines and circles.
[-]
- jseliger 9 days ago
  This is very cool but also not accurate, at least for jakeseliger.com. Henryn.ca lists 0 links from jakeseliger.com to nytimes.com, reason.com, and numerous others that simple search demonstrates are linked to, for example: https://jakeseliger.com/?s=nytimes.com&submit=Search
  I put up many links posts, so I probably link to an abnormally large number of sites.
  [-]
  - ng-henry 9 days ago
    Yep this is only for stuff that we've crawled, so we can't detect all of your links. Because we have limited crawling resources, we rate-limit the crawling by domain so we don't get stuck in spider traps.
    The current visualization only shows the current state of the crawl, so it won't know about all of the posts.
- TuringNYC 9 days ago
  > I scraped my favorite blogs and made a graph from the domains that each blog links to.
  Nice analysis! However, I'm guessing these arent your fav blogs as there are tens of thousands of entries! How did you decide which blogs to index, did you use some central registry of blogs?
  [-]
  - dredmorbius 9 days ago
    <https://news.ycombinator.com/item?id=40137135>
- nickjj 9 days ago
  Thanks a lot for including my site in your list. It was fun to see where it appeared on the map. It was pretty close to RealPython and GitHub.
- dameyawn 9 days ago
  Very neat! So you wrote the graph visualization UI? I see in prior project you used cytoscape - any motivation for doing it yourself this time (vs one of the available libraries)?
  [-]
  - ng-henry 9 days ago
    Yeah I used cytoscape before but it didn't have the full customization that I wanted. Besides the performance issues, there were some problems I couldn't have solved without a custom renderer - if many lines overlap, how should their colors blend? - how to render circles so that they look nice both zoomed in / out - how to avoid it looking like a hairball graph [1]
    The nice thing about a personal project is that I can do whatever I like with no constraints, so I built one that's suited for this project and fits my tastes.
    [1] https://cambridge-intelligence.com/how-to-fix-hairballs/
- varenc 9 days ago
  This is a really cool project! I'd love to hear more about how your built the front end.
- gala8y 9 days ago
  serendipity heaven, but... how is this map of _your favorite_ blogs?
  [-]
  - ng-henry 9 days ago
    I started off with my favorite blogs and recursively explored from there based on what they linked to.
    [-]
    - gala8y 9 days ago
      ok, got it.
erikig 9 days ago
Reminds me a little of my fav sub-reddit browsing tool - https://anvaka.github.io/map-of-reddit/
One nice feature that would be helpful is the ability to preview the blog.
[-]
- imdsm 9 days ago
  Oh dear, I zoomed in to the Island of Debauchery
jmmv 9 days ago
Reminds me of the intern project I worked on at Google back in 2008.
My mentor at the time had a traceroute dataset of the Internet and wanted to render it on top of Google Maps. I implemented a MapReduce algorithm that geolocated the data points and then produced Google Maps tiles at various zoom levels to show how the Internet was connected. It was pretty cool to visualize how the data flowed throughout the world and to be able to "dig deeper" by zooming into the mess of connections. Very similar to what this project does!
The project didn't go anywhere but it was a cool fun experiment and a great learning opportunity for me (S2 geometry is... well, weird, but touching MapReduce and Bigtable were invaluable exercises for my later tenure at the company). Those were very different times. I don't think you would be able to pursue such a "useless" project as an intern at Google these days.
[-]
- pmayrgundter 9 days ago
  Howdy sir! Great memory and fun times :)
  Dataset was something from CAIDA, like this: https://www.caida.org/catalog/datasets/ipv4_prefix_probing_d...
  IIRC we used the LGL algorithm (https://lgl.sourceforge.net/) while pinning any nodes we could get geolocations for, giving a nice hybrid geo/topo layout
  I don't remember exactly how we got the geolocations, but often network routers have 3-letter airport codes in their DNS names, so maybe that? We may also have had a lookup table in el googz somewhere
  Definitely a project whose time should again come! ;)
  [-]
  - jmmv 9 days ago
    Oh that’s right! We had to use LGL in addition to geolocation to lay out points without data. And yeah, the geolocation came from some internal service.
    Thanks for chiming in. Good times!
bhartzer 9 days ago
This is very similar to Majestic's Link Graph where you can put in any domain name and see all the links, up to tier 5, that link to that domain name.
amadeuspagel 9 days ago
I think it would be cool if the search results were also visualized as a network.
jszymborski 9 days ago
My blog is on here, but as a lonely, lonely node. I link to stuff I promise!
[-]
- ng-henry 9 days ago
  We just haven't crawled your site yet! There's a lot of links so we can't crawl them all :(
  [-]
  - ploum 8 days ago
    Well, the results are quite strange. Mine ( ploum.net ) is said to have 7 links to youdoblog. (which is false, there’s only one link to that website in all my 900 blog posts).
- etimberg 9 days ago
  Same. Honestly a bit surprised, in a good way, thst my site made it. I didn’t think anyone read it
  [-]
  - ploum 9 days ago
    Mine is listed as "personal growth" and I find it quite funny (that’s probably the last category I would think for my blog posts)
jll29 9 days ago
This is a neat idea - however, I think the graphical view of the blog graph trades "coolness" for "utility".
Have you thought of a front end that is basically just text/plain HTML (in normal size) + navigation links to explore the blogs in one frame, and the currently chosen blog in another frame? That way, you could look at the blogs while travelling your crawl graph, a kind of "blog explorer".
montyanderson 9 days ago
Reminds me of my friend's visualisation of tracks on the popular London station NTS https://www.barneyhill.com/pages/nts-tracklists/. Turns out a lot of cool artists like the same tracks... ;)
rcarmo 9 days ago
Hmmm. My site is listed, but I have _way_ more inbound and outbound links than shown.
And I have my own internal links visualization, which might be a bit over the top (GPU recommended): https://taoofmac.com/static/graph
[-]
- ng-henry 9 days ago
  See this comment I posted in another thread:
  Yep this is only for stuff that we've crawled, so we can't detect all of your links. Because we have limited crawling resources, we rate-limit the crawling by domain so we don't get stuck in spider traps. The current visualization only shows the current state of the crawl, so it won't know about all of the posts.
Avamander 9 days ago
Awesome, I've always wanted to build something like that on top of YaCy just so that I could properly select new potentially interesting sites to index. (I can't rely on the auto-index unfortunately because it has no option to pre-confirm before indexing.)
CalRobert 9 days ago
This is only tangentially related, but has anyone done similar for HN comments? I'd be curious to know who responds to whom on particular topics, etc....
anfractuosity 9 days ago
Cool, I'm just wondering how come some nodes don't have any lines to/from them, does that mean they came from an initial seed list?
aendruk 9 days ago
Safari on iOS 17.4.1 is consistently crashing with “A problem repeatedly occurred […] Webpage Crashed”
abalaji 9 days ago
This is neat, found my blog in there. Don't think I linked to NatGeo at any point, though.
adithyabalaji.com
[-]
- ng-henry 9 days ago
  Nice! I looked through the logs and saw that you linked to it in this article: https://www.adithyabalaji.com/datascience/2021/05/17/Analyzi...
  [-]
  - abalaji 9 days ago
    ah, you’re right!
ibaikov 9 days ago
Scraped 10k blogs some time ago. Only like 20 of them had /ideas page, sad :(
[-]
- mixedmath 9 days ago
  What does an /ideas page mean to you?
system2 9 days ago
Awesome. I will spend many hours looking at it today. Thanks a lot.
nexuist 9 days ago
This is really awesome work! How did you classify so many links?
[-]
- ng-henry 9 days ago
  To get their topics? I used a basic louvain community detection algorithm, then put all the URLs into GPT with some few-shot prompting tricks to get it to output a particular topic. There's some heuristics to break up giant communities / combine small communities in there too.
  [-]
  - internetter 9 days ago
    Interesting, I was curious what I would be categorized as and it's "Whistleblowing and Leaks", which I do suppose is what my content has lately been to some extent but it was funny to see that written out.
    My question for you is how can I see what sites link to me, as opposed to what sites I link to?
    [-]
    - dougunplugged 9 days ago
      Sites like ahrefs.com, semrush.com and majestic.com can show what sites link to your site. Looks like your site boehs.org is pretty popular with thousands of backlinks. Your top linked pages are: https://boehs.org/node/everything-i-know-about-the-xz-backdo... https://boehs.org/node/truth-social https://boehs.org/node/npm-everything
denvaar 9 days ago
Would be interested in seeing an article about creating this.
hanniabu 9 days ago
Surprised to see litprotocol at the same level as etherscan
JohnKemeny 9 days ago
Any chance of getting a copy of the underlying dataset?
nhggfu 9 days ago
some clustering would be nice.
gdelfino01 9 days ago
citizenfreepress.com is missing. That one will add a lot of edges.
nerdl0ve_kr 9 days ago
what's the point of this?
[-]
- gverrilla 9 days ago
  This video about a wikipedia graph might give you some ideas: https://www.youtube.com/watch?v=JheGL6uSF-4