Show HN: TidalWaves API – live, tokenized news metadata from around the world

(tidalwaves.io)

124 points | by prohobo 1276 days ago

15 comments

prohobo 1276 days ago
I spent 3 months building a data pipeline for parsing and transforming raw data from the GDELT project, as well as applying a machine learning model to find breaking news stories, which turned out to be a pretty nifty tool for creating visualizations and data mining applications.
So, I built an API around it and released it on Rapid API today.
I'm hoping my pricing isn't too high.. It's currently set so that a Tier 1 sub can create the marketing site I linked, and Tier 2 subs can access historical data as well. But, I also hope to create a steady income so I can work on other projects in the future.
Hope you guys like it, and please send your comments and crits.
arusahni 1276 days ago
As someone who has built event pipelines from GDELT data, I'd advise folks to not start (a GDELT project) with high expectations. Not only are there plenty of duplicates, but tagging quality is all over the place, especially for non-Western media.
[-]
- prohobo 1276 days ago
  Agreed, it's very hit or miss in some situations. I'm hoping to keep evolving this to clean it up and eek out more salient information.
  I think GDELT is the best source, nothing else comes close to coverage in my opinion, it's just that all the articles it gathers revolve around world events, as opposed to all of the millions of topics of discussion and niche interests people post about.
  [-]
  - arusahni 1276 days ago
    I agree that GDELT is the best source. It's a hard problem, for sure. I just think that their project's mission and description may lead folks to believe it's this perfect data set, which it isn't.
- chris_f 1276 days ago
  The one nice thing about GDELT is it translates non-english language news so that you can use an english search term and see articles in different language. It's not perfect (i.e. machine translation), but I am not aware of any other service that provides similar functionality.
  More info: https://blog.gdeltproject.org/gdelt-translingual-translating...
  (unrelated: scrolling on that page is horrible)
- artembugara 1276 days ago
  I worked with GELT data for a long time. Same feedback.
  What are you using instead of GDELT now?
  [-]
  - arusahni 1276 days ago
    That was with a previous org, so thankfully not something I have to worry about now.
    Our domain was pretty well defined, and we weren't building a real-time system, so we developed an adjudication tool and hired a few part time employees to evaluate and clean up our output.
delineator 1276 days ago
Is New Zealand left off TidalWaves' world map by design, or by accident?
Asking on behalf of Rhys Darby. https://www.youtube.com/watch?v=HynsTvRVLiI
[-]
- prohobo 1276 days ago
  I don't know what to say except: apologies. It is on there, and you can drag the map around horizontally, though that feature isn't very apparent currently.
- bmilleare 1276 days ago
  NZ is there if you make your browser viewport big enough.
OJFord 1276 days ago
I've never heard of RapidAPI before, but it seems kind of scammy. As in, nice idea, bit just filled with low-quality spam APIs wrapping presumably free or cheaper services, not that RapidAPI itself is a scam.
But then, Amazon's awash with tat too these days, and I wouldn't be surprised at anyone selling their product there.
[-]
- prohobo 1276 days ago
  Yeah, Rapid API is quite good in itself, but there are a lot of trash APIs on there.
  I see it as a good way to test the waters. If people are interested in it, I'll eventually move it off of Rapid API for direct access and use Stripe for subscriptions.
  [-]
  - keithwhor 1276 days ago
    Hey prohobo, you should consider experimenting with connector APIs on Autocode to deliver something like;
    https://autocode.com/lib/url/temporary/
    We're ostensibly an automation platform but our core technology around Connector APIs automatically generates docs for you and a whole ton of other neat things - like your API can show up in autocompletion dropdowns for Autocode users. We're a Stripe-backed company, if you haven't heard of us it's because we rebranded / relaunched on July of this year. Disclaimer; am founder. :)
Grimm1 1276 days ago
Hey there, first off cool stuff I can see this being plenty useful for various projects. We're exploring GDELT data for our own needs and I was wondering if you wouldn't mind sharing what were some rough spots or gotchas using the project?
[-]
- prohobo 1276 days ago
  Thanks! There are a lot of issues with GDELT data, but the things that come to mind recently are:
  - Missing keywords. Huge topics like "brexit" won't be found, so you need to extract those yourself.
  - Dirty titles. You need to extract the article titles from the website metadata, and more often than not they'll pre/append their site name like "[title] - DailyStormer" and links like "[title]: Business, Stocks, News" etc. That requires NLP to fix.
  - Non-standardized locations. Location tags are all over the place, so "united_states" might have many tags like "us", "usa", "america", "the_states", etc. I'm still working on combining these tags in the API.
  [-]
  - panarky 1276 days ago
    Ugh, "DailyStormer" is your example of a "news" site?
    https://www.splcenter.org/fighting-hate/extremist-files/indi...
    [-]
    - prohobo 1276 days ago
      I was just being cheeky. Their articles don't get pulled into the data, but I'm sure there are plenty of other bad sources if you look around.
  - Grimm1 1276 days ago
    Okay so the data is messy and they have a limited set of acknowledged keywords and entity types so will have to work those out ourselves. Got it. That is super helpful information thank you!
nl 1276 days ago
For some uses (if accuracy is more important than speed), ICEWS is available as periodic dumps on Dataverse. They use CAMEO for event encoding which - if imperfect - at least is used elsewhere.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...
wombatmobile 1276 days ago
What are the 1000+ news sources? Is there a list of publications? e.g. New York Times, The Australian, Bangkok Post... ?
I understand your data comes from GDELT. I'm new to that and all I could glean from visiting the GDELT site at first go was that they have access to a lot of historical data sources - but I couldn't see a list of 1000+ current news publications. Is there a list somewhere? Are all the sources typographical, or might some be radio/tv? Thank you.
[-]
- prohobo 1276 days ago
  They're all online written news sources, but I believe GDELT is also getting into TV news so that might be coming in the future.
  There is no official list currently, but you could query either GDELT or TidalWaves to see which are available in the data sets.
  I could add a page to the marketing website which displays all of them, instead of the top sources for the latest news - would that be something useful for people?
  [-]
  - wombatmobile 1276 days ago
    Thanks for replying. I'm not sure what is useful yet - first of all I am trying to understand the scope of the data sources and the methodology for selections displayed in TidalWaves.
    I would like to see the list of publications if that is possible.
    I am wondering what "trending stories" means in TidalWaves? For example, if the NYT at any point contains 500 stories, and is refreshed say 3 times every 24 hours, with say 100 stories aged off and 100 new stories added each addition, what is a trending story in the NYT? Does the front page headline count more than a page 27 minor traffic report?
    How is the top 5 trending stories for USA derived?
    [-]
    - prohobo 1276 days ago
      I just did a quick query grouping all sources and number of articles per source in the database (currently about 4 days worth of data): https://pastebin.com/HyktFpML
      As you can see, there are already almost 8000 sources. I'm honestly not sure how GDELT scrapes its data though, but this is what they do for over 5 years now and it's backed by Google Cloud processing, so I'm sure it's ever expanding and very thorough.
      The kinds of articles which are discovered revolve around world events, however. So you won't see random blog posts, or traffic reports. Instead you'll see geopolitics, big tech news, and cultural discourse.
      Trending stories are found for the entire set of articles in a batch, with no separation for source, category, or location. If the top stories all happen to be in the USA in a given moment, then that's all you'll find. If NYT happens to publish an article that fits an overall story being talked about in the rest of the batch, it'll get added to that story.
      Trending popularity is decided by how many individual sources talk about the given topic. This is good enough to identify the Zeitgeist of a moment, but it likely won't catch nuances in evolving stories.
      [-]
      - wombatmobile 1276 days ago
        Thanks for compiling that list. I have had a quick look and downloaded it for a longer look.
        Some GDELT methodology is extolled here:
        https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-...
TruffleLabs 1276 days ago
“All data is transformed to comply with ISO standards, reduce duplication, and maximize salience”
Which ISO standards is this statement referring to?
[-]
- prohobo 1276 days ago
  Right now, country codes (ISO 3316 Alpha-2) converted from FIPS. In the future, anything that needs to be converted to ISO.
jszymborski 1276 days ago
Curious, are you getting _all_ your data from GDELT, or are you also scraping from news sites.
I've never heard of GDELT before, but I've found news websites to have incredibly tight ToS's which prevent you from keeping a database of their articles offline, regardless of whether or not you distribute the articles.
[-]
- prohobo 1276 days ago
  It's all from GDELT and HTML meta tags. Also, this doesn't have any article content, just metadata extracted from the content.
artembugara 1276 days ago
Nice project. I actually did pretty much the same a long time ago: an API around GDELT data.
Your map looks fantastic, btw.
The problem with GDELT is that they don't have that much coverage. I discovered it when we began to build our own News API [1].
I failed to sell this data over an API. I think you should see much more interest for an application. API like that is a bit complex. Almost all potential clients who really needed GDELT already parsing this data from Big Query, Redshift, or GDELT's dump files.
Anyway, check our solution, probably we could collaborate somehow. Map is cool! Feel free to reach me over artem [at] newscatcherapi [dot] com
[1] https://newscatcherapi.com
[-]
- prohobo 1276 days ago
  Thanks! While I agree GDELT doesn't have the kind of coverage some people may be looking for, I think it's the best source for world news insights.
  Oh nice, I actually saw Newscatcher on Rapid API but didn't see the marketing page. It looks like a great product, and I think suited for large-scale solutions. TidalWaves will likely find a niche for people making visualizations and personal apps.
  What I want to do is make a mature pipeline that handles at least 90% of what amateur and solo users would want to use GDELT for, so hopefully people see the value in that. After all, I could make this visualization using the API - so I'm sure people can think of many other cool uses for it.
  Also, thanks for reaching out, I'd definitely like to collaborate in the coming months.
abraxaz 1276 days ago
Have you considered using JSON-LD/RDF for this? It would make it easier to integrate with WikiData and data from other sources.
[-]
- prohobo 1276 days ago
  No, but that actually sounds like a good idea. If there are a lot of people using this data along with WikiData, or if there's a kind of standard for "big" encyclopedic data, then I'd definitely like to support it.
  [-]
  - abraxaz 1275 days ago
    Semantic Web Technology (RDF/OWL/SPARQL) is kind of that, used by WikiData and search engines (schema.org) among others (e.g. https://api.parliament.uk/sparql, https://data.gov.uk/dataset/6efafa3b-ec10-4e11-82a2-b4724dd5...).
    If you make data available as RDF (which for JSON only involves creating a JSON-LD context mostly) then it makes it very easy for others to integrate with data from other sources as that is one of the key challenges that Semantic Web technology is designed to solve.
op03 1276 days ago
Can journos be ranked by topic frequency?
Not too interested in the most popular topics, but the obscure stuff like why some dept in govt has changed something related to topic and there are a couple journos who have been monitoring/reporting on that dept and so over time have a high count related to that topic.
awestroke 1276 days ago
There are duplicates on the list of 4 breaking stories. They have the exact same title, too.
[-]
- prohobo 1276 days ago
  Yep, currently the breaking stories are identified in the 15 minute intervals, so all the duplicate stories showing are from different times. If they have the same title that means that I can train the model to combine them.
  They also could be press releases, considering the same title was found for multiple articles over the past 30-45 minutes. The data pipeline currently removes duplicate titles, but only inside single batches. So, it could be a long sustained press release.. like propaganda.
  As of writing it shows three different stories though:
  - 12:15 Will there be a coronavirus relief bill with $1,200 checks?
  - 12:30 Nigeria protesters break curfew amid gunfire, chaos in Lagos
  - 12:30 Google antitrust case to turn on how search engine grew dominant
  Maybe what you're seeing is a bug in the website client, I'll investigate it.
  [-]
  - awestroke 1276 days ago
    If they have the exact same title, perhaps you can skip having a deep neural net and just do a string comparison? Might be sligthtly faster ;)
    Finding duplicate stories that have different texts and titles sounds like something you should use AI for
    [-]
    - prohobo 1276 days ago
      I'm definitely thinking of ways to improve it for long-term stories. You're right, fuzzy string matching would do wonders for that, thanks :)
      [-]
      - chris_f 1276 days ago
        I've used trigram matching of article title strings with some success for comparing similarity of news article. If the strings are greater than .6 they are probably about the same topic.
  - petre 1276 days ago
    There are news about Pakistan and Hungary in Poland.
    [-]
    - prohobo 1276 days ago
      Most news articles mention multiple locations. It's probable that those pieces talked about both Pakistan/Hungary and Poland, so will be found in both countries.
      The locations are ordered by salience (proximity to start of the article, number of times mentioned), but not completely filtered, so on this specific visualization you will see a lot of matched locations which you wouldn't consider part of the article, but were mentioned in passing.
      Trying to understand the content of an article is still really, really difficult, and error prone.
bashallah 1276 days ago
How to make this into an interactive 3D globe map with story rankings localized
vadiml 1276 days ago
In the json data sample the site shows, what does 'tone' field means?
[-]
- prohobo 1276 days ago
  Tone is the author's wording of the article that conveys their feelings towards the subject. The lower the tone/sentiment score, the more negative the article.
  These scores are averaged for each spot on the map, showing the associated color.
  [-]
  - lawrenceong 1276 days ago
    this is cool @prohobo! may I ask how you classify tone? I have a completely unrelated project (enterprise app with thousands of documents) on which I'd like to apply to. TIA.
    [-]
    - prohobo 1276 days ago
      I can't say specifically which algorithms are run on it, but GDELT runs sentiment analysis (https://en.wikipedia.org/wiki/Sentiment_analysis) on each article, and they have a new data set called GKG which is all about emotion mining.