I spent 3 months building a data pipeline for parsing and transforming raw data from the GDELT project, as well as applying a machine learning model to find breaking news stories, which turned out to be a pretty nifty tool for creating visualizations and data mining applications.
So, I built an API around it and released it on Rapid API today.
I'm hoping my pricing isn't too high.. It's currently set so that a Tier 1 sub can create the marketing site I linked, and Tier 2 subs can access historical data as well. But, I also hope to create a steady income so I can work on other projects in the future.
Hope you guys like it, and please send your comments and crits.
As someone who has built event pipelines from GDELT data, I'd advise folks to not start (a GDELT project) with high expectations. Not only are there plenty of duplicates, but tagging quality is all over the place, especially for non-Western media.
Agreed, it's very hit or miss in some situations. I'm hoping to keep evolving this to clean it up and eek out more salient information.
I think GDELT is the best source, nothing else comes close to coverage in my opinion, it's just that all the articles it gathers revolve around world events, as opposed to all of the millions of topics of discussion and niche interests people post about.
I agree that GDELT is the best source. It's a hard problem, for sure. I just think that their project's mission and description may lead folks to believe it's this perfect data set, which it isn't.
The one nice thing about GDELT is it translates non-english language news so that you can use an english search term and see articles in different language. It's not perfect (i.e. machine translation), but I am not aware of any other service that provides similar functionality.
That was with a previous org, so thankfully not something I have to worry about now.
Our domain was pretty well defined, and we weren't building a real-time system, so we developed an adjudication tool and hired a few part time employees to evaluate and clean up our output.
I don't know what to say except: apologies. It is on there, and you can drag the map around horizontally, though that feature isn't very apparent currently.
I've never heard of RapidAPI before, but it seems kind of scammy. As in, nice idea, bit just filled with low-quality spam APIs wrapping presumably free or cheaper services, not that RapidAPI itself is a scam.
But then, Amazon's awash with tat too these days, and I wouldn't be surprised at anyone selling their product there.
Yeah, Rapid API is quite good in itself, but there are a lot of trash APIs on there.
I see it as a good way to test the waters. If people are interested in it, I'll eventually move it off of Rapid API for direct access and use Stripe for subscriptions.
We're ostensibly an automation platform but our core technology around Connector APIs automatically generates docs for you and a whole ton of other neat things - like your API can show up in autocompletion dropdowns for Autocode users. We're a Stripe-backed company, if you haven't heard of us it's because we rebranded / relaunched on July of this year. Disclaimer; am founder. :)
Hey there, first off cool stuff I can see this being plenty useful for various projects. We're exploring GDELT data for our own needs and I was wondering if you wouldn't mind sharing what were some rough spots or gotchas using the project?
Thanks! There are a lot of issues with GDELT data, but the things that come to mind recently are:
- Missing keywords. Huge topics like "brexit" won't be found, so you need to extract those yourself.
- Dirty titles. You need to extract the article titles from the website metadata, and more often than not they'll pre/append their site name like "[title] - DailyStormer" and links like "[title]: Business, Stocks, News" etc. That requires NLP to fix.
- Non-standardized locations. Location tags are all over the place, so "united_states" might have many tags like "us", "usa", "america", "the_states", etc. I'm still working on combining these tags in the API.
Okay so the data is messy and they have a limited set of acknowledged keywords and entity types so will have to work those out ourselves. Got it. That is super helpful information thank you!
For some uses (if accuracy is more important than speed), ICEWS is available as periodic dumps on Dataverse. They use CAMEO for event encoding which - if imperfect - at least is used elsewhere.
What are the 1000+ news sources? Is there a list of publications? e.g. New York Times, The Australian, Bangkok Post... ?
I understand your data comes from GDELT. I'm new to that and all I could glean from visiting the GDELT site at first go was that they have access to a lot of historical data sources - but I couldn't see a list of 1000+ current news publications. Is there a list somewhere? Are all the sources typographical, or might some be radio/tv? Thank you.
They're all online written news sources, but I believe GDELT is also getting into TV news so that might be coming in the future.
There is no official list currently, but you could query either GDELT or TidalWaves to see which are available in the data sets.
I could add a page to the marketing website which displays all of them, instead of the top sources for the latest news - would that be something useful for people?
Thanks for replying. I'm not sure what is useful yet - first of all I am trying to understand the scope of the data sources and the methodology for selections displayed in TidalWaves.
I would like to see the list of publications if that is possible.
I am wondering what "trending stories" means in TidalWaves? For example, if the NYT at any point contains 500 stories, and is refreshed say 3 times every 24 hours, with say 100 stories aged off and 100 new stories added each addition, what is a trending story in the NYT? Does the front page headline count more than a page 27 minor traffic report?
How is the top 5 trending stories for USA derived?
I just did a quick query grouping all sources and number of articles per source in the database (currently about 4 days worth of data): https://pastebin.com/HyktFpML
As you can see, there are already almost 8000 sources. I'm honestly not sure how GDELT scrapes its data though, but this is what they do for over 5 years now and it's backed by Google Cloud processing, so I'm sure it's ever expanding and very thorough.
The kinds of articles which are discovered revolve around world events, however. So you won't see random blog posts, or traffic reports. Instead you'll see geopolitics, big tech news, and cultural discourse.
Trending stories are found for the entire set of articles in a batch, with no separation for source, category, or location. If the top stories all happen to be in the USA in a given moment, then that's all you'll find. If NYT happens to publish an article that fits an overall story being talked about in the rest of the batch, it'll get added to that story.
Trending popularity is decided by how many individual sources talk about the given topic. This is good enough to identify the Zeitgeist of a moment, but it likely won't catch nuances in evolving stories.
Curious, are you getting _all_ your data from GDELT, or are you also scraping from news sites.
I've never heard of GDELT before, but I've found news websites to have incredibly tight ToS's which prevent you from keeping a database of their articles offline, regardless of whether or not you distribute the articles.
Nice project. I actually did pretty much the same a long time ago: an API around GDELT data.
Your map looks fantastic, btw.
The problem with GDELT is that they don't have that much coverage. I discovered it when we began to build our own News API [1].
I failed to sell this data over an API. I think you should see much more interest for an application. API like that is a bit complex. Almost all potential clients who really needed GDELT already parsing this data from Big Query, Redshift, or GDELT's dump files.
Anyway, check our solution, probably we could collaborate somehow. Map is cool! Feel free to reach me over artem [at] newscatcherapi [dot] com
Thanks! While I agree GDELT doesn't have the kind of coverage some people may be looking for, I think it's the best source for world news insights.
Oh nice, I actually saw Newscatcher on Rapid API but didn't see the marketing page. It looks like a great product, and I think suited for large-scale solutions. TidalWaves will likely find a niche for people making visualizations and personal apps.
What I want to do is make a mature pipeline that handles at least 90% of what amateur and solo users would want to use GDELT for, so hopefully people see the value in that. After all, I could make this visualization using the API - so I'm sure people can think of many other cool uses for it.
Also, thanks for reaching out, I'd definitely like to collaborate in the coming months.
No, but that actually sounds like a good idea. If there are a lot of people using this data along with WikiData, or if there's a kind of standard for "big" encyclopedic data, then I'd definitely like to support it.
If you make data available as RDF (which for JSON only involves creating a JSON-LD context mostly) then it makes it very easy for others to integrate with data from other sources as that is one of the key challenges that Semantic Web technology is designed to solve.
Not too interested in the most popular topics, but the obscure stuff like why some dept in govt has changed something related to topic and there are a couple journos who have been monitoring/reporting on that dept and so over time have a high count related to that topic.
Yep, currently the breaking stories are identified in the 15 minute intervals, so all the duplicate stories showing are from different times. If they have the same title that means that I can train the model to combine them.
They also could be press releases, considering the same title was found for multiple articles over the past 30-45 minutes. The data pipeline currently removes duplicate titles, but only inside single batches. So, it could be a long sustained press release.. like propaganda.
As of writing it shows three different stories though:
- 12:15 Will there be a coronavirus relief bill with $1,200 checks?
I've used trigram matching of article title strings with some success for comparing similarity of news article. If the strings are greater than .6 they are probably about the same topic.
Most news articles mention multiple locations. It's probable that those pieces talked about both Pakistan/Hungary and Poland, so will be found in both countries.
The locations are ordered by salience (proximity to start of the article, number of times mentioned), but not completely filtered, so on this specific visualization you will see a lot of matched locations which you wouldn't consider part of the article, but were mentioned in passing.
Trying to understand the content of an article is still really, really difficult, and error prone.
Tone is the author's wording of the article that conveys their feelings towards the subject. The lower the tone/sentiment score, the more negative the article.
These scores are averaged for each spot on the map, showing the associated color.
this is cool @prohobo! may I ask how you classify tone? I have a completely unrelated project (enterprise app with thousands of documents) on which I'd like to apply to. TIA.
I can't say specifically which algorithms are run on it, but GDELT runs sentiment analysis (https://en.wikipedia.org/wiki/Sentiment_analysis) on each article, and they have a new data set called GKG which is all about emotion mining.
So, I built an API around it and released it on Rapid API today.
I'm hoping my pricing isn't too high.. It's currently set so that a Tier 1 sub can create the marketing site I linked, and Tier 2 subs can access historical data as well. But, I also hope to create a steady income so I can work on other projects in the future.
Hope you guys like it, and please send your comments and crits.
I think GDELT is the best source, nothing else comes close to coverage in my opinion, it's just that all the articles it gathers revolve around world events, as opposed to all of the millions of topics of discussion and niche interests people post about.
More info: https://blog.gdeltproject.org/gdelt-translingual-translating...
(unrelated: scrolling on that page is horrible)
What are you using instead of GDELT now?
Our domain was pretty well defined, and we weren't building a real-time system, so we developed an adjudication tool and hired a few part time employees to evaluate and clean up our output.
Asking on behalf of Rhys Darby. https://www.youtube.com/watch?v=HynsTvRVLiI
But then, Amazon's awash with tat too these days, and I wouldn't be surprised at anyone selling their product there.
I see it as a good way to test the waters. If people are interested in it, I'll eventually move it off of Rapid API for direct access and use Stripe for subscriptions.
https://autocode.com/lib/url/temporary/
We're ostensibly an automation platform but our core technology around Connector APIs automatically generates docs for you and a whole ton of other neat things - like your API can show up in autocompletion dropdowns for Autocode users. We're a Stripe-backed company, if you haven't heard of us it's because we rebranded / relaunched on July of this year. Disclaimer; am founder. :)
- Missing keywords. Huge topics like "brexit" won't be found, so you need to extract those yourself.
- Dirty titles. You need to extract the article titles from the website metadata, and more often than not they'll pre/append their site name like "[title] - DailyStormer" and links like "[title]: Business, Stocks, News" etc. That requires NLP to fix.
- Non-standardized locations. Location tags are all over the place, so "united_states" might have many tags like "us", "usa", "america", "the_states", etc. I'm still working on combining these tags in the API.
https://www.splcenter.org/fighting-hate/extremist-files/indi...
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...
I understand your data comes from GDELT. I'm new to that and all I could glean from visiting the GDELT site at first go was that they have access to a lot of historical data sources - but I couldn't see a list of 1000+ current news publications. Is there a list somewhere? Are all the sources typographical, or might some be radio/tv? Thank you.
There is no official list currently, but you could query either GDELT or TidalWaves to see which are available in the data sets.
I could add a page to the marketing website which displays all of them, instead of the top sources for the latest news - would that be something useful for people?
I would like to see the list of publications if that is possible.
I am wondering what "trending stories" means in TidalWaves? For example, if the NYT at any point contains 500 stories, and is refreshed say 3 times every 24 hours, with say 100 stories aged off and 100 new stories added each addition, what is a trending story in the NYT? Does the front page headline count more than a page 27 minor traffic report?
How is the top 5 trending stories for USA derived?
As you can see, there are already almost 8000 sources. I'm honestly not sure how GDELT scrapes its data though, but this is what they do for over 5 years now and it's backed by Google Cloud processing, so I'm sure it's ever expanding and very thorough.
The kinds of articles which are discovered revolve around world events, however. So you won't see random blog posts, or traffic reports. Instead you'll see geopolitics, big tech news, and cultural discourse.
Trending stories are found for the entire set of articles in a batch, with no separation for source, category, or location. If the top stories all happen to be in the USA in a given moment, then that's all you'll find. If NYT happens to publish an article that fits an overall story being talked about in the rest of the batch, it'll get added to that story.
Trending popularity is decided by how many individual sources talk about the given topic. This is good enough to identify the Zeitgeist of a moment, but it likely won't catch nuances in evolving stories.
Some GDELT methodology is extolled here:
https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-...
Which ISO standards is this statement referring to?
I've never heard of GDELT before, but I've found news websites to have incredibly tight ToS's which prevent you from keeping a database of their articles offline, regardless of whether or not you distribute the articles.
Your map looks fantastic, btw.
The problem with GDELT is that they don't have that much coverage. I discovered it when we began to build our own News API [1].
I failed to sell this data over an API. I think you should see much more interest for an application. API like that is a bit complex. Almost all potential clients who really needed GDELT already parsing this data from Big Query, Redshift, or GDELT's dump files.
Anyway, check our solution, probably we could collaborate somehow. Map is cool! Feel free to reach me over artem [at] newscatcherapi [dot] com
[1] https://newscatcherapi.com
Oh nice, I actually saw Newscatcher on Rapid API but didn't see the marketing page. It looks like a great product, and I think suited for large-scale solutions. TidalWaves will likely find a niche for people making visualizations and personal apps.
What I want to do is make a mature pipeline that handles at least 90% of what amateur and solo users would want to use GDELT for, so hopefully people see the value in that. After all, I could make this visualization using the API - so I'm sure people can think of many other cool uses for it.
Also, thanks for reaching out, I'd definitely like to collaborate in the coming months.
If you make data available as RDF (which for JSON only involves creating a JSON-LD context mostly) then it makes it very easy for others to integrate with data from other sources as that is one of the key challenges that Semantic Web technology is designed to solve.
Not too interested in the most popular topics, but the obscure stuff like why some dept in govt has changed something related to topic and there are a couple journos who have been monitoring/reporting on that dept and so over time have a high count related to that topic.
They also could be press releases, considering the same title was found for multiple articles over the past 30-45 minutes. The data pipeline currently removes duplicate titles, but only inside single batches. So, it could be a long sustained press release.. like propaganda.
As of writing it shows three different stories though:
- 12:15 Will there be a coronavirus relief bill with $1,200 checks?
- 12:30 Nigeria protesters break curfew amid gunfire, chaos in Lagos
- 12:30 Google antitrust case to turn on how search engine grew dominant
Maybe what you're seeing is a bug in the website client, I'll investigate it.
Finding duplicate stories that have different texts and titles sounds like something you should use AI for
The locations are ordered by salience (proximity to start of the article, number of times mentioned), but not completely filtered, so on this specific visualization you will see a lot of matched locations which you wouldn't consider part of the article, but were mentioned in passing.
Trying to understand the content of an article is still really, really difficult, and error prone.
These scores are averaged for each spot on the map, showing the associated color.