A lot of comments in here are poking fun at how little data it is relative to a commercial data mining operation. The data they process and what they do with it is worth more to society than any number of petabytes crunched to target ads. Processing petty petabytes is not praiseworthy.
If you focus on the headline, you'll miss the point. The point is they used open source technology to process public data for reporting once the government stopped updating its own tools.
Would it be possible to put the gigabytes back in the title? In general, ETL of gigabytes of data can involve complicated operations, e.g., use of statistical models. And, the utility of data is not determined by the size of data. One has to be a pretty petty person to make fun of this article for data sizes.
You're right on the facts, but that's not the way a forum like this works. Like it or not (and probably no one likes it), the way it works is that a minor irrelevant provocation in the title completely determines the discussion. The solution is to take out the minor provocation, even if in principle we shouldn't need to.
Hear hear! It's a good solution for this domain using reliable, free tools. And other folks without a lot of compute resources or experience setting up Spark clusters or whatever can easily adapt their approach. Hats off.
Not just "open source technology", battle-tested tried-and-true technologies. Articles like these should remind us that we don't need to keep rebuilding tools to solve the same set of problems -- sometimes, all it takes is some familiarity with what exists
The production databases on the project I build and support around are around 100GB in size - tiny. But if they didn't work correctly - ambulances wouldn't arrive at the correct location on time, nor fire engines nor emergency services workers.
Covering an area of 1.25 million square kilometres, supporting 40,000 first responders helping to protect 8 million people.
Of course the databases are not the only important part of an emergency services network such as ours, but they are a critical component.
I would rather work on a project like this any day than working to prop up some faceless advertising/data collection behemoth such as Facebook or Google.
As someone who has done a lot of data processing in journalism, I've found the engineering issues aren't usually about scale, but involve the harder problems around data cleaning/wrangling/updating. Particularly interoperability with opaque government systems, and transformation/delivery to a variety of users, including ones with a high variation in technical skill (i.e. journalists), and an extremely picky tolerance for public-facing errors in the finished product.
I started ProPublica's Dollars for Docs , and the initial project involved <1M rows. Gathering the data involved writing scrapers for more than a dozen independent sources (i.e. drug companies) using whatever format they felt like publishing (HTML, PDF, Flash applet). This data had to be extracted and collated without error for the public-facing website, as mistakenly listing a doctor had a very high chance of legal action. Hardest part by far was distributing the data and coordinating the research among reporters and interns internally, and also with about 10 different newsrooms. I had to work with outside investigative journalists who, when emailed a CSV text file, thought I had attached a broken Excel file.
Today, the D4D has millions of records, and the government now its own website  for the official dissemination of the standardized data. I have a few shell scripts that can download the official raw data -- about ~30GB of text when unzipped -- and import it into a SQLite DB in about 20 minutes. The data for the first D4D investigation probably could've fit in a single Google Sheet, but it still took months to properly wrangle. But the computational bottleneck wasn't the size of data.
One of other tricky issues is that data management isn't easy in a newsroom. Devops is not only not a traditional priority, but anyone working with data has to do it fairly fast, and they have to move on almost immediately to another project/domain when done. There's not a lot of incentive or resources to get past a collection of hacky scripts, so it's really cool (to several-years-ago me) to see a guide about how to get things started in a more proper, maintainable way.
edit: for a more technical detailed example of newsroom data issues, check out Adrian Holovaty's (creator of Django) 3-part essay, "Sane Data Updates Are Harder than You Think", which details the ETL process for Chicago crime data:
I suspect that anyone who has worked in tech at more "traditional" non-tech businesses would be far more familiar with the challenges inherent in any ETL undertaking. It's usually critical business data, too, so there's a strong incentive to avoid errors there, too.
The trouble is, despite (or possibly because of) being cognitively difficult and requiring a certain discipline (for lack of a better word), this kind of work doesn't come across as very "sexy" anecdotally.
Even if it does get shared, the part that makes it hard gets overlooked.
It isn't even about tech vs "non-tech". It's about whether you get data in a consistent format or not. Where I used to work, we would get a gigabyte sized file of random XML without any documentation and be told to deal with it, first step being to tell the non technical people what we had. Another delivery would be something totally different. Saying "oh, we deal with petabytes" is missing the point. If there's nothing unexpected or unknown about the data, then it's not a challenge, because you know, computers process stuff automatically.
I guess I was making the assumption that "tech" businesses are more likely to have data that's entirely generated by modern software (e.g. click logs) or at least pre-coerced into a consistent, if not structured format (e.g. tweets), whereas non-tech businesses are more likely to have data that's free-form human input or comes disparate/arbitrary machine sources (e.g. scientific instruments, the mainframe or AS/400 worlds).
I'm sure there's a spectrum, but my point was that the vast majority of what the companies we read about on this site ("tech") deal with is going to fall close to the consistent-format edge of the spectrum, hence the prejudice.
> The data they process and what they do with it is worth more to society than any number of petabytes crunched to target ads.
No it isn't. You are getting defensive for no reason. If propublica ceases to exist or never existed, it wouldn't matter a single bit to the world. You could even argue the world would be better off.
> Processing petty petabytes is not praiseworthy.
From a technical point and many other ways, it is.
I don't get why you are getting offended by people making a jab at the scant amount of data. Last I checked, hacker news is a technology oriented site. And from a technology point of view, what pro publica is doing is a joke. It's a toy amount of data.
Why not just say pro publica is not a technology company and hence people shouldn't expect technological feats of wonder?
> The point is they used open source technology to process public data for reporting once the government stopped updating its own tools.
Which is something I could have done on a lazy afternoon all by myself. It isn't anything to be impressed about. But good for them anyways.
ProPublica is not a technology company, it's a non-profit investigative journalism outlet.
They and countless other journalism/civic orgs would likely be happy for you to show them up by whipping up usable ETL scripts relevant in their respective domains. Since it all involves public open data you don't have to wait for anyone's permission.
Minor nitpick about their exit code technique : The command checks if the table exists, but it does not appear to re-run if the source file has been updated. Usually with Make you expect it to re-run the database load if the source file has changed.
It's better to use empty targets  to track when the file has last been loaded and re-run if the dependency has been changed.
> The first is that we use Aria2 to handle FTP duties. Earlier versions of the script used other FTP clients that were either slow as molasses or painful to use. After some trial and error, I found Aria2 did the job better than lftp (which is fast but fussy) or good old ftp (which is both slow and fussy). I also found some incantations that took download times from roughly an hour to less than 20 minutes.
Tangential question: is it possible to use wget for ftp duties? Though may be additional FTP-specific functionality in `aria2c` of course:
Yes, I should have specified that I was interested in what aria2 provides for FTP in addition to what a more ubiquitous tool like wget seemingly has. u/rasz says aria2 allows multi-connections, so that seems sensible: https://news.ycombinator.com/item?id=17508858
Make is often brought out for data, "single machine ETL" jobs, but for big, complicated (and iterative) workflows it doesn't feel good enough to me.
What do you folks use? Drake, "make for data" https://github.com/Factual/drake seems ok, but doesn't have "batch" jobs, (aka "pattern rules") where you can do every file in a directory matching a pattern.
Others have come up with different swiss army knives but nothing ever sticks for me, it usually ends up as a single Makefile with eg 3 targets that call a bunch of shell scripts.
The whole thing would be configurable to build from scratch, but not well set up to do incremental ETL on a per file basis, after I eg delete some extraneous rows in one file, clean up a column, redownload a folder, or add files to a dataset.
I use Snakemake , a parallel make system for data, designed around pattern-matching rules. The rules are either shell commands or Python 3 code.
I settled on it after originally using make, getting frustrated with the crazy work-arounds I needed to implement because it doesn't understand build steps with multiple outputs, switching to Ninja where you have to construct the dependency tree yourself, and finally ending up on Snakemake which does everything I need.
Thank you for sharing this information about snakemake. I administer a cluster for a group of geneticists. I'll try to get them to use it for their publications to make their results easily reproducible by others.
Is that using traditional (plaintext) FTP? Is it listening on port 21?
~ $ ftp ftp.elections.il.gov
Connected to ftp.elections.il.gov (18.104.22.168).
220-Microsoft FTP Service
Name (ftp.elections.il.gov): ^C
It looks like they are sending their password in plaintext. aria2 supports SFTP, so they should really talk to elections.il.gov about moving to SFTP or any other protocol that doesn't send the password in plaintext.
I imagine there would be other systems (state-owned and private) that use the FTP server, and maybe in a way that changing protocols is inexplicably full of friction. I wonder why the elections server, assuming it only contains records legal to distribute to the public, is even password protected. Maybe it was a policy when govt bandwidth was scarce. California, for example, has campaign finance data on a public webserver: https://www.californiacivicdata.org/
I did something like this a few years ago! I needed to do a bunch of transformations and measurements of data that came in on a regular basis. Make was a perfect fit - I could test the whole process with a single command, cleaning either just the result data, or nuking everything to make sure it pulled stuff in properly.
I spent some time trying to write my own processing system in Python before realizing this was a familiar task...
On the whole debate revolving around gigabytes in the title, I'd like to add:
There's a well-substantiated linguistic theory revolving around "maxims of conversation". Maxims of conversation are so strongly universal among the speakers of a given language that they become part of the implied meaning of a conversational act.
For example the maxim of cooperativitiy implies that when a person sitting in a cold room next to a window is spoken to by a person sitting further from the window and is being told "It's a bit chilly, isn't it", they can take it to mean "Please close the window".
Similarly, there are certain maxims of conversation which are part of the language game inherent in the formulation of the title of a blogpost. They are kind of assumed to be boasting about something. So when somebody says "We figured out a way to load a gigabyte's worth of data into a database in a single day" then the being-boastful-about-something maxim is violated. That's why it triggered so many people.
And pointing out that this is not something to be boastful about is a perfectly valid thing to do to keep certain facts straight.
But, by all means, if you get a thrill out of it, keep downvoting me.
Make has a dependency system. You can tell it how to create file A, and that it depends on file B, and tell it how to create file B. Then if you request file A, it will check whether file B exists, and create it first if it's missing or outdated.
That's very valuable for building things. If you change a file, you only need to re-build the files that could reasonably be affected by the change. And things happen in the right order without micro-managing.
They can produce the same result but they do it in different ways and require you to express it in different ways.
Make has you describe a graph of outputs and how to produce them. It then traverses the graph to produce the requested output.
Bash is just a regular sequence of commands, with functions and loops if you wish.
If the pipeline you need to run can easily be turned into a dependency graph, I think make is a great fit. It's easy to use, comes with most of what you need built in and has some fun extras, like -jXXX, which allows you to parallelise things and built in caching so you don't regenerate the same asset twice if you don't need to.
You can do all that in bash but you'll have to write it yourself, which takes time you could spend on other things.
In addition, expanding on @Pissompons's note -- make gives you job-level parallelism for free with constructs like:
make -j 24 transform
which will (if possible/allowed by the dependence structure in the Makefile) run 24 jobs at once to bring "transform" up to date.
So for instance, if "transform" depends on a bunch of targets, one for each month across a decade, you get 24-way parallelism for free. It's kind of like gnu "xargs -P", but embedded within the make job-dispatcher.
You're right that it's /bin/sh, but, since it could be (and is, in some cases) bash, it's not quite right to call it "not bash", either.
I'll grant that the distinction is important, though, in the face of the history of #!/bin/sh Linux scripts with bashisms breaking upon the Debian/Ubuntu switch to dash. Even if you're on a system where /bin/sh is bash, it's safest to set SHELL in your GNU makefiles to bash explicitly, if that's what you you're writing in.
Make is an excellent way to automatically decide to run your bash scripts (and other scripts, shell commands & executables) or not, all depending on if the existing output is newer than the available input