I really enjoyed this article, and I think it shows how successfully jq fits into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI tools that feel as "classical" as jq. It has helped me do one-off tasks like this several times, but I've really only scratched the surface. Often with newer tools I'm reluctant to invest in going deeper into really learning the features, but with jq I have a lot of confidence that it would pay off for years to come. I don't see any books yet, but this is about the phase where I'd normally buy an O'Reilly volume (hint hint).
Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.
What the GP means is that JSON doesn't require an implementation to decode JSON integers as arbitrary-precision integers, to be "conformant JSON."
Therefore, you can't assume that if you pass some JSON through an arbitrary pipeline of JSON-manipulating tools, written in various languages, that your integer values will be passed through losslessly.
Therefore, you just shouldn't use JSON integer values when you know that the values can potentially be large. This is why e.g. Ethereum's JSON-RPC API uses hex-escaped strings (e.g. "0x0") for representing its "quantity" type.
The problem is made worse on the receiving end (the browser). I've ran into this issue when serialization libraries in Java send a 64-bit long value as a sequence of digits, then things over ~50 bits get silently truncated, you find out about it, then switch to quoted strings.
That's cool, you made a thing that verifies the two export jobs you wrote have the same data even though they have different output.
I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.
good question! the analysis that I was doing was really a one-off for switching between these processes. We have unit tests and sanity checks to ensure consistency going forward, but as a final check before flipping the switch we wanted to be as confident as possible that we hadn't introduced any regressions across the full data-set.
The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.
If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.
If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.
One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.
OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.
It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib. I still don't think Ruby is the best choice for doing raw JSON parsing. Last time I had to care about JSON speed we were transforming billions of events and the Ruby JSON lib was becoming a bottleneck
Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.
Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.
https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.
(1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby
(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ).
Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.
The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.
> My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn't work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?
Key takeaway: next time, start with the second thought first and save yourself well over an hour!
I had a similar problem diffing large API responses a few months ago and implemented an automation friendly JSON schema tool. It's a great way to make a summary of the data, especially when looking for forgotten fields for example.
As part of an automated Jira upgrade script (well, Makefile) we needed to export changes to the listener port/scheme configuration which is unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which Atlassian doesn’t deign to whitespace normally (indentation is all over the place, as is formatting, character encoding, and more) —- and they mangle it differently somehow with each point release. So the Makefile calls xmllint to normalize formatting and whitespace of both the untouched source files from the old and new release as well as the locally modified (deployed) configuration, then calls diff/patch accordingly (in a three-way).
It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.
I'm not sure why you are comparing the data to the old export instead of against a source of truth... for example what is in the upstream data source. Also why not verify using unit tests? Who is to say that the original export is valid and not the second export.
In theory, I agree! I hope the new codebase has a set of tests to validate just that.
But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)
I'm curious how well could Crystal language handles that huge amount of JSON since most of the Ruby code could be ported over to Crystal.
It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.
There are stream parsers for JSON for Ruby too, including bindings for C libraries like YAJL - using the default JSON parser is an awful choice for doing comparisons like that given the massive overhead of the amount of objects it'll be creating for no good reason.
My problem with this article is the entire strategy for delivering data. A JSON file? That's probably the worst way I could think of: a CSV file would have been better. Part of the reason why databases exist is to handle exactly the problem the author is posing. There are already tools in place in SQL databases that track diffs for TB db's, and the authors could simply export patch files which would be far easier to analyze for inconsistency.
It really bothers me that the article is like "wow, check ou this awesome utility that helps us with a huge problem" instead of really thinking about "how did we get this huge problem and is there already a solution."
CSV is hardly an easy format to parse, or really produce. There is no CSV standard, and I bet lyrics contain all kinds of weird characters that makes choosing a separator hard.
Newline JSON is a fine interchange format for this, and the only advantage I can see for CSV is you can load it into a database in one command. Which begs the question as to why use a database at all for a simple one-off diff, when there are much more lightweight alternatives (a shell command).
So now you are converting your JSON to CSV to load it into a database to run a bunch of database diffs over it to then compare them in some way. Wouldn't that lead to the question "how did we get this huge problem and is there already a solution"?
Seems like you are the one over complicating things.
And I have to say, choosing CSV and then using a database for this task reeks of inexperience. KISS.
Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters. You didn't think I actually meant commas did you? I guess your experience with character separated value files is very limited. But my point was to illustrate if record fields are consistent you don't need a heavyweight solution like JSON... and you clearly missed my point.
> Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters.
That seems unnecessarily condescending. JSON can also mean Janky Serialized Object Notation, but that's not the common case.
> I guess your experience with character separated value files is very limited.
In practice, using something other than a comma is a good solution for some problems, but not others (eg transfer corruption or you know, the OP's use case).
> a heavyweight solution like JSON.
I've literally never heard that phrase, nor does it make much sense. At best it's 2 more characters for wrapping braces with existing quoted data/numbers and at worst you have to make up a new non-interchangeable format as you run into exceptions from the diff, which can affect past encodings. Sounds more involved than using JSON. shrug
Yes, you're totally correct. Using a heavyweight solution like JSON is beyond the pale, I should use a much more lightweight approach involving a database server.
Your tone is oddly superior in your reply, which is really at odds with the technical content of your messages.
> if record fields are consistent
This is all very confused. The issue is that the JSON fields where not consistent compared to the baseline. So now what? You suggest instead of investing time making them consistent, you should just switch format entirely and then make them consistent? Or are you suggesting that somehow a line of CSV is easier to compare than a line of JSON? Or I should now shove a bunch of non-printing ascii characters in my message and that's now better?
I'm currently implementing an "individual-scale data-warehouse" service (i.e. "Hadoop without the Hadoop part"), and I'm currently pondering between the choices of "a tarball of CSVs", an SQLite file, and an Apache Avro file, as input-side wire formats. (And now HDF5 as well; didn't know about that one.)
I'm still leaning toward "a tarball of CSVs", though:
1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)
2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.
3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:
comm -1 -3 "$old_csv" "$new_csv" > diff.csv
And then, getting an SQL data-migration file out of it (at least for an SQL DB that has something like Postgres's COPY statement, which can read CSV directly) is as simple as:
You can then throw that file right into, say, Google Cloud SQL's "import" command.
That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)
But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)
You're wrong - json has types - it's very very useful just because of that, and pass newline delimited json through gzip and you basically remove all the size redundant keys...
I think this guy did the right thing for what sounded like essentially a one-off job to test this new export tool. Why would you go to all the trouble to use a SQL database for a one-off thing that can be done using text processing or worst-case writing a small script?
In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.
jq is a C program, yes, but jq programs are interpreted. Because jq is a dynamically-typed language, it wouldn't be easy to compile it to object code that would run too much faster than the byte-interpreted version (though it would still run faster).
As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.