• pjungwir 255 days ago

    I really enjoyed this article, and I think it shows how successfully jq fits into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI tools that feel as "classical" as jq. It has helped me do one-off tasks like this several times, but I've really only scratched the surface. Often with newer tools I'm reluctant to invest in going deeper into really learning the features, but with jq I have a lot of confidence that it would pay off for years to come. I don't see any books yet, but this is about the phase where I'd normally buy an O'Reilly volume (hint hint).

    Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.

  • yonatank 255 days ago

    Just a heads up to anyone using jq - I've previously spent a couple of hours debugging a problem because jq uses float64 to store integers (which might lead to rounding-errors/overflows). For example:

      echo 1152921504606846976 | jq                                                            
    • otterley 255 days ago

      This is an artifact of JavaScript, which even as of ES6 uses IEEE 754 double-precision floats for all numeric values. jq likely uses the same implementation internally for compatibility reasons and to avoid surprises of a different kind.

      See https://www.ecma-international.org/ecma-262/6.0/#sec-ecmascr...

    • aditya 255 days ago
      • kibibu 255 days ago

        Twitter had to switch their tweet id representation in the API to handle this - it was numeric and switched to strings.

        • chokma 255 days ago

          Note: the links currently redirect to https://imgur.com/32R3qLv (image of a testicle and derogatory comment on HN)

          • baud147258 255 days ago

            Copy-pasting the link bypass this. It's using the HN referrer (I think?) to redirect to imgur

        • mattbillenstein 255 days ago

          Yikes, that's nasty.

          • Drdrdrq 255 days ago

            It is. But it's a problem of JSON itself, not just jq.

            • Groxx 255 days ago

              TIL: JSON has no specified number implementation: http://www.ecma-international.org/publications/files/ECMA-ST...

              >JSON is agnostic about the semantics of numbers ... JSON instead offers only the representation of numbers that humans use: a sequence of digits.

              So... anything is valid, per the spec.

              • iooi 255 days ago

                JSON != JavaScript

                > echo 1152921504606846976 | python -c 'import sys, json; print(json.load(sys.stdin))'


                • pritambaral 255 days ago

                  Python's json package != JSON

                  JSON: https://tools.ietf.org/html/rfc8259#page-8

                  • kemitche 255 days ago

                    The link says that it's up to the implementation, which means it's valid for Python's JSON implementation to support larger numbers.

                    It's less "interoperable" but not strictly invalid, by my read.

                    • derefr 255 days ago

                      What the GP means is that JSON doesn't require an implementation to decode JSON integers as arbitrary-precision integers, to be "conformant JSON."

                      Therefore, you can't assume that if you pass some JSON through an arbitrary pipeline of JSON-manipulating tools, written in various languages, that your integer values will be passed through losslessly.

                      Therefore, you just shouldn't use JSON integer values when you know that the values can potentially be large. This is why e.g. Ethereum's JSON-RPC API uses hex-escaped strings (e.g. "0x0") for representing its "quantity" type.

                    • yonatank 255 days ago

                      It looks like JSON doesn't specifically define how numeric numbers should be stored. It just recommends expecting precision up to the double precision limits.

                      Still interesting to know it's not just a jq quirk.

                    • wolf550e 255 days ago

                      Just because python has bigints and its stdlib json module supports bigints in json doesn't mean that is an interoperable thing to do.

                      • rzzzt 255 days ago

                        The problem is made worse on the receiving end (the browser). I've ran into this issue when serialization libraries in Java send a 64-bit long value as a sequence of digits, then things over ~50 bits get silently truncated, you find out about it, then switch to quoted strings.

                        • hashhar 255 days ago

                          That's Python not adhering to the JSON standard as defined in the RFC.

                  • oftenwrong 256 days ago

                    Another option is to use a tool like 'gron' to convert the JSON into a shell-friendly line-oriented format. This makes the rest straightforward.


                    • cryptonector 256 days ago

                      jq can do something like that too! Check out its --stream option.

                      • pimlottc 255 days ago

                        Oh, this sounds handy, though i can’t imagine it’d be as performant for this particular case.

                      • totally 256 days ago

                        That's cool, you made a thing that verifies the two export jobs you wrote have the same data even though they have different output.

                        I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.

                        • wwarnerandrew 256 days ago

                          good question! the analysis that I was doing was really a one-off for switching between these processes. We have unit tests and sanity checks to ensure consistency going forward, but as a final check before flipping the switch we wanted to be as confident as possible that we hadn't introduced any regressions across the full data-set.

                          The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.

                          • ryanisnan 255 days ago

                            I'm surprised the layout of the JSON doesn't arrange music by artist, any reasoning why that wasn't done?

                        • jillesvangurp 256 days ago

                          If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.

                          If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.

                          One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.

                          • wwarnerandrew 256 days ago

                            OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.

                          • jb3689 256 days ago

                            It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib. I still don't think Ruby is the best choice for doing raw JSON parsing. Last time I had to care about JSON speed we were transforming billions of events and the Ruby JSON lib was becoming a bottleneck

                            • paulddraper 256 days ago

                              > It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib.

                              I am on the edge of my seat now.

                              Would you mind listing which libraries are much (say, an order of magnitude) faster?

                            • baldeagle 256 days ago

                              What do you think is the best way to do deep JSON comparisons? We work with 2GB JSONs all day, and it is super annoying how long they take to process.

                              • vidarh 256 days ago

                                Not parse them into a tree, to start with.

                                Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.

                                Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.

                                https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.

                              • nicoburns 256 days ago

                                Rust is absolutely wonderful for tasks like this. They don't hit any of the cases where Rust's ownership can make things tricky. And the serde library makes deserializing JSON a piece of cake.

                                You end up with code which looks pretty similar to the equivalent JavaScript or Python code, but performs much faster (10x, 100x or even 1000x faster).

                              • Filligree 256 days ago

                                Start with a compiled language, I guess? I don't operate on anywhere near that scale, but json-rust reaches 400 MB/s for me.

                                It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.

                                • theamk 256 days ago

                                  (1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby


                                  (2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ). Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.

                                  The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.

                                  • dith3r 256 days ago

                                    Python is fast at parsing JSON, Go had hard time to match parsing speed of it. Additionally you have PyPy to help.

                                • gyrgtyn 256 days ago

                                  nodejs comes to mind!

                              • wgjordan 256 days ago

                                > My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn't work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?

                                Key takeaway: next time, start with the second thought first and save yourself well over an hour!

                              • g-harel 255 days ago

                                I had a similar problem diffing large API responses a few months ago and implemented an automation friendly JSON schema tool. It's a great way to make a summary of the data, especially when looking for forgotten fields for example.


                                • specialist 255 days ago

                                  I love jq.

                                  I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv, psql, etc. Fast, efficient, iterative, inspectable, portable.

                                  Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins goo, misc mayfly tech stacks. So my "use the most simple tool that works" advocacy has never gained traction.

                                  • ComputerGuru 255 days ago

                                    As part of an automated Jira upgrade script (well, Makefile) we needed to export changes to the listener port/scheme configuration which is unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which Atlassian doesn’t deign to whitespace normally (indentation is all over the place, as is formatting, character encoding, and more) —- and they mangle it differently somehow with each point release. So the Makefile calls xmllint to normalize formatting and whitespace of both the untouched source files from the old and new release as well as the locally modified (deployed) configuration, then calls diff/patch accordingly (in a three-way).

                                    • mattbillenstein 255 days ago
                                      • zoom6628 255 days ago

                                        I have found jq immensely useful to process ugly large responses from REST APIs in enterprise systems. It's like an awk for JSON... And I've been awk fan for 30 years for any text processing.

                                        • evancordell 255 days ago

                                          This is somewhat related to a hack I threw together recently: https://github.com/ecordell/jf

                                          It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.

                                          • cphoover 255 days ago

                                            I'm not sure why you are comparing the data to the old export instead of against a source of truth... for example what is in the upstream data source. Also why not verify using unit tests? Who is to say that the original export is valid and not the second export.

                                            • kiallmacinnes 255 days ago

                                              In theory, I agree! I hope the new codebase has a set of tests to validate just that.

                                              But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)

                                            • gorbachev 255 days ago

                                              I'm using jsonassert [1], a Java based JSON unit testing library, for something very similar.

                                              Not sure how it'd handle comparing 5GB files though.

                                              1. http://jsonassert.skyscreamer.org/

                                              • proyb2 256 days ago

                                                I'm curious how well could Crystal language handles that huge amount of JSON since most of the Ruby code could be ported over to Crystal.

                                                It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.

                                                • vidarh 256 days ago

                                                  There are stream parsers for JSON for Ruby too, including bindings for C libraries like YAJL - using the default JSON parser is an awful choice for doing comparisons like that given the massive overhead of the amount of objects it'll be creating for no good reason.

                                              • forgot-my-pw 256 days ago

                                                If speed is essential, why not use protobuf/flatbuffer or one of their variants?

                                                • gerbilly 255 days ago

                                                  I wanted to like jq, but honestly, I can't figure out it's crazy syntax.

                                                  • rane 255 days ago

                                                    If you're into JavaScript (or LiveScript) or functional programming, you might find ramda-cli[1] more palatable. Disclaimer: I've created it.

                                                    [1]: https://github.com/raine/ramda-cli

                                                    • Rotten194 255 days ago

                                                      it's a bit weird, but it's perfect for writing quick shell one liners with once you get used to it. no regular scripting language can match it for that

                                                      • saagarjha 255 days ago

                                                        awk, sed?

                                                        • gerbilly 255 days ago

                                                          The braces make that more difficult, awx and sed can work with YAML more easily, but awk works best on column oriented data, and YAML and JSON are more row oriented.

                                                          • pathseeker 255 days ago

                                                            For extracting a value out of json where the keys can arbitrarily re-order and you have nested maps containing the same key names? No thanks.

                                                            • saagarjha 255 days ago

                                                              I was providing them as examples of things "perfect for writing quick shell one liners", not good JSON parsers.

                                                      • nrclark 255 days ago

                                                        +1 for a great tool.

                                                        • johnjohnsmith 256 days ago

                                                          My problem with this article is the entire strategy for delivering data. A JSON file? That's probably the worst way I could think of: a CSV file would have been better. Part of the reason why databases exist is to handle exactly the problem the author is posing. There are already tools in place in SQL databases that track diffs for TB db's, and the authors could simply export patch files which would be far easier to analyze for inconsistency.

                                                          It really bothers me that the article is like "wow, check ou this awesome utility that helps us with a huge problem" instead of really thinking about "how did we get this huge problem and is there already a solution."

                                                          It just reeks of inexperience.

                                                          • orf 256 days ago

                                                            CSV is hardly an easy format to parse, or really produce. There is no CSV standard, and I bet lyrics contain all kinds of weird characters that makes choosing a separator hard.

                                                            Newline JSON is a fine interchange format for this, and the only advantage I can see for CSV is you can load it into a database in one command. Which begs the question as to why use a database at all for a simple one-off diff, when there are much more lightweight alternatives (a shell command).

                                                            So now you are converting your JSON to CSV to load it into a database to run a bunch of database diffs over it to then compare them in some way. Wouldn't that lead to the question "how did we get this huge problem and is there already a solution"?

                                                            Seems like you are the one over complicating things.

                                                            And I have to say, choosing CSV and then using a database for this task reeks of inexperience. KISS.

                                                            • johnjohnsmith 255 days ago

                                                              Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters. You didn't think I actually meant commas did you? I guess your experience with character separated value files is very limited. But my point was to illustrate if record fields are consistent you don't need a heavyweight solution like JSON... and you clearly missed my point.

                                                              • dang 254 days ago

                                                                Did you really manage to turn CSV files into a flamewar topic?

                                                                Please review https://news.ycombinator.com/newsguidelines.html and avoid turning nasty in arguments on Hacker News.

                                                                • ArchTypical 255 days ago

                                                                  > Oh, I see: you probably didn't know that CSV formats is also means character separated values, and can actually use non-printing ASCII characters as delimiters.

                                                                  That seems unnecessarily condescending. JSON can also mean Janky Serialized Object Notation, but that's not the common case.

                                                                  > I guess your experience with character separated value files is very limited.

                                                                  In practice, using something other than a comma is a good solution for some problems, but not others (eg transfer corruption or you know, the OP's use case).

                                                                  > a heavyweight solution like JSON.

                                                                  I've literally never heard that phrase, nor does it make much sense. At best it's 2 more characters for wrapping braces with existing quoted data/numbers and at worst you have to make up a new non-interchangeable format as you run into exceptions from the diff, which can affect past encodings. Sounds more involved than using JSON. shrug

                                                                  • orf 255 days ago

                                                                    Yes, you're totally correct. Using a heavyweight solution like JSON is beyond the pale, I should use a much more lightweight approach involving a database server.

                                                                    Your tone is oddly superior in your reply, which is really at odds with the technical content of your messages.

                                                                    > if record fields are consistent

                                                                    This is all very confused. The issue is that the JSON fields where not consistent compared to the baseline. So now what? You suggest instead of investing time making them consistent, you should just switch format entirely and then make them consistent? Or are you suggesting that somehow a line of CSV is easier to compare than a line of JSON? Or I should now shove a bunch of non-printing ascii characters in my message and that's now better?

                                                                    • scarejunba 255 days ago

                                                                      Like a SQLite DB? Actually, why don't we just transfer stuff as SQLite DBs. Single file, built-in schema, you can index.

                                                                      I mean, HDF is super-general and stuff, but it looks like SQLite would solve all the trouble with CSVs.

                                                                      • derefr 255 days ago

                                                                        I'm currently implementing an "individual-scale data-warehouse" service (i.e. "Hadoop without the Hadoop part"), and I'm currently pondering between the choices of "a tarball of CSVs", an SQLite file, and an Apache Avro file, as input-side wire formats. (And now HDF5 as well; didn't know about that one.)

                                                                        I'm still leaning toward "a tarball of CSVs", though:

                                                                        1. it's very easy to allow different devs to write a bunch of single-purpose Extract tools, each in whatever language is best for the job (e.g. Python if it has to use an API where the only available API-client library impl is in Python) to scrape some particular dimension out of an external source. You can write out CSV data in pretty much any language—even a bash script! That's because, even if the language doesn't have a CSV library, a "trivial" CSV dump can be accomplished by just calling printf(2) with a CSV template string. (Trivial = you know all your stringly-typed data is of constrained formats, such that it doesn't require any quoting. CSV files are trivial more often than you'd think!)

                                                                        2. Presuming your CSV file is column-ordered to have any primary key(s) first, and that it has no embedded header line, you can "reduce" on the output of a bunch of Extract jobs (i.e. merge-sorting + deduping to produce one CSV dataset) by just feeding all the input files to sort(1) with the -u and -n switches passed. `sort -n -t ','` basically behaves as a very simple streaming CSV parser, while also being amazingly-well-optimized at chewing through on-disk files in parallel. sort(1) is to (local on-disk) CSV data as LevelDB's compaction algorithm is to key-value pair data: a solid primitive that scales with your dataset size.

                                                                        3. Once you've got two sorted+deduped CSV "snapshot" files, you can create a differential snapshot from them just by calling:

                                                                            comm -1 -3 "$old_csv" "$new_csv" > diff.csv
                                                                        And then, getting an SQL data-migration file out of it (at least for an SQL DB that has something like Postgres's COPY statement, which can read CSV directly) is as simple as:

                                                                            cat pre_ddl.sql copy_stmt_begin.sql header.csv diff.csv copy_stmt_end.sql post_ddl.sql > migration.sql
                                                                        You can then throw that file right into, say, Google Cloud SQL's "import" command.

                                                                        That being said, the other formats are nice for 1. keeping data like numbers in more compact binary forms, 2. being able to sort and de-dup the data slightly more cheaply, without having to parse anything at point-of-sort. (Though this matters less than you'd think; sort(1)'s minimal parser is very fast, and SQLite/Avro can't get any big access-time wins since the data is neither pre-sorted nor column-oriented.)

                                                                        But in exchange for this, you lose the ability to cheaply merge working datasets together. You can't just concatenate them—you have to ask your storage-format library to serialize data from one of your data files, and then ask the library to parse and import said data into your other data file. Frequently, the overhead of a complete deep parse of the data is the thing we're trying to avoid the expense of in the first place! (Otherwise, why use an ETL pipeline at all, when you could just have your operational data sources do batched SQL inserts directly to your data warehouse?)

                                                                • mattbillenstein 256 days ago

                                                                  You're wrong - json has types - it's very very useful just because of that, and pass newline delimited json through gzip and you basically remove all the size redundant keys...

                                                                  I think this guy did the right thing for what sounded like essentially a one-off job to test this new export tool. Why would you go to all the trouble to use a SQL database for a one-off thing that can be done using text processing or worst-case writing a small script?

                                                                • nrdgrl 256 days ago

                                                                  It sounds like apple requires the data in JSON format - they may not have a choice.

                                                                • CyberMew 256 days ago

                                                                  I wonder if a C/C++ programm would perform better?

                                                                  • JulianMorrison 256 days ago

                                                                    jq is a C program.

                                                                    In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.

                                                                    • cryptonector 256 days ago

                                                                      jq is a C program, yes, but jq programs are interpreted. Because jq is a dynamically-typed language, it wouldn't be easy to compile it to object code that would run too much faster than the byte-interpreted version (though it would still run faster).

                                                                      As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.

                                                                      • JulianMorrison 255 days ago

                                                                        The interpreted language is just the setup phase for a pipeline of compiled-in data transformations.

                                                                        • cryptonector 244 days ago

                                                                          I don't understand this statement. Keep in mind I'm a jq maintainer.

                                                                    • cryptonector 256 days ago

                                                                      Yes. If well-written anyways.

                                                                    • GordonS 255 days ago

                                                                      > What’s the best way to compare these two 5GB files?

                                                                      A much simpler way to do this is simply to hash the files, for example using sha256sum, which AFAIK ships with just about every Linux distro. Then just compare the hashes.

                                                                      • hannasanarion 255 days ago

                                                                        Having the same content is not the same as being identical verbatim.