Running Awk in parallel to process 256M records

(ketancmaheshwari.github.io)

358 points | by ketanmaheshwari 1423 days ago

24 comments

tetha 1423 days ago
Hm. I'm fully aware that I'm currently turning into a bearded DBA. And I may be just misreading the article and I probably don't understand the article.
But, I started being somewhat confused by something:
> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.
> The total data size is 329GB.
At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
The article presents a lot more AWK knowledge than I have. I'm impressed by that. I acknowledge that.
But I'd probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.
I do agree with the underlying sentiment of "We don't need hadoop". I'm impressed that AWK goes so far. I'd still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.
[-]
- ketanmaheshwari 1422 days ago
  Thank you for your comment. I hear you when you say postgres would probably be faster. However, I imagine there would be more work if I chose postgres and the whole development and testing paradigm would change.
  First, I would need to figure the right schema and populate the database.
  Second, it would need some creative SQL acrobatics that I would probably not be comfortable with.
  Third, it would probably be hard to perform quick tests at a small scale that I can perform easily with text files.
  Fourth, the solution would probably be hard to port elsewhere where postgres is not available. Most Unix systems have awk available.
  Fifth, programming the postgres db in a higher level language would require connector api libs which would be additional effort.
  Note that this is not a production work -- it started simply as a hack to see how far I can go without getting into serious rabbitholes and giving up. Surprisingly, with Awk I went really far and never fell into any rabbithole so to speak.
- flukus 1422 days ago
  Some of awks stengths are iterative development, pattern matching and integration with other tools. A databases strengths are data consistency, concurrency (multiple readers/writers), transactions, efficiency. So the problem itself is playing to awks strengths and doesn't require the strengths of a database. With a database you'd usually have much more analysis upfront, work out the appropriate data types, work out what is nullable, work out the exceptions to the rules, work out your indexes, etc which isn't great for this type of quick and dirty problem. It also looks like most scripts are culling the dataset on input instead of at processing time which is more efficient.
  I don't think it's really displayed here because everything is run through swift, but it can also compose much better with other tools, a database frequently won't have cli tools that integrate well (not sure about postgres). There are 322 source files being globbed for instance, move that to a make file and you can automatically re-run without wasting time on data already processed. If it was in a database you'd have to track source files and manage changes somehow.
  > At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
  Note that it doesn't require that much memory, it's just using it to boost performance because it's available, this could have been developed on a dual core 4GB laptop outputting to spinning rust and all that would change is the directory it's output to (and running time), but the same machine may choke on database queries over that much data.
- n_t 1422 days ago
  I think the first line itself (High Performance Computing) reveales that this is a special purpose computer (read - real supercomputer). 512 cores and 24TiB of memory further confirm that. However, this is just a small sized supercomputer (unless that's the limit of resources a user can reserve). I don't know the constraints on this particular SGI machine, but many regular apps cannot be run or rendered useless in HPC domain even when using Linux.
  @ketanmaheshwari, which supercomputer is this?
  [-]
  - ketanmaheshwari 1422 days ago
    Thank you for your comment.
    This is an SGI UV300 system similar to this one: https://www.hpcwire.com/2016/05/11/tgac-installs-largest-sgi...
    I believe databases like postgres etc. would probably run on this system. I did not choose that route because I wanted to see how far I can go with Awk.
- throwaway_pdp09 1423 days ago
  That is a shagload of hardware. I had to work in MSSQL with ~100GB database and we only had, I think, 12 cores in those servers (or maybe it was just 6 + SMT). Avoiding IO is the key, so having enough mem to cache the working set is essential.
  Those are massive, massive machines.
  [-]
  - hobs 1422 days ago
    lmao I wish, I was running several TB databases on servers with less than 128GB of ram (thankfully we werent processing the entire set) but that amount of ram is eye popping for the dataset, many data warehouse implementations I am familiar with on MSSQL have 10TB datasets with significantly less ram than this :)
    [-]
    - D2187645 1422 days ago
      On the opposite spectrum, I have have over estimated our hardware requirements with 3x 128GB machines with 32 cores each, and the uncompressed dataset is 200 mb on disk.
- timClicks 1422 days ago
  One thought ... the author talks a lot about working in an "HPC environment". Most HPC sites don't allow services such as a database to be run.
- VHRanger 1423 days ago
  If you're dealing with such a static single-machine dataset, why not go for SQLite instead of PostGres?
  There'd be much less setup overhead
  [-]
  - tetha 1422 days ago
    Sqlite is single-writer for a single database. I mean. Of course, VHRanger, only your team will need to write to the database, and your team will make sure that only one person on your team will write to it. Eh. I've been there too many times. Oh, but yes, your team will also figure out the fallout if things to wrong. Ah..
    Ok. Maybe those are enterprise concerns: Sqlite doesn't scale regarding multiple users reading and writing. Of course it's a read only dataset, but do you know the bouquet of views and derived tables data scientists create around a read-only dataset? Hah. Oh and of course these are not critical, but if they get lost, shit hits the fan because it takes multiple weeks to rebuild them.
    I've been in that swamp enough times to just install postgres and stop caring. Takes me 2 more hours now, but avoids weeks of discussions in the future.
    [-]
    - Volt 1422 days ago
      Remember that we're comparing this to awk…
  - twoodfin 1423 days ago
    How good is SQLite’s parallelism these days?
    [-]
    - doteka 1423 days ago
      It’s alright, concurrent reads should not be a problem at least. Most drivers implement some features to make it easier.
  - mongol 1423 days ago
    I was thinking the same thing first, but does SQLite parallelize work it does?
  - Scarbutt 1422 days ago
    There's no overhead.
```
  apt-get install postgres
  create user
  create database
```
- MuffinFlavored 1423 days ago
  > rely on automated query optimization and parallelization from there.
  I know you said "automated parallelization" but... in newer versions of Postgres, what does it take to trigger some of the automatic query parallelization?
  [-]
  - tomnipotent 1422 days ago
    > what does it take to trigger some of the automatic query parallelization
    Table scans, index scans (b-tree & bitmap only I believe), joins, and aggregations. There may be some limitations with joins, such as hash joins duplicating hashes across processes or merges requiring separate sorts.
    [-]
    - Frost1x 1422 days ago
      Does this work outside of a shared memory environments? I haven't had to fuss with postgres for several years but last time I used it, this was non-trivial.
      [-]
      - tomnipotent 1422 days ago
        Postgres uses separate processes under the hood and thus shared memory to combine results.
- gjulianm 1423 days ago
  Yeah, I work with awk quite a lot and I find weird that they needed so many resources.
  Although I agree that for this problem Postgres looks like a better option, I wouldn't discard awk or other Unix tools. It is way faster than people think it is, easy to use and solves pretty well quite a lot of use cases, such as pre-processing records before database ingestion, aggregations, some simple queries on temporary data...
  [-]
  - tuatoru 1422 days ago
    Agreed. If you only have 64 MB of RAM to process your 329 GB, AWK's your tool. As you say, it's faster than many people think.
  - ketanmaheshwari 1422 days ago
    OP here -- thank you for your comment.
    Just wanted to say that the resources were not needed. I simply had access to them and were not being used much at the time so thought why not try them.
- afiodorov 1422 days ago
  Alternatively one can load the data into Google BigQuery and start running queries without indexing the data. It's not free though, however can save a lot of time. BigQuery supports import of JSON and CSV files from Google Cloud Storage and it can even infer the schema.
tobias2014 1423 days ago
I'm sure that with tools like MPI-Bash [1] and more generally libcircle [2] many embarrassingly parallelizable problems can easily be tackled with standard *nix tools.
[1] https://github.com/lanl/MPI-Bash [2] http://hpc.github.io/libcircle/
[-]
- pdimitar 1423 days ago
  I have lately used a shell `for` loop that only emits part of the iterated values (not all are eligible for further processing) and then fed the loop directly to the `parallel` tool via the pipe operator.
  The results were indeed embarrassingly parallel.
  I am a fan of some languages that seem better equipped to utilise our modern many-core machines and I'd still write a longer-living system with better guarantees in those languages -- but many people ignore shell goodness at their own peril.
  [-]
  - qchris 1423 days ago
    I've done something very similar recently, for processing hundreds of GB of .csv files across thousands of records. I ended up writing a fast single-threaded Rust program for the file parsing, then fed it into a bash script running GNU Parallel. That thing flies, and watching the system monitor showing every thread on the system get pegged to 100% simultaneously is a neat experience.
    I'm sure I could have done the parallelization itself in Rust given enough time, but honestly I found the `parallel` command to be so easy and resulting in so little overhead that it didn't really even seem worth it to spend the time on a language-native solution.
    [-]
    - heresie-dabord 1422 days ago
      GNU Parallel with awk is a SAPANS, effective combination for data analysis. No faddish rube-goldberg machines.
      https://www.gnu.org/software/parallel/
      *SAPANS: as simple as possible, and no simpler
      Do note that awk and GNU Parallel are tools that use/contain C-like syntax, which is abhorrent to some factions. GNU Parallel is written in Perl.
    - spockz 1423 days ago
      Did the 100% system utilisation also lead to better performance? More cpu utilisation doesn’t always mean faster or higher throughput.
      [-]
      - qchris 1423 days ago
        That's certainly true, but it does look like in this case. I did the preliminary benchmarking on an 8-thread CPU (the actual HPC cluster has higher counts), and the total execution time on the processing was just under 8x, which I imagine came from the OS-level thread management overhead. I still need to check that pattern holds once thread counts increases, but I'm fairly optimistic it'll work out as long as I don't run into thermal throttling or I/O limits.
      - pdimitar 1423 days ago
        This can't be generalised but in many cases it did help me for CPU-bound tasks.
        Obviously there are cases where parallelisation is very superfluous. F.ex. if you have 1000 records of something, splitting the work of processing those (and we're talking really light processing like JSON encoding) among 4-8 threads is actually detrimental to performance.
  - mturmon 1423 days ago
    Big agree. Gnu parallel, xargs, and make -j are all very useful basic tools for embarrassingly parallel workloads.
    I've been developing a simulation software that does Monte Carlo over realizations of a simulated universe, and xargs and (later) parallel were super useful for parts of the workload. All the parallel job instances run the same simulation code, but for a different simulated universe, all controlled by a random number seed, so you can generate an ensemble of simulations with basically:
```
   head -NumberOfSimsWanted seeds.txt | xargs simulation.py
```
    [-]
    - pdimitar 1423 days ago
      Yep! Even as a fan of more parallel-friendly programming languages I still think having single-core algorithms applied to different input parameters in parallel is a very viable alternative approach if your problem space allows for it.
      I quite disliked the journey. The UNIX shell legacy shenanigans are real and can be a pain. But you can still bear a lot of fruit without going to the more arcane corners of it.
    - eythian 1422 days ago
      I have lately been working on a genetic algorithm system, written in single-threaded Perl. When I started off wanting to do, say, 20 simulations, I just used parallel, and it worked great on the 8 threads of my desktop machine. Now I've done a bit more and want to do long-running simulations that might take a few hours, I've taken to just firing it all at AWS Batch and having a script that uploads the results to S3. Now I can do 1-200 instances at the same time, which my own hardware isn't up to in a reasonable timeframe.
      [-]
      - pdimitar 1421 days ago
        I have a pretty strong workstation (iMac Pro) that I currently don't use all the time. And a gaming PC that I barely touch in the last 9-10 months. Hit me up if you want me to lend you some CPU power sometime.
        [-]
        eythian 1420 days ago
        Cheers, but the configuring of it would be more hassle than it's worth I suspect. As it is, using AWS is very cheap as I can have it use spot instances. So hundreds of CPU-hours used for in the order of US$10-20.
        [-]
        pdimitar 1420 days ago
        I suspect setting that up would indeed be a hassle. :)
        Thanks for mentioning AWS Batch, I didn't know about it and will look it up.
    - flyinghamster 1422 days ago
      I wound up writing a hybrid bash script with embedded Makefile to re-encode my FLAC collection to Opus. It was nice to be able to use -j16 on that task.
      Switching to Opus saved a great deal of space on my phone's SD card.
  - wott 1423 days ago
    At lot of people do not like to use the features already provided by their environment and like to take a single element of the original features set and reimplement the same features within that single element.
    Sometimes it improves performance, sometimes not. Sometimes it improves productivity, sometimes not. I am not sure of the global balance.
    * The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
    * The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
    * A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
    * The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
    [-]
    - pdimitar 1423 days ago
      I agree with your premise but your examples lack nuance and aren't strictly true.
      > The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
      Many one-off CLI script programs don't have the time to deal with N process forks (where N = CPU threads). This overhead is proven. It's objectively much quicker to spawn N threads inside the same process especially if your separate tasks need to converge and/or exchange messages here and there.
      And runtimes like Erlang's BEAM VM have unquestionable benefits (like preemptive green threads / fibers that exchange immutable messages), although they aren't well known for big raw processing muscle.
      > The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
      Which one is that? Linux's X11 is flawed in many regards, with the ability of each program to freely capture keystrokes being one of the most egregious. I can't blame people for trying to escape such hell and implement other platforms for the same thing -- without the drawbacks.
      > A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
      People wanted multi-platform desktop programs. F.ex. in my rather small country the only reason a software for managing shop stock and customer purchases, backorders, inventory etc. succeeded was because it was written in Java Swing; many shops used old Windows machines but no small amount of them also used Macs and some even used Ubuntu netbooks. That software would never sell back in 2007 if it wasn't OS-neutral.
      Also, Qt has commercial licensing that requires paying to use. Many, me included, don't want to deal with that. The programming world is complex as it is and if I suddenly find myself having to pay royalties, personally, to Qt, years after I stopped working for the employer I developed a desktop app for, that would be catastrophic. So many just dodge such potential bullets.
      > The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
      That's quite fair, however people want much more isolation than the OS usually offers -- hence stuff like FreeBSD jails, LXC containers / Docker, Windows' Sandboxie program etc. It's not enough to just run, say, Chrome, as a different user if you suspect Google is scanning your machine.
      ---
      Again, I agree with your premise but the examples aren't really accurate. A lot of the modern OS-es and and their technologies aren't as top-notch as we'd like (the Linux kernel and network infrastructure probably being some very rare exceptions).
      [-]
      - heavyset_go 1422 days ago
        > Also, Qt has commercial licensing that requires paying to use
        Only if you intend on modifying Qt and distributing it. Otherwise, you're free to build commercial software using Qt under the LGPL license.
        [-]
        pdimitar 1421 days ago
        Interesting, thank you.
  - gopalv 1423 days ago
    > The results were indeed embarrassingly parallel.
    The xargs/parallel mode can also scale up to many more nodes, if you are willing to introduce ssh into your shell scripts & do a minimum of stdio piping.
    I wrote a python mapreduce program named "batchman" a few years ago, which was aimed at running awk on hundreds of machines (& then being able to fire more things based on the results).
    zgrep on gzipped logs + awk + a central scheduler is extremely fast, to the point of beating Splunk to lookup exact things like ("does this zid have another session which they kept logging in while the game was throwing out-of-sequence errors?").
    [-]
    - pdimitar 1423 days ago
      I have to admit I have no clue how to do that, any pointers?
      I thought of utilising my 4 machines at home sometimes for such things and would hugely benefit from implementing this using standard tools.
      [-]
      - gopalv 1422 days ago
        > I have to admit I have no clue how to do that, any pointers?
        My code is poorly documented and badly written, but it works at least the ssh_copy_id one I use enough to keep updated[1] (that ssh logins with passwords to place the ssh keys).
        That's my script, the brains of it is basically an ssh connector.
        Paramiko is as simple as
        s = SSHClient() s.connect(hostname=h[1], username=h[0], password=pwd) (_in, _out, _err) = s.exec_command( cmd, bufsize=4096)
        Originally, I had awk scripts to process ^A separated, I'd read the out stream back and formulate my next script over the logs to drill-down in a loop.
        [1] - https://github.com/t3rmin4t0r/batchman/blob/master/ssh_copy_...
      - gobengo 1422 days ago
        gnu parallel can use cpus on remote hosts (via ssh) to parallelize work. https://www.gnu.org/software/parallel/parallel_tutorial.html...
        [-]
        eythian 1422 days ago
        Well, that's interesting and useful to know. Thanks!
  - technofiend 1423 days ago
    Another vote for Parallel but I also wanted to mention using Redis for its atomic operations. It's very straight forward to push a list of things to do onto a redis queue and then have your scripts run via parallel pop one piece of work and then push the work item to either a success or failure queue.
    Unfortunately I no longer have a working example but as I recall it wasn't difficult to keep n number of scripts running and popping entries until the queue emptied.
    [-]
    - pdimitar 1423 days ago
      Thanks for mentioning it. I gave up on Redis because I had a lot of trouble sharding it / distributing it years ago and I kind of forgot about it and stopped being interested in its features.
      But what you mention is very interesting and I'll have it in mind for the future. Thank you.
tannhaeuser 1423 days ago
It's odd that TFA has this focus on performance but doesn't mention which awk implementation has been used; at least I haven't found any mention of it. There are 3-4 implementations in mainstream use: nawk (the one true awk, an ancient version of which is installed on Mac OS by default), mawk (installed on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-awk. Tip: mawk is much faster than the others and to get performance out of gawk you should use LANG=C (and also because of crashes with complex regexpes in Unicode locales in some versions of gawk 3 and 4).
[-]
- ketanmaheshwari 1423 days ago
  Thank you for your comment. I should have clarified that I use gawk version 4.0.2. I mean to update the post but somehow am unable to push to repo or login to my github.
  Edit: Done.
  [-]
  - mfontani 1423 days ago
    They seem to be currently having a few problems with log-in: https://www.githubstatus.com/incidents/q3cfsrp1qb6l
  - tannhaeuser 1423 days ago
    Yeah just read in another answer of yours that for your scripts probably regexp performance dominates the results as there's not much procedural code to make a difference.
hidiegomariani 1423 days ago
this somewhat reminds me of taco bell programming http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
[-]
- gen220 1422 days ago
  Thank you! I was hunting down this link for a coworker a few months ago, but I forgot the keyword “taco bell”, and Google fu was failed me...
  I love these kind of “counterculture” uses of UNIX tools to solve hard problems. The boundary of “where to stop using xargs, awk, and grep and start using Python” is pretty blurry for a surprising number of tasks, especially if you’re willing to invest in mastering those tools!
- sydney6 1423 days ago
  Is this Hunter S. Thompson in the picture?
- ketanmaheshwari 1423 days ago
  Excellent! Didn't know there is a name for this. Agreed the work is a case of Taco bell programming!
  [-]
  - Scarbutt 1423 days ago
    There is also https://adamdrake.com/command-line-tools-can-be-235x-faster-...
- kyuudou 1422 days ago
  Oh my word thank you so much for linking this.
Upvoter33 1422 days ago
I've always wanted to build a parallel awk. And call it pawk. And have an O'Reilly book about it. With a chicken on the cover. pawk, pawk, pawk! This is a true story, sadly.
[-]
- ketanmaheshwari 1422 days ago
  I too wanted to do almost exactly the same. I want to port awk to GPUs by hacking its code and adding OpenCL tags at right places. Someday!
- elmolino89 1422 days ago
  Name is already taken by python awk substitute:
  https://github.com/alecthomas/pawk
- snorrah 1422 days ago
  Q. How much rep / user score do I need before I can downvote? I ask because I can only upvote and this comment should be downvoted.
svnpenn 1423 days ago
```
    !($1 in a) && FILENAME ~ /aminer/ { print }
```
This uses a regular expression. As regex is not actually needed in this case, you might be able to get better performance with something like this:
```
    !($1 in a) && index(FILENAME, "aminer") != 0 { print }
```
co_dh 1423 days ago
I like the idea of using AWK for this. But you can give kdb/q a try. 250M rows is nothing for kdb, and it seemed that you can afford the license.
[-]
- greggyb 1423 days ago
  kdb/q, or many other mature data platforms.
  250M records isn't big for a lot of data platforms. I've seen plenty of solutions in traditional RDBMSs that would easily scale an order of magnitude larger than this, if not several.
  Moving into anything analytically focused, like an OLAP engine, or just a columnstore in an RDBMS easily gives you another order of magnitude or few.
- lenkite 1423 days ago
  kdb/q is amazing. I have seen some experienced folks make jaw dropping data computations at the drop of a hat on gigabyte sized data. Makes everything else look laughably puny.
  I suspect if it was open source, it would probably be the most popular big-data storage and computing platform.
  [-]
  - labelbias 1422 days ago
    Hmm... 256M records is still not that much for postgres or mysql so it shouldn't be too much for kdb/q. Several gb of data also shouldn't be too much.
mjcohen 1423 days ago
gawk has been my go-to text processing program for many years. I have written a number of multi-thousand line programs in it. I always use the lint option. Catches many of my errors. One of these had to read a 300,000,000 byte file into a single string so it could be searched. The file was in 32-byte lines. At first, I read each line in and appended it to the result string, but that took way to long since the result string was reallocated each time it was appended to. So I read in about 1000 of the 32-byte lines, appending them to a local string. This 32000-byte string was then appended to the result, so this only was done 10000 times. Worked fine.
FDSGSG 1423 days ago
Spending minutes on these tasks on hardware like this is pretty silly. awk is fine if these are just one-off scripts where development time is the priority, otherwise you're wasting tons of compute time.
Querying things like these on such a small dataset should take seconds, not minutes.
[-]
- ketanmaheshwari 1422 days ago
  Thank you for your comment. Most of the solutions indeed take less than a minute. Solution to problem 1 & 3 took 25 sec and that to 5 took 26.
  The solution that took 9 minutes involved processing the abstract in each record. The abstracts are quite sizeable on some of these publications. Processing millions of them took time.
  The solution that took 48 minutes involved a nested loop effectively reaching the iteration count for 216 years times 256M records which comes to about 55B.
  Hope this clarifies things a bit but I am not claiming this to be the most optimized solution. I am sure there is scope for refinements -- this was my take on it.
ineedasername 1423 days ago
Definitely an under-appreciated tool. Very useful for one-off tasks that would take a fair bit longer to code in something like python.
Zeebrommer 1422 days ago
I am often impressed by the things that can be done with these old-school UNIX tools. I'm trying to learn a few of them, and the most difficult part are these very implicit syntax constructions. How is the naive observer to know that in bash `$(something)` is command substitution, but in a Makefile `$(something)` is just a normal variable? With `awk`, `sed` and friends it gets even worse of course.
Is the proper answer 'just learn it'? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?
[-]
- avar 1422 days ago
  Some of it you pickup or remember from the context of the file you're looking at, but you really should take the time to read the manuals for these tools from cover to cover at some point if you're making extensive use of them. In the case of GNU bash & make: https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.... & https://www.gnu.org/software/make/manual/make.html
- MaxBarraclough 1422 days ago
  Shoutout to the lesser-known GNU Recutils. Not sure how well it would fare in this particular instance, but it seems relevant.
  https://www.gnu.org/software/recutils/
  https://en.wikipedia.org/wiki/Recfiles
arendtio 1418 days ago
First, I think it is great that you found a tool that suits your needs. A few weeks ago I was mangling some data too (just about 17 million records) and would like to contribute my experience.
My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).
So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn't mean awk isn't a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.
schmichael 1422 days ago
https://mobile.twitter.com/awkdb was a joke account made in frustration by a coworker trying to operate a Hadoop cluster almost a decade ago. Maybe it's time to hand over the account...
gautamcgoel 1423 days ago
Your system had 512-core Xeons? Did you mean that you had 5 12-core xeons? Or 512 cores total?
[-]
- ketanmaheshwari 1423 days ago
  512 Intel Xeon CPUs configured over 16 nodes. This one: http://www.comnetco.com/sgi-uv300-the-most-powerful-in-memor...
  [-]
  - gautamcgoel 1423 days ago
    Wow. I didn't realize you could run so many CPUs in one address space. This thing is basically one huge computer. Would love to open up Gnome System Monitor and see 512 cores!
    [-]
    - cesaref 1423 days ago
      Core counts has gone up quite considerably in the last few years, certainly in a compute farm context. 10 years ago, you'd be looking at 2U servers having 8 cores (dual quad core) as high density. These days, 2U servers can pack 4 sockets, and processors are in the high 20 something cores, so if you've got deep pockets you can get over 100 cores in 2U.
      [-]
      - jerven 1422 days ago
        AMD Epyc can give you 128/256 cores in a 2 socket system. Not that expensive.
  - MR4D 1423 days ago
    #ComputeEnvy
winrid 1422 days ago
and here I am working on a big distributed system that has to handle 200k records a day (and hardly does successfully). sigh.
_wldu 1422 days ago
Turning json data into tabular data using jq was pretty neat. So many json apis in use today yet still a need for csv and excel docs.
[-]
- nojito 1421 days ago
  Excel can natively connect to and parse json nowadays.
tarun_anand 1423 days ago
Amazing work. Keep it up.
nmz 1422 days ago
You couldn't have used FS="\036" or "\r"?
nojito 1423 days ago
Why not just use data.table?
The solution would be much less error prone and most likely much quicker as well.
[-]
- ketanmaheshwari 1423 days ago
  Thank you for the comment. One reason to use awk was that I wanted to see how far I could go with Awk. I will checkout data.table. Does it offer any kind of parallelism?
  [-]
  - nojito 1422 days ago
    Many common functions in data.table are parallelized under the hood by OpenMP.
- pwdisswordfish2 1422 days ago
  Why not just use kdb+?
  [-]
  - nojito 1422 days ago
    data.table is faster and open source.
    [-]
    - pwdisswordfish2 1422 days ago
      "data.table is faster"
      How do you know this, i.e., can you share a benchmark -- the test data, not the results -- that you ran?
      [-]
      - nojito 1421 days ago
        Because I process ~1.5TB a day and I was exploring multiple options.
        Data.table was the clear winner for my use case.
        [-]
        pwdisswordfish2 1421 days ago
        But you cannot share the details of the pericular use case so someone else can test your claims?
        "After "exploring" many options, kdb+ was the clear winner for my use case. I process an average of 1.6TB per day."
        See how silly that sounds without any details or credibility? Does "clear winner" mean it was faster? How does a person know if the recommender even tried both solutions? What if someone else's use case is dfferent from the recommender's? How could someone verify that one solution was faster than the other?
        The easiest way is to provide sample data and a processing task and let people try the two solutions for themselves.
gh123man 1423 days ago
Slightly off topic, but as a swift developer (https://swift.org/) the usage of swift/T in this project really confused me. Is swift/T in any way related to Apple's swift language?
The naming conflict makes googling the differences fairly challenging.
[-]
- ketanmaheshwari 1423 days ago
  Both Swift's are different. I was aware of potential confusion which is why I did not use Swift in the title and mention explicitly in the blog that this is not Apple Swift. Fun story: Apple contacted Swift/T team before launching Apple Swift and made a mention of it in their page.
  [-]
  - gh123man 1423 days ago
    Cool! Thanks for the explanation. That answers my biggest question of who came first.
skanga 1423 days ago
Try mawk if you can. I find that it does even faster.
[-]
- ketanmaheshwari 1423 days ago
  Indeed I did try mawk and found it to be faster. However, when I try setting locale "LC_ALL=C"; the performance of both awk and mawk were almost the same. I also let it at awk in favor of portability -- mawk is not available on most systems and needs to be installed.
flatfilefan 1422 days ago
GNU Parallel + AWK = even less code to write.
eatYourFood 1422 days ago
Wow - great insight......