Hm. I'm fully aware that I'm currently turning into a bearded DBA. And I may be just misreading the article and I probably don't understand the article.
But, I started being somewhat confused by something:
> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.
> The total data size is 329GB.
At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
The article presents a lot more AWK knowledge than I have. I'm impressed by that. I acknowledge that.
But I'd probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.
I do agree with the underlying sentiment of "We don't need hadoop". I'm impressed that AWK goes so far. I'd still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.
Thank you for your comment. I hear you when you say postgres would probably be faster. However, I imagine there would be more work if I chose postgres and the whole development and testing paradigm would change.
First, I would need to figure the right schema and populate the database.
Second, it would need some creative SQL acrobatics that I would probably not be comfortable with.
Third, it would probably be hard to perform quick tests at a small scale that I can perform easily with text files.
Fourth, the solution would probably be hard to port elsewhere where postgres is not available. Most Unix systems have awk available.
Fifth, programming the postgres db in a higher level language would require connector api libs which would be additional effort.
Note that this is not a production work -- it started simply as a hack to see how far I can go without getting into serious rabbitholes and giving up. Surprisingly, with Awk I went really far and never fell into any rabbithole so to speak.
Some of awks stengths are iterative development, pattern matching and integration with other tools. A databases strengths are data consistency, concurrency (multiple readers/writers), transactions, efficiency. So the problem itself is playing to awks strengths and doesn't require the strengths of a database. With a database you'd usually have much more analysis upfront, work out the appropriate data types, work out what is nullable, work out the exceptions to the rules, work out your indexes, etc which isn't great for this type of quick and dirty problem. It also looks like most scripts are culling the dataset on input instead of at processing time which is more efficient.
I don't think it's really displayed here because everything is run through swift, but it can also compose much better with other tools, a database frequently won't have cli tools that integrate well (not sure about postgres). There are 322 source files being globbed for instance, move that to a make file and you can automatically re-run without wasting time on data already processed. If it was in a database you'd have to track source files and manage changes somehow.
> At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
Note that it doesn't require that much memory, it's just using it to boost performance because it's available, this could have been developed on a dual core 4GB laptop outputting to spinning rust and all that would change is the directory it's output to (and running time), but the same machine may choke on database queries over that much data.
I think the first line itself (High Performance Computing) reveales that this is a special purpose computer (read - real supercomputer). 512 cores and 24TiB of memory further confirm that. However, this is just a small sized supercomputer (unless that's the limit of resources a user can reserve). I don't know the constraints on this particular SGI machine, but many regular apps cannot be run or rendered useless in HPC domain even when using Linux.
I believe databases like postgres etc. would probably run on this system. I did not choose that route because I wanted to see how far I can go with Awk.
That is a shagload of hardware. I had to work in MSSQL with ~100GB database and we only had, I think, 12 cores in those servers (or maybe it was just 6 + SMT). Avoiding IO is the key, so having enough mem to cache the working set is essential.
lmao I wish, I was running several TB databases on servers with less than 128GB of ram (thankfully we werent processing the entire set) but that amount of ram is eye popping for the dataset, many data warehouse implementations I am familiar with on MSSQL have 10TB datasets with significantly less ram than this :)
On the opposite spectrum, I have have over estimated our hardware requirements with 3x 128GB machines with 32 cores each, and the uncompressed dataset is 200 mb on disk.
Sqlite is single-writer for a single database. I mean. Of course, VHRanger, only your team will need to write to the database, and your team will make sure that only one person on your team will write to it. Eh. I've been there too many times. Oh, but yes, your team will also figure out the fallout if things to wrong. Ah..
Ok. Maybe those are enterprise concerns: Sqlite doesn't scale regarding multiple users reading and writing. Of course it's a read only dataset, but do you know the bouquet of views and derived tables data scientists create around a read-only dataset? Hah. Oh and of course these are not critical, but if they get lost, shit hits the fan because it takes multiple weeks to rebuild them.
I've been in that swamp enough times to just install postgres and stop caring. Takes me 2 more hours now, but avoids weeks of discussions in the future.
> rely on automated query optimization and parallelization from there.
I know you said "automated parallelization" but... in newer versions of Postgres, what does it take to trigger some of the automatic query parallelization?
> what does it take to trigger some of the automatic query parallelization
Table scans, index scans (b-tree & bitmap only I believe), joins, and aggregations. There may be some limitations with joins, such as hash joins duplicating hashes across processes or merges requiring separate sorts.
Does this work outside of a shared memory environments? I haven't had to fuss with postgres for several years but last time I used it, this was non-trivial.
Yeah, I work with awk quite a lot and I find weird that they needed so many resources.
Although I agree that for this problem Postgres looks like a better option, I wouldn't discard awk or other Unix tools. It is way faster than people think it is, easy to use and solves pretty well quite a lot of use cases, such as pre-processing records before database ingestion, aggregations, some simple queries on temporary data...
Just wanted to say that the resources were not needed. I simply had access to them and were not being used much at the time so thought why not try them.
Alternatively one can load the data into Google BigQuery and start running queries without indexing the data. It's not free though, however can save a lot of time. BigQuery supports import of JSON and CSV files from Google Cloud Storage and it can even infer the schema.
I'm sure that with tools like MPI-Bash [1] and more generally libcircle [2] many embarrassingly parallelizable problems can easily be tackled with standard *nix tools.
I have lately used a shell `for` loop that only emits part of the iterated values (not all are eligible for further processing) and then fed the loop directly to the `parallel` tool via the pipe operator.
The results were indeed embarrassingly parallel.
I am a fan of some languages that seem better equipped to utilise our modern many-core machines and I'd still write a longer-living system with better guarantees in those languages -- but many people ignore shell goodness at their own peril.
I've done something very similar recently, for processing hundreds of GB of .csv files across thousands of records. I ended up writing a fast single-threaded Rust program for the file parsing, then fed it into a bash script running GNU Parallel. That thing flies, and watching the system monitor showing every thread on the system get pegged to 100% simultaneously is a neat experience.
I'm sure I could have done the parallelization itself in Rust given enough time, but honestly I found the `parallel` command to be so easy and resulting in so little overhead that it didn't really even seem worth it to spend the time on a language-native solution.
That's certainly true, but it does look like in this case. I did the preliminary benchmarking on an 8-thread CPU (the actual HPC cluster has higher counts), and the total execution time on the processing was just under 8x, which I imagine came from the OS-level thread management overhead. I still need to check that pattern holds once thread counts increases, but I'm fairly optimistic it'll work out as long as I don't run into thermal throttling or I/O limits.
This can't be generalised but in many cases it did help me for CPU-bound tasks.
Obviously there are cases where parallelisation is very superfluous. F.ex. if you have 1000 records of something, splitting the work of processing those (and we're talking really light processing like JSON encoding) among 4-8 threads is actually detrimental to performance.
Big agree. Gnu parallel, xargs, and make -j are all very useful basic tools for embarrassingly parallel workloads.
I've been developing a simulation software that does Monte Carlo over realizations of a simulated universe, and xargs and (later) parallel were super useful for parts of the workload. All the parallel job instances run the same simulation code, but for a different simulated universe, all controlled by a random number seed, so you can generate an ensemble of simulations with basically:
head -NumberOfSimsWanted seeds.txt | xargs simulation.py
Yep! Even as a fan of more parallel-friendly programming languages I still think having single-core algorithms applied to different input parameters in parallel is a very viable alternative approach if your problem space allows for it.
I quite disliked the journey. The UNIX shell legacy shenanigans are real and can be a pain. But you can still bear a lot of fruit without going to the more arcane corners of it.
I have lately been working on a genetic algorithm system, written in single-threaded Perl. When I started off wanting to do, say, 20 simulations, I just used parallel, and it worked great on the 8 threads of my desktop machine. Now I've done a bit more and want to do long-running simulations that might take a few hours, I've taken to just firing it all at AWS Batch and having a script that uploads the results to S3. Now I can do 1-200 instances at the same time, which my own hardware isn't up to in a reasonable timeframe.
I have a pretty strong workstation (iMac Pro) that I currently don't use all the time. And a gaming PC that I barely touch in the last 9-10 months. Hit me up if you want me to lend you some CPU power sometime.
Cheers, but the configuring of it would be more hassle than it's worth I suspect. As it is, using AWS is very cheap as I can have it use spot instances. So hundreds of CPU-hours used for in the order of US$10-20.
I wound up writing a hybrid bash script with embedded Makefile to re-encode my FLAC collection to Opus. It was nice to be able to use -j16 on that task.
Switching to Opus saved a great deal of space on my phone's SD card.
At lot of people do not like to use the features already provided by their environment and like to take a single element of the original features set and reimplement the same features within that single element.
Sometimes it improves performance, sometimes not. Sometimes it improves productivity, sometimes not. I am not sure of the global balance.
* The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
* The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
* A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
* The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
I agree with your premise but your examples lack nuance and aren't strictly true.
> The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
Many one-off CLI script programs don't have the time to deal with N process forks (where N = CPU threads). This overhead is proven. It's objectively much quicker to spawn N threads inside the same process especially if your separate tasks need to converge and/or exchange messages here and there.
And runtimes like Erlang's BEAM VM have unquestionable benefits (like preemptive green threads / fibers that exchange immutable messages), although they aren't well known for big raw processing muscle.
> The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
Which one is that? Linux's X11 is flawed in many regards, with the ability of each program to freely capture keystrokes being one of the most egregious. I can't blame people for trying to escape such hell and implement other platforms for the same thing -- without the drawbacks.
> A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
People wanted multi-platform desktop programs. F.ex. in my rather small country the only reason a software for managing shop stock and customer purchases, backorders, inventory etc. succeeded was because it was written in Java Swing; many shops used old Windows machines but no small amount of them also used Macs and some even used Ubuntu netbooks. That software would never sell back in 2007 if it wasn't OS-neutral.
Also, Qt has commercial licensing that requires paying to use. Many, me included, don't want to deal with that. The programming world is complex as it is and if I suddenly find myself having to pay royalties, personally, to Qt, years after I stopped working for the employer I developed a desktop app for, that would be catastrophic. So many just dodge such potential bullets.
> The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
That's quite fair, however people want much more isolation than the OS usually offers -- hence stuff like FreeBSD jails, LXC containers / Docker, Windows' Sandboxie program etc. It's not enough to just run, say, Chrome, as a different user if you suspect Google is scanning your machine.
---
Again, I agree with your premise but the examples aren't really accurate. A lot of the modern OS-es and and their technologies aren't as top-notch as we'd like (the Linux kernel and network infrastructure probably being some very rare exceptions).
> The results were indeed embarrassingly parallel.
The xargs/parallel mode can also scale up to many more nodes, if you are willing to introduce ssh into your shell scripts & do a minimum of stdio piping.
I wrote a python mapreduce program named "batchman" a few years ago, which was aimed at running awk on hundreds of machines (& then being able to fire more things based on the results).
zgrep on gzipped logs + awk + a central scheduler is extremely fast, to the point of beating Splunk to lookup exact things like ("does this zid have another session which they kept logging in while the game was throwing out-of-sequence errors?").
> I have to admit I have no clue how to do that, any pointers?
My code is poorly documented and badly written, but it works at least the ssh_copy_id one I use enough to keep updated[1] (that ssh logins with passwords to place the ssh keys).
That's my script, the brains of it is basically an ssh connector.
Originally, I had awk scripts to process ^A separated, I'd read the out stream back and formulate my next script over the logs to drill-down in a loop.
Another vote for Parallel but I also wanted to mention using Redis for its atomic operations. It's very straight forward to push a list of things to do onto a redis queue and then have your scripts run via parallel pop one piece of work and then push the work item to either a success or failure queue.
Unfortunately I no longer have a working example but as I recall it wasn't difficult to keep n number of scripts running and popping entries until the queue emptied.
Thanks for mentioning it. I gave up on Redis because I had a lot of trouble sharding it / distributing it years ago and I kind of forgot about it and stopped being interested in its features.
But what you mention is very interesting and I'll have it in mind for the future. Thank you.
It's odd that TFA has this focus on performance but doesn't mention which awk implementation has been used; at least I haven't found any mention of it. There are 3-4 implementations in mainstream use: nawk (the one true awk, an ancient version of which is installed on Mac OS by default), mawk (installed on eg. Ubuntu by default), gawk (on RH by default last I checked), or busybox-awk. Tip: mawk is much faster than the others and to get performance out of gawk you should use LANG=C (and also because of crashes with complex regexpes in Unicode locales in some versions of gawk 3 and 4).
Thank you for your comment. I should have clarified that I use gawk version 4.0.2. I mean to update the post but somehow am unable to push to repo or login to my github.
Yeah just read in another answer of yours that for your scripts probably regexp performance dominates the results as there's not much procedural code to make a difference.
Thank you! I was hunting down this link for a coworker a few months ago, but I forgot the keyword “taco bell”, and Google fu was failed me...
I love these kind of “counterculture” uses of UNIX tools to solve hard problems. The boundary of “where to stop using xargs, awk, and grep and start using Python” is pretty blurry for a surprising number of tasks, especially if you’re willing to invest in mastering those tools!
I've always wanted to build a parallel awk.
And call it pawk.
And have an O'Reilly book about it.
With a chicken on the cover.
pawk, pawk, pawk!
This is a true story, sadly.
250M records isn't big for a lot of data platforms. I've seen plenty of solutions in traditional RDBMSs that would easily scale an order of magnitude larger than this, if not several.
Moving into anything analytically focused, like an OLAP engine, or just a columnstore in an RDBMS easily gives you another order of magnitude or few.
kdb/q is amazing. I have seen some experienced folks make jaw dropping data computations at the drop of a hat on gigabyte sized data. Makes everything else look laughably puny.
I suspect if it was open source, it would probably be the most popular big-data storage and computing platform.
gawk has been my go-to text processing program for many years. I have written a number of multi-thousand line programs in it.
I always use the lint option. Catches many of my errors.
One of these had to read a 300,000,000 byte file into a single string so it could be searched. The file was in 32-byte lines. At first, I read each line in and appended it to the result string, but that took way to long since the result string was reallocated each time it was appended to. So I read in about 1000 of the 32-byte lines, appending them to a local string. This 32000-byte string was then appended to the result, so this only was done 10000 times. Worked fine.
Spending minutes on these tasks on hardware like this is pretty silly. awk is fine if these are just one-off scripts where development time is the priority, otherwise you're wasting tons of compute time.
Querying things like these on such a small dataset should take seconds, not minutes.
Thank you for your comment. Most of the solutions indeed take less than a minute. Solution to problem 1 & 3 took 25 sec and that to 5 took 26.
The solution that took 9 minutes involved processing the abstract in each record. The abstracts are quite sizeable on some of these publications. Processing millions of them took time.
The solution that took 48 minutes involved a nested loop effectively reaching the iteration count for 216 years times 256M records which comes to about 55B.
Hope this clarifies things a bit but I am not claiming this to be the most optimized solution. I am sure there is scope for refinements -- this was my take on it.
I am often impressed by the things that can be done with these old-school UNIX tools. I'm trying to learn a few of them, and the most difficult part are these very implicit syntax constructions. How is the naive observer to know that in bash `$(something)` is command substitution, but in a Makefile `$(something)` is just a normal variable? With `awk`, `sed` and friends it gets even worse of course.
Is the proper answer 'just learn it'? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?
First, I think it is great that you found a tool that suits your needs. A few weeks ago I was mangling some data too (just about 17 million records) and would like to contribute my experience.
My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).
So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn't mean awk isn't a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.
https://mobile.twitter.com/awkdb was a joke account made in frustration by a coworker trying to operate a Hadoop cluster almost a decade ago. Maybe it's time to hand over the account...
Wow. I didn't realize you could run so many CPUs in one address space. This thing is basically one huge computer. Would love to open up Gnome System Monitor and see 512 cores!
Core counts has gone up quite considerably in the last few years, certainly in a compute farm context. 10 years ago, you'd be looking at 2U servers having 8 cores (dual quad core) as high density. These days, 2U servers can pack 4 sockets, and processors are in the high 20 something cores, so if you've got deep pockets you can get over 100 cores in 2U.
Thank you for the comment. One reason to use awk was that I wanted to see how far I could go with Awk. I will checkout data.table. Does it offer any kind of parallelism?
But you cannot share the details of the pericular use case so someone else can test your claims?
"After "exploring" many options, kdb+ was the clear winner for my use case. I process an average of 1.6TB per day."
See how silly that sounds without any details or credibility? Does "clear winner" mean it was faster? How does a person know if the recommender even tried both solutions? What if someone else's use case is dfferent from the recommender's? How could someone verify that one solution was faster than the other?
The easiest way is to provide sample data and a processing task and let people try the two solutions for themselves.
Slightly off topic, but as a swift developer (https://swift.org/) the usage of swift/T in this project really confused me. Is swift/T in any way related to Apple's swift language?
The naming conflict makes googling the differences fairly challenging.
Both Swift's are different. I was aware of potential confusion which is why I did not use Swift in the title and mention explicitly in the blog that this is not Apple Swift. Fun story: Apple contacted Swift/T team before launching Apple Swift and made a mention of it in their page.
Indeed I did try mawk and found it to be faster. However, when I try setting locale "LC_ALL=C"; the performance of both awk and mawk were almost the same. I also let it at awk in favor of portability -- mawk is not available on most systems and needs to be installed.
But, I started being somewhat confused by something:
> Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.
> The total data size is 329GB.
At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
The article presents a lot more AWK knowledge than I have. I'm impressed by that. I acknowledge that.
But I'd probably put all of that into a postgres instance, compute indexes and rely on automated query optimization and parallelization from there. Maybe tinker with pgstorm to offload huge index operations to a GPU. A lot of the shown scripting would be done by postgres, the parallelization is done automatically based on indexes, while eliminating the string serializations.
I do agree with the underlying sentiment of "We don't need hadoop". I'm impressed that AWK goes so far. I'd still recommend postgres in this case as a first solution. Maybe I just work with too many silly people at the moment.
First, I would need to figure the right schema and populate the database.
Second, it would need some creative SQL acrobatics that I would probably not be comfortable with.
Third, it would probably be hard to perform quick tests at a small scale that I can perform easily with text files.
Fourth, the solution would probably be hard to port elsewhere where postgres is not available. Most Unix systems have awk available.
Fifth, programming the postgres db in a higher level language would require connector api libs which would be additional effort.
Note that this is not a production work -- it started simply as a hack to see how far I can go without getting into serious rabbitholes and giving up. Surprisingly, with Awk I went really far and never fell into any rabbithole so to speak.
I don't think it's really displayed here because everything is run through swift, but it can also compose much better with other tools, a database frequently won't have cli tools that integrate well (not sure about postgres). There are 322 source files being globbed for instance, move that to a make file and you can automatically re-run without wasting time on data already processed. If it was in a database you'd have to track source files and manage changes somehow.
> At first glance, that's an awful lot of hardware for a ... decently sized but not awfully large dataset. We're dealing with datasets that size at 32G or 64G of RAM, just a wee bit less.
Note that it doesn't require that much memory, it's just using it to boost performance because it's available, this could have been developed on a dual core 4GB laptop outputting to spinning rust and all that would change is the directory it's output to (and running time), but the same machine may choke on database queries over that much data.
@ketanmaheshwari, which supercomputer is this?
This is an SGI UV300 system similar to this one: https://www.hpcwire.com/2016/05/11/tgac-installs-largest-sgi...
I believe databases like postgres etc. would probably run on this system. I did not choose that route because I wanted to see how far I can go with Awk.
Those are massive, massive machines.
There'd be much less setup overhead
Ok. Maybe those are enterprise concerns: Sqlite doesn't scale regarding multiple users reading and writing. Of course it's a read only dataset, but do you know the bouquet of views and derived tables data scientists create around a read-only dataset? Hah. Oh and of course these are not critical, but if they get lost, shit hits the fan because it takes multiple weeks to rebuild them.
I've been in that swamp enough times to just install postgres and stop caring. Takes me 2 more hours now, but avoids weeks of discussions in the future.
I know you said "automated parallelization" but... in newer versions of Postgres, what does it take to trigger some of the automatic query parallelization?
Table scans, index scans (b-tree & bitmap only I believe), joins, and aggregations. There may be some limitations with joins, such as hash joins duplicating hashes across processes or merges requiring separate sorts.
Although I agree that for this problem Postgres looks like a better option, I wouldn't discard awk or other Unix tools. It is way faster than people think it is, easy to use and solves pretty well quite a lot of use cases, such as pre-processing records before database ingestion, aggregations, some simple queries on temporary data...
Just wanted to say that the resources were not needed. I simply had access to them and were not being used much at the time so thought why not try them.
[1] https://github.com/lanl/MPI-Bash [2] http://hpc.github.io/libcircle/
The results were indeed embarrassingly parallel.
I am a fan of some languages that seem better equipped to utilise our modern many-core machines and I'd still write a longer-living system with better guarantees in those languages -- but many people ignore shell goodness at their own peril.
I'm sure I could have done the parallelization itself in Rust given enough time, but honestly I found the `parallel` command to be so easy and resulting in so little overhead that it didn't really even seem worth it to spend the time on a language-native solution.
https://www.gnu.org/software/parallel/
*SAPANS: as simple as possible, and no simpler
Do note that awk and GNU Parallel are tools that use/contain C-like syntax, which is abhorrent to some factions. GNU Parallel is written in Perl.
Obviously there are cases where parallelisation is very superfluous. F.ex. if you have 1000 records of something, splitting the work of processing those (and we're talking really light processing like JSON encoding) among 4-8 threads is actually detrimental to performance.
I've been developing a simulation software that does Monte Carlo over realizations of a simulated universe, and xargs and (later) parallel were super useful for parts of the workload. All the parallel job instances run the same simulation code, but for a different simulated universe, all controlled by a random number seed, so you can generate an ensemble of simulations with basically:
I quite disliked the journey. The UNIX shell legacy shenanigans are real and can be a pain. But you can still bear a lot of fruit without going to the more arcane corners of it.
Thanks for mentioning AWS Batch, I didn't know about it and will look it up.
Switching to Opus saved a great deal of space on my phone's SD card.
Sometimes it improves performance, sometimes not. Sometimes it improves productivity, sometimes not. I am not sure of the global balance.
* The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
* The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
* A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
* The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
> The OS provides process isolation and management, as well as various IPCs to deal with their interconnection? Let's use a single process and create threads, and then fibres. More performance? yes, a bit; more headaches? yes, many.
Many one-off CLI script programs don't have the time to deal with N process forks (where N = CPU threads). This overhead is proven. It's objectively much quicker to spawn N threads inside the same process especially if your separate tasks need to converge and/or exchange messages here and there.
And runtimes like Erlang's BEAM VM have unquestionable benefits (like preemptive green threads / fibers that exchange immutable messages), although they aren't well known for big raw processing muscle.
> The display protocol/server provides lots of primitives for managing graphical interfaces, together with many benefits? Let's just use it to get a canvas and reimplement everything ourselves inside that canvas (and thereby turn a snappy interface into something sluggish when the display is not local).
Which one is that? Linux's X11 is flawed in many regards, with the ability of each program to freely capture keystrokes being one of the most egregious. I can't blame people for trying to escape such hell and implement other platforms for the same thing -- without the drawbacks.
> A GUI toolkit provides easy to use, multiplatform interface elements? Let's take a single program which only use a single canvas from the toolkit, and reimplement a whole GUI toolkit within (in one of the worst languages, over a program not made at all for the job, otherwise it wouldn't be fun).
People wanted multi-platform desktop programs. F.ex. in my rather small country the only reason a software for managing shop stock and customer purchases, backorders, inventory etc. succeeded was because it was written in Java Swing; many shops used old Windows machines but no small amount of them also used Macs and some even used Ubuntu netbooks. That software would never sell back in 2007 if it wasn't OS-neutral.
Also, Qt has commercial licensing that requires paying to use. Many, me included, don't want to deal with that. The programming world is complex as it is and if I suddenly find myself having to pay royalties, personally, to Qt, years after I stopped working for the employer I developed a desktop app for, that would be catastrophic. So many just dodge such potential bullets.
> The OS provides management for multiples users and groups, together with 2 systems to manage resources access rights? Nah, let's run everything as one user, and then try to implement some separation/protection within. Advanced level: run everything not only as a single user, but inside a single program, and then try to reimplement some isolation within it.
That's quite fair, however people want much more isolation than the OS usually offers -- hence stuff like FreeBSD jails, LXC containers / Docker, Windows' Sandboxie program etc. It's not enough to just run, say, Chrome, as a different user if you suspect Google is scanning your machine.
---
Again, I agree with your premise but the examples aren't really accurate. A lot of the modern OS-es and and their technologies aren't as top-notch as we'd like (the Linux kernel and network infrastructure probably being some very rare exceptions).
Only if you intend on modifying Qt and distributing it. Otherwise, you're free to build commercial software using Qt under the LGPL license.
The xargs/parallel mode can also scale up to many more nodes, if you are willing to introduce ssh into your shell scripts & do a minimum of stdio piping.
I wrote a python mapreduce program named "batchman" a few years ago, which was aimed at running awk on hundreds of machines (& then being able to fire more things based on the results).
zgrep on gzipped logs + awk + a central scheduler is extremely fast, to the point of beating Splunk to lookup exact things like ("does this zid have another session which they kept logging in while the game was throwing out-of-sequence errors?").
I thought of utilising my 4 machines at home sometimes for such things and would hugely benefit from implementing this using standard tools.
My code is poorly documented and badly written, but it works at least the ssh_copy_id one I use enough to keep updated[1] (that ssh logins with passwords to place the ssh keys).
That's my script, the brains of it is basically an ssh connector.
Paramiko is as simple as
Originally, I had awk scripts to process ^A separated, I'd read the out stream back and formulate my next script over the logs to drill-down in a loop.[1] - https://github.com/t3rmin4t0r/batchman/blob/master/ssh_copy_...
Unfortunately I no longer have a working example but as I recall it wasn't difficult to keep n number of scripts running and popping entries until the queue emptied.
But what you mention is very interesting and I'll have it in mind for the future. Thank you.
Edit: Done.
I love these kind of “counterculture” uses of UNIX tools to solve hard problems. The boundary of “where to stop using xargs, awk, and grep and start using Python” is pretty blurry for a surprising number of tasks, especially if you’re willing to invest in mastering those tools!
https://github.com/alecthomas/pawk
250M records isn't big for a lot of data platforms. I've seen plenty of solutions in traditional RDBMSs that would easily scale an order of magnitude larger than this, if not several.
Moving into anything analytically focused, like an OLAP engine, or just a columnstore in an RDBMS easily gives you another order of magnitude or few.
I suspect if it was open source, it would probably be the most popular big-data storage and computing platform.
Querying things like these on such a small dataset should take seconds, not minutes.
The solution that took 9 minutes involved processing the abstract in each record. The abstracts are quite sizeable on some of these publications. Processing millions of them took time.
The solution that took 48 minutes involved a nested loop effectively reaching the iteration count for 216 years times 256M records which comes to about 55B.
Hope this clarifies things a bit but I am not claiming this to be the most optimized solution. I am sure there is scope for refinements -- this was my take on it.
Is the proper answer 'just learn it'? Are these tools one of these things (like musical instruments or painting) where the initial learning phase is tedious and frustrating, but the potential is basically limitless?
https://www.gnu.org/software/recutils/
https://en.wikipedia.org/wiki/Recfiles
My tools of choice were awk, R, and Go (in that order). Sometimes I could calculate something within a few seconds with awk. But for various calculations, R proved to be a lot faster. At some point, I reached a problem where the simple R implementation I borrowed from Stack Overflow (which was supposed to be much faster than the other posted solutions) did not satisfy my expectations and I spend 4 hours writing an implementation in Go which was a magnitude faster (I think it was about 20 minutes vs. 20 seconds).
So my advice is to broaden your toolset. When you reach the point where a single execution of your awk program takes 48 minutes, it might be worth considering using another tool. However, that doesn't mean awk isn't a good tool, I still use it for simple things, as writing 2 lines in awk is much faster than writing 30 in Go for the same task.
The solution would be much less error prone and most likely much quicker as well.
How do you know this, i.e., can you share a benchmark -- the test data, not the results -- that you ran?
Data.table was the clear winner for my use case.
"After "exploring" many options, kdb+ was the clear winner for my use case. I process an average of 1.6TB per day."
See how silly that sounds without any details or credibility? Does "clear winner" mean it was faster? How does a person know if the recommender even tried both solutions? What if someone else's use case is dfferent from the recommender's? How could someone verify that one solution was faster than the other?
The easiest way is to provide sample data and a processing task and let people try the two solutions for themselves.
The naming conflict makes googling the differences fairly challenging.