GoAWK: an AWK interpreter written in Go

(github.com)

101 points | by ngaut 2072 days ago

12 comments

  • vvern 2071 days ago
    I would love to see support for calling out into go functions. The go stdlib so often has good implementations of functionality in places where things like the python stdlib doesn't.

    There are some obvious questions around calling conventions and error handling, method invocation, etc. but nothing there seems totally insurmountable. Having a compliant implementation as a jumping off point is a great start.

    Looking at the interp internals, the representation of function call expressions might need a little bit more structure to pull this off (rather than just a big switch for the awk builtins and user calls as just more awk instructions plopped inline). Furthermore there are questions about how exactly to represent go objects but I suspect with some boxing it could be made relatively ergonomic.

    • benhoyt 2071 days ago
      That's a neat idea! (Though not necessarily in the spirit of AWK as a simple text processing language. :-) I started playing with something like this in a previous language interpreter I made in Go (https://github.com/benhoyt/littlelang/blob/master/interprete...). It's definitely possible to do with the reflect package, though it wouldn't be trivial to do a full implementation.
  • kiwidrew 2071 days ago
    This is really cool! It's rare to see new implementations of awk these days. Bonus points for running it through the nawk test suite.

    In my opinion /usr/bin/awk is a thing of beauty. Certainly it's the most usable out of the trifecta of scripting languages that are mandated by POSIX.

    (There's /bin/sh, where merely using variables can quickly turn into a quoting nightmare. But the true nightmare material is /usr/bin/sed, which has actually been shown [1] to be a Turing complete language!)

    [1] http://www.catonmat.net/ftp/sed/turing.txt

  • stevekemp 2071 days ago
    Looks like a really good project, especially with the collection of test-files.

    I had fun fuzzing the "real" awk, finding a couple of trivial segfaults. If you've not already experimented with fuzzing I'd recommend it - I found a few minor issues in my own simple-interpreter, and language, via feeding them malformed scripts.

  • linsomniac 2071 days ago
    I've been playing with writing a Python-based AWK-inspired library over the last couple weeks. This isn't an implementation of AWK like here, but a Python interpretation of "If I wanted to do the sort of things that AWK is good at, in Python, what would that look like?"

    For example, to extract and add line numbers to SQL table definitions:

      t = gawk.Gawk(sys.stdin)
      t.context.data = ''
      @t.range(r'CREATE TABLE', r');')
      def line(context, line):
          context.data += (('line %d:' % context.range.line_number) + line)
          if context.range.is_last_line:
              print(context.data)
              context.data = ''
      t.run()
    
    https://github.com/linsomniac/gawk

    I've used AWK for close to 30 years, but I've never achieved or maintained any level of proficiency at it. I pretty much just use it for "{ print $1, $3 }" in a filter or the like. Every time I try to do something more complicated I spend an hour or more futzing around with it and more often than not getting to: almost but not quite" where I want to be. This is, of course, a me failing not an awk failing.

    But it's left me wanting something that would make doing awk-like processes easy in Python, which I'm very proficient at.

    I ended up using the name "gawk" because it's an English word and nods to the AWK inspiration, but then I remembered GnuAWK so I'll probably rename it.

    • srean 2071 days ago
      Yeah Gawk is Gnu's Awk, its a well established name.

      Your code snippet does showcase Awk's utility. It brutally cuts through all the ceremony around reading and iterating over lines.

  • samuell 2071 days ago
    This is so cool! I have been calling out to GNU Awk in like 50% of our SciPipe workflow tasks lately (See e.g. [1]) ... now I should be able to keep it all inside Go/SciPipe.

    [1] https://github.com/pharmbio/ptp-project/blob/master/exp/2018...

  • tomcam 2072 days ago
    Haven’t tried it yet but my favorite part is that it’s embeddable in your own programs.
    • srean 2071 days ago
      Just in case it's useful, you can embed/extend awka or mawk or rather the library they are based on. Gawk allows extending it but the interface is not pretty.
  • hi41 2071 days ago
    I am a big fan of awk. I find it very beautiful. I admire the authors of awk so much. When I read the awk programming book, I find that has such clarity of thought. Kudos to you for your implementation!
    • another-cuppa 2071 days ago
      I'm so glad I spent about twenty minutes to learn awk several years ago. That's really all it takes to learn it and you get incredible power.
  • helper 2072 days ago
    Ha. I was just thinking how useful it would be to have the awk programming language available in a tool that natively understood csv files. Suddenly that seems a lot more doable!
    • benhoyt 2071 days ago
      Interesting point. I've though that AWK should have a mode where it does proper quote parsing of CSV files. Maybe I'll add a -csv option for that (or just have it do it automatically when the FS is ',' -- though that wouldn't be backwards compatible).
      • vram22 2071 days ago
        You probably know it already, but in case not, the CSV format is not fully standardized, and so there are variations. So you might have to handle those to provide better support of the feature, if you implement it.

        https://en.wikipedia.org/wiki/Comma-separated_values

        For example, csv.reader in Python's csv module in the stdlib, has a dialect argument, due to this.

        https://docs.python.org/2/library/csv.html

        https://docs.python.org/3/library/csv.html

      • jrumbut 2071 days ago
        Perhaps deal with quotes strings generally, and then quoted strings mode with FS=, is CSV mode, and that quoted string functionality is available elsewhere (and maybe also for output).

        Less intuitive maybe for beginners but more generally useful?

      • helper 2071 days ago
        If you add a -csv I would use it!
    • dbro 2072 days ago
      You might want to check out https://github.com/dbro/csvquote which helps awl and other text tools handle csv files which have quoted strings as values.
      • helper 2071 days ago
        If I'm going to preprocess before invoking awk I think I'd rather switch the separators to use ascii record/unit separator values than to replace the content of the actual fields.
  • srean 2071 days ago
    Ah! This makes me wish for an awk with first class channels and loadable modules. Gawk does allow talking over sockets but this would have been so much sweeter.

    One thing hope this implementation remedies is the absence of a linear time string concatenation in awk. Awk has split but no join. Only way I know is to iteratively join two strings which has a quadratic running time.

    • benhoyt 2071 days ago
      Huh, interesting. So one thing I haven't focused on much yet is performance. It is slightly faster than "one true awk" on large inputs with very simple programs, so I'm guessing Go's I/O speed is pretty good, but the actual interpreter itself is significantly slower than awk's as yet -- hoping to work on that soon.

      I haven't looked at linear-time string concat. Interesting point -- I'll put it on my TODO list. Though I think instead of string building you could simply use printf to write output and that would be linear time.

      • kiwidrew 2071 days ago
        Yes, if the "destination" is stdout (or a pipe/file) then the obvious loop works just fine:

          function join(ARRAY) {
            for (i=0; i in ARRAY; i++) printf "%s", ARRAY[i];
          }
        
        But if you need the result back as a string for further processing, the obvious methods are not linear:

          function join(ARRAY,_s) {
            for (i=0; i in ARRAY; i++) _s=_s ARRAY[i];
            return _s;
          }
          
          # try to be clever and join 2 items at the same time
          function join2(ARRAY,_s) {
            for (i=0; i in ARRAY; i+=2)
              _s=sprintf("%s%s%s",_s,ARRAY[i],ARRAY[i+1]);
            return _s;
          }
        
        I for one definitely miss having a join() function, and it seems odd that this natural complement to the split() function was never implemented...
      • srean 2071 days ago
        As they say make it work and then make it fast. Fair warning, mawk will give you tough competition on speed. Gawk should be possible to beat, its implementation is not very performance focused.

        Regarding concat, printf and sprintf would indeed work in some situations. Its not flexible enough in the following scenario:

        Consider an array of substrings, possibly generated by 'split'. I modify and delete some of them. Then I want to put them back together. Since i do not know ahead of time how many substrings will survive, sprintf is difficult to use.

  • yjftsjthsd-h 2072 days ago
    This is beautiful. I really love that awk has numerous implementations running around.
  • wenc 2071 days ago
    Great work. Reimplementing a practical and useful mini-language in another language is always a useful exercise.

    I'm curious though, why code the lexer and parser by hand? What's the state of lexing/parsing in the Go world?

    • benhoyt 2071 days ago
      Couple of reasons. 1) Because I'm a fan of few dependencies and lex/yacc are non-Go dependencies. 2) I've never used them and it'd probably take me longer to learn them than hand-write a lexer and parser. 3) Writing a lexer is trivial, and writing a recursive-descent parser is fun and not that hard.

      As to the state of lexing/parsing in the Go world. There's a simple scanner (text/scanner) in the stdlib. I've run across this quite neat parser library that's based off structs and tags: https://github.com/alecthomas/participle ... but I really don't know the landscape very well.

  • theparanoid 2072 days ago
    I feel let down, GAWK would have been a great name.