Getting started with regular expressions: An example

(redhat.com)

65 points | by tcarriga 1627 days ago

6 comments

  • macando 1626 days ago
    For some reason regular expressions have the lowest expiry date in my mind's cache. I had to relearn them at least 10 times. Watching that famous Udemy programming course where regex is explained in great detail with examples and state machine diagrams didn't help.
    • 52-6F-62 1626 days ago
      Same here save for some of the basics.

      Regardless, I love this tool: https://regex101.com/

      I usually hammer out a prototype there before testing in any method if I have to write anything non-trivial or that may have edge cases I want to test out.

    • kbenson 1626 days ago
      Regular expressions are a powerful tool, but unless you use them often after learning them (and are in a language where it makes sense to do so), it can be hard to make it stick.

      I've had amazing success in using a regex in situations where you wouldn't think it would work as well as other solutions. For example, I've gotten more than an order of magnitude speedup in parsing a well known simple (but fairly large) XML data set using regular expressions instead of the fastest XML parsing libraries I could find at the time. Sometimes the less efficient tool that does only what you need is much faster than the highly optimized tool that handles all the special cases that don't matter for the particular job.

      • macando 1626 days ago
        You use whatever works for your case. When I hear parsing with regex I always think of this legendary Stack Overflow answer :) https://stackoverflow.com/a/1732454
        • kbenson 1626 days ago
          Yeah, that's a classic. But it's also mostly about parsing arbitrary HTML, and that's where "well known" comes in. The data in question looked somewhat like:

            <doc>
              <bool name"foo">1</bool>
              <int name="bar">12345</int>
              <str name="baz">test string</str>
              <float name="quux">1.234</float>
            </doc><doc>
              <bool name"foo">0</bool>
              ...
            </doc>
          
          but with a lot more fields per-doc. To pull out each <doc> as a string (using a regex) in a while loop and to then parse the doc into a hash of key-value pairs (another regex) that are stored into an array is less than twenty lines of pretty standard Perl, include exception handling and error reporting:

              my @items;
              my $count = 0;
              while ($item_xml =~ m{<doc>(.*?)</doc>}gsmi) {
                  try {
                      # process <doc>
                      my $item = $1;
                      my $i = {};
                      $i->{$1} = $2 while $item =~ m{<(?:arr|date|str|bool|int|float) name="([^"]+)">([^<]+)</[^>]+>}gsmi;
                      push(@items, $i);
                      $count++;
                  }
                  catch {
                      warn "Error :: $_";
                  }
              }
              print "$count items found\n";
          
          And if you think the regex to pull out the field values is hard to read, it could always be written in the extended format like so (this is overly verbose, but you should get the idea)

              my $field_parse_re = q{
                # Parse opening tag
                < # Start of opening tag
                # Any of the allowed tag types
                (?: arr
                  | date
                  | str
                  | bool
                  | int
                  | float
                )
                \s # space between tag type and name attribute
                name="([^"]+)" # Save tag name as $1, or first returned item
                > # End of opening tag
          
                # Parse tag contents
                ([^<]+) # One or more characters that are not <, in $2, or second item returned
          
                # Parse enough of the closing tag to make sure we got all the contens
                </
              }smix;
              
              # Now use it
              $t->{$1} = $2 while $ticket =~ m{$field_parse_re}gsmi;
          
          In any case, that's a trivial amount of work to beat the fasted XML parsing I could find (and I surveyed a few libraries) by like 14x, IIRC.
          • macando 1626 days ago
            Being an engineer means finding/building the best tool for the task at hand. Sometimes it feels great to write a short and effective snippet of code instead of examining how some bizarre API works.
  • dmonitor 1626 days ago
    My life can be divided into pre-regular expressions and post-regular expressions. I get excited every time I find an opportunity to use it, and it happens a lot
    • Zhyl 1626 days ago
      Regular expressions are one thing I would want to teach non-technical people. So many day-to-day tasks, especially in normal office work could be made trivial.
  • mikece 1626 days ago
    I wonder if, knowing everything we collectively do now, if we had the chance to re-invent regular expressions would they look any different? Given the use case I think the complex syntax is inescapable but I can't help but thinking this could be implemented a bit simpler without losing any power.
    • dan-robertson 1626 days ago
      Perl6 (ie raku) did do this. Sort of. They made them look a bit more like bnf grammars (or at least the way those grammars are typically written).

      They ignored whitespace (and allowed them to be written over multiple lines, with comments) which made them more readable and allowed referencing regexps (stored in variables) directly within a regexp. I think they also made some of the syntax nicer (eg non capturing groups or lookaheads), and added notation for common things (e.g. like X* except everything is separated by commas) and long names for things like character classes.

      • tyingq 1626 days ago
        The /x modifier in Perl 5 allows for comments and ignores whitespace. Things like [[:upper:]] are also in Perl 5.
    • l_t 1626 days ago
      The biggest point of complexity, IMO, is the representation in which the characters to be matched are mixed in with the syntactical elements in a single string.

      This tends to cause confusion, makes escaping complex, and is just really different from most other ways of expressing logic in programming.

      Even advanced regex features like extended mode and named captures, which I love, don't get around this fundamental issue with the representation.

      It does have advantages though, a big one being terseness, and the ability to be expressed in an implementation-independent way. Another is that for some types of regexes, the expression "sort of looks like" the strings it can match against, which can help intuition.

      I think the way forward is to consider the existing syntax a "terse mode" and to use regex builder libraries [0] as a "verbose mode." The "terse mode" does have merits, though, and I wouldn't want to get rid of it or anything.

      [0]: For example https://github.com/VerbalExpressions/JSVerbalExpressions is a library that lets developers express regexps using a fluent API. This changes things from being syntactic elements of a string to just being traditional function calls.

      • kbenson 1626 days ago
        > For example https://github.com/VerbalExpressions/JSVerbalExpressions is a library that lets developers express regexps using a fluent API.

        Except from what I can see, it only supports trivial cases. What's a syntax that supports grouping, alternation, a numerical range of number of matches (e.g. \d{3,10}), and the combining of any number of those those? I suspect it quickly devolves to the case where you either use the extended syntax and comments are really what matters, or the terseness is useful in at least not making it so verbose and meandering that it's hard to comprehend just because the amount of boilerplate to usefully express something dwarfs the functional bits.

        Regular expressions work best when matching a a character level. The non-terse mode of stuff you are grasping at works best when matching at a symbolic level, and has existing for a very long time. They're called grammars. A popular form is BNF[1] (and variants), which will probably look somewhat familiar from technical specifications or RFCs, if you've looked at any before.

        If you want something beyond regular expressions and you're willing to use a library for it, just use a library that provides grammar parsing. Or use Raku (fka Perl 6), which has support built in (with hooks for calling code at particular parsing points).[2] Hopefully most languages have a good library for parsing grammars at this point.

        1: http://matt.might.net/articles/grammars-bnf-ebnf/

        2: https://docs.perl6.org/language/grammar_tutorial

    • mci 1626 days ago
      SNOBOL4 pattern matching may fit your bill, although it is more powerful than regexps [1]. You can play with SNOBOL4 on x86 [2].

      [1] http://www.snobol4.org/docs/burks/tutorial/ch4.htm [2] https://github.com/spitbol/x64

    • Boxxed 1626 days ago
      I think syntax is the big whiff in regular expressions. I bet it would feel much more natural to represent them not as a string but as a tree (which they kind of are). Maybe with s-expressions? Would probably be more verbose but I imagine way more readable.
      • mikece 1626 days ago
        S-expressions which make me think of Lisp and functional programming... which I mention because regular expressions as means for defining a pattern against which to find matches has a functional programming feel to it. Perhaps something like Linq in .NET or Map() in JavaScript would work as you're more or less converting the text in a string/document to an array and then running a "does it match or not" function on it and returning the position when there's a match. So something which a functional programming feel makes sense.
  • austincheney 1626 days ago
    Things I love about regex:

    * The global or g switch. The global switch applies a pattern to all matches in a string, which can be used to return all pattern matches or apply all pattern changes.

    * In JavaScript space like characters are represented by the \s switch. There are many characters classified as white space, so it is very nice to have a grouping for them when performing a search and being able to ignore them all if necessary. \s+ will match a group of consecutive characters that may be any combination of space, new line, carriage return, form feed, and many other characters.

    * Partial matches are helpful. Sometimes you know the pattern you want to search for and you need to replace that pattern with a different pattern without harm specific data in the match. In this scenario I perform a replace method on a string and search by regular expression with the a g global switch. For the replacement result I supply a function name. A regular expression match is a string argument in that function. In the function I can manipulate that result to be something else, even using regular expressions, and return the result, which is inserted back into the target string over the regular expression match.

  • spartas 1626 days ago
  • mekane8 1626 days ago
    I love the real-world example, but I couldn't help but wonder if this article was more about Sed and Awk than actual regex.