Regexp Ranges and Locales: A Long Sad Story

(gnu.org)

54 points | by js2 2104 days ago

9 comments

  • kijin 2103 days ago
    Leaving the behavior undefined in almost all commonly used locales (i.e. anything involving UTF-8) doesn't seem to be a particularly helpful standard.

    It's unreasonable to expect someone who writes regexp to anticipate which locales a user will execute his program in. It's just as unreasonable to tell him to stick to an outdated locale. How about we ignore locales altogether and just use code points? Code point order is the only ordering that every locale can agree on. [a-z] should match any character whose code point is between U+0061 and U+007A regardless of the locale.

    • dwheeler 2103 days ago
      I agree that it is terrible that this is undefined, but utf-8 is not a Locale. That is an encoding. The standard works just fine when you use C as the Locale and utf-8 as the encoding. I think they should have just defined ranges as being the encoding values, because that would make more sense, but that would be exactly the opposite of what the standard previously said.
      • kijin 2103 days ago
        > utf-8 is not a Locale.

        Of course it isn't. What I suggested above is to use neither locales nor UTF-8, but code points. "a" = U+0061 no matter which locale you're in, and no matter which encoding you use. Every locale and every encoding is based on the same universal mapping of characters to code points.

        • dwheeler 2103 days ago
          > What I suggested above is to use neither locales nor UTF-8, but code points.

          I agree, I think that is the right thing to do for ranges. If you want "all alphas in the current locale", then you should use some special indicator like [:alpha:] that is defined to be that.

        • oldmanhorton 2103 days ago
          Sure, but those code points are arbitrary. For instance in German languages, you may want a-z to include umlauted vowels, or you may not. That's a locale specific setting, even though the umlauted characters come well outside of the range of ascii a to z.
          • mmt 2103 days ago
            Keeping ranges undefined doesn't satisfy these wants, either.

            Using code points would at least allow for ranges to have the possibility of being usable to someone in a standard, predictable fashion, outside of the C/POSIX locale.

            For example, specifying a-z plus each umlauted vowel is still shorter than specifying all letters individually.

            Perhaps there is some wisdom in William S. Burrough's "If you can't be just, be arbitrary."

      • 1996 2103 days ago
        Funny enough, after plenty of locale nonsense with ls and the other tools, I decided my locale was C.UTF-8

        This works at least as well as en-US, and provides me just enough internationalization to remove the most annoying americanocentric things

    • zokier 2103 days ago
      Codepoint ranges were also suggested in the bugzilla thread linked by rwmj

      https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c13

    • pgeorgi 2103 days ago
      your proposal changes behavior for EBCDIC
      • kijin 2103 days ago
        According to the linked article, recent versions of gawk have already shown the middle finger to EBCDIC, by taking the liberty to interpret "undefined" as "we'll just go back to ASCII". It would be nice to expand that to the full UTF-8 range, though.
        • laumars 2103 days ago
          Genuine question: does anyone actually still use EPCDIC (outside of hobbyists maintaining retro hardware)?
          • slavik81 2103 days ago
            From the "IBM comment on preparing for a Trigraph-adverse future in C++17" (Oct 2014):

            > There are real customers who use EBCDIC. We cannot reveal their names due to confidentiality agreements. One key example is some of the major banks in North America who continue to use IBM machines to perform check clearing operations. These high reliability software systems are written on IBM mainframes clearing your checks and because they have been debugged over so many years and are highly critical to daily integrity of the financial industry, they are highly reliable and will never be moved to any other platform. In a way, this makes their code unaffected by this removal of trigraphs.

            > Other EBCDIC users with trigraphs from a world-wide company that can’t get square brackets (i.e. [ ]) everywhere so must use trigraphs to get at them quotes: “In that capacity we have widely distributed deployments, around the world, across Windows, HP-UX, Linux, z/OS and iSeries systems. That's why we need the trigraphs, it doesn't seem you can count on the [ ] characters working everywhere. ... "

            Source: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n421...

            • laumars 2103 days ago
              Interesting read however even that was 4 years ago. A lot of European banks have been making major upgrades to their IT infrastructure (for better or worse) in recent years so it wouldn't surprise me if a few of those EBCDIC systems have been torn out of American banks as well.

              However even assuming they are still in use (which is a real possibility knowing how slowly banks upgrade) I would be surprised if they depended on GNU Awk let alone undefined behaviour in its Regex library.

              I'm not normally someone who says "move fast and break things" but I do think this is one situation where it is ok. I mean EBCDIC is hardly recent and it wasn't exactly popular even when it was "current tech" (I seem to recall it was a bit of a laughing stock?)

          • dfox 2103 days ago
            This question should probably be rephrased as "was there ever system that used EBCDIC in it's POSIX-compliant interface?" which I strongly believe there was not.
        • pgeorgi 2103 days ago
          gawk, yes. I was referring to the POSIX standard declaring it undefined.
          • kijin 2103 days ago
            If the standard says the behavior is undefined, EBCDIC users can't complain if the standard later becomes defined in a way that is different from the way they've been doing it.
  • tzs 2103 days ago
    I'm sure POSIX thought about this a lot more than I have, so I'm probably missing something and am about to say something that is actually stupid, but...

    It should overload items of the form "x-y" in ranges where x and y are single characters in the locale in use. It should define specific items of this form as not being ranges but simply shorthand for certain predefined strings. The expression is treated as if those items were replaced by the corresponding predefined strings before the regular expression was parsed.

    In particular "a-z" => "abcdefghijklmnopqrstuvwxyz". Similar for uppercase. Include such a definition to produce each possible substring of length 2 or more of "abcdefghijklmnopqrstuvwxyz". Similar for "0-9".

    • bewuethr 2103 days ago
      Perl more or less does that [1]:

      > Perl also guarantees that the ranges A-Z, a-z, 0-9, and any subranges of these match what an English-only speaker would expect them to match on any platform.

      [1]: https://perldoc.pl/perlrecharclass#Character-Ranges

    • raverbashing 2103 days ago
      I'm not sure what you're saying, when would that be different from what we have today?

      I don't see why that would be beneficial. You also might want ranges as [a-fk-z] (for example)

      • tzs 2103 days ago
        > I'm not sure what you're saying, when would that be different from what we have today?

        The situation today is that "[a-z]" is undefined by POSIX if you are not in the POSIX locale. I'm suggesting that it, and similar cases, should be made defined in all locales that include a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, and z.

        > You also might want ranges as [a-fk-z] (for example)

        Looking back, I wrote unclearly. Where I wrote 'overload items of the form "x-y" in ranges' it would have been better to write 'overload items of the form "x-y" inside bracket expressions'.

        In "[a-fk-z]" there are two items of that form. Under my suggestions, "a-f" would be replaced with "abcdef" in all locales, and "k-z" would be replaced with "klmnopqrstuvwxyz", giving "[abcdefklmnopqrstuvwxyz]".

        • bonzini 2102 days ago
          But would a-z also include for example à and ä, or should 0-9 include ½?

          (The solution that glibc will implement is to un-interleave lowercase and uppercase characters whenever the collation order is like aàAÀbBcC...).

  • rwmj 2103 days ago
  • chrismorgan 2103 days ago
    > ‘["-/]’ is perfectly valid in ASCII, but is not valid in many Unicode locales, such as en_US.UTF-8.

    Why is this the case? Collation sequences, I’m guessing?

    • jwilk 2103 days ago
      Yes, slash sorts before double-quote in this locale:

        $ (echo '"'; echo '/') | LC_ALL=en_US.UTF-8 sort
        /
        "
      • a1369209993 2103 days ago
        Edit: nevermind, I apparently rm'd the offending locale directory last time I encountered a bug like this and sort is silently ignoring LC_ALL:

          bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
        
        I can't reproduce this bug:

          $ (echo '"'; echo '/') | LC_ALL=en_US.UTF-8 sort
          "
          /
          $ sort --version
          sort (GNU coreutils) 8.26
        
        What version are you using?
    • raverbashing 2103 days ago
      Well, that's weird, I though Unicode matched ASCII for codepoints < 127

      Still this doesn't look like an issue with regex's per se

      • LukeShu 2103 days ago
        The numeric value of Unicode codepoints does match ASCII for codepoints < 127. However, Unicode codepoint numeric order is not collation order. Collation order isn't encoding-specific, it's locale (language)-specific.

        My current locale is en_US.UTF-8; if I changed that to a different locale ending with ".UTF-8", the codepoint values haven't changed (still Unicode), and the encodings of them haven't changed (still UTF-8), but the collation order might have.

        One can observe this change even without getting exotic. In the "C" or "POSIX" locales (synonyms), the ASCII values sort in ASCII numeric order ("ABC..XYZ...abc...xyz"), but in "en_US", they sort "aAbBcC...xXyYzZ" (because that's what a native American English speaker would expect, if they're not familiar with ASCII).

        • chrismorgan 2102 days ago
          … ouch. So if you’re operating by locale, [a-z] and [A-Z] may do the wrong thing completely, including all but the first or last letter of the wrong case.

          Putting lowercase before uppercase also surprises me; if asked to sort them, I’d definitely put uppercases before lowercases: AaBbCc…

          • LukeShu 2102 days ago
            Yes, that's why people found GNU awk <4.0 troublesome. Fortunately modern versions POSIX make regex range expressions ("[a-z]") unspecified behavior on locales other than "C", so modern versions of GNU awk and other GNU utilities "do the right thing".

            As for the en_US example, for the most part collation is case-insensitive, and case only comes in as a tiebreaker; "aa" < "Aa" < "ab".

  • rspeer 2103 days ago
    I've seen something that sounds related. In grep, in the en_US.UTF-8 locale, sometimes I can match [A-Z]+ and it will match accented uppercase strings such as "SCHÖN". It will not match lowercase letters.

    This is often desirable, except for the part that I don't know what the heck ranges mean anymore. "Ö" is certainly not between "A" and "Z" in codepoint order. It is in collation order, but if it were collation order, it would match lowercase letters. How does this work?

    • bonzini 2102 days ago
      Collation order did not interleave lowercase and uppercase until recently, except in a few oddball locales (e.g. cs_CZ.UTF-8).

      Interleaving was added to all locales recently, and people started complaining that their scripts broke, so it will probably be reverted.

  • gnufx 2103 days ago
    People seem to be missing character classes like [[:upper:]]. If you need some other sort of range in a portable script, say, just make sure you set LC_COLLATE. And if you're testing with GNU sort, use --debug to check what it's actually doing in case you don't have the definition for the current LC_COLLATE, for instance.
  • theothermkn 2103 days ago
    My main problem with the [A-Za-z0-9] notation is that it looks great at first glance. I mean, it looks really, really great. Of course [A-Z] means all capital letters. And then you think about it, and you start to suspect something like the situation described in the article. Suddenly, you're in the familiar but dizzying position of being perched atop a shoddily-built and wobbling tower of abstractions. You feel your familiar nausea soaking in from the periphery of your editor window.

    I just now, minutes ago, got some regexes to mostly sorta work in a project to convert some jai alai score data (converted from wonky pdf files). It's a one-off script. My Python is rusty, but I couldn't find how to get a posix descriptor for 'lowercase letters and uppercase letters' to work. [A-Za-z] happens to work, for now.

    I love regexes. I hate regexes.

    • Sharlin 2103 days ago
      Every (Finnish) first grader knows that [A-Z] obviously doesn't mean all capital letters, because the letters Å, Ä, and Ö follow Z in the alphabet ;)
  • jwilk 2103 days ago
    > the 2008 standard had changed the definition of ranges, such that outside the "C" and "POSIX" locales, the meaning of range expressions was undefined

    It was changed earlier. It's undefined in the 2004 edition, too.

    • tzs 2103 days ago
      And no change in the 2017 edition.
  • bjourne 2103 days ago
    Aren't there already half a dozen Linus Torvalds rants on the brain-damaged stupidity of the POSIX specs? The story is long and sad because the developers thought following the spec was more important than writing usable software.