Leaving the behavior undefined in almost all commonly used locales (i.e. anything involving UTF-8) doesn't seem to be a particularly helpful standard.
It's unreasonable to expect someone who writes regexp to anticipate which locales a user will execute his program in. It's just as unreasonable to tell him to stick to an outdated locale. How about we ignore locales altogether and just use code points? Code point order is the only ordering that every locale can agree on. [a-z] should match any character whose code point is between U+0061 and U+007A regardless of the locale.
I agree that it is terrible that this is undefined, but utf-8 is not a Locale. That is an encoding. The standard works just fine when you use C as the Locale and utf-8 as the encoding. I think they should have just defined ranges as being the encoding values, because that would make more sense, but that would be exactly the opposite of what the standard previously said.
Of course it isn't. What I suggested above is to use neither locales nor UTF-8, but code points. "a" = U+0061 no matter which locale you're in, and no matter which encoding you use. Every locale and every encoding is based on the same universal mapping of characters to code points.
> What I suggested above is to use neither locales nor UTF-8, but code points.
I agree, I think that is the right thing to do for ranges. If you want "all alphas in the current locale", then you should use some special indicator like [:alpha:] that is defined to be that.
Sure, but those code points are arbitrary. For instance in German languages, you may want a-z to include umlauted vowels, or you may not. That's a locale specific setting, even though the umlauted characters come well outside of the range of ascii a to z.
Keeping ranges undefined doesn't satisfy these wants, either.
Using code points would at least allow for ranges to have the possibility of being usable to someone in a standard, predictable fashion, outside of the C/POSIX locale.
For example, specifying a-z plus each umlauted vowel is still shorter than specifying all letters individually.
Perhaps there is some wisdom in William S. Burrough's "If you can't be just, be arbitrary."
According to the linked article, recent versions of gawk have already shown the middle finger to EBCDIC, by taking the liberty to interpret "undefined" as "we'll just go back to ASCII". It would be nice to expand that to the full UTF-8 range, though.
From the "IBM comment on preparing for a Trigraph-adverse future in C++17" (Oct 2014):
> There are real customers who use EBCDIC. We cannot reveal their names due to confidentiality
agreements. One key example is some of the major banks in North America who continue to use IBM
machines to perform check clearing operations. These high reliability software systems are written on
IBM mainframes clearing your checks and because they have been debugged over so many years and
are highly critical to daily integrity of the financial industry, they are highly reliable and will never be
moved to any other platform. In a way, this makes their code unaffected by this removal of trigraphs.
> Other EBCDIC users with trigraphs from a world-wide company that can’t get square brackets (i.e. [ ]) everywhere so must use trigraphs to get at them quotes: “In that capacity we have widely distributed deployments, around the world, across Windows, HP-UX, Linux, z/OS and iSeries systems. That's why we need the trigraphs, it doesn't seem you can count on the [ ] characters working everywhere. ... "
Interesting read however even that was 4 years ago. A lot of European banks have been making major upgrades to their IT infrastructure (for better or worse) in recent years so it wouldn't surprise me if a few of those EBCDIC systems have been torn out of American banks as well.
However even assuming they are still in use (which is a real possibility knowing how slowly banks upgrade) I would be surprised if they depended on GNU Awk let alone undefined behaviour in its Regex library.
I'm not normally someone who says "move fast and break things" but I do think this is one situation where it is ok. I mean EBCDIC is hardly recent and it wasn't exactly popular even when it was "current tech" (I seem to recall it was a bit of a laughing stock?)
This question should probably be rephrased as "was there ever system that used EBCDIC in it's POSIX-compliant interface?" which I strongly believe there was not.
If the standard says the behavior is undefined, EBCDIC users can't complain if the standard later becomes defined in a way that is different from the way they've been doing it.
I'm sure POSIX thought about this a lot more than I have, so I'm probably missing something and am about to say something that is actually stupid, but...
It should overload items of the form "x-y" in ranges where x and y are single characters in the locale in use. It should define specific items of this form as not being ranges but simply shorthand for certain predefined strings. The expression is treated as if those items were replaced by the corresponding predefined strings before the regular expression was parsed.
In particular "a-z" => "abcdefghijklmnopqrstuvwxyz". Similar for uppercase. Include such a definition to produce each possible substring of length 2 or more of "abcdefghijklmnopqrstuvwxyz". Similar for "0-9".
> Perl also guarantees that the ranges A-Z, a-z, 0-9, and any subranges of these match what an English-only speaker would expect them to match on any platform.
> I'm not sure what you're saying, when would that be different from what we have today?
The situation today is that "[a-z]" is undefined by POSIX if you are not in the POSIX locale. I'm suggesting that it, and similar cases, should be made defined in all locales that include a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, and z.
> You also might want ranges as [a-fk-z] (for example)
Looking back, I wrote unclearly. Where I wrote 'overload items of the form "x-y" in ranges' it would have been better to write 'overload items of the form "x-y" inside bracket expressions'.
In "[a-fk-z]" there are two items of that form. Under my suggestions, "a-f" would be replaced with "abcdef" in all locales, and "k-z" would be replaced with "klmnopqrstuvwxyz", giving "[abcdefklmnopqrstuvwxyz]".
The numeric value of Unicode codepoints does match ASCII for codepoints < 127. However, Unicode codepoint numeric order is not collation order. Collation order isn't encoding-specific, it's locale (language)-specific.
My current locale is en_US.UTF-8; if I changed that to a different locale ending with ".UTF-8", the codepoint values haven't changed (still Unicode), and the encodings of them haven't changed (still UTF-8), but the collation order might have.
One can observe this change even without getting exotic. In the "C" or "POSIX" locales (synonyms), the ASCII values sort in ASCII numeric order ("ABC..XYZ...abc...xyz"), but in "en_US", they sort "aAbBcC...xXyYzZ" (because that's what a native American English speaker would expect, if they're not familiar with ASCII).
… ouch. So if you’re operating by locale, [a-z] and [A-Z] may do the wrong thing completely, including all but the first or last letter of the wrong case.
Putting lowercase before uppercase also surprises me; if asked to sort them, I’d definitely put uppercases before lowercases: AaBbCc…
Yes, that's why people found GNU awk <4.0 troublesome. Fortunately modern versions POSIX make regex range expressions ("[a-z]") unspecified behavior on locales other than "C", so modern versions of GNU awk and other GNU utilities "do the right thing".
As for the en_US example, for the most part collation is case-insensitive, and case only comes in as a tiebreaker; "aa" < "Aa" < "ab".
I've seen something that sounds related. In grep, in the en_US.UTF-8 locale, sometimes I can match [A-Z]+ and it will match accented uppercase strings such as "SCHÖN". It will not match lowercase letters.
This is often desirable, except for the part that I don't know what the heck ranges mean anymore. "Ö" is certainly not between "A" and "Z" in codepoint order. It is in collation order, but if it were collation order, it would match lowercase letters. How does this work?
People seem to be missing character classes like [[:upper:]]. If you need some other sort of range in a portable script, say, just make sure you set LC_COLLATE. And if you're testing with GNU sort, use --debug to check what it's actually doing in case you don't have the definition for the current LC_COLLATE, for instance.
My main problem with the [A-Za-z0-9] notation is that it looks great at first glance. I mean, it looks really, really great. Of course [A-Z] means all capital letters. And then you think about it, and you start to suspect something like the situation described in the article. Suddenly, you're in the familiar but dizzying position of being perched atop a shoddily-built and wobbling tower of abstractions. You feel your familiar nausea soaking in from the periphery of your editor window.
I just now, minutes ago, got some regexes to mostly sorta work in a project to convert some jai alai score data (converted from wonky pdf files). It's a one-off script. My Python is rusty, but I couldn't find how to get a posix descriptor for 'lowercase letters and uppercase letters' to work. [A-Za-z] happens to work, for now.
> the 2008 standard had changed the definition of ranges, such that outside the "C" and "POSIX" locales, the meaning of range expressions was undefined
It was changed earlier. It's undefined in the 2004 edition, too.
Aren't there already half a dozen Linus Torvalds rants on the brain-damaged stupidity of the POSIX specs? The story is long and sad because the developers thought following the spec was more important than writing usable software.
It's unreasonable to expect someone who writes regexp to anticipate which locales a user will execute his program in. It's just as unreasonable to tell him to stick to an outdated locale. How about we ignore locales altogether and just use code points? Code point order is the only ordering that every locale can agree on. [a-z] should match any character whose code point is between U+0061 and U+007A regardless of the locale.
Of course it isn't. What I suggested above is to use neither locales nor UTF-8, but code points. "a" = U+0061 no matter which locale you're in, and no matter which encoding you use. Every locale and every encoding is based on the same universal mapping of characters to code points.
I agree, I think that is the right thing to do for ranges. If you want "all alphas in the current locale", then you should use some special indicator like [:alpha:] that is defined to be that.
Using code points would at least allow for ranges to have the possibility of being usable to someone in a standard, predictable fashion, outside of the C/POSIX locale.
For example, specifying a-z plus each umlauted vowel is still shorter than specifying all letters individually.
Perhaps there is some wisdom in William S. Burrough's "If you can't be just, be arbitrary."
This works at least as well as en-US, and provides me just enough internationalization to remove the most annoying americanocentric things
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c13
> There are real customers who use EBCDIC. We cannot reveal their names due to confidentiality agreements. One key example is some of the major banks in North America who continue to use IBM machines to perform check clearing operations. These high reliability software systems are written on IBM mainframes clearing your checks and because they have been debugged over so many years and are highly critical to daily integrity of the financial industry, they are highly reliable and will never be moved to any other platform. In a way, this makes their code unaffected by this removal of trigraphs.
> Other EBCDIC users with trigraphs from a world-wide company that can’t get square brackets (i.e. [ ]) everywhere so must use trigraphs to get at them quotes: “In that capacity we have widely distributed deployments, around the world, across Windows, HP-UX, Linux, z/OS and iSeries systems. That's why we need the trigraphs, it doesn't seem you can count on the [ ] characters working everywhere. ... "
Source: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n421...
However even assuming they are still in use (which is a real possibility knowing how slowly banks upgrade) I would be surprised if they depended on GNU Awk let alone undefined behaviour in its Regex library.
I'm not normally someone who says "move fast and break things" but I do think this is one situation where it is ok. I mean EBCDIC is hardly recent and it wasn't exactly popular even when it was "current tech" (I seem to recall it was a bit of a laughing stock?)
It should overload items of the form "x-y" in ranges where x and y are single characters in the locale in use. It should define specific items of this form as not being ranges but simply shorthand for certain predefined strings. The expression is treated as if those items were replaced by the corresponding predefined strings before the regular expression was parsed.
In particular "a-z" => "abcdefghijklmnopqrstuvwxyz". Similar for uppercase. Include such a definition to produce each possible substring of length 2 or more of "abcdefghijklmnopqrstuvwxyz". Similar for "0-9".
> Perl also guarantees that the ranges A-Z, a-z, 0-9, and any subranges of these match what an English-only speaker would expect them to match on any platform.
[1]: https://perldoc.pl/perlrecharclass#Character-Ranges
I don't see why that would be beneficial. You also might want ranges as [a-fk-z] (for example)
The situation today is that "[a-z]" is undefined by POSIX if you are not in the POSIX locale. I'm suggesting that it, and similar cases, should be made defined in all locales that include a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, and z.
> You also might want ranges as [a-fk-z] (for example)
Looking back, I wrote unclearly. Where I wrote 'overload items of the form "x-y" in ranges' it would have been better to write 'overload items of the form "x-y" inside bracket expressions'.
In "[a-fk-z]" there are two items of that form. Under my suggestions, "a-f" would be replaced with "abcdef" in all locales, and "k-z" would be replaced with "klmnopqrstuvwxyz", giving "[abcdefklmnopqrstuvwxyz]".
(The solution that glibc will implement is to un-interleave lowercase and uppercase characters whenever the collation order is like aàAÀbBcC...).
https://sourceware.org/bugzilla/show_bug.cgi?id=23393 (https://news.ycombinator.com/item?id=17557243)
Why is this the case? Collation sequences, I’m guessing?
Still this doesn't look like an issue with regex's per se
My current locale is en_US.UTF-8; if I changed that to a different locale ending with ".UTF-8", the codepoint values haven't changed (still Unicode), and the encodings of them haven't changed (still UTF-8), but the collation order might have.
One can observe this change even without getting exotic. In the "C" or "POSIX" locales (synonyms), the ASCII values sort in ASCII numeric order ("ABC..XYZ...abc...xyz"), but in "en_US", they sort "aAbBcC...xXyYzZ" (because that's what a native American English speaker would expect, if they're not familiar with ASCII).
Putting lowercase before uppercase also surprises me; if asked to sort them, I’d definitely put uppercases before lowercases: AaBbCc…
As for the en_US example, for the most part collation is case-insensitive, and case only comes in as a tiebreaker; "aa" < "Aa" < "ab".
This is often desirable, except for the part that I don't know what the heck ranges mean anymore. "Ö" is certainly not between "A" and "Z" in codepoint order. It is in collation order, but if it were collation order, it would match lowercase letters. How does this work?
Interleaving was added to all locales recently, and people started complaining that their scripts broke, so it will probably be reverted.
I just now, minutes ago, got some regexes to mostly sorta work in a project to convert some jai alai score data (converted from wonky pdf files). It's a one-off script. My Python is rusty, but I couldn't find how to get a posix descriptor for 'lowercase letters and uppercase letters' to work. [A-Za-z] happens to work, for now.
I love regexes. I hate regexes.
It was changed earlier. It's undefined in the 2004 edition, too.