Unicode in five minutes (2013)

(richardjharris.github.io)

279 points | by jstanley 1343 days ago

15 comments

  • necovek 1343 days ago
    I originally laughed at "in five minutes", but even though I do not think the article reads in five minutes, it does a surprisingly good job of covering the basics: so good job!

    I do wonder if it is clear for people who are unfamiliar with Unicode? Anyone who is mostly unfamiliar with the details article covers who can say how comprehensible the article is?

    I would also add a mention of the standard Unicode collation table that does a passable job for many languages at the same time (though Unicode Collation Algorithm is mentioned, which this is the default for, I think it's worth highlighting this property of most UCA implementations).

    As for the article gotchas, multilingual text is even more complex when go past 5 minutes even for "simple" European scripts. Eg. in Bosnian/Croatian/Serbian in Roman/Latin alphabet, "nj" will be capitalized to "Nj" or "NJ" depending on the rest of the word — eg. "Njegoš" or "NJEGOŠ"; confusingly, Unicode also includes digraphs for both capitalization forms (the eternal tension in Unicode between encoding letters, glyphs or characters), even though they are linguistically equivalent — in practice, they are never used, which makes their inclusion even more perplexing (they are always spelled out using two characters, and there was no historical reason since none of the 8-bit encodings had them)! It will also sometimes be two distinct letters, especially in loanwords like "konjugovan" — this makes things harder when you need to collate texts since the proper order would be "konjugovan", "kontakt", "konj".

    All of this is why I like to joke how Cyrillic script is technically much better for all of these languages, even though it is basically in official use only for the Serbian language — in Cyrillic, there is no conundrum in either of the above examples since nj=њ (or нј), Nj/NJ=Њ, and the order is clear: конјугован, контакт, коњ.

    • pmiller2 1343 days ago
      > I originally laughed at "in five minutes", but even though I do not think the article reads in five minutes, it does a surprisingly good job of covering the basics: so good job!

      Slightly off topic, but just to riff on this a bit: maybe books and articles called "$THING in $NUMBER_OF $TIME_PERIODS" or "Learn $THING in $NUMBER_OF $TIME_PERIODS" should be retitled "$NUMBER_OF $TIME_PERIODS with $THING." It would be more accurate, not imply any sort of mastery, and, on top of that, sound a little more dignified. But, maybe it wouldn't sell as many books, so... ¯\_(ツ)_/¯.

  • Wistar 1343 days ago
    Joel Spolsky's 2003 Joel On Software piece: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

    https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

    • bmn__ 1342 days ago
      This meanwhile fell behind with the times. I would recommend the submitted article rather than Spolsky's.
      • Rels 1333 days ago
        I'm not so sure about that, Spolsky's article seems way better at introducing someone to Unicode if they don't know anything about it. The OP article goes way deeper and has more interesting insights about Unicode itself though.

        Disclaimer: might be biased because I've discovered Unicode through Spolsky's article.

      • Wistar 1340 days ago
        I didn't post it as an alternative but as a "see also."
  • herodotus 1343 days ago
    This is a really great summary of Unicode. I wish it had been available when I first started getting into the complexities of multilingual string searching and normalization. Ultimately, reading the official documentation (unicode.org) was necessary, but a succinct and clearly written introduction like this would have saved me hours (if not days) of effort.
    • hombre_fatal 1343 days ago
      Yeah, this is the kind of cut-to-the-damn-chase I want 90% of the time as an experienced developer touching technology I don't necessarily deep-dive every day, like an actual example of what NFKC does.

      Even if it's too topical to be actionable in every case, it gives you the general idea and vocabulary to put together useful search queries when you want to know more.

      • andrepd 1341 days ago
        Honestly. It's so frustrating when people go on tangents and say in three paragraphs what you could say in two short sentences. An example: the rust book.
  • begriffs 1343 days ago
  • wcarss 1343 days ago
    Something complementary -- because this article just takes a moment to talk about the different encoding schemes -- a wonderful, terse, very informative video describing how utf-8 encoding works and why (with a little history) by Tom Scott/Computerphile: https://www.youtube.com/watch?v=MijmeoH9LT4
  • jermier 1343 days ago
    I always loved the whimsy present in Unicode. For nostalgia, here's a HN post from 2010 pointing to the `Unicode Snowman for You` site (which is still up!)

    https://news.ycombinator.com/item?id=2035572

    And the site:

    http://xn--n3h.net/

    • mysterypie 1343 days ago
      According to the HTML source, the original site was:

      http://unicodesnowmanforyou.com/

      I wish I understood what the keepers of Unicode were thinking by including so much bloat in a character set (or character encoding). I realize that Unicode is going to have a huge number of symbols no matter what, if they're going to represent all the world's languages and math and punctuation, but I'd draw the line at emoticons, emojis, playing card symbols, and snowmen.

      • ekidd 1343 days ago
        One of the major goals of Unicode was to support round trip conversion from all the widely-used character sets into Unicode, and then back out again. In particular, supporting popular Japanese character sets was important for technical and commercial reasons.

        There was a lot of weird stuff in the world's character sets.

        Emoji were first used by Japanese cell phone carriers. They were encoded as Shift JIS characters, but in incompatible ways. The Unicode Consortium had no real interest in this until Google and Apple basically said, "If we're going to have to support all these character sets, could we please standardize them?"

        I think it's just the reality of standardizing the world's character sets. A lot of weird legacy stuff will slip in, and other countries will want to standardize things that seem unnecessary. Personally, I'm very thankful that somebody wants to do all the exhausting political work of coming to a consensus. A few snowmen are small price to pay.

      • Mediterraneo10 1343 days ago
        Playing card symbols are -- like chess symbols -- typeset inline with ordinary text in books that deal with the strategy of those games. So, IMO it makes since to include them in a character set that a font and typesetting engine will support.
      • Dylan16807 1343 days ago
        Well right now it's about two percent of unicode, right?

        And people use them as text, so there's a reason to add them and not much reason to refuse them.

        • mysterypie 1343 days ago
          You might be right, but where are you getting the 2% from? Are you thinking of just emoticons, emojis, playing card symbols, and snowmen? There's more than that I'd question.
          • Dylan16807 1343 days ago
            I looked up how many emoji there were, added some for wingdings, and rounded up a bit.

            What else would you question? Would it be more than 1500 more, which would bump it from 2 to 3 percent?

    • yrro 1342 days ago
      Hm... what's going on here?

          $ host xn--n3h.net
          host: 'xn--n3h.net.' is not a legal IDNA2008 name (string contains a disallowed character), use +noidnout
      
      Looks like emoji were forbidden in IDNA2008... :'(
  • rurban 1343 days ago
    It misses the security considerations for names. Almost nobody knows about nor implements that. Eg for filenames or variable names.
  • UncleEntity 1343 days ago
    Unicode is weird...this prints out backwards (including the comma and space) in the python3 repl:

      >>> [chr(0x07c0+i) for i in range(10)]
      ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
    
    0..9 in the N'Ko script BTW...
    • hombre_fatal 1343 days ago
      I don't get what you mean by backwards.

          py3> [chr(0x07c0+i) for i in range(10)]
          ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
      
          
          js> [...Array(10)].map((_,i)=>String.fromCodePoint(0x07c0+i))
          ['߀', '߁', '߂', '߃', '߄', '߅', '߆', '߇', '߈', '߉']
      • diath 1343 days ago
        I believe he's trying to print the 0..9 range by providing the proper start and end point for those characters but instead gets 9..0 (I don't know the script but I'm basing it off by the 0 at the end). So for instance 0x07c0 stands for 0 in Nko script, and this is his starting point, but the entire sequence ends up being reversed. I'm not sure how comparing it to JS helps here other than I guess pointing out that it's also behaving unexpectedly.
        • hombre_fatal 1343 days ago
          Wait, I just realized the results in my repl (0..9) are reversed from what I pasted into HN (9..0). And if you shrink the width of the browser to force my HN snippets to wrap, it changes the order. And it selects in the reverse order on click and drag.

          I spoke way too soon. Unicode is weird. My apologies to our friend UncleEntity.

          • diath 1343 days ago
            It looks like it depends on how your terminal (or the browser, or anything that renders it) handles Unicode (which I guess just means that Unicode is hard to get right): https://i.imgur.com/8FPNYMP.png
            • necovek 1343 days ago
              It's how the directionality (right to left or left to right) is decided that is complicated for mixed texts (and always nothing but a heuristic).

              I must admit that I was surprised that the following snippet kept the LTR order in my terminal:

              >> [(chr(ord('0')+i), chr(0x07c0+i)) for i in range(10)] [('0', '߀'), ('1', '߁'), ('2', '߂'), ('3', '߃'), ('4', '߄'), ('5', '߅'), ('6', '߆'), ('7', '߇'), ('8', '߈'), ('9', '߉')] >>> [(chr(0x07c0+i), chr(ord('0')+i)) for i in range(10)] [('߀', '0'), ('߁', '1'), ('߂', '2'), ('߃', '3'), ('߄', '4'), ('߅', '5'), ('߆', '6'), ('߇', '7'), ('߈', '8'), ('߉', '9')]

  • wingi 1343 days ago
    The variation selector link is dead, but is archived.

    https://web.archive.org/web/20160417233039/http://babelstone...

    • obelos 1343 days ago
      I've worked with Unicode for years and thought I had a good handle on its mechanics until I discovered this feature of the system last year. I was puzzling out why some symbol code points sometimes render in flat character style and other times as more graphic emoji, even when the same font and same code point is used in each case. Turned out it was a matter of applying VS15 or VS16 as a combining character, and which was the default for a given code point. Incredibly detailed stuff that this archived BabelStone article goes into in much greater depth than the bit I wrote about my exploration: https://khephera.net/posts/a-unicode-woe-solved/
  • jitteriest 1343 days ago
    > it gives a (double-story) and a (single-story) the same codepoint.

    But they did see fit to have ɑ (LATIN SMALL LETTER alpha)which is distinct from α (GREEK SMALL LETTER ALPHA).

  • nabla9 1343 days ago
    This is first short intro to Unicode I have seen where the reader does not leave thinking that one user perceived character must be just one code point.
    • a1369209993 1343 days ago
      Although they do mistakenly refer to ffi (U+FB03) as a character. Still better than most intros though.
  • ngcc_hk 1343 days ago
    Just reading an article about Japanese Saito surname and how hard the idea of “uni”code (or possibly dropped idea of Hans unification) is problematic in real life situation. Yes you may have a codepoint but it is only part of the problem especially related to human name.
  • nopacience 1343 days ago
    excelent by a perl programmer.. and so many never try modern perl
  • jariel 1343 days ago
    This is a good introduction, unfortunately, Unicode may ultimately be a problem in and of itself.

    To start, consider that the term 'character' used in the article, though 'generally correct' ... is definitely not correct in the broadest sense.

    Western, Cyrillic and Asian scripts boil down to 'characters' with some complexity maybe with ligatures ('Straße'), but it falls apart quickly for other languages.

    Unfortunately, rather than creating rigorously applied definitions for things, and applying them consistently, even Unicode falls into this bureaucratic trap of vagaries with their own definitions.

    So Unicode works well for most things, but then it falls off a cliff.

    Here is the definitions section [1]

    Even have a look at the definitions of 'Character' and 'Grapheme' and 'Grapheme Cluster' - and you start to see how confusion sets in very quickly.

    Consider that in Unicode ... there isn't really such a thing as a 'character' - it's just an unspecific word we use that has no technical application! (When we say 'character' generally what we mean is 'Grapheme Cluster').

    Language is itself a rabbit hole of complexity, so any standard trying to manage it will be painful - but it feels as though the true corner cases of Unicode are actually unbounded.

    In short, too many pragmatic loose ends. Given any scenario where you think you have an alg sorted out ... and probably there are holes in it if you cared to try to find them for a specific language.

    It's not 'bad', but it's not the uber solution, it's frayed at the edges.

    [1] https://unicode.org/glossary/

    • matvore 1342 days ago

        > Consider that in Unicode ... there isn't really such a thing as a 'character' 
      
      This is a really important consideration, since it helps you realize the immense difficulty of wrapping your own logic for character-aware handling - unless you are deliberately limiting your scope, like only handling NFC-normalized text of a limited number of languages.