7 comments

  • SloopJon 132 days ago

    My first thought on seeing this was Bjoern Hoehrmann's UTF-8 decoder, which encodes a finite state machine into a byte array:

    http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

    I happened to come across this decoder again recently in Niels Lohmann's JSON library for C++:

    https://github.com/nlohmann/json

    I see that this is mentioned in a previous post:

    https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-...

    One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.

    • loeg 132 days ago

      > One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.

      I believe that's what checkContinuation() is doing, based on its use of the "counts" parameters. I don't understand how it works, but I don't see any other reason for count_nibbles() to compute the '->count' member.

      • kwillets 132 days ago

        The counts are actually to find the length of the following continuation bytes, and mark them to look for overlaps or underlaps between the end of one code point and the next.

      • kwillets 132 days ago

        It checks overlongs. It's not too hard -- a few bits in the first byte, and a few in the second for some lengths.

      • KMag 132 days ago

        With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8, as well as reading/writing 128-bit variable length integers (preferably prefix-encoded[0] rather than LEB128).

        A while back, I read that the Chinese Longson processors were a (subset?) of the MIPS instruction set with added instructions for Unicode handling, but that's all I've heard of processors with Unicode accelerating instructions, and I'm not sure which encoding(s) was/were accelerated.

        [0]https://news.ycombinator.com/item?id=11263378

        • chrisseaton 132 days ago

          > With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8

          If we took this attitude with every new technology, we'd have a large number of instructions that are now useless. At one time it probably seemed a good idea to have custom instructions for parsing XML, and people really were doing custom instructions for interpreting JVM bytecode.

          • mywittyname 132 days ago

            I disagree. It's reasonable to assume UTF-8 will is here to stay. It will likely outlive any processor architectures designed today.

            Plus, there are already several instructions on x86 that for string manipulation.

            • loeg 132 days ago

              > Plus, there are already several instructions on x86 that for string manipulation.

              Also SHA1, AES, CRC32, and other specialized functions that may have less staying power than UTF-8 (SHA1 in particular, but also AES being a block cipher has some nuisances that means it is not always used in new algorithms).

              • GolDDranks 132 days ago

                What irks me is that the CRC32 instruction in the x86 architecture uses different polynomial than many of the CRC32 implementations in the wild, which makes the instruction unusable for them...

            • cyphar 132 days ago

              The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way) should be clear that CPU designers don't really care about longevity, all that matters is current performance characteristics.

              Not to mention that I think it's fair to say that UTF-8 will outlast AES (it's actually older than AES). After all, AES was only standardised in 2001 -- and AES-NI was added when it was 7 years old. Unicode (and UTF-8) were standardised in 1991. So if we have AES instructions we should already have UTF-8 instructions.

              • rauhl 131 days ago

                > The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way)

                Say what? Unless you mean something other than ‘elliptic curve’ by ‘EC,’ that doesn’t make sense to me: AES & EC are completely different kinds of thing, the former being symmetric & the latter being asymmetric. Indeed, it’s quite common to use an EC-derived secret as the key for AES-encrypted material.

              • AceJohnny2 132 days ago

                > and people really were doing custom instructions for interpreting JVM bytecode.

                Ah yes, ARM Jazelle, and the good ol' ARM926EJ-S...

                https://en.wikipedia.org/wiki/Jazelle

                Did the concept die with ARMv8 (64bit)?

                • monocasa 132 days ago

                  It had been essentially dead for a while even before AArch64. It's slightly obfuscated by the fact that there's a version of Jazelle that's basically a nop mode that'll trap on each bytecode.

                  • sigjuice 132 days ago

                    I don't see java in the list of features on my Raspberry Pi 3

                      $ cat /proc/cpuinfo
                      processor       : 0
                      model name      : ARMv7 Processor rev 4 (v7l)
                      BogoMIPS        : 38.40
                      Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
                      CPU implementer : 0x41
                      CPU architecture: 7
                      CPU variant     : 0x0
                      CPU part        : 0xd03
                      CPU revision    : 4
                    • gsnedders 132 days ago

                      Yes, it's gone in ARMv8.

                    • basementcat 132 days ago

                      Thanks to CPUID and analogous hardware capability registers on other platforms, outdated instruction sets can be deprecated and removed when there is no longer market demand for them.

                    • fao_ 132 days ago

                      > we'd have a large number of instructions that are now useless.

                      We already have a large number of instructions that are useless. Look at the difference between CISC and RISC.

                      • 132 days ago
                        [deleted]
                    • zbjornson 132 days ago

                      I'm confused how the ASCII check works. It's checking if any byte is greater than zero. Wouldn't you want to check if the 8th bit in any byte is set?

                      • aeruder 132 days ago

                        The trick here is that each byte is treated as a signed 8-bit number. When the top bit is set, the number is negative.

                        • zbjornson 132 days ago

                          Oh duh, thanks. (Checking less than zero, read it wrong.)

                          I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).

                          https://github.com/lemire/fastvalidate-utf-8/pull/2

                          • Sharlin 132 days ago

                            Depends on your input. If non-ASCII strings are frequent and likely to contain a non-ASCII character fairly close to the start of the string, then it makes sense to short circuit.

                          • vardump 132 days ago

                            > I think it would be faster to OR the entire string with itself, then finally check the 8th bit though.

                            The string could have NUL (zero) bytes in between.

                            • zbjornson 132 days ago

                              You're right that it changes the behavior vs. what the current implementation is, but 0x0 is a valid ASCII character.

                              • vardump 132 days ago

                                While you're technically right NUL is a part of ASCII set in practise it's rarely wanted in the data.

                            • 132 days ago
                              [deleted]
                        • MichaelGG 132 days ago

                          Fast checking is really useful in things like HTTP/SIP parsing. Rust should expose such a function as well seeing as their strings must be UTF-8 validated. Though it's even faster if you can just avoid utf8 strings and work only on a few known ASCII bytes, it means you might push garbage further down the line.

                          • masklinn 132 days ago

                            > Rust should expose such a function as well seeing as their strings must be UTF-8 validated.

                            That's more or less what std::str::from_utf8 is: it runs UTF8 validation on the input slice, and just casts it to an &str if it's valid: https://doc.rust-lang.org/src/core/str/mod.rs.html#332-335

                            from_utf8_unchecked nothing more than an unsafe (c-style) cast: https://doc.rust-lang.org/src/core/str/mod.rs.html#437 and so should be a no-op at runtime.

                            • MichaelGG 132 days ago

                              I meant Rust should have a SIMD optimised version that assumes mostly ASCII. I'm guessing there is a trade-off involved depending on the content of the string.

                              • scottlamb 132 days ago

                                The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in Rust is a work in progress - the next version (1.27) will stabilize x86-64-specific SIMD intrinsics. There's an rfc (https://github.com/rust-lang/rfcs/pull/2366) for portable SIMD types.

                                • masklinn 132 days ago

                                  simd support in Rust was only recently accepted and is being implemented so it currently relies on the vectorisation abilities on the compiler (it might get revisited soon-ish I guess).

                                  As for the assumption of mostly-ascii, the validation function has a "striding" fast path for ascii which checks 2 words at a time (so 128 bits per iteration on 64b platforms) until it finds a non-ascii byte: https://doc.rust-lang.org/src/core/str/mod.rs.html#1541

                              • chrismorgan 132 days ago

                                std::str::from_bytes is that API.

                                Rust’s current implementation of full validation: https://github.com/rust-lang/rust/blob/2a3f5367a23a769a068c3...

                                I have a vague feeling there’s an even faster path for probably-ASCII out there, but I can’t immediately recall where and am going to bed. Doubtless someone else will answer before I get up.

                                The core team will be amenable to replacing this algorithm with something faster presuming it’s still correct.

                                • kzrdude 132 days ago

                                  Given the new simd features, it's probably time to revisit that now

                                  • MichaelGG 132 days ago

                                    There is probably a trade-off depending on the content of the string, right? So the API probably needs a general-purpose and a "this should be all ASCII" version?

                                    • simias 132 days ago

                                      I don't know if it really makes a lot of sense to have such a specialized version of a method in the standard library. Effectively from_str and from_str_mostlyascii would be functionally identical except for a small performance difference depending on the encoding of the string data which wouldn't matter in 99% of programs.

                                      Having that as a 3rd party crate would make complete sense however.

                                      • Retra 132 days ago

                                        If there's a canonical, obvious, and low maintenance way to do something that there's a common need for, then there's no reason it shouldn't be in std. If it needs to be hacked at and redesigned every few years, then yeah, leave it out.

                                        • simias 131 days ago

                                          That's not really the Rust way, for instance things like timekeeping and random number generation are delegated to external crates. Given how trivial it is to add dependencies thanks to cargo it's not really an issue in practice, even if I do think that they go a bit overboard at times (time really ought to be in std IMO, especially since there's already a rather useless std::time module).

                              • nabla9 132 days ago

                                There is validating and "validating", what is this code doing?

                                For example, valid utf-8 must always use the shortest possible sequence or it's invalid. Validator must check against decoding invalid sequences.

                                example of invalid sequence:

                                    0xF0 0x80 0x80 0x8A
                                • kwillets 132 days ago

                                  There are actually a few different parts to the code, and they work somewhat independently.

                                  The first is to classify each byte by type, using the highest 4-5 bits. The types are:

                                  ASCII

                                  continuation

                                  initial byte of:

                                     2 
                                  
                                     3 
                                  
                                     4-byte sequence
                                  
                                  Given these types, the multibyte sequences are checked for proper lengths, ie each length indicator has to be followed by that exact number of continuation bytes until the next non-continuation. We do this by putting the following byte count where each initial is found, zeroing the rest, and carrying counts right with shift-and-subtract, saturated to 0. This just creates a descending sequence after each initial, eg 4 -> 3,2,1, which should reach zero right at the next initial or "first" byte (ascii or multi). The vector of nonzero following bytes xored with the vector of first bytes should then be all 1's, ie there should be no gaps or overlaps.

                                  The next task is to check for overlongs. These are just bitfields in the first or second bytes which should be nonzero, and they can be checked at least two ways. One is masking and checking for any nonzero bits, based on a lookup table of masks. Another is comparison, based on the idea that any 1 bits in the bitfield of interest will make the byte larger than all 0's (the higher-order bits are fixed). Both of these methods assign checks on a per-length basis, using a lookup table from the highest 4 bits. Longer sequences also check the second byte in similar fashion.

                                  In the end we basically have bitmaps of each type, and we check that they don't intersect and that their union covers the range. The code is a bit complicated by carrying over the previous range so that sequences that span registers will still get checked.

                                  • rspeer 132 days ago

                                    This is built on UTF-8 described as a state machine. The state machine already rejects overlong sequences.

                                    There is no sense of "validator" in which an expression that accepts overlong sequences could be called a UTF-8 validator.

                                    • Someone 132 days ago

                                      ”For example, valid utf-8 must always use the shortest possible sequence or it's invalid.”

                                      nabla9 likely knows this, but for those wondering: utf8 can’t contain overlong encodings (https://en.m.wikipedia.org/wiki/UTF-8#Overlong_encodings), but need not use precomposed characters (https://en.m.wikipedia.org/wiki/Precomposed_character) whenever possible (another way in which a shorter sequence is possible)

                                      • jwilk 132 days ago

                                        The test suite includes these as "bad" sequences:

                                        * \xc0\x9f (overlong U+001F)

                                        * \xed\xa0\x81 (surrogate)

                                        • SloopJon 132 days ago

                                          I was too lazy to look up what these sequences represented, but I did a quick test with the Euro symbol example from Wikipedia, and it did indeed reject the overlong sequence:

                                              #include <stdbool.h>
                                              #include <string.h>
                                              #include <stdio.h>
                                              #include "simdutf8check.h"
                                              
                                              int
                                              main(int argc, char *argv[])
                                              {
                                                  const char euro[] = "\xe2\x82\xac";
                                                  const char eurolong[] = "\xf0\x82\x82\xac";
                                              
                                                  bool valid = validate_utf8_fast(euro, sizeof euro);
                                                  printf("validate_utf8_fast(euro): %d\n", valid);
                                              
                                                  valid = validate_utf8_fast(eurolong, sizeof eurolong);
                                                  printf("validate_utf8_fast(eurolong): %d\n", valid);
                                              
                                                  return 0;
                                              }
                                      • kizer 132 days ago

                                        Should strings be represented as codepoint arrays in memory? In C, for example, should I always decode input and work with that, or convert to utf8?

                                        • masklinn 132 days ago

                                          > Should strings be represented as codepoint arrays in memory?

                                          No. I think the implementor of Factor's unicode support originally did that but it turns out to not be useful:

                                          * it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)

                                          * this also has impact on CPU caches, lowering the average amount of data you can fit in your cache and work on

                                          * it requires a complete conversion of incoming ascii and utf8 data (which only get more and more common as time goes on) rather than just a validation

                                          * and because Unicode itself is variable-length (combining codepoints) it's not actually helpful when you're trying to properly manipulate unicode data

                                          The only "advantage" of representing strings as codepoint arrays is that you get O(1) access to codepoints which is a terrible idea you should not encourage.

                                          UTF-32 internal encoding makes some manipulations very slightly easier, but not enough to matter in the long run, and it encourages bad habits. If you don't need the O(1) access thing for backwards compatibility reasons, don't bother.

                                          • kizer 132 days ago

                                            I did not know that Unicode itself was variable length. That alone nullifies motivation to use codepoint arrays. Why would they do that??

                                            • masklinn 131 days ago

                                              To limit combinatoric explosion. Without combining codepoints, you need a version of each base symbol for each combination of modifier. And while latin scripts usually limit themselves to one diacritic per base[0], other scripts (hebrew or arabic come to mind) can pile up multiple diacritics.

                                              [0] not that that's always the case, the navajo alphabet can have both ogonek and acute on the same latin base character

                                              • kizer 131 days ago

                                                I see... speaks to my ignorance on language.

                                            • bumblebritches5 132 days ago

                                              Have you ever written your own unicode library, or are you just parroting what you've heard?

                                              I have written my own unicode library, normalizing strings as UTF-32 is MUCH MUCH MUCH easier than trying to do the same in UTF-8 or UTF-16.

                                              not to mention casefolding, and it also allows you to design better interfaces, so that the actual algorithm only needs to be implemented once.

                                              • greglindahl 132 days ago

                                                Sounds like your library is the poster-child for:

                                                > it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)

                                                • bumblebritches5 132 days ago

                                                  For the microseconds it's in the decoded state, sure, but I'm not dealing with massive strings.

                                                  Literally all the strings I'm dealing with are under 256 codepoints.

                                            • Ndymium 132 days ago

                                              That would require an array of 32 bit integers (basically UTF-32). It wastes a lot of space for most characters and doesn't help that much with indexing since user perceived characters can still be composed of many codepoints. Different languages take different approaches.

                                              Python 3.3+, I believe, looks at the string and chooses ASCII/UCS-2/UCS-4 based on the character that takes the highest amount of bytes [1]. Elixir uses UTF-8 for strings (but you can get the codepoints easily), whereas Erlang uses the list of unicode ints style.

                                              [1] https://www.b-list.org/weblog/2017/sep/05/how-python-does-un...

                                              • kizer 132 days ago

                                                Thanks for that link. It is very informative.

                                              • steveklabnik 132 days ago

                                                It's always tradeoffs.

                                                If by "arrays of codepoints" you mean "each element of the array is a single codepoint", then you get O(1) indexing regarding codepoints, but may use up to 4x the memory as the variable-length encoding.

                                                • bumblebritches5 132 days ago

                                                  Which is fine when you're doing nothing but a simple operation on a string, then reencoding it.

                                                  Seriously, how many strings have you ever ran across that are more than 1kb?

                                                  how many strings have you ran across that are more than 256 codepoints?

                                                  • nirvdrum 132 days ago

                                                    People routinely read text files larger than 1 KB into memory. HTTP responses are often larger than 1 KB. Serialized data (JSON, XML, etc.) is often larger than 1 KB.

                                                    • bumblebritches5 132 days ago

                                                      HTTP headers are not anywhere near 1kb of text.

                                                      the packet's themselves are only 1.5kb.

                                                      • nirvdrum 132 days ago

                                                        The headers are one component of a response. The body is often larger than 1 KB.

                                                        • bumblebritches5 127 days ago

                                                          and your whole packet is text, without any compression?

                                                          I doubt that.

                                                    • mromanuk 132 days ago

                                                      I’m currently working with a 2 MB string, representing 150k words, with a structure that is a Packed Trie stored as an array for fast access of Unicode Scalars in Swift

                                                      • bumblebritches5 132 days ago

                                                        Why would you use a string to store that in the first place? if you're editing text, make a structure that contains just words, sentences, paragraphs with smaller more manageable strings in there.

                                                        There's no reason for a single string to contain that much data.

                                                      • steveklabnik 132 days ago

                                                        Sure. It's fine sometimes, and not fine other times. That's why it's called a tradeoff.

                                                        I see strings larger than that all the time, and care about using low RSS. YMMV.

                                                    • bumblebritches5 132 days ago

                                                      Yes, it gives you flexibility in using UTF-8 and UTF-16 (which despite being a shitty format, you WILL have to support UTF-16), and also overlong sequences don't matter.

                                                      • slrz 132 days ago

                                                        > which despite being a shitty format, you WILL have to support UTF-16

                                                        Luckily not. If you absolutely must deal with legacy encodings, feed them through a conversion pass beforehand.

                                                        • bumblebritches5 131 days ago

                                                          Can you not read?

                                                          UTF-16 IS the "legacy" format...

                                                      • loeg 132 days ago

                                                        Generally no. You will know when you need it.

                                                      • ozzmotik 132 days ago

                                                        how exactly do you use a fraction of a cycle? i would assume that it would be fractional on average, but then again I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping

                                                        • dragontamer 132 days ago

                                                          > I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping

                                                          1. Standard ALUs on modern processors are 64-bits at a time. So right there, you're 8x faster on a per-byte basis.

                                                          2. He's using vectorized operations, so he can work with 128-bits, 256-bits (or potentially even 512-bits on high-end processors like Skylake-X). So 16x, 32x, or 64x at a time per operation.

                                                          3. Modern processors are super-scalar with multiple ALUs and multiple execution ports. I forget what the precise theoretical limit is, but Intel AND AMD machines can execute something like 4 or 5 operations per clock, depending on circumstances.

                                                          That assumes that all operations have been decoded into micro-ops (uops), they fit inside the uop cache (think of a uop cache as a L0 cache: beyond even the L1 cache), that they perfectly line up to available execution ports, that your data has no dependencies, and a whole host of other conditions. But its theoretically possible.

                                                          ---------

                                                          In practice: your code will be limited by memory (even L1 cache is slower than the CPU), by decoding speed (not everything fits in the uOp cache, and only loops really benefit from the uop cache), dependencies (a = x + 1. a = a+2. The 2nd instruction depends on the 1st one to execute first, so the two instructions can't be done in parallel / superscalar).

                                                          The CPU is pretty good at trying to figure out how to optimally reorder and out-of-order execute your code. But that means that predicting the performance of the CPU is incredibly difficult: you have to keep in mind all of the parts of the CPU while thinking of the assembly code.

                                                          • mcguire 132 days ago

                                                            You handle multiple bytes at once; it's fractional on average.

                                                            He's using vectorized operations to handle 16 bytes at a time, basically:

                                                                __m128i has_error = _mm_setzero_si128();
                                                                __m128i zero = _mm_setzero_si128();
                                                                for (...) {
                                                                    __m128i current_bytes = _mm_loadu_si128(src);
                                                                    has_error =
                                                                      _mm_or_si128(has_error, _mm_cmpgt_epi8(zero,current_bytes));
                                                                }
                                                                return _mm_testz_si128(has_error, has_error);
                                                            
                                                            Each of the _mm* "functions" becomes 1 (?) vector instruction.
                                                            • kizer 132 days ago

                                                              I believe he means on average, as you said.