> One thing I'd like to check in the new code is whether it's as picky about things like overlong sequences as Bjoern's code is.
I believe that's what checkContinuation() is doing, based on its use of the "counts" parameters. I don't understand how it works, but I don't see any other reason for count_nibbles() to compute the '->count' member.
With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8, as well as reading/writing 128-bit variable length integers (preferably prefix-encoded rather than LEB128).
A while back, I read that the Chinese Longson processors were a (subset?) of the MIPS instruction set with added instructions for Unicode handling, but that's all I've heard of processors with Unicode accelerating instructions, and I'm not sure which encoding(s) was/were accelerated.
> With such a high percentage of text on the web in UTF-8, and many uses for variable-length integers, I hope that we'd see single instructions for reading and writing Unicode codepoints to/from UTF-8
If we took this attitude with every new technology, we'd have a large number of instructions that are now useless. At one time it probably seemed a good idea to have custom instructions for parsing XML, and people really were doing custom instructions for interpreting JVM bytecode.
> Plus, there are already several instructions on x86 that for string manipulation.
Also SHA1, AES, CRC32, and other specialized functions that may have less staying power than UTF-8 (SHA1 in particular, but also AES being a block cipher has some nuisances that means it is not always used in new algorithms).
The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way) should be clear that CPU designers don't really care about longevity, all that matters is current performance characteristics.
Not to mention that I think it's fair to say that UTF-8 will outlast AES (it's actually older than AES). After all, AES was only standardised in 2001 -- and AES-NI was added when it was 7 years old. Unicode (and UTF-8) were standardised in 1991. So if we have AES instructions we should already have UTF-8 instructions.
> The fact that we have AES-NI (despite the fact that AES constructions are being replaced with EC constructions in quite a few protocols, due to the difficulty of using AES in a safe way)
Say what? Unless you mean something other than ‘elliptic curve’ by ‘EC,’ that doesn’t make sense to me: AES & EC are completely different kinds of thing, the former being symmetric & the latter being asymmetric. Indeed, it’s quite common to use an EC-derived secret as the key for AES-encrypted material.
Fast checking is really useful in things like HTTP/SIP parsing. Rust should expose such a function as well seeing as their strings must be UTF-8 validated. Though it's even faster if you can just avoid utf8 strings and work only on a few known ASCII bytes, it means you might push garbage further down the line.
The linked implementation assumes mostly ASCII. It doesn't use SIMD. SIMD in Rust is a work in progress - the next version (1.27) will stabilize x86-64-specific SIMD intrinsics. There's an rfc (https://github.com/rust-lang/rfcs/pull/2366) for portable SIMD types.
I don't know if it really makes a lot of sense to have such a specialized version of a method in the standard library. Effectively from_str and from_str_mostlyascii would be functionally identical except for a small performance difference depending on the encoding of the string data which wouldn't matter in 99% of programs.
Having that as a 3rd party crate would make complete sense however.
If there's a canonical, obvious, and low maintenance way to do something that there's a common need for, then there's no reason it shouldn't be in std. If it needs to be hacked at and redesigned every few years, then yeah, leave it out.
That's not really the Rust way, for instance things like timekeeping and random number generation are delegated to external crates. Given how trivial it is to add dependencies thanks to cargo it's not really an issue in practice, even if I do think that they go a bit overboard at times (time really ought to be in std IMO, especially since there's already a rather useless std::time module).
There are actually a few different parts to the code, and they work somewhat independently.
The first is to classify each byte by type, using the highest 4-5 bits. The types are:
initial byte of:
Given these types, the multibyte sequences are checked for proper lengths, ie each length indicator has to be followed by that exact number of continuation bytes until the next non-continuation. We do this by putting the following byte count where each initial is found, zeroing the rest, and carrying counts right with shift-and-subtract, saturated to 0. This just creates a descending sequence after each initial, eg 4 -> 3,2,1, which should reach zero right at the next initial or "first" byte (ascii or multi). The vector of nonzero following bytes xored with the vector of first bytes should then be all 1's, ie there should be no gaps or overlaps.
The next task is to check for overlongs. These are just bitfields in the first or second bytes which should be nonzero, and they can be checked at least two ways. One is masking and checking for any nonzero bits, based on a lookup table of masks. Another is comparison, based on the idea that any 1 bits in the bitfield of interest will make the byte larger than all 0's (the higher-order bits are fixed). Both of these methods assign checks on a per-length basis, using a lookup table from the highest 4 bits. Longer sequences also check the second byte in similar fashion.
In the end we basically have bitmaps of each type, and we check that they don't intersect and that their union covers the range. The code is a bit complicated by carrying over the previous range so that sequences that span registers will still get checked.
> Should strings be represented as codepoint arrays in memory?
No. I think the implementor of Factor's unicode support originally did that but it turns out to not be useful:
* it blows up memory usage for ASCII and BMP (4 bytes per codepoint versus 1~3)
* this also has impact on CPU caches, lowering the average amount of data you can fit in your cache and work on
* it requires a complete conversion of incoming ascii and utf8 data (which only get more and more common as time goes on) rather than just a validation
* and because Unicode itself is variable-length (combining codepoints) it's not actually helpful when you're trying to properly manipulate unicode data
The only "advantage" of representing strings as codepoint arrays is that you get O(1) access to codepoints which is a terrible idea you should not encourage.
UTF-32 internal encoding makes some manipulations very slightly easier, but not enough to matter in the long run, and it encourages bad habits. If you don't need the O(1) access thing for backwards compatibility reasons, don't bother.
To limit combinatoric explosion. Without combining codepoints, you need a version of each base symbol for each combination of modifier. And while latin scripts usually limit themselves to one diacritic per base, other scripts (hebrew or arabic come to mind) can pile up multiple diacritics.
 not that that's always the case, the navajo alphabet can have both ogonek and acute on the same latin base character
That would require an array of 32 bit integers (basically UTF-32). It wastes a lot of space for most characters and doesn't help that much with indexing since user perceived characters can still be composed of many codepoints. Different languages take different approaches.
Python 3.3+, I believe, looks at the string and chooses ASCII/UCS-2/UCS-4 based on the character that takes the highest amount of bytes . Elixir uses UTF-8 for strings (but you can get the codepoints easily), whereas Erlang uses the list of unicode ints style.
If by "arrays of codepoints" you mean "each element of the array is a single codepoint", then you get O(1) indexing regarding codepoints, but may use up to 4x the memory as the variable-length encoding.
how exactly do you use a fraction of a cycle? i would assume that it would be fractional on average, but then again I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping
> I can't exactly speak for my expertise on how instructions map to cycles and if it's solely an integer mapping
1. Standard ALUs on modern processors are 64-bits at a time. So right there, you're 8x faster on a per-byte basis.
2. He's using vectorized operations, so he can work with 128-bits, 256-bits (or potentially even 512-bits on high-end processors like Skylake-X). So 16x, 32x, or 64x at a time per operation.
3. Modern processors are super-scalar with multiple ALUs and multiple execution ports. I forget what the precise theoretical limit is, but Intel AND AMD machines can execute something like 4 or 5 operations per clock, depending on circumstances.
That assumes that all operations have been decoded into micro-ops (uops), they fit inside the uop cache (think of a uop cache as a L0 cache: beyond even the L1 cache), that they perfectly line up to available execution ports, that your data has no dependencies, and a whole host of other conditions. But its theoretically possible.
In practice: your code will be limited by memory (even L1 cache is slower than the CPU), by decoding speed (not everything fits in the uOp cache, and only loops really benefit from the uop cache), dependencies (a = x + 1. a = a+2. The 2nd instruction depends on the 1st one to execute first, so the two instructions can't be done in parallel / superscalar).
The CPU is pretty good at trying to figure out how to optimally reorder and out-of-order execute your code. But that means that predicting the performance of the CPU is incredibly difficult: you have to keep in mind all of the parts of the CPU while thinking of the assembly code.