C2x is expected to gain the char8_t type and almost certainly will not gain any new string handling routines. In the year 2030 I am expecting to still see more rounds of posts comparing string libraries for C. With more than 4 decades of development, we still don't have a great solution to string handling in C.
Every library like this is incompatible and most make slightly odd choices like in this library ownership of the string is denoted by a bit in the info/size field. Not that that is a bad choice or anything, but it is one reason someone might decline to use it and decide to write their own.
The lack of namespacing in C doesn't help, this library chooses str_ as its prefix, which is a bit likely to collide with other libraries. It also makes it harder to try to write libraries that allow for the string library to be switched out.
Maybe we don't have a single solution to string handling in C because there's no such thing as a "string". A text editor would need a very different "string" than a typesetting system or an indexing engine.
Note that ISO C only reserves /^str[a-z]/ for <string.h> and <stdlib.h>. POSIX's reservation of /^str_/ (note that you are still allowed for identifiers like str8line) is for STREAMS (<stropts.h>) [1], a completely different thing from strings...
IMHO it's better to ignore "universal string processing" in the C standard completely instead of providing a half-assed and over-complicated solution like in most other languages. String processing isn't exactly a core-competency of C, and never will be, it's better to use a different language for such things.
I think it is good if everybody implements their own string library.
It builds character and you learn something from it.
It is a rite of passage.
However, don't add to the pile of dependency hell that is already plaguing many open source projects. If you feel uneasy with how C strings work, consider switching programming languages instead. You will probably have an easier time and there will be less unmaintained incompatible string libraries rotting around on github.
> However, don't add to the pile of dependency hell that is already plaguing many open source projects
If you create small libraries that don't produce shared objects intended to stand on their own with a stable API/ABI, but are simply headers or at most produce a .a from source fully intended to become vendored in-tree, you're not contributing to "dependency hell".
It's hardly a "dependency hell" when it only affects developers, in what are essentially unique collision type situations, and are generally addressable by the developer because the source is all present. And upstream maintainers of such intended-to-be-vendored code should generally be receptive to improving compatibility and build system configurability for such situations. And if they're not receptive/it's abandonware, congratulations your vendored library is now a fork and fix it yourself.
When we refer to "dependency hell", AIUI, it's in reference to unresolvable runtime dependencies creating hell for end-users.
> I think it is good if everybody implements their own string library.
...until you someone exploits the bugs in it.
Everyone who did the exercises in K&R should be able to write their own string library, probably with less bugs than the standard one. However, I really feel it's much better for everyone to use proven code like bstring.
Whatever implements string.h on your system, and other functions dealing with string input. Some of these functions simply shouldn't be used at all. An extreme case is gets() that was phased out, but many others are no better.
You're kidding, right? We're not talking about the implementation, but the design. If I ever wanted to write a gets() replacement, it would definitely have proper checks in place to prevent buffer overflow. Everyone using strcpy() is playing with fire. You'll get it right 9 times and make a mistake the 10th time. It's not that the people who implemented these functions are stupid, but they were designed in different times for other types of environments.
Every time I think to use C for something, I re-realize how terrible it is to do anything involving strings. Although this library looks nice, I’ll still have to compose and manage them myself , which is a major headache.
When I review C code, I look for strncpy, etc., and give them special attention. There's always a bug or two in it.
0-terminated strings not only have proven to be a rich source of bugs, they're remarkably inefficient as well [1]. Doing better was a major focus of the initial design of D.
[1] This is because of constantly scanning to get the length (which also necessitates reloading the string contents into the memory cache), and having to make copies of strings instead of just slicing them.
If you wanted to slice at an arbitrary point, then you would either have to lose some data in the original string, move or copy the original string to make space for the extra delimiter/null character, or have set up the string ahead of time to contain the delimiter in the desired position. If you are using strtok.
There is also the mangle-use-repair choice. I've done that with pathnames for creating nested directories.
C programmers are expected to make the best choice based on the situation. The various choices trade off memory usage, CPU usage, source code readability, and program correctness.
It's not problematic. C programmers are expected to avoid screwing that up. C is a full-power language.
If available, strdupa() would be a fine way to get a suitable local copy of the string. Commonly though, the programmer knows that there will not be threads and can make the string non-constant.
I encountered Hollerith constants in an ancient Fortran codebase I worked on and was thrilled to see folks were doing clever stuff with strings in the 60s.
I wonder how much time was wasted in early computing (maybe not wasted really) because of the fear of incompatibility that is getting smaller and smaller as computing platforms coalesce into standardized-ish things.
Watching the M1 roll out and how it doesn't seem to care much that x86 is a thing and gets along with its life has been fascinating.
D uses "phat pointers" for strings, aka a length/pointer pair. Over the years, this has proven to be simple, efficient, and resistant to errors. It means array bounds checking can be automatically done. It enables efficient slicing.
String literals also have an extra 0 appended, making it transparently easy to still pass strings to C functions like printf.
C++ itself doesn't support it. There are libraries that provide unicode-aware handling of strings/vectors of bytes. It's not always clear that you want unicode-aware code when dealing with unicode, but there are times when it is nice to have.
I gave up trying to teach C after 2 years teaching it at university. You’ve been at it, what? 25+ years? Mad props. Thanks for great C/++ compilers, and double-thanks for D!
I'm not sure about C but at least in C++ having const on the prototype is meaningless as you can still have the arguments as non-const in the actual definition. Considering that C is usally less strict with these things I'd expect that to be the case there too.
It seems that everyone implementing their own string library (including, eh, antirez) thinks masquerading pointers is cute, but in my opinion and experience it's very dangerous because it requires a specific coding convention that can't be checked by compilers. SDS is no exception to this problem:
sds a = sdsnew("hell");
sds b = a;
a = sdscat(a, "o"); // this invalidates b
Masqueraded pointers are inherently linear (or affine if you are pedantic). Any length-changing updates to such pointers can potentially reallocate them, so any value can't be "updated" more than once; values should be consumed and returned by many operations. No typical C types behave like this: primitive values or structs can be updated by assignments and pointers can be updated by dereference. C doesn't support linear types and, while normal pointers do need care, masqueraded pointers need much more care to use correctly. Yes, you can replicate the same bug with normal pointers by replacing the third like to `free(a);`, but you wouldn't expect a bug for non-destructive operations. (Put in the other way, masqueraded pointers make many otherwise non-destructive operations destructive.)
While technically not a string library, this and the strict-aliasing issue for type-generic routines prompted me to write my own small extensible vector library [1] years ago.
It's not about saving a bit, but about saving an entire word. If the length and ownership weren't packed together, you'd need one more field in the str struct for the ownership bit, and due to alignment the minimum size increase would be a word.
strtok_r() destructively modifies its input so wrapping it works against this library's objective of using const ponters. It is easy enough to reimplement, though, and I've done this for similar sub-string pointer objects that work without NUL termination.
The existence of strtok_r() is weird anyway. If we can make errno thread-safe, there is no reason why plain strtok() can't be thread-safe. The idea that somebody wants strtok() state shared across threads is just as weird as the idea that errno should be shared across threads.
Say what? You like to call plain strtok() from different threads, having them all update a shared global state? I think you missed the point here, because that would be some really evil usage of the strtok() function.
We had "int errno" as global state. We fixed it, in a compatible way, to be thread-safe. Platforms without threads can still implement it the old way if desired.
The same kind of compatible fix could have been done with the strtok() function. There was no need to introduce another function.
Simply: the internal state of strtok() shall be distinct for each thread. (which is trivial if the platform only supports a single thread)
All the UTF-8 codepoints can be held inside an 8bit char, which is what this library seems to use under the covers.
You might need to add a couple UTF-specific methods if you want number of graphemes rather than number of bytes, but there's nothing to stop you placing UTF8 data inside a char buffer.
I don’t know what you mean by “all the UTF-8 codepoints can be held inside an 8bit char”. All Unicode codepoints obviously cannot be held in 8-bits. The UTF-8 encoding matches ASCII over the first 7-bits, but that’s not relevant. You can UTF-8 encode Unicode codepoints into a bunch of 8-bit chars, but then you can encode anything you like into a bunch of 8-bit chars; a JPEG file for instance.
> All Unicode codepoints obviously cannot be held in 8-bits.
I mentioned UTF-8 specifically, because the UTF-8 encoding actually does specify this particular feature:
> UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. [0]
It’s very rare that any data encoding doesn’t come in a multiple of 8-bits. Have you ever seen a 1.5 byte (12-bit) file? UTF-8 isn’t special in that regard. You can put literally anything into a char buffer and decode it how you like.
The issue isn't whether or not your character encoding is always a multiple of 8 bits. It is whether or not you can use standard (octet-focused) parsing functions to deal with those strings. This is what makes utf-8 "special". No byte of a utf-8 multibyte sequence will ever have a value < 127. So for most "syntactic" parsing problems, you can use standard C functions to deal with utf-8 strings - something that is not true with most other multibyte character encodings.
UTF-8 has an even stronger guarantee: If a byte sequence at any position in a UTF-8 string matches the byte sequence of a UTF-8 encoding of a Unicode code point then that part of the string represents that code point. This means you cannot just use standard C functions like strchr with UTF-8 strings and ASCII characters but you can alos use e.g. strstr to find UTF-8 substrings in UTF-8 strings.
Bytes are bytes. We’re not debating whether it’s easier to write a UTF-8 decoder; I’m asserting that (almost?) any data can be represented as a sequence of bytes and UTF-8 is not special in that regard.
And what good is that? All computer memory is just an array of addressable bytes. So all you've said is you can store UTF-8 strings in memory. You still can't do random access on the string (ie. s[i] will not give you the ith character).
For any definition of a character that is useful to anyone except a text shaping engine, neither will s[i] with UCS-4 (and definitely not with UTF-16).
The most valuable property of UTF-8 from a C-string point of view is that it guarantees there are no embedded NULs in a UTF-8 string.
If you naively put the bytes of UTF-16 or UTF-32 encodings into a buffer, they might contain NUL (zero) byte values. Which, for C strings, means "end of string". UTF-8 makes sure this doesn't happen, which makes it compatible with existing C string functions.
> This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.
I loved this. “This ain’t your fancy schmancy Tesla, it’s granddaddy’s old Ford pickup”.
> Disclaimer: This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.
I'm a bit surprised why C11 support is needed. When you write a library like this, you usually aim for compatibility. There is a lot of ANSI C code around, including popular projects like SQLite. Yet I don't really see much of C11 features in this code except C++-style comments and inline functions that could be solved with simple #ifdefs.
C++ comments and inline functions arrived with C99, which was 21 years ago, not C11. With C11, itself now an obsolete standard from 9 years ago, we get the _Generic keyword.
Avoiding the _Generic keyword is difficult. One might try using the sizeof operator.
Ownership is nice, but how can a zero terminated buffer library still call itself string? We have unicode for a while. UTF-8 only, not less. Strings must support unicode.
This impacts sort and comparisons mostly. But without cmp you cannot search in strings.
UTF-8 strings can be compared with strcmp(), you just can't get alphabetical sorting out of it. Most other str*() functions also work with UTF-8 encoded strings, you just need to know what to expect (e.g. splitting with strtok() works as long as the delimiters are all 7-bit ASCII chars, etc...).
No, they can't? These two UTF-8 byte sequences in a C char pointer,
c3 a9 00
65 cc 81 00
Represent the same string, but do not compare equal with strcmp.
And it's not just that; you've noted how strtok will break down. strchr() can't be used w/ a non-ASCII needle, there is no support for code units, etc.
AFAIK, comparison is language dependent, but how do you tell the string's language and how do you compare strings from different languages and multilingual strings?
Maybe I'm missing something but it doesn't look like memory allocation was carefully considered in this library. For example, there are no custom allocators?
Every library like this is incompatible and most make slightly odd choices like in this library ownership of the string is denoted by a bit in the info/size field. Not that that is a bad choice or anything, but it is one reason someone might decline to use it and decide to write their own.
The lack of namespacing in C doesn't help, this library chooses str_ as its prefix, which is a bit likely to collide with other libraries. It also makes it harder to try to write libraries that allow for the string library to be switched out.
It looks like it'll be getting strdup and strndup.
[1] https://en.wikipedia.org/wiki/STREAMS
I bet the security industry agrees with my definition.
It builds character and you learn something from it.
It is a rite of passage.
However, don't add to the pile of dependency hell that is already plaguing many open source projects. If you feel uneasy with how C strings work, consider switching programming languages instead. You will probably have an easier time and there will be less unmaintained incompatible string libraries rotting around on github.
If you create small libraries that don't produce shared objects intended to stand on their own with a stable API/ABI, but are simply headers or at most produce a .a from source fully intended to become vendored in-tree, you're not contributing to "dependency hell".
When we refer to "dependency hell", AIUI, it's in reference to unresolvable runtime dependencies creating hell for end-users.
That's awful and I love it.
...until you someone exploits the bugs in it.
Everyone who did the exercises in K&R should be able to write their own string library, probably with less bugs than the standard one. However, I really feel it's much better for everyone to use proven code like bstring.
Fewer bugs than what “standard one”?
That's not how most people use the word "bug".
BUGS: Never use gets().
0-terminated strings not only have proven to be a rich source of bugs, they're remarkably inefficient as well [1]. Doing better was a major focus of the initial design of D.
[1] This is because of constantly scanning to get the length (which also necessitates reloading the string contents into the memory cache), and having to make copies of strings instead of just slicing them.
C programmers are expected to make the best choice based on the situation. The various choices trade off memory usage, CPU usage, source code readability, and program correctness.
Which is problematic for thread safety and depending on the source of the string (constant) may not be possible.
If available, strdupa() would be a fine way to get a suitable local copy of the string. Commonly though, the programmer knows that there will not be threads and can make the string non-constant.
I wonder how much time was wasted in early computing (maybe not wasted really) because of the fear of incompatibility that is getting smaller and smaller as computing platforms coalesce into standardized-ish things.
Watching the M1 roll out and how it doesn't seem to care much that x86 is a thing and gets along with its life has been fascinating.
String literals also have an extra 0 appended, making it transparently easy to still pass strings to C functions like printf.
I don't know if that is a typo given you normally call them "fat pointers", but they are "pretty hot and tempting".
It doesn't have the same breadth of features as, say, Python's string class, but it's ok.
See, eg. https://en.cppreference.com/w/cpp/string
1: https://www.boost.org/doc/libs/develop/libs/nowide/doc/html/...
- I'm pretty sure public symbols starting with "str" are reserved by the standard.
- Declaring function arguments as const is pretty silly for value types, imo.
It is also very well documented. And all you need to embed it in your project is : sds.c sds.h sdsalloc.h
The source code is small and every C99 compiler should deal with it without issues.
While technically not a string library, this and the strict-aliasing issue for type-generic routines prompted me to write my own small extensible vector library [1] years ago.
[1] https://gist.github.com/lifthrasiir/4422136
https://github.com/maxim2266/str/blob/f4e84657b23977ab3c5cd7...
seems unlikely to matter if you have a bunch of strings flying around...
two features I'd love to see implemented:
- wrapping thread safe tokenization using strtok_r so it's pleasant to tokenize a string
- sprintf-like formatting
anything that improves string handling in C is doing God's work
We had "int errno" as global state. We fixed it, in a compatible way, to be thread-safe. Platforms without threads can still implement it the old way if desired.
The same kind of compatible fix could have been done with the strtok() function. There was no need to introduce another function.
Simply: the internal state of strtok() shall be distinct for each thread. (which is trivial if the platform only supports a single thread)
All the UTF-8 codepoints can be held inside an 8bit char, which is what this library seems to use under the covers.
You might need to add a couple UTF-specific methods if you want number of graphemes rather than number of bytes, but there's nothing to stop you placing UTF8 data inside a char buffer.
I mentioned UTF-8 specifically, because the UTF-8 encoding actually does specify this particular feature:
> UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. [0]
[0] https://en.wikipedia.org/wiki/UTF-8
UTF-8 is a variable length encoding, some characters need 8 bits, other characters need 32 bits per your own quote.
And it is designed so that it fits in one-byte units. That was a central goal of the encoding.
So... If you have a char buffer, like we have been talking about, you can toss any valid UTF-8 sequence inside it.
Up to you to make sense of the data in singles, pairs, fours, and before that to declare the buffer as the appropriate multiplier plus 1.
If you naively put the bytes of UTF-16 or UTF-32 encodings into a buffer, they might contain NUL (zero) byte values. Which, for C strings, means "end of string". UTF-8 makes sure this doesn't happen, which makes it compatible with existing C string functions.
I loved this. “This ain’t your fancy schmancy Tesla, it’s granddaddy’s old Ford pickup”.
> Disclaimer: This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.
Avoiding the _Generic keyword is difficult. One might try using the sizeof operator.
This impacts sort and comparisons mostly. But without cmp you cannot search in strings.
No, they can't? These two UTF-8 byte sequences in a C char pointer,
Represent the same string, but do not compare equal with strcmp.And it's not just that; you've noted how strtok will break down. strchr() can't be used w/ a non-ASCII needle, there is no support for code units, etc.
Lexicographical sorting over UTF-8 strings is actually the same as lexicographical sorting over the corresponding Unicode code point sequence.
Indeed.