Str: Yet another string library for C language

(github.com)

82 points | by SiempreViernes 1248 days ago

15 comments

abqexpert 1248 days ago
C2x is expected to gain the char8_t type and almost certainly will not gain any new string handling routines. In the year 2030 I am expecting to still see more rounds of posts comparing string libraries for C. With more than 4 decades of development, we still don't have a great solution to string handling in C.
Every library like this is incompatible and most make slightly odd choices like in this library ownership of the string is denoted by a bit in the info/size field. Not that that is a bad choice or anything, but it is one reason someone might decline to use it and decide to write their own.
The lack of namespacing in C doesn't help, this library chooses str_ as its prefix, which is a bit likely to collide with other libraries. It also makes it harder to try to write libraries that allow for the string library to be switched out.
[-]
- Mikhail_Edoshin 1247 days ago
  Maybe we don't have a single solution to string handling in C because there's no such thing as a "string". A text editor would need a very different "string" than a typesetting system or an indexing engine.
- josephcsible 1247 days ago
  > almost certainly will not gain any new string handling routines
  It looks like it'll be getting strdup and strndup.
  [-]
  - abqexpert 1247 days ago
    Thanks for that! I hadn't seen it yet, https://en.cppreference.com/w/c/string/byte/strdup says it will be C23 as well.
- EdSchouten 1247 days ago
  Fun fact: 'str*' is also reserved by POSIX for future extensions. This means that this library completely collides with POSIX.
  [-]
  - lifthrasiir 1247 days ago
    Note that ISO C only reserves /^str[a-z]/ for <string.h> and <stdlib.h>. POSIX's reservation of /^str_/ (note that you are still allowed for identifiers like str8line) is for STREAMS (<stropts.h>) [1], a completely different thing from strings...
    [1] https://en.wikipedia.org/wiki/STREAMS
- flohofwoe 1247 days ago
  IMHO it's better to ignore "universal string processing" in the C standard completely instead of providing a half-assed and over-complicated solution like in most other languages. String processing isn't exactly a core-competency of C, and never will be, it's better to use a different language for such things.
- saagarjha 1247 days ago
  It’s getting memccpy, which is an extremely important function for the creation of a performant, safe string copying method.
  [-]
  - pjmlp 1247 days ago
    It is impossible to be safe if size is a function argument that cannot be validated without hardware support.
    [-]
    - saagarjha 1247 days ago
      My definition of safety likely differs from yours.
      [-]
      - pjmlp 1247 days ago
        My definition of safety means having a size greater than the actual string doesn't turn an innocent looking call into a CVE database entry.
        I bet the security industry agrees with my definition.
  - bzb6 1247 days ago
    Isn’t a function like that trivial to implement?
fefe23 1248 days ago
I think it is good if everybody implements their own string library.
It builds character and you learn something from it.
It is a rite of passage.
However, don't add to the pile of dependency hell that is already plaguing many open source projects. If you feel uneasy with how C strings work, consider switching programming languages instead. You will probably have an easier time and there will be less unmaintained incompatible string libraries rotting around on github.
[-]
- pengaru 1247 days ago
  > However, don't add to the pile of dependency hell that is already plaguing many open source projects
  If you create small libraries that don't produce shared objects intended to stand on their own with a stable API/ABI, but are simply headers or at most produce a .a from source fully intended to become vendored in-tree, you're not contributing to "dependency hell".
  [-]
  - rossmohax 1247 days ago
    In theory this vendored library might show up multiple times in a dependency tree and be incompatible with each other.
    [-]
    - pengaru 1247 days ago
      It's hardly a "dependency hell" when it only affects developers, in what are essentially unique collision type situations, and are generally addressable by the developer because the source is all present. And upstream maintainers of such intended-to-be-vendored code should generally be receptive to improving compatibility and build system configurability for such situations. And if they're not receptive/it's abandonware, congratulations your vendored library is now a fork and fix it yourself.
      When we refer to "dependency hell", AIUI, it's in reference to unresolvable runtime dependencies creating hell for end-users.
- paledot 1247 days ago
  > It builds character
  That's awful and I love it.
- dvfjsdhgfv 1247 days ago
  > I think it is good if everybody implements their own string library.
  ...until you someone exploits the bugs in it.
  Everyone who did the exercises in K&R should be able to write their own string library, probably with less bugs than the standard one. However, I really feel it's much better for everyone to use proven code like bstring.
  [-]
  - saagarjha 1247 days ago
    > probably with less bugs than the standard one
    Fewer bugs than what “standard one”?
    [-]
    - dvfjsdhgfv 1247 days ago
      Whatever implements string.h on your system, and other functions dealing with string input. Some of these functions simply shouldn't be used at all. An extreme case is gets() that was phased out, but many others are no better.
      [-]
      - saagarjha 1247 days ago
        The string functions in your system are almost certainly less buggy than anything you’re going to write.
        [-]
        dvfjsdhgfv 1247 days ago
        You're kidding, right? We're not talking about the implementation, but the design. If I ever wanted to write a gets() replacement, it would definitely have proper checks in place to prevent buffer overflow. Everyone using strcpy() is playing with fire. You'll get it right 9 times and make a mistake the 10th time. It's not that the people who implemented these functions are stupid, but they were designed in different times for other types of environments.
        [-]
        creata 1247 days ago
        > probably with less bugs than the standard one… We're not talking about the implementation, but the design.
        That's not how most people use the word "bug".
        [-]
        dvfjsdhgfv 1247 days ago
        Literally from the man page of gets():
        BUGS: Never use gets().
  - GoblinSlayer 1247 days ago
    bstring is allocated on heap, so slicing requires allocation.
azhenley 1248 days ago
Every time I think to use C for something, I re-realize how terrible it is to do anything involving strings. Although this library looks nice, I’ll still have to compose and manage them myself , which is a major headache.
[-]
- WalterBright 1248 days ago
  When I review C code, I look for strncpy, etc., and give them special attention. There's always a bug or two in it.
  0-terminated strings not only have proven to be a rich source of bugs, they're remarkably inefficient as well [1]. Doing better was a major focus of the initial design of D.
  [1] This is because of constantly scanning to get the length (which also necessitates reloading the string contents into the memory cache), and having to make copies of strings instead of just slicing them.
  [-]
  - aidenn0 1248 days ago
    Well you don't have to make copies of strings to slice them, just ask strtok! </s>
    [-]
    - abqexpert 1248 days ago
      If you wanted to slice at an arbitrary point, then you would either have to lose some data in the original string, move or copy the original string to make space for the extra delimiter/null character, or have set up the string ahead of time to contain the delimiter in the desired position. If you are using strtok.
      [-]
      - souprock 1247 days ago
        There is also the mangle-use-repair choice. I've done that with pathnames for creating nested directories.
        C programmers are expected to make the best choice based on the situation. The various choices trade off memory usage, CPU usage, source code readability, and program correctness.
        [-]
        account42 1247 days ago
        > There is also the mangle-use-repair choice.
        Which is problematic for thread safety and depending on the source of the string (constant) may not be possible.
        [-]
        souprock 1245 days ago
        It's not problematic. C programmers are expected to avoid screwing that up. C is a full-power language.
        If available, strdupa() would be a fine way to get a suitable local copy of the string. Commonly though, the programmer knows that there will not be threads and can make the string non-constant.
  - codezero 1247 days ago
    I encountered Hollerith constants in an ancient Fortran codebase I worked on and was thrilled to see folks were doing clever stuff with strings in the 60s.
    I wonder how much time was wasted in early computing (maybe not wasted really) because of the fear of incompatibility that is getting smaller and smaller as computing platforms coalesce into standardized-ish things.
    Watching the M1 roll out and how it doesn't seem to care much that x86 is a thing and gets along with its life has been fascinating.
- pansa2 1248 days ago
  How is the support for strings in C++? Presumably better than in C, but is it good enough when compared to other compiled languages - Go, Rust etc?
  [-]
  - WalterBright 1248 days ago
    D uses "phat pointers" for strings, aka a length/pointer pair. Over the years, this has proven to be simple, efficient, and resistant to errors. It means array bounds checking can be automatically done. It enables efficient slicing.
    String literals also have an extra 0 appended, making it transparently easy to still pass strings to C functions like printf.
    [-]
    - abqexpert 1248 days ago
      >"phat pointers"
      I don't know if that is a typo given you normally call them "fat pointers", but they are "pretty hot and tempting".
  - __d 1248 days ago
    C++ has string support in the standard library.
    It doesn't have the same breadth of features as, say, Python's string class, but it's ok.
    See, eg. https://en.cppreference.com/w/cpp/string
  - emmanueloga_ 1247 days ago
    In my opinion, still a bit hairy, the reason something like nowide exists [1].
    1: https://www.boost.org/doc/libs/develop/libs/nowide/doc/html/...
  - kenniskrag 1247 days ago
    How is the support of unicode in c++?
    [-]
    - PaulDavisThe1st 1247 days ago
      C++ itself doesn't support it. There are libraries that provide unicode-aware handling of strings/vectors of bytes. It's not always clear that you want unicode-aware code when dealing with unicode, but there are times when it is nice to have.
  - ansgri 1248 days ago
    It's exceedingly verbose, but decent, if you have a recent language version and/or Boost.
- herodoturtle 1248 days ago
  Gosh this evoked a keen sense of nostalgia. I don't miss C strings at all. Along with segmentation fault, core dumped. Ruined many a night!
- pcdoodle 1248 days ago
  I know. This and the fact that compliers never bitch about a single = in if() statements really take time out of my life...
  [-]
  - WalterBright 1248 days ago
    Most C compilers will give a warning for that. D makes it an error in the grammar. `a < b < c` is also an error in the grammar.
    [-]
    - thechao 1247 days ago
      I gave up trying to teach C after 2 years teaching it at university. You’ve been at it, what? 25+ years? Mad props. Thanks for great C/++ compilers, and double-thanks for D!
unwind 1247 days ago
Always interesting with C posts! Two notes:
- I'm pretty sure public symbols starting with "str" are reserved by the standard.
- Declaring function arguments as const is pretty silly for value types, imo.
[-]
- 1wd 1247 days ago
  Function names starting with "str" followed by a lowercase letter are reserved. So technically "str" itself and "str_" are not.
- Google234 1247 days ago
  Doesn’t it stop you modifying them inside the function?
  [-]
  - account42 1247 days ago
    I'm not sure about C but at least in C++ having const on the prototype is meaningless as you can still have the arguments as non-const in the actual definition. Considering that C is usally less strict with these things I'd expect that to be the case there too.
- GoblinSlayer 1247 days ago
  String isn't a value type.
  [-]
  - unwind 1247 days ago
    Yes it is, it's a small struct.
cassepipe 1247 days ago
It would be a pity not to mention here the Simple Dynamic String (SDS) library made by the maker of Redis : https://github.com/antirez/sds
It is also very well documented. And all you need to embed it in your project is : sds.c sds.h sdsalloc.h
The source code is small and every C99 compiler should deal with it without issues.
[-]
- lifthrasiir 1247 days ago
  It seems that everyone implementing their own string library (including, eh, antirez) thinks masquerading pointers is cute, but in my opinion and experience it's very dangerous because it requires a specific coding convention that can't be checked by compilers. SDS is no exception to this problem:
```
    sds a = sdsnew("hell");
    sds b = a;
    a = sdscat(a, "o"); // this invalidates b
```
  Masqueraded pointers are inherently linear (or affine if you are pedantic). Any length-changing updates to such pointers can potentially reallocate them, so any value can't be "updated" more than once; values should be consumed and returned by many operations. No typical C types behave like this: primitive values or structs can be updated by assignments and pointers can be updated by dereference. C doesn't support linear types and, while normal pointers do need care, masqueraded pointers need much more care to use correctly. Yes, you can replicate the same bug with normal pointers by replacing the third like to `free(a);`, but you wouldn't expect a bug for non-destructive operations. (Put in the other way, masqueraded pointers make many otherwise non-destructive operations destructive.)
  While technically not a string library, this and the strict-aliasing issue for type-generic routines prompted me to write my own small extensible vector library [1] years ago.
  [1] https://gist.github.com/lifthrasiir/4422136
andrewshadura 1248 days ago
Interesting: this library makes use of C generics, so you can str_join a str into a str or into a FILE.
recursivedoubts 1248 days ago
oof bit packing to save a single bit?
https://github.com/maxim2266/str/blob/f4e84657b23977ab3c5cd7...
seems unlikely to matter if you have a bunch of strings flying around...
two features I'd love to see implemented:
- wrapping thread safe tokenization using strtok_r so it's pleasant to tokenize a string
- sprintf-like formatting
anything that improves string handling in C is doing God's work
[-]
- jsnell 1248 days ago
  It's not about saving a bit, but about saving an entire word. If the length and ownership weren't packed together, you'd need one more field in the str struct for the ownership bit, and due to alignment the minimum size increase would be a word.
- kevin_thibedeau 1248 days ago
  strtok_r() destructively modifies its input so wrapping it works against this library's objective of using const ponters. It is easy enough to reimplement, though, and I've done this for similar sub-string pointer objects that work without NUL termination.
  [-]
  - souprock 1247 days ago
    The existence of strtok_r() is weird anyway. If we can make errno thread-safe, there is no reason why plain strtok() can't be thread-safe. The idea that somebody wants strtok() state shared across threads is just as weird as the idea that errno should be shared across threads.
    [-]
    - kevin_thibedeau 1247 days ago
      The C API was never designed with threading in mind. Not all platforms support threads so the old behavior must stay around.
      [-]
      - souprock 1245 days ago
        Say what? You like to call plain strtok() from different threads, having them all update a shared global state? I think you missed the point here, because that would be some really evil usage of the strtok() function.
        We had "int errno" as global state. We fixed it, in a compatible way, to be thread-safe. Platforms without threads can still implement it the old way if desired.
        The same kind of compatible fix could have been done with the strtok() function. There was no need to introduce another function.
        Simply: the internal state of strtok() shall be distinct for each thread. (which is trivial if the platform only supports a single thread)
chmaynard 1248 days ago
Strings of ASCII characters only? Or can this library be used with Unicode as well? Just asking.
[-]
- shakna 1248 days ago
  Depends what you mean by Unicode.
  All the UTF-8 codepoints can be held inside an 8bit char, which is what this library seems to use under the covers.
  You might need to add a couple UTF-specific methods if you want number of graphemes rather than number of bytes, but there's nothing to stop you placing UTF8 data inside a char buffer.
  [-]
  - Xophmeister 1248 days ago
    I don’t know what you mean by “all the UTF-8 codepoints can be held inside an 8bit char”. All Unicode codepoints obviously cannot be held in 8-bits. The UTF-8 encoding matches ASCII over the first 7-bits, but that’s not relevant. You can UTF-8 encode Unicode codepoints into a bunch of 8-bit chars, but then you can encode anything you like into a bunch of 8-bit chars; a JPEG file for instance.
    [-]
    - shakna 1248 days ago
      > All Unicode codepoints obviously cannot be held in 8-bits.
      I mentioned UTF-8 specifically, because the UTF-8 encoding actually does specify this particular feature:
      > UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. [0]
      [0] https://en.wikipedia.org/wiki/UTF-8
      [-]
      - stonemetal12 1248 days ago
        >Unicode using one to four one-byte
        UTF-8 is a variable length encoding, some characters need 8 bits, other characters need 32 bits per your own quote.
        [-]
        shakna 1247 days ago
        > one to four one-byte (8-bit) code units
        And it is designed so that it fits in one-byte units. That was a central goal of the encoding.
        So... If you have a char buffer, like we have been talking about, you can toss any valid UTF-8 sequence inside it.
        [-]
        Xophmeister 1247 days ago
        It’s very rare that any data encoding doesn’t come in a multiple of 8-bits. Have you ever seen a 1.5 byte (12-bit) file? UTF-8 isn’t special in that regard. You can put literally anything into a char buffer and decode it how you like.
        [-]
        alarge 1247 days ago
        The issue isn't whether or not your character encoding is always a multiple of 8 bits. It is whether or not you can use standard (octet-focused) parsing functions to deal with those strings. This is what makes utf-8 "special". No byte of a utf-8 multibyte sequence will ever have a value < 127. So for most "syntactic" parsing problems, you can use standard C functions to deal with utf-8 strings - something that is not true with most other multibyte character encodings.
        [-]
        account42 1247 days ago
        UTF-8 has an even stronger guarantee: If a byte sequence at any position in a UTF-8 string matches the byte sequence of a UTF-8 encoding of a Unicode code point then that part of the string represents that code point. This means you cannot just use standard C functions like strchr with UTF-8 strings and ASCII characters but you can alos use e.g. strstr to find UTF-8 substrings in UTF-8 strings.
        Xophmeister 1247 days ago
        Bytes are bytes. We’re not debating whether it’s easier to write a UTF-8 decoder; I’m asserting that (almost?) any data can be represented as a sequence of bytes and UTF-8 is not special in that regard.
        Brian_K_White 1247 days ago
        ...merely, a[n] is not necessarily the nth charater.
        Up to you to make sense of the data in singles, pairs, fours, and before that to declare the buffer as the appropriate multiplier plus 1.
        globular-toast 1247 days ago
        And what good is that? All computer memory is just an array of addressable bytes. So all you've said is you can store UTF-8 strings in memory. You still can't do random access on the string (ie. s[i] will not give you the ith character).
        [-]
        account42 1247 days ago
        For any definition of a character that is useful to anyone except a text shaping engine, neither will s[i] with UCS-4 (and definitely not with UTF-16).
        [-]
        globular-toast 1247 days ago
        Erm... ASCII? The original comment in this thread was asking whether this supports anything but ASCII.
        Google234 1247 days ago
        You can fit 32 bits inside 4 8 bit chars...
    - __d 1247 days ago
      The most valuable property of UTF-8 from a C-string point of view is that it guarantees there are no embedded NULs in a UTF-8 string.
      If you naively put the bytes of UTF-16 or UTF-32 encodings into a buffer, they might contain NUL (zero) byte values. Which, for C strings, means "end of string". UTF-8 makes sure this doesn't happen, which makes it compatible with existing C string functions.
  - aidenn0 1248 days ago
    See my link in sibling comment; the library supports decoding arbitrary encodings to unicode, even those with embedded NULLs.
- fjfaase 1248 days ago
  There is some support for reading a string as an encoded in the current program locale. See: https://github.com/maxim2266/str#unicode-support
- aidenn0 1248 days ago
  https://github.com/maxim2266/str#unicode-support
gabereiser 1247 days ago
> This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.
I loved this. “This ain’t your fancy schmancy Tesla, it’s granddaddy’s old Ford pickup”.
keyle 1247 days ago
I loved that
> Disclaimer: This is the good old C language, not C++ or Rust, so nothing can be enforced on the language level, and certain discipline is required to make sure there is no corrupt or leaked memory resulting from using this library.
dvfjsdhgfv 1247 days ago
I'm a bit surprised why C11 support is needed. When you write a library like this, you usually aim for compatibility. There is a lot of ANSI C code around, including popular projects like SQLite. Yet I don't really see much of C11 features in this code except C++-style comments and inline functions that could be solved with simple #ifdefs.
[-]
- souprock 1247 days ago
  C++ comments and inline functions arrived with C99, which was 21 years ago, not C11. With C11, itself now an obsolete standard from 9 years ago, we get the _Generic keyword.
  Avoiding the _Generic keyword is difficult. One might try using the sizeof operator.
rurban 1248 days ago
Ownership is nice, but how can a zero terminated buffer library still call itself string? We have unicode for a while. UTF-8 only, not less. Strings must support unicode.
This impacts sort and comparisons mostly. But without cmp you cannot search in strings.
[-]
- flohofwoe 1248 days ago
  UTF-8 strings can be compared with strcmp(), you just can't get alphabetical sorting out of it. Most other str*() functions also work with UTF-8 encoded strings, you just need to know what to expect (e.g. splitting with strtok() works as long as the delimiters are all 7-bit ASCII chars, etc...).
  [-]
  - deathanatos 1248 days ago
    > UTF-8 strings can be compared with strcmp()
    No, they can't? These two UTF-8 byte sequences in a C char pointer,
```
  c3 a9 00
  65 cc 81 00
```
    Represent the same string, but do not compare equal with strcmp.
    And it's not just that; you've noted how strtok will break down. strchr() can't be used w/ a non-ASCII needle, there is no support for code units, etc.
    [-]
    - drran 1247 days ago
      These two strings will be equal after normalization and validation of UTF-8.
    - flohofwoe 1247 days ago
      That's a problem with the UNICODE standardization process, and not a problem with the UTF-8 encoding though.
  - account42 1247 days ago
    > you just can't get alphabetical sorting out of it
    Lexicographical sorting over UTF-8 strings is actually the same as lexicographical sorting over the corresponding Unicode code point sequence.
- GoblinSlayer 1247 days ago
  AFAIK, comparison is language dependent, but how do you tell the string's language and how do you compare strings from different languages and multilingual strings?
  [-]
  - qwerty456127 1247 days ago
    There is the Unicode Collation Algorithm standard to address this. An example of its implementation is the utf8_unicode_ci collation in MySQL.
- topspin 1248 days ago
  > zero terminated buffer library
  Indeed.
squid_demon 1247 days ago
Maybe I'm missing something but it doesn't look like memory allocation was carefully considered in this library. For example, there are no custom allocators?
Uptrenda 1247 days ago
Looks useful. Should make it single-header / single-file too. Regex would also be good to have.
jdright 1248 days ago
What about support string interning, any plans?