Cttrie – Compile-time trie-based string matching for C++

(smilingthax.github.io)

82 points | by texec 60 days ago

10 comments

  • joejoint 60 days ago

    Why not just hash the string using constexpr ?

    ====================

    C preprocessor based compile time hash from lolengine:

    http://lolengine.net/blog/2011/12/20/cpp-constant-string-has...

    ====================

    #define H1(s,i,x)(x65599u+(uint8_t)s[(i) <strlen(s)?strlen(s)-1-(i):strlen(s)])

    #define H4(s,i,x) H1(s,i,H1(s,i+1,H1(s,i+2,H1(s,i+3,x))))

    #define H16(s,i,x) H4(s,i,H4(s,i+4,H4(s,i+8,H4(s,i+12,x))))

    #define H64(s,i,x) H16(s,i,H16(s,i+16,H16(s,i+32,H16(s,i+48,x))))

    #define H256(s,i,x) H64(s,i,H64(s,i+64,H64(s,i+128,H64(s,i+192,x))))

    #define HASH(s) ((uint32_t)(H256(s,0,0)^(H256(s,0,0)>>16)))

    template<size_t N> constexpr uint32_t h(char const (&s)[N]) { return HASH(s); }

    constexpr uint32_t h(const char s) { return HASH(s); }

    uint32_t h(const std::string& s) { return h(s.c_str()); }

    int main(int argc, const char argv) {

        auto s = "keep";
        std::string x = "simple";
    
        switch(h(x))
        {
            case h("keep")  : std::cout << "keep"; break;
            case h("it")    : std::cout << "it"; break;
            case h("simple"): std::cout << "simple"; break;
    
        };
    • kllrnohj 60 days ago

      Hash collisions would be one obvious problem with this approach. You could likely always tweak the hash used in response to collisions on the set of constants, but that's not going to be fun the first time it happens.

      Also with that approach since it can't be made into a jump table it's still going to compile into a bunch of if/else combinations. At least with the trie it can early-return if the input doesn't have any chance of matching anything.

      • beached_whale 60 days ago

        This is what I have done. However, you still have the risk of collisions. Depending on the context, like i did a small test and permuted all letters a-z and numbers and there are no collisions. However, changing a-z to A-Z resulted in many(may need to check code). However, this allows for a 5 character command switch.

        • ur-whale 60 days ago

          The hash has to look at all the chars in the string.

          Does the trie have to as well?

          • beached_whale 60 days ago

            The string compare has a lot of branching and looks at both strings. The hash method one checks the hash of the test string(size_t) with the size_t of the other string. The hash itself, say fnv1a, can be very quick(a xor and a multiply I think per character) I think I remember testing this and it was quicker. ill try to find the test, but it is easily done. Either way, it's the only way to use a switch statement this way

            • slededit 60 days ago

              If it’s possible for the string to not be in the set then you have to look at all chars.

              Otherwise there are some cases you can skip characters. I generally don’t use this optimization when writing Tries because it’s tricky and rarely makes a difference.

          • chubot 60 days ago

            If you're OK with textual code generation, you can also use something like re2c:

            http://re2c.org/

            And then you get the power of regexes too. re2c will match an arbitrary set of regexes (including constant strings) by walking through the string byte-by-byte, a single time.

            A trie is basically a special case of a DFA.

            • wmu 60 days ago

              "A trie is basically a special case of a DFA." The best definition of trie I read.

            • acidx 60 days ago

              I like this trick that I use in C. It's limited, as it only matches string prefixes, but it works pretty well in the end: https://tia.mat.br/posts/2018/02/01/more_on_string_switch_in...

              It allows you to write something like this: https://github.com/lpereira/lwan/blob/master/src/lib/lwan-ti... -- which is generated as a binary search by GCC.

              • convivialdingo 60 days ago

                Something we’ve used before is preprocessor string hashing similar to [1]. Basically we hash strings at compile time to ints and use those for our case statements. Word of caution though, there’s the possibility of collisions. It works great on a limited set of strings like XML tags and such.

                1. http://lolengine.net/blog/2011/12/20/cpp-constant-string-has...

                • saagarjha 60 days ago

                  You know, I don't quite understand the point the author is making there. They concede that the compiler is smart enough to elide memcpys, but not smart enough to vectorize their strlen function? What?

                • wyldfire 60 days ago

                  Another approach not mentioned: LLVM's StringSwitch [1]. Unfortunately it's not designed to be used independent of LLVM, so it has a dependency or two. Also note that it's a slightly different use case (can only return different values, is not used for flow control).

                  [1] https://github.com/llvm-mirror/llvm/blob/master/include/llvm...

                  • BeeOnRope 60 days ago

                    It is mentioned right near the end of the slides.

                  • lasagnaphil 60 days ago

                    Although I understand people want to do some crazy stuff with template metaprogramming, I’m more concerned with how it will affect build times (you’re basically running a tree-walk interpreter in a compiler that was not supposed to run a turing complete language... it’s going to be very slow.)

                    For more practical reasons, what about just using runtime interned strings? Each unique string is hashed and allocated in heap memory only once, and to compare the strings you just compare the hashed values (which is just uint_64 comparison). Then you can easily use switch statements to compare your strings... As for performance, interning has an initial runtime cost for each string, but it will probably be miniscule compared to the massive compile time cost induced by crazy constexpr stuff.

                    Example of a string interning library:

                    https://github.com/mattiasgustavsson/libs/blob/master/docs/s...

                    • stargrazer 60 days ago

                      If I recall correctly, https://www.boost.org/doc/libs/1_68_0/libs/spirit/doc/html/i... uses trie's for it's compile time recursive descent parser for writing grammars and lexical analyzers.

                      • ape4 60 days ago

                        This only an issue if you want to have many hardcoded strings in a switch. If you want that, there is a probably a better way to organize the code.

                        • detaro 60 days ago

                          I'd assume that in most cases where this comes up, you're not exactly in a position to not use those strings, e.g. because they're defined in an input format you're processing.

                        • saagarjha 60 days ago

                          Interesting. The way I would have thought of doing something like this is by laying out each "switch" case sequentially in memory, then having the "SWITCH(string)" macro expand out to something that compute the offset necessary, using a similar pre-computed trie, and jump to that offset. That way fallthrough should work as expected.

                          • AndyKelley 59 days ago
                            • 60 days ago
                              [deleted]