Pony Performance Cheatsheet

(ponylang.org)

98 points | by spooneybarger 2441 days ago

6 comments

  • CJefferson 2441 days ago
    I'm disappointed (in pony) that one suggest is to use tuple instead of a class, and in general avoid naking little clases.

    One often overlooked benefit of c++ (and I am sure other langyages) is that I can wrap one or two ints in a class, add some constructors and little member functions, and be fairly confident the whole lot will get inlined and compiled away.

    • trishume 2441 days ago
      Indeed, this is a good document for Pony (and a lot applies to other languages like OCaml as well!) but it shows what programming in a language like Rust or C++ that has many zero-cost abstractions gets you.

      In Rust I can always use wrappers, aliases and union types, and idiomatic error handling. And since the abstractions are either zero-overhead or will reliably be optimized away by LLVM, I don't have to worry about sacrificing performance.

      • pjmlp 2440 days ago
        This is not 100% true, just see the efforts of Google in optimizing string allocation on Chrome.

        Just because a language offers zero-cost abstractions, doesn't mean performance comes for free, one needs to know how to use them properly.

        And be clever to chose libraries that are also designed with performance in mind.

        • nine_k 2440 days ago
          In Rust or C++, you can make the abstractions zero-cost with some effort. I'm many other languages, you can't, no matter the effort.
          • pjmlp 2440 days ago
            Yes, but that "some effort" also has a development cost, which might be worthwhile or not, depending on the use case.

            Going a bit off-thread, Ada, Object Pascal, Active Oberon, D, Nim, Modula-3, Swift also offer such capabilities.

            C# also has on their roadmap plans to adopt more features from System C# and Midori, specially when coupled with .NET Native and CoreRT.

        • srean 2440 days ago
          I would say std::string and std::io are special cases and somewhat atypical. Both have large performance overheads. I think the standardization committee blessed existing implementationsite that had become popular rather than design one with performance in mind. Strings really ought to have been immutable by default.
    • vanderZwan 2440 days ago
      That's a bit of an apples to oranges comparison since class in pony isn't a class in C++. It's an actor, with all of the machinery that this entails.
      • mkishi 2440 days ago
        Classes and Actors are two different things in Pony.
        • vanderZwan 2440 days ago
          My apologies, I just checked and you're right. I remembered wrongly. Sadly it's too late to edit the comment above
  • infradig 2441 days ago
    I only got as far the string concatenation example. That languages like Pony, C++ and no doubt many others, make programmers go through these performance hoops for common-case scenarios is unfortunate.
    • rhencke 2441 days ago
      This is being added as an optimization:

      Some of you probably looked at the String performance enhancement above and thought, “doesn’t the JVM do that for you?” You’d be right. That’s a standard optimization in many languages. It’s an optimization we are adding to Pony. However, the basic pattern applies. Be aware of triggering extra allocations that you don’t need to. Your compiler and runtime can add many optimizations to avoid allocations, but it’s not going to do everything for you. You still need to understand your allocations. They’re called “your allocations” because in the end, you end up owning them and they end up owning your performance.

      • infradig 2441 days ago
        Great. Sorry for not reading further.
    • beagle3 2440 days ago
      The problem is inherent in the imperative style.

      The programmer SPECIFICALLY ASKED for all the partial results, and it turns out that doing what was asked is sub-optimal, to the tune of O(n) allocations of O(n) each and O(n^2) runtime as opposed to the optimal O(1) allocations of O(n), and O(n) runtime.

      It is extremely hard to make the compiler transform the sub-optimal version to the optimal one in the general case. Sure, one can special case strings (or std::string, or whatever) and do that without changing the semantics -- Python, for example, does that with string appends. However, none of Python, C++ or Rust will do that for a user-defined type.

      The right thing is to make sure that idiomatic way to concatenate strings is, in fact, efficient - Python does that by making the idiom "seperator.join(string_list)", which is as efficient as it can be.

      There's a lot of performance gains to be made by transforming unreasonable code to reasonable code, which is why compilers still do it; and unreasonable code just as often results from mountains of abstractions as it does from programmer unawareness. But IMHO our goal should be that the norm is that programmers are expected to write reasonable code, not that compilers are able to optimize unreasonable code.

      • Veedrac 2440 days ago
        Rust will actually optimize a chain of string additions to a chain of `push_str`s. This isn't hardcoded, just how addition is implemented.

            fn add(mut self, other: &str) -> String {
                self.push_str(other);
                self
            }
        
        https://github.com/rust-lang/rust/blob/master/src/liballoc/s...

        Of course, upfront sizing is still advantageous, but you avoid the pathological n² case.

      • wruza 2440 days ago
        >"seperator.join(string_list)", which is as efficient as it can be.

        If you store strings in array, it is not efficient to concatenate it at all. Strings could be implemented as an arrays or strings internally, so none of (insert, delete, append, prepend) series would impact performance so much. Editors actually do that, and your text buffer is not a single string that memmoves every time you type a char.

        Strings as they are today almost everywhere are simply of bad design.

        • beagle3 2440 days ago
          Depends on your use case. If you use a rope or tree representation (like most editors do these days), random access is O(log n) at best, in many implementations typically O(sqrt n) or even O(n), whereas concatenation string is O(1).

          I don't think that it is bad as a general compromise between editor buffers (lots of inserts and deletes and concats) and symbols (immutable after created), the vast majority of which are under 200 chars (file names, urls, actions, text fields).

          Sure, life would be better if every standard library supported a variety of string types for specific uses. But they don't, and I suspect that it is mostly for cognitive overload (that is, it is a PEBKAC, not a technical issue)

          • wruza 2440 days ago
            Much this, use case. Statistically, my use case is 95% cat cat cat cat... (hundreds of them) and then seqwrite of some form, never accessing it randomly. I'm generating json, xml, sql, etc to send it over socket. Sometimes it is utility library, sometimes me (or me writing utility library). My string type is always an array, but this type-shift is not obvious to new coders.

            "Why StringBuilder" was once very popular on SO, not sure for now. I oversimplified editors example, their structures are not for 'everyday' use, but I think that simple linked list in string would wipe the entire obvious class of cat-in-loop performance issues.

            • majewsky 2440 days ago
              > Statistically, my use case is 95% cat cat cat cat... (hundreds of them)

              If you use a linked-list representation like you proposed, your performance would die a death by 1000 allocations. A simple string class backed by an exponentially growing contiguous memory buffer (i.e., what 99% of string classes do) sounds much better from a performance standpoint. Especially if you are able to know (or estimate) the size of your response, and allocate once upfront.

              • wruza 2440 days ago
                I'm surely able to write a C routine that uses custom 1.5x-growing buffer and does everything, including subexpr indentation shift etc, in one pass.

                The point is that I cannot delegate this work to newbie/lowcost who will cat-cat-cat strings allocated by other newbie thousand times with no idea that it is slow and consuming. Even telling them how to do it properly I'll spend more machine cycles than simply sending it in production. Even then, very source strings are not under my control and I'm no world dictator to tell a bunch of maintainers who don't even know me what they should do for my performance case. It is real world, not my cool homegrown toolkit.

                It is not how-to talk, it is why not make already inefficient method slightly more forgiving by appropriate means.

                If you don't like a list, take an array, but I like to see tests against it before taking that "death" argument. Maybe tomorrow I'll test all three (multicat, list, array) of immutable strings here.

                • wruza 2440 days ago
                  https://pastebin.com/APHt8C5V (valid for 1 month)

                  In short, 1M iterations of 1) s=s..x, 2) insert(t,x)+concat(t) 3) l={l,x}+concat(flatten(l)). Time in integer seconds passed.

                    > lua5.1.exe x.lua
                    ..      1850000 bytes, 10000 iterations         -- n / 100, since it never ends with 1M
                    time:   23
                    array   185000000 bytes, 1000000 iterations
                    time:   5
                    list    185000000 bytes, 1000000 iterations
                    time:   6                                       -- seems pretty alive
                  
                  edit: table.concat() seems to eat most of the time here, so lists are effectively instant, as are arrays for 1M. (Full GC included in all test times.)

                  edit2: on n=10M without actual concat(): array takes 10s, list 19s. Conclusion is, malloc() is not too slow compared to 2x realloc step used in Lua tables.

          • fnord123 2440 days ago
            Not to disagree, but you can barely perform random access on a utf8 string. You need to explode it out to utf16 or utf32 which isn't what most languages have built in. Rust and go largely work with utf8 while c and c++ love them byte arrays (not sure I've even seen std::wstring in the wild)
            • burntsushi 2440 days ago
              This is misleading. UTF-16 doesn't actually provide random access, because codepoints outside the basic multilingual plane are encoded with two UTF-16 code units (4 bytes). UTF-32 guarantees 4 bytes for every codepoint, which is quite wasteful, but even then, random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

              > but you can barely perform random access on a utf8 string.

              This really isn't true, or at least, isn't a problem in practice. If you need indices into UTF-8 strings, then you can record them by decoding the string. This is sufficient for most string related algorithms except for the "give me the first N characters" variety, which actually turns out to be a relative good thing since "give me the first N characters" should require applying Unicode's grapheme algorithm, which is never amenable to random access in UTF-8, UTF-16 or UTF-32.

              • fnord123 2440 days ago
                > random access by codepoint is generally a bad idea because codepoints and graphemes aren't synonymous.

                That's a good point.

                >If you need indices into UTF-8 strings, then you can record them by decoding the string.

                Sure but the context is that I was responding to "If you use a rope or tree representation (like most editors do these days), random access is O(log n) at best, in many implementations typically O(sqrt n) or even O(n), whereas concatenation string is O(1).".

                Decoding the string is O(n); not O(1) (if GP meant "concatenation string" as in a string where the text is concatenated into a single buffer - if GP meant "concatenating a string in a rope or tree is O(1)" then I'm off on a wild tangent).

                • beagle3 2440 days ago
                  But you only need to decode it once (or otherwise receive those indices without even deciding) whereas random access to a rope/tree is always o(log n) or o(n).

                  Use case is everything.

                  • fnord123 2438 days ago
                    >Use case is everything.

                    amen.

            • beagle3 2440 days ago
              Well, depends if you want the random access to be by bytes or code points - you might have the byte indices from a previous access.

              Also, factor (and iirc python too) keeps it at 8 bits if all codepoints fit in 8 bits, 16 if it fits in 16, and 32 otherwise.

        • burntsushi 2440 days ago
          > Strings as they are today almost everywhere are simply of bad design.

          I very much disagree. Eschewing a contiguous memory representation in favor of making other operations cheaper is a monumental trade off, and whether it's good or bad design depends on your use case.

          Others have mentioned that you lose random access (which is incredibly important for some algorithms), but you also lose optimizations that rely on a contiguous memory representation. For example, SIMD optimized routines like memchr live at the heart of any substring search algorithm. If you can't make effective use of SIMD optimized routines in your substring search, then your implementation will likely be left in the dust. If you proclaim contiguous memory representations as "bad design," then users of your strings won't care about design when you tell them that only bad designs have fast substring search implementations.

          • wruza 2440 days ago
            Personally I see no difference between the current statement of

              sep.join(list) is idiomatic
            
            and possible statement of

              s.flatten() before BM/SIMD is idiomatic
            
            except that join is ubiquitous and BM/SIMD searching is never heard of by a regular developer in modern boring language. Not to mention that few of us who aware can implement or use it properly.

            edit: searching/iterating methods could simply flatten() string on demand, because one doesn't search in semi-composed string [to trash performance back again]. If string doesn't have to be searched, but is for socket/file, writing methods could `s.fetch_next(BUFSIZ)` and skip huge concatenation before BUFSIZ-chunking-again at all. What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

            • burntsushi 2439 days ago
              > BM/SIMD searching is never heard of by a regular developer in modern boring language

              Of course not, because it's usually implemented in the standard library of whatever language you're using. But if your strings aren't stored in contiguous memory, then having to flatten every string into contiguous memory before searching also presents a significant performance cost.

              > What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

              I feel like you're missing my point. My point isn't that "contiguous memory is better than not contiguous memory." My point is that "contiguous memory strings are bad design" is completely unwarranted outside of the context of a specific use case. Outside of specific use cases, the only thing we should be doing is discussing trade offs, and not declaring a monopoly on what constitutes "bad design."

            • beagle3 2439 days ago
              > What's the reason to concatenate prematurely AND by hand, if it could be done automatically and only if needed?

              First, it's simplicity (of implementation, explanation, etc. by far).

              Second, it depends very significantly on your use case.

              If your strings are of an average length of 4 bytes (not uncommon, even much shorter in code that does ''.join('<',tag,'>',body,'</',tag,'>') if bodies are short) then you are very likely to take some x4 to x10 memory compared to what you actually need (e.g. if you have allocations and 64-bit pointers).

              With respect to your socket example, iovecs could be used instead of flattening if you have access to low-level OS primitives, but (a) it would be much slower if the sections are short, and (b) most programming languages and frameworks do not actually give you access to iovec (or equivalent) io calls, for either sockets or files.

              Ropes have been available since forever in C++, going as far back as the original SGI STL in 1995, RogueWave in 1991 IIRC, and countless other independent implementations. for your typical use case, the complexity (in time and space) that they provide is apparently a net negative.

              The "s.flatten() before XYZ is idiomatic" is exactly the kind of leaky abstraction that everyone hates, and it requires an understanding that is lacking among most programmers (guaranteeing half the flatten()s will be useless and half of those needed will be missing).

          • beagle3 2440 days ago
            I agree completely. Back in the 8086 days, Boyer Moore string search delivered significant speed up over linear memory scan, but nowadays nothing practically beats SIMD linear scans except in very specialized situations.
            • burntsushi 2440 days ago
              Indeed. Many Boyer Moore implementations implement the skip loop by using memchr. But in practice, it's often faster to forgo Boyer Moore completely and heuristically select the rarest byte in the string you're searching, and then run memchr on that byte. You can even pick the second rarest byte and use that as a quick way to throw out false positives before doing a full string comparison.

              (The aforementioned optimization is one of the tricks ripgrep uses that occasionally causes it to be faster than GNU grep. GNU grep uses a traditional Boyer Moore substring search with memchr always applied using the last byte in the pattern.)

  • panic 2441 days ago
    They talk about profiling at the end -- it would be nice to see numbers for how much faster or slower each of the code snippets is. It's hard to tell how much you should care about "boxing machine words", for example, especially when the faster version makes your code less maintainable. Is this only something to worry about if you have millions of items in your array?
  • nerdponx 2441 days ago
    This sounds like a good argument for having macros in the language.

    Also, how hard is it for the compiler to optimize these cases? Why are they zero-cost in C++ and not in Pony?

  • bhauer 2441 days ago
    This is a great summery of performance-oriented tips that in many cases are generally applicable, even beyond the specific context of Pony.
  • df388dfsa8f9 2440 days ago
    It's really disappointing that something as simple as a string concatenation expression `a + ":" + b` has to be replaced with a pre-allocating monstrosity. This is the kind of thing that a modern language should be able to optimize at compile time. Surely doesn't make for a good impression of the language.