> Transactional memory is currently one of the hottest topics in computer-science research, having attracted the focus of researchers in programming languages, computer architecture, and parallel programming, as well as the attention of development groups at major software and hardware companies. The fundamental source of the excitement is the belief that by replacing locks and condition variables with transactions we can make it easier to write correct and efficient shared-memory parallel programs.
Has this research come to fruition? What languages/environments use such an approach today?
Well, I'd say it's a bit muddled. Pure STM systems have seen limited success, mostly on the jvm, particularly clojure. But as a technology it's not spread widely in the industry, despite the fundamental advantage of composability. I personally would attribute this to a sort of gulf created by two common approaches:
1. It's become very common to structure software as a stateless service cluster backed by a stateful database technology. Most of the databases used in this approach would not be described as STM, but the net effect to the application is similar, even if involving more layers and complexity.
2. People have gotten pretty comfortable with simple mutex patterns connected by queues of various sorts. This is a sort of worse is better situation, where the simplicity and high performance of a mutex protected whatever in a toy test or microbenchmark is far more efficient than STM. However, a system composed of many of these mutex protected islands proves a whole different beast indeed. STM has largely been criticized from the perspective of the former, vs the latter.
There are many people who have made the observation that transactional concurrency control, state of the art garbage collection, and even file systems have been converging on similar features. This is being driven by the common constrains on both sides of what hardware and humans expect. In particular with persistent memory, I think you'll see all three of these unify into a single design, because systems that attempt to solve these problems separately will have very inferior match to the hardware.
I think trying to scale out compute + mutable state across multiple CPUs on a single box has somewhat fallen by the wayside outside of specialized applications like databases and distributed query engines, as a local maximum avoided in favour of scalability via more boxes.
There's several forces behind this. Applications are more likely to be web-based or services - i.e. inherently distributed - and less likely to be desktop or traditional client/server, where almost all compute happened on a single server. As distributed services, maximizing statelessness and avoiding mutable shared memory is key to solving a lot of problems: scaling out (no shared memory), redundancy (keep a second copy of the application logic running somewhere, no sync required), recovery and idempotence (if something fails, try again until it succeeds - repeated attempts are safe).
Reliable persistent queues are part of that. They let you bring services up and down and switch over without down time, or restart after a failure and resume where they left off.
The problems of shared mutable state are best kept in specialized applications: databases, queuing systems, distributed caches, consistent key-value stores. Keeping state consistent in a distributed system is a genuinely hard problem, and STM isn't much help, except perhaps as an implementation detail in some of those specialized applications.
For what it's worth, scaling mutable shared state across multiple CPUs on a single box has fallen by the wayside for databases too. Thread-per-core style software architecture has become idiomatic, in part because it efficiently scales up on machines with a very large number of cores. Nothing about this precludes scale-out, it just allows you to serve 10-100x the load on a single box as the more typical naive design.
Web applications aren't special in this regard. We've just normalized being incredibly wasteful of computing resources when designing them, spinning up huge clusters to serve a workload that could be easily satisfied with a single machine (or a few machines for availability).
Premature scale-out, because we've forgotten how to scale-up, is the root of much evil.
This has been my experience as well. I shifted from embedded systems programming to web services world briefly. I found that scale up approach has been so maligned that it is ok to spin up several hundred ec2 instances running a (poorly written) java application instead of a single multi core instance running erlang that performed much better. Some frameworks even ran web services in php and spent all effort in not having the request reach php code since it was doing like 10 reqs/sec.
Basically, poor STM performance in .Net is too much mutation and mutation-induced logging. This is why clojure's STM is much more widely used.
I have particularly deep experience in various Haskell projects and I must insist that no relatively (1) big Haskell project (2) with a hint of concurrency and (3) developed by several people can go without use of STM.
I can use MVars myself for problem space discovery. When there are at least two of use, I will use TVar and insist others should too.
I am not free to name the product, but I worked on a product that has done billions of dollars in revenue that designed a bunch of state management checkpointing using a custom STM implementation for C (yes, C). It made so, so many things straightforward in terms of corner cases; we also extended transactionality to sync over the wire in some cases.
I also think STM-adjacent designs are gaining some traction - for example, FoundationDB looks an awful lot like an STM system from the point of view of applications, much more than it looks like a traditional database.
It’s widely regarded as a dead end. Conceptually, it’s quite appealing so who knows, it might find new life someday. Garbage collection was a niche technology for decades, as was JIT compilation.
Joe Duffy’s write-up on how STM failed at Microsoft with .NET is illustrative:
right... it's not hard to invent some scheme that scales to a 2 core computer but it is not crazy to have 64 or more cores these days and even if you have 2% serial overhead from communications you are not going to scale past 50.
The problems of making a system really scale as opposed to just scale are bad enough that the best we know how to do is give people simpler primitives that work -- each layer you add to the system is another opportunity to add another limitation to it.
Modern garbage collectors are fairly scalable (e.g. don't break outright as the heap size increases) and any manual memory management would also have the risk of scaling problems, to the point where you might say that garbage collection is more "scalable" when the thing being scaled is the kind of arbitrary complexity that appears in business applications.
What people complain about with garbage collection are: (1) a fixed constant overhead they think is too high and will probably always think to high, (2) long pauses that are manageable in practice unless you are timing events with millisecond precision with a microcontroller.
Thus one of these things is mainstream and other isn't.
> What people complain about with garbage collection are: (1) a fixed constant overhead they think is too high and will probably always think to high, (2) long pauses that are manageable in practice unless you are timing events with millisecond precision with a microcontroller.
I keep hearing this about "modern" GC and it keeps being very much not true in practice. I'm running IntelliJ IDEA (Java) on a laptop with 8GB of RAM, and it keeps freezing every so often. I'm suspecting GC causing paging. This is something that is obviously avoided with reference counting or manual memory management.
It's not like tracing GC is without memory overhead...
Slower is a question of priorities. I'd definitely trade off slower overall performance (I don't really need my IDE to be that performant... it's not like it's using 100% CPU, or even 10%!) but a huge improvement in latency (i.e. no catastrophic slowdowns / freezes).
There are very few ways of having zero memory overhead, though.
Also, most of the flaws of Java's memory management seem to be concentrated in the part of the language that says that almost everything has to be an object with a header and on the heap. Remove that and your graph of objects to trace shrinks very substantially.
Interestingly, I was running out of memory, so I gave it more RAM (like 20GB) and it got even worse. Apparently it takes a long time to GC that much memory. I’d also rather amortize that pause. The loss of productivity due to a sudden pause is real.
Yeah. As is pointed out down-thread of the linked post, even when it was confined to a niche, garbage collection did what it claimed it would (eliminate the need for error-prone manual memory management) at an acceptable cost for many systems & applications. JITs were similar.
So far, STM and HTM have followed a path more like VLIW instruction sets: Conceptually appealing for their promise of simplification, but in practice only demonstrating success in specialized uses, with the tradeoffs proving unattractive seemingly everywhere else they've been tried.
I think it is a false equivalence to say that STM and HTM are like VLIW (which has been been a resounding success for DSP workloads). I am not defending VLIW.
Transactions are extremely powerful conceptually, to be composable and not cause flapping requires a higher level of coordination.
It kind of came and went in Clojure, not much code uses it today.
In your second link, the linked followup post is also interesting as it captures some misunderstandings one might have when starting out with STM code.
Hardware support is only very recent, Intel added support at various times and then had to remove it due to security and correctness issues, but I think it is now finally available:
> Intel's Transactional Synchronization Extensions (TSX) is available in some of the Skylake processors. It was earlier implemented in Haswell and Broadwell processors as well, but the implementations turned out both times to be defective and support for TSX was disabled. The TSX specification describes the transactional memory API for use by software developers, but withholds details on technical implementation.[11] ARM architecture has a similar extension.[12]
I believe even with it available, OSes turn it off because of all the other Intel vulnerabilities and it making them even more expensive to workaround.
I thought the big win for Transactional Memory over locking was composability. Locks don’t compose. Whenever accessing data, you have to think about all the locks or else you end up with dead(live)lock.
The promise of transactional memory was that it composes and you only have to think about your data access and not have to worry about running into deadlocks.
Borrow checkers feel like a bottom up approach to STM to me.
There’s a lot of static analysis in the JVM that reduces locking, speeds up garbage collection, and removes the need to declare the type of a generic variable.
I expect over time we will see inference in borrow checkers reduce the brain cycles you need to spend thinking about these things, and borrow semantics will be “good enough” for a long time.
Or who knows, maybe STM systems will be able to infer borrow semantics and benefit from aggressive lock elision.
Does this all go back to a derivative of the halting problem? You have an abstraction, a computer must make a decision on what you might ever want to do but while you may never intent for a set of conditions to occur, the computer cannot 'know' that. And the abstraction doesn't offer a way for you to specify your conditions to a more optimal manner.
Quarkus on the JVM has an interesting implementation of TX Memory, and it's integrated with the JTA global XA manager, which really gives it some cool capabilities.
Memory is allocated in pages, which are split in subpages, etc., then you throw at it heaps, queues, linked lists and so on.
All of this structure doesn't exist in the memory. So you need transactions at the page level which is too crude, or the byte level which is too granular.
Hardware is simply the wrong level to solve this problem at. Unless the hardware starts understanding how we use memory way, way, way, way better than it does right now.
You have misunderstood what transactional memory is. Think of it as database transaction applied to memory operations. Like a database transaction makes a collection of read and write and computation based on those read and write appear atomic w.r.t other transactions and read/write, transactional memory generally aims to make a section of instructions to have transactional semantic for all load and store and other instructions in the transaction. Most transactional memory proposals are not page level granularity - though obviously most proposals have certain limits, and byte level being too granular doesn't even make any sense.
PDF available here: https://homes.cs.washington.edu/~djg/papers/analogy_oopsla07...
Has this research come to fruition? What languages/environments use such an approach today?
1. It's become very common to structure software as a stateless service cluster backed by a stateful database technology. Most of the databases used in this approach would not be described as STM, but the net effect to the application is similar, even if involving more layers and complexity.
2. People have gotten pretty comfortable with simple mutex patterns connected by queues of various sorts. This is a sort of worse is better situation, where the simplicity and high performance of a mutex protected whatever in a toy test or microbenchmark is far more efficient than STM. However, a system composed of many of these mutex protected islands proves a whole different beast indeed. STM has largely been criticized from the perspective of the former, vs the latter.
There are many people who have made the observation that transactional concurrency control, state of the art garbage collection, and even file systems have been converging on similar features. This is being driven by the common constrains on both sides of what hardware and humans expect. In particular with persistent memory, I think you'll see all three of these unify into a single design, because systems that attempt to solve these problems separately will have very inferior match to the hardware.
There's several forces behind this. Applications are more likely to be web-based or services - i.e. inherently distributed - and less likely to be desktop or traditional client/server, where almost all compute happened on a single server. As distributed services, maximizing statelessness and avoiding mutable shared memory is key to solving a lot of problems: scaling out (no shared memory), redundancy (keep a second copy of the application logic running somewhere, no sync required), recovery and idempotence (if something fails, try again until it succeeds - repeated attempts are safe).
Reliable persistent queues are part of that. They let you bring services up and down and switch over without down time, or restart after a failure and resume where they left off.
The problems of shared mutable state are best kept in specialized applications: databases, queuing systems, distributed caches, consistent key-value stores. Keeping state consistent in a distributed system is a genuinely hard problem, and STM isn't much help, except perhaps as an implementation detail in some of those specialized applications.
Web applications aren't special in this regard. We've just normalized being incredibly wasteful of computing resources when designing them, spinning up huge clusters to serve a workload that could be easily satisfied with a single machine (or a few machines for availability).
Premature scale-out, because we've forgotten how to scale-up, is the root of much evil.
Basically, poor STM performance in .Net is too much mutation and mutation-induced logging. This is why clojure's STM is much more widely used.
I have particularly deep experience in various Haskell projects and I must insist that no relatively (1) big Haskell project (2) with a hint of concurrency and (3) developed by several people can go without use of STM.
I can use MVars myself for problem space discovery. When there are at least two of use, I will use TVar and insist others should too.
I am not free to name the product, but I worked on a product that has done billions of dollars in revenue that designed a bunch of state management checkpointing using a custom STM implementation for C (yes, C). It made so, so many things straightforward in terms of corner cases; we also extended transactionality to sync over the wire in some cases.
I also think STM-adjacent designs are gaining some traction - for example, FoundationDB looks an awful lot like an STM system from the point of view of applications, much more than it looks like a traditional database.
Joe Duffy’s write-up on how STM failed at Microsoft with .NET is illustrative:
http://joeduffyblog.com/2010/01/03/a-brief-retrospective-on-...
The problems of making a system really scale as opposed to just scale are bad enough that the best we know how to do is give people simpler primitives that work -- each layer you add to the system is another opportunity to add another limitation to it.
Modern garbage collectors are fairly scalable (e.g. don't break outright as the heap size increases) and any manual memory management would also have the risk of scaling problems, to the point where you might say that garbage collection is more "scalable" when the thing being scaled is the kind of arbitrary complexity that appears in business applications.
What people complain about with garbage collection are: (1) a fixed constant overhead they think is too high and will probably always think to high, (2) long pauses that are manageable in practice unless you are timing events with millisecond precision with a microcontroller.
Thus one of these things is mainstream and other isn't.
I keep hearing this about "modern" GC and it keeps being very much not true in practice. I'm running IntelliJ IDEA (Java) on a laptop with 8GB of RAM, and it keeps freezing every so often. I'm suspecting GC causing paging. This is something that is obviously avoided with reference counting or manual memory management.
You'd prefer your data being larger in memory due to refcount fields and slower to access due to atomic operations on those fields?
Slower is a question of priorities. I'd definitely trade off slower overall performance (I don't really need my IDE to be that performant... it's not like it's using 100% CPU, or even 10%!) but a huge improvement in latency (i.e. no catastrophic slowdowns / freezes).
Also, most of the flaws of Java's memory management seem to be concentrated in the part of the language that says that almost everything has to be an object with a header and on the heap. Remove that and your graph of objects to trace shrinks very substantially.
So far, STM and HTM have followed a path more like VLIW instruction sets: Conceptually appealing for their promise of simplification, but in practice only demonstrating success in specialized uses, with the tradeoffs proving unattractive seemingly everywhere else they've been tried.
Transactions are extremely powerful conceptually, to be composable and not cause flapping requires a higher level of coordination.
* https://clojure.org/reference/refs * https://sw1nn.com/blog/2012/04/11/clojure-stm-what-why-how/
In your second link, the linked followup post is also interesting as it captures some misunderstandings one might have when starting out with STM code.
> Intel's Transactional Synchronization Extensions (TSX) is available in some of the Skylake processors. It was earlier implemented in Haswell and Broadwell processors as well, but the implementations turned out both times to be defective and support for TSX was disabled. The TSX specification describes the transactional memory API for use by software developers, but withholds details on technical implementation.[11] ARM architecture has a similar extension.[12]
https://en.wikipedia.org/wiki/Transactional_memory
I believe even with it available, OSes turn it off because of all the other Intel vulnerabilities and it making them even more expensive to workaround.
The promise of transactional memory was that it composes and you only have to think about your data access and not have to worry about running into deadlocks.
There’s a lot of static analysis in the JVM that reduces locking, speeds up garbage collection, and removes the need to declare the type of a generic variable.
I expect over time we will see inference in borrow checkers reduce the brain cycles you need to spend thinking about these things, and borrow semantics will be “good enough” for a long time.
Or who knows, maybe STM systems will be able to infer borrow semantics and benefit from aggressive lock elision.
All of this structure doesn't exist in the memory. So you need transactions at the page level which is too crude, or the byte level which is too granular.
Hardware is simply the wrong level to solve this problem at. Unless the hardware starts understanding how we use memory way, way, way, way better than it does right now.