In the scaling law comparison, I wonder if it is reasonable to compare number of parameters between Llama, Mamba, RWKV, xLSTM? Isn't compute time more relevant? E.g. in the figure about scaling laws, replace num of params by compute time.
Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation. So scaling up Transformer could still look better when you look at compute time.
It seems neither the code nor the model params are released. I wonder if that will follow.
Disclaimer: I'm shared first author of this paper.
As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.
Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?
So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)
Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.
Congratulations on the paper. That's some very interesting work!
But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.
Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed.
We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.
Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.
You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.
For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.
> For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture
Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?
To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.
Recurrence is less of issue with really large models training than it is with medium sized models. Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training. And sequence parallelism is same for transformer or recurrent model.
For really large models, it is in fact easier to achieve peak flops because computation required scales faster than memory bandwidth required(square vs cube).
With sequence parallelism, you mean to increase the batch size, i.e. number of sequences in a batch?
> Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training
Is there some word missing? You mean it's more common for large-sized Transformers?
> computation required scales faster than memory bandwidth required (square vs cube)
That is an interesting thought. I'm trying to understand what exactly you mean. You mean, computation time is in O(N^2) where N is the sequence length, while required memory bandwidth is in O(N^3)? Why is that?
No, it means dividing the sequence into multiple chunks and processing them one by one, very similar to recurrence. See [1]. Sequence parallelism is needed when the sequence can't fit in a single GPU. Sequence parallelism is the hardest parallelism, but it is required for longer sequence. Many models just trains for smaller sequence length for majority of the training and switch to sequence parallelism for last few percentage of training.
>Sequence parallelism is the hardest parallelism, but it is required for longer sequence
In terms of difficulty of implementation it's arguably much easier than pipeline parallelism, which I'd argue is the hardest kind (at least to implement it efficiently without bubbles), and takes the most lines of code to implement (especially in Jax, where sequence parallelism is almost trivial).
> Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation.
If you mean that you cannot fully parallelize inference, this might be true but also not quite relevant since the computational demands of inference are low. And you can always "parallelize" training to some extent, just by training larger batches.
This was formulated a bit unclear. It is not possible to parallelize in the sequence dimension for training as it is possible for Transformers. In the batch dimension you can always do it.
For those who don't know, the senior author on this paper (Sepp Hochreiter) was the first author on the original paper with Schmidhuber introducing LSTMs in 1997.
At least in biology, the first author of a paper is more often than not just a pair of gifted hands who did the experiments and plotted the graphs. Doesn’t always translate that they become good PIs later (though they get their chances from these papers).
I like the color coded equations, I wish they would become a thing. We have syntax highlighting for programming languages, it's time we have it for math too.
The claim is something than will replace the transformer, a technology powering a good chunk of AI companies.
The paper's authors seems to be either from a public university, or Sepp Hochreiter's private company or labs nx-ai.com https://www.nx-ai.com/en/xlstm
Where is the code ? What is the license ? How are they earning money ? Why publish their secret recipe ? Will they not be replicated ? How will the rewards be commensurate with the value their algorithm bring ? Who will get money from this new technology ?
Nope, they should not. It is academia after all. How would you even do that in, say, pure mathematics? Concretely, I would love to know what the business plan/economic consideration of Gower's 1998 proof of Szemeredi's theorem using higher order Fourier analysis would even look like.
Yes they should. Academia and peer review is so corrupt, gamified, and poor quality that I’d literally trust capitalist parasites more than the current regime of “publish or perish” and citation cartels.
At least capitalists have something to fight over that’s worth fighting for (money). Academics will bitterly fight over the dumbest, least important shit. There’s a law about how the less something matters, the more political the fights over it will be.
I am certainly not going to defend peer review and its inherent flaws. I am also not sure "capitalists" or the market is always as efficient as one might hope or think. But that aside, to my point above, if capitalists were to optimize "money" as you say, how would that fix publishing? Firstly, how would they ascribe a monetary value to Gower's 1998 and his other few papers that catapulted him to the Fields Medal? Are you saying these subject do not matter because no one is bidding for these papers? I fear we would not have published Heisenberg's early papers or the discovery of Penicillin if so. And over what horizon would "capitalists" optimize that monetary value (internal IRR)? Governments usually have to step in for long term IRR projects (e.g. the internet protocol's development was famously funded by DARPA and they keeping "deep learning" alive during the downturns as no one believed in short term returns ...). The UK water system or quite a few train services around the world bear witness to the fact that even in a "capitalist" society, some long term common benefits are hard to fund with short term IRR considerations even pension funds consider reasonable. Taking that observation to its perverse conclusion, if you believe in "capitalists" then you could argue that the current imperfect review system is a side effect of capitalist societies' long term research funding plan (universities, research grants, tax breaks for endowments, student grants, ...).
I just think knowledge sharing is not always compatible with financial interests. And the former, to me, is the public good that academia should attain. But you get no argument from me that peer review is broken. I struggle to think, though, of a better system and doubt "money" is it, tbh.
I don't understand at all what the monetary value of this algorithm should be.
The authors are positioning themselves as a company and not merely academics :
A Sepp Hochreiter's video from 6 months ago hyping xLSTM :
https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he state his intent to raise €300M to make a european alternative to openai's GPT for niche domains thanks to this new method that will allow to train for cheaper and better.
He recently received (2023) € 35,000 in prize money at the 5th annual German AI Award.
If you are asking that question, I guess you must have wondered about this for years, right, in fact nearly a decade? I mean why would Google have bought DeepMind with them publishing in peer reviewed journals for years after? Same for Meta (formerly facebook)? I think there is a well trodden path being followed here ... and I am surprised by your surprise.
Acquisitions, like for DeepMind is usually a way to hire talent. It can make sense when the technology is new and getting a few year of lead time on what is going to be a growing market may make some financial sense.
In this specific xLSTM case, the industry has matured, they are just one among many (Mamba, S3Ms, transformers-variants... ), they have already been sitting on it for at least 6 months, I don't see what their play is.
An other case study that's probably interesting, are the authors of the Adam Paper, https://arxiv.org/abs/1412.6980 , (Awarded "2020: The Adam optimization paper is the world's #1 most cited scientific paper of the past five years"). Probably a few (10?,100?) billions worth value created. You can find the authors bios http://dpkingma.com/https://jimmylba.github.io/
I think there is a huge problem with the capture and sharing of value in the whole deep-learning industry. Academia's naivety plays a role in it, Generational Shift technologies are badly rewarded. Incremental Shift technologies aren't rewarded at all.
Powerful technologies into many hands with low rewards for their creators while the value they generate keeps going to the same pockets. That's a recipe for disaster.
Will be a fun thing to come back in a few years to see how it had unfold.
There is a lot to unpack. But let's start with your first point. If the acquisition of DeepMind was just a talent acquisition, why continue to let them publish?
Your second point: how did you get the impression that this market is "mature"? And, going back to the first point, which market do you actually mean to have matured?
Regarding value creation/capturing/sharing, academic naivety, this industry is no different to any other, nor has basic economics changed. Deep Learning is an amazingly powerful new technology that has the potential to change the world. But how you make products/services out of it which we all value, pay for and thus provide the basis of employment is the usual risk/reward cycle ANY business has to subject itself to. More believe in the technology = more investors willing to fund businesses that have negative free cash-flow for longer. Yes, the competitive landscape seems stacked against new entrants, but that is no different to when today's teach behemoths started. And yes, as with any industry, monopolies are not great and, according to Kara Swisher, maybe tech at large, today, is an unhealthy monopoly.
Will this technology be encumbered by patents/license ? I guess it is most likely already patented (or very close) and you will need a license. xLTSM is not open source
Can someone ELI5 this? Reading comments it sounds like it's going to replace transformers which LLMs are based on? Is it something exponentially better than current tech on scale?
LSTMs are a recurrent architecture for neural networks, meaning that your output depends both on your current input and your previous output. This is similar to how language works, as the next word in your sentence must fit both the idea you're trying to convey (your input) and the words you've said up until now (your previous output).
LSTMs where very popular for a while (I think the first good version of Google Translate used them) but they had two critical downsides: their performance went down with longer outputs, and they where a bit annoying to parallelize because computing the output for the 10th word required first computing the output of the previous 9 words - no way to use 10 parallel computers. The first problem was solved with Attention, a scaffolding method that prevented degradation over longer sequences. Eventually someone realized that Attention was doing most of the heavy lifting, built an attention-only network that could be easily parallelized (the Transformer), and LSTMs lost the top place.
Are xLSTMs better? On paper I'd say they could be - they seem to have a solid theory and good results. Will they dethrone Transformers? My guess is no, as it wouldn't be the first time that the "better" technology ends up losing against whatever is popular. Having said that, it is entirely possible that some inherently recurrent tasks like stock price prediction could get a boost from this technology and they may find their place.
They reference "a GPT-3 model with 356M parameters"
So GPT-3 Medium (from the GPT-3 paper) - feels pretty disingenuous to list that as no one is referencing that model when they say "GPT-3", but the 175B model.
I wasn't aware that size of the model (356M) was released- what am I missing here?
I also think it's relatively well understood that (with our current methods) transformers have a tipping point with parameter count, and I don't know of any models less than ~3B that are useful- arguably 7B.
phi3 mini is surprisingly capable given its size. You can teach small transformers to do stuff well, you just can't have good general purpose small models.
The point still stands, Phi3 is an excellent model and shows that good models don’t need that many parameters
You should see the work on ReFT coming from mannings group showing that you can instruction fine tune models by modifying like, 0.00001% of the parameters. By doing it this way, you significantly mitigate the risk of catastrophic forgetting.
I think it's a fine name. The prefix ensures that people don't confuse it with vanilla LSTMs. Also, I'm fairly certain that they must've considered LSTM++ and LSTM-XL.
Another week, another paper that thinks they can revive recurrent networks. Although this time the father of LSTM is a co-author, so this paper should not come as a surprise. Sadly, the results seem to indicate that even by employing literally all tricks of the trade, their architecture can't beat the throughput of flash-attention (not by a long shot, but that is not surprising for recurrent designs) and, on top of that, it is even slower than Mamba, which offers similar accuracy at lower cost. So my money is on this being another DOA architecture, like all the others we've seen this year already.
To put another perspective on this, lots of modern advancements in both ML/AI and especially computer graphics has come from ideas already from the 70-80s that were published, forgotten, and revived. Because underlying dependencies change, like the profile of the HW of the day. So just let the ideas flow, not every paper has to have an immediate payoff.
To be fair, Hochreiter seems pretty confident that this will be a success.
He stated in interviews "Wir werden das blöde GPT einfach wegkicken" (roughly: We will simply kick silly GPT off the pitch) and he just founded a company to secure funding.
Interesting times.
With all due respect for his academic accomplishments, confidence in this domain in the current climate is usually a signal towards potential investors; it can be backed by anything between solid work (as I hope this turns out to be) and a flashy slide deck combined with a questionable character.
Being a researcher at a public university in a country that doesn't exactly splurge on this kind of research he has to get creative to get any meaningful amount of funding.
To say the least. It's a bit unfortunate that there is about 0 culture in the EU regarding moonshot projects compared to silicon valley. I've tried to get money a couple of times from government grants for (yet another..) foundational AI model, neuroscience inspired, but the grants instead seem to almost exclusively go to well developed industrial companies that now wants some free money to "leverage" ChatGPT in their existing internal processes.. and being still in the research phase, the more risk-averse VCs here are not touching stuff like this either.
So I guess what's left is doing these grand proclamations that you are going to "knock the crown off OpenAI" etc. Though, some sort of vision is good to have for sure :)
The benchmarking done in the table 1 is extremely questionable. Their table basically contradicts the results from multiple peer reviewed papers, especially for RNNs which report results much closer to baseline transformers (and conducted much larger experiments btw).
Page 40 they mention that all models are trained with the same lr for comparability.
> Contradicts their own scaling laws table which uses different lr for different models
> And no it is not a fair comparison to use the same lr to test all these different models. Benchmarking results just looks like they are using tuned hyperparameters for their model which happens to not work for other models.
Are you saying this is obvious because people have published the exact same benchmarks which are 100% comparable in journals? If so where are they? I have seen quite a few published benchmarks that could not quite be reproduced, tbh. So, again, what makes this "obvious" to you?
I thought it was common knowledge that architecture comparisons in papers aren't worth the paper they're printed on; there are so many ways to deliberately or accidentally structure things to favour one architecture over the others. Ultimately the lmsys chatpot arena will be the final judge.
True, but they normally arent this far off.
HGRN claims that they outperform transformer for 1B parameter model trained on the pile. HGRN performing 8ppl worse suggests that its useless.
My experience - many are far off and most of the time published tables of different papers are hard to compare. If you make the assertion here of these results to be flawed, I would like to see more substance (code, reproduction,...). And for balance, for the same reason, hard to verify the accuracy of these results without further insight.
Results of xlstm are promising but will need larger scale experiments.
However they completely messed up benchmarking experiments for various RNN models which in their papers claim comparable and even better performance than base transformer.
In the scaling law comparison, I wonder if it is reasonable to compare number of parameters between Llama, Mamba, RWKV, xLSTM? Isn't compute time more relevant? E.g. in the figure about scaling laws, replace num of params by compute time.
Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation. So scaling up Transformer could still look better when you look at compute time.
It seems neither the code nor the model params are released. I wonder if that will follow.
As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.
Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?
1. https://www.etched.com/
2. https://www.embedded.com/ai-chip-features-hardware-support-f...
Can you summarise how the model in your paper differs from this implementation of xLSTM ?
https://github.com/huggingface/transformers/issues/27011
Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.
Unless you give them chain of thought. In which case they do great.
But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.
So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?
Having a bit of issue understanding this fundamental part
For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.
Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?
For really large models, it is in fact easier to achieve peak flops because computation required scales faster than memory bandwidth required(square vs cube).
> Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training
Is there some word missing? You mean it's more common for large-sized Transformers?
> computation required scales faster than memory bandwidth required (square vs cube)
That is an interesting thought. I'm trying to understand what exactly you mean. You mean, computation time is in O(N^2) where N is the sequence length, while required memory bandwidth is in O(N^3)? Why is that?
[1]: https://arxiv.org/pdf/2105.13120
In terms of difficulty of implementation it's arguably much easier than pipeline parallelism, which I'd argue is the hardest kind (at least to implement it efficiently without bubbles), and takes the most lines of code to implement (especially in Jax, where sequence parallelism is almost trivial).
If you mean that you cannot fully parallelize inference, this might be true but also not quite relevant since the computational demands of inference are low. And you can always "parallelize" training to some extent, just by training larger batches.
https://betterexplained.com/articles/colorized-math-equation...
The claim is something than will replace the transformer, a technology powering a good chunk of AI companies.
The paper's authors seems to be either from a public university, or Sepp Hochreiter's private company or labs nx-ai.com https://www.nx-ai.com/en/xlstm
Where is the code ? What is the license ? How are they earning money ? Why publish their secret recipe ? Will they not be replicated ? How will the rewards be commensurate with the value their algorithm bring ? Who will get money from this new technology ?
At least capitalists have something to fight over that’s worth fighting for (money). Academics will bitterly fight over the dumbest, least important shit. There’s a law about how the less something matters, the more political the fights over it will be.
The authors are positioning themselves as a company and not merely academics :
A Sepp Hochreiter's video from 6 months ago hyping xLSTM :
https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he state his intent to raise €300M to make a european alternative to openai's GPT for niche domains thanks to this new method that will allow to train for cheaper and better.
He recently received (2023) € 35,000 in prize money at the 5th annual German AI Award.
https://www.jku.at/en/festival-university/media/detail/news/...
Or is it just an academic tactic to get more funding ? To extract more work from PhD students by making them think they are going to strike it big ?
How are they intending to build a moat if they publish their papers ? Will this technology be encumbered by patents/license ?
In this specific xLSTM case, the industry has matured, they are just one among many (Mamba, S3Ms, transformers-variants... ), they have already been sitting on it for at least 6 months, I don't see what their play is.
An other case study that's probably interesting, are the authors of the Adam Paper, https://arxiv.org/abs/1412.6980 , (Awarded "2020: The Adam optimization paper is the world's #1 most cited scientific paper of the past five years"). Probably a few (10?,100?) billions worth value created. You can find the authors bios http://dpkingma.com/ https://jimmylba.github.io/
I think there is a huge problem with the capture and sharing of value in the whole deep-learning industry. Academia's naivety plays a role in it, Generational Shift technologies are badly rewarded. Incremental Shift technologies aren't rewarded at all.
Powerful technologies into many hands with low rewards for their creators while the value they generate keeps going to the same pockets. That's a recipe for disaster.
Will be a fun thing to come back in a few years to see how it had unfold.
LSTMs where very popular for a while (I think the first good version of Google Translate used them) but they had two critical downsides: their performance went down with longer outputs, and they where a bit annoying to parallelize because computing the output for the 10th word required first computing the output of the previous 9 words - no way to use 10 parallel computers. The first problem was solved with Attention, a scaffolding method that prevented degradation over longer sequences. Eventually someone realized that Attention was doing most of the heavy lifting, built an attention-only network that could be easily parallelized (the Transformer), and LSTMs lost the top place.
Are xLSTMs better? On paper I'd say they could be - they seem to have a solid theory and good results. Will they dethrone Transformers? My guess is no, as it wouldn't be the first time that the "better" technology ends up losing against whatever is popular. Having said that, it is entirely possible that some inherently recurrent tasks like stock price prediction could get a boost from this technology and they may find their place.
So GPT-3 Medium (from the GPT-3 paper) - feels pretty disingenuous to list that as no one is referencing that model when they say "GPT-3", but the 175B model.
I wasn't aware that size of the model (356M) was released- what am I missing here?
I also think it's relatively well understood that (with our current methods) transformers have a tipping point with parameter count, and I don't know of any models less than ~3B that are useful- arguably 7B.
Compare these benchmarks to, say, the RWKV 5/6 paper https://arxiv.org/abs/2404.05892
You should see the work on ReFT coming from mannings group showing that you can instruction fine tune models by modifying like, 0.00001% of the parameters. By doing it this way, you significantly mitigate the risk of catastrophic forgetting.
The name XLSTM reminds me of the time in the late eighties when my university professor got accepted to hold a presentation on WOM: write-only memory.
He stated in interviews "Wir werden das blöde GPT einfach wegkicken" (roughly: We will simply kick silly GPT off the pitch) and he just founded a company to secure funding. Interesting times.
Someone gathered most of the available information here: https://github.com/AI-Guru/xlstm-resources
Being a researcher at a public university in a country that doesn't exactly splurge on this kind of research he has to get creative to get any meaningful amount of funding.
So I guess what's left is doing these grand proclamations that you are going to "knock the crown off OpenAI" etc. Though, some sort of vision is good to have for sure :)
The benchmarking done in the table 1 is extremely questionable. Their table basically contradicts the results from multiple peer reviewed papers, especially for RNNs which report results much closer to baseline transformers (and conducted much larger experiments btw).
Page 40 they mention that all models are trained with the same lr for comparability.
> Contradicts their own scaling laws table which uses different lr for different models
> And no it is not a fair comparison to use the same lr to test all these different models. Benchmarking results just looks like they are using tuned hyperparameters for their model which happens to not work for other models.
RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round obviously. HGRN 8 ppl worse than baseline transformers? NIPS 2023 spotlight paper btw.
However they completely messed up benchmarking experiments for various RNN models which in their papers claim comparable and even better performance than base transformer.