XLSTM: Extended Long Short-Term Memory

(arxiv.org)

197 points | by mauricesvp 11 days ago

10 comments

  • albertzeyer 11 days ago
    It seems Sepp Hochreiter has talked already about this model since Oct 2023: https://github.com/huggingface/transformers/issues/27011

    In the scaling law comparison, I wonder if it is reasonable to compare number of parameters between Llama, Mamba, RWKV, xLSTM? Isn't compute time more relevant? E.g. in the figure about scaling laws, replace num of params by compute time.

    Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation. So scaling up Transformer could still look better when you look at compute time.

    It seems neither the code nor the model params are released. I wonder if that will follow.

    • korbip 11 days ago
      Disclaimer: I'm shared first author of this paper.

      As a clarification: The speed for training will be on par with FlashAttention-2, when fully optimized and only including the mLSTM. For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture. The sLSTM has memory mixing, that is state tracking capabilities, for problems Transformers and State Space Models (and any other sequence-parallelizable architecture) cannot solve fundamentally.

      • brookst 11 days ago
        Congrats on the paper, very interesting.

        Can you opine on how the model will fare on hardware that is optimized for transformers? There is so much investment in accelerating the transformer arch[1][2], will xLSTM / sLSTM benefit as well, or will the hardware optimizations give transformers enough of an advantage that it’s hard to compete on general purpose hardware?

        1. https://www.etched.com/

        2. https://www.embedded.com/ai-chip-features-hardware-support-f...

      • deepnet 11 days ago
        Fascinating work, very promising.

        Can you summarise how the model in your paper differs from this implementation of xLSTM ?

        https://github.com/huggingface/transformers/issues/27011

        • korbip 7 days ago
          Thanks! I don't see any implementation there. In any case, we are planning a code release soon.
      • WithinReason 11 days ago
        Can you expand on the "cannot solve fundamentally" part?
        • lucidrains 11 days ago
          • Der_Einzige 10 days ago
            So does anything do proper state tracking? And don’t point to the OP since very often purportedly better new architectures end up being basically vaporware (like mamba or rkwv, which still don’t have good quality pre trained models yet)
            • impossiblefork 10 days ago
              How do you mean vaporware?

              Surely whether a big model using a certain system exists is only a matter of the choices of those with sufficient resources to train it. That's only a matter of their beliefs, not about actual model performance.

        • thomasahle 10 days ago
          Transformers and SSMs can't do long computations that are inherently sequential.

          Unless you give them chain of thought. In which case they do great.

      • albertzeyer 11 days ago
        Congratulations on the paper. That's some very interesting work!

        But you would want to include sLSTM as well to get the best performance, right? How does the speed compares in that case? Specifically when scaling up.

        • korbip 11 days ago
          Thank you! I can say that it is not really a diminishing factor at the scales reported in the paper. So, xLSTM[7:1] is pretty much on par with xLSTM[1:0] in speed. We show that it is helpful on toy tasks, and it shows even better sequence extrapolation performance, so yes.
      • goldemerald 11 days ago
        Great work! I'd love to start using the language model variant of your work. Do you know when/if it will be open sourced? I'd start using it today if it were that soon.
      • hh1 10 days ago
        When you talk about "c" or "scalar memory" in the paper, does that refer to a single unit in the vector usually referred to as c?

        So in mLSTM, each unit of the vector c is now a matrix (so a 3d tensor)? And we refer to each matrix as a head?

        Having a bit of issue understanding this fundamental part

        • korbip 7 days ago
          You mainly got it right. Usually one does have many scalar 'c' cells, that talk to each other via memory mixing. For the sLSTM, you group them into heads, talking only to cells within the same head. The reason that we referred to scalar cells here is that these are that fundamental building block. Many of them can and are usually combined and vector notation is useful in this case.

          For the matrix 'C' state, there are also heads/cells in that sense that you have multiple, but they don't talk to each other. So yes, you can view that as a 3D tensor. And here, the matrix is the fundamental building block / concept.

      • SpaceManNabs 11 days ago
        > For decoding/inference both are very close to Mamba as xLSTM is a recurrent architecture

        Can you explain this statement more if you have time? Are you saying the recurrent architecture of xLSTM enables fast inference on par with Mamba? Or the xLSTM architecture slows it down so that its inference is as slow as mamba?

      • logicchains 11 days ago
        To clarify, is the sLSTM strictly necessary (to achieve better accuracy than those other architectures), or is the mLSTM good enough? The [1/0] model in the paper seemed to do quite well.
        • korbip 11 days ago
          For language in general it seems fine. But there might be specific tasks where it is necessary indeed.
    • YetAnotherNick 11 days ago
      Recurrence is less of issue with really large models training than it is with medium sized models. Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training. And sequence parallelism is same for transformer or recurrent model.

      For really large models, it is in fact easier to achieve peak flops because computation required scales faster than memory bandwidth required(square vs cube).

      • albertzeyer 11 days ago
        With sequence parallelism, you mean to increase the batch size, i.e. number of sequences in a batch?

        > Medium sized transformer models are generally not trained with sequence parallelism, but sequence parallelism is getting more common with transformer training

        Is there some word missing? You mean it's more common for large-sized Transformers?

        > computation required scales faster than memory bandwidth required (square vs cube)

        That is an interesting thought. I'm trying to understand what exactly you mean. You mean, computation time is in O(N^2) where N is the sequence length, while required memory bandwidth is in O(N^3)? Why is that?

        • YetAnotherNick 11 days ago
          No, it means dividing the sequence into multiple chunks and processing them one by one, very similar to recurrence. See [1]. Sequence parallelism is needed when the sequence can't fit in a single GPU. Sequence parallelism is the hardest parallelism, but it is required for longer sequence. Many models just trains for smaller sequence length for majority of the training and switch to sequence parallelism for last few percentage of training.

          [1]: https://arxiv.org/pdf/2105.13120

          • logicchains 11 days ago
            >Sequence parallelism is the hardest parallelism, but it is required for longer sequence

            In terms of difficulty of implementation it's arguably much easier than pipeline parallelism, which I'd argue is the hardest kind (at least to implement it efficiently without bubbles), and takes the most lines of code to implement (especially in Jax, where sequence parallelism is almost trivial).

    • zozbot234 11 days ago
      > Specifically, the sLSTM has still recurrence (memory mixing) in it, i.e. you cannot fully parallelize the computation.

      If you mean that you cannot fully parallelize inference, this might be true but also not quite relevant since the computational demands of inference are low. And you can always "parallelize" training to some extent, just by training larger batches.

      • korbip 11 days ago
        This was formulated a bit unclear. It is not possible to parallelize in the sequence dimension for training as it is possible for Transformers. In the batch dimension you can always do it.
  • KhoomeiK 11 days ago
    For those who don't know, the senior author on this paper (Sepp Hochreiter) was the first author on the original paper with Schmidhuber introducing LSTMs in 1997.
    • ramraj07 11 days ago
      At least in biology, the first author of a paper is more often than not just a pair of gifted hands who did the experiments and plotted the graphs. Doesn’t always translate that they become good PIs later (though they get their chances from these papers).
      • cdavid 11 days ago
        In ML, it generally is ordered from most contributor to least contributor, w/ heads of the lab last.
      • querez 10 days ago
        In this specific case, it's fairly well known that Hochreiter was the major brain behind the original LSTM.
  • WithinReason 11 days ago
    I like the color coded equations, I wish they would become a thing. We have syntax highlighting for programming languages, it's time we have it for math too.
  • GistNoesis 11 days ago
    Can someone explain the economics behind this ?

    The claim is something than will replace the transformer, a technology powering a good chunk of AI companies.

    The paper's authors seems to be either from a public university, or Sepp Hochreiter's private company or labs nx-ai.com https://www.nx-ai.com/en/xlstm

    Where is the code ? What is the license ? How are they earning money ? Why publish their secret recipe ? Will they not be replicated ? How will the rewards be commensurate with the value their algorithm bring ? Who will get money from this new technology ?

    • imjonse 11 days ago
      Should all arxiv papers be backed by economic considerations or business plans?
      • AIsore 11 days ago
        Nope, they should not. It is academia after all. How would you even do that in, say, pure mathematics? Concretely, I would love to know what the business plan/economic consideration of Gower's 1998 proof of Szemeredi's theorem using higher order Fourier analysis would even look like.
        • queuebert 11 days ago
          Coming soon to an HFT firm near you ...
      • Der_Einzige 10 days ago
        Yes they should. Academia and peer review is so corrupt, gamified, and poor quality that I’d literally trust capitalist parasites more than the current regime of “publish or perish” and citation cartels.

        At least capitalists have something to fight over that’s worth fighting for (money). Academics will bitterly fight over the dumbest, least important shit. There’s a law about how the less something matters, the more political the fights over it will be.

        • AIsore 10 days ago
          I am certainly not going to defend peer review and its inherent flaws. I am also not sure "capitalists" or the market is always as efficient as one might hope or think. But that aside, to my point above, if capitalists were to optimize "money" as you say, how would that fix publishing? Firstly, how would they ascribe a monetary value to Gower's 1998 and his other few papers that catapulted him to the Fields Medal? Are you saying these subject do not matter because no one is bidding for these papers? I fear we would not have published Heisenberg's early papers or the discovery of Penicillin if so. And over what horizon would "capitalists" optimize that monetary value (internal IRR)? Governments usually have to step in for long term IRR projects (e.g. the internet protocol's development was famously funded by DARPA and they keeping "deep learning" alive during the downturns as no one believed in short term returns ...). The UK water system or quite a few train services around the world bear witness to the fact that even in a "capitalist" society, some long term common benefits are hard to fund with short term IRR considerations even pension funds consider reasonable. Taking that observation to its perverse conclusion, if you believe in "capitalists" then you could argue that the current imperfect review system is a side effect of capitalist societies' long term research funding plan (universities, research grants, tax breaks for endowments, student grants, ...). I just think knowledge sharing is not always compatible with financial interests. And the former, to me, is the public good that academia should attain. But you get no argument from me that peer review is broken. I struggle to think, though, of a better system and doubt "money" is it, tbh.
      • jampekka 11 days ago
        Or any?
    • refulgentis 11 days ago
      Are you asking how academics make money from giving away knowledge in papers? It's complicated
      • GistNoesis 11 days ago
        I don't understand at all what the monetary value of this algorithm should be.

        The authors are positioning themselves as a company and not merely academics :

        A Sepp Hochreiter's video from 6 months ago hyping xLSTM :

        https://youtu.be/hwIt7ezy6t8?feature=shared&t=561 in which he state his intent to raise €300M to make a european alternative to openai's GPT for niche domains thanks to this new method that will allow to train for cheaper and better.

        He recently received (2023) € 35,000 in prize money at the 5th annual German AI Award.

        https://www.jku.at/en/festival-university/media/detail/news/...

        Or is it just an academic tactic to get more funding ? To extract more work from PhD students by making them think they are going to strike it big ?

        How are they intending to build a moat if they publish their papers ? Will this technology be encumbered by patents/license ?

        • AIsore 10 days ago
          If you are asking that question, I guess you must have wondered about this for years, right, in fact nearly a decade? I mean why would Google have bought DeepMind with them publishing in peer reviewed journals for years after? Same for Meta (formerly facebook)? I think there is a well trodden path being followed here ... and I am surprised by your surprise.
          • GistNoesis 10 days ago
            Acquisitions, like for DeepMind is usually a way to hire talent. It can make sense when the technology is new and getting a few year of lead time on what is going to be a growing market may make some financial sense.

            In this specific xLSTM case, the industry has matured, they are just one among many (Mamba, S3Ms, transformers-variants... ), they have already been sitting on it for at least 6 months, I don't see what their play is.

            An other case study that's probably interesting, are the authors of the Adam Paper, https://arxiv.org/abs/1412.6980 , (Awarded "2020: The Adam optimization paper is the world's #1 most cited scientific paper of the past five years"). Probably a few (10?,100?) billions worth value created. You can find the authors bios http://dpkingma.com/ https://jimmylba.github.io/

            I think there is a huge problem with the capture and sharing of value in the whole deep-learning industry. Academia's naivety plays a role in it, Generational Shift technologies are badly rewarded. Incremental Shift technologies aren't rewarded at all.

            Powerful technologies into many hands with low rewards for their creators while the value they generate keeps going to the same pockets. That's a recipe for disaster.

            Will be a fun thing to come back in a few years to see how it had unfold.

            • AIsore 9 days ago
              There is a lot to unpack. But let's start with your first point. If the acquisition of DeepMind was just a talent acquisition, why continue to let them publish? Your second point: how did you get the impression that this market is "mature"? And, going back to the first point, which market do you actually mean to have matured? Regarding value creation/capturing/sharing, academic naivety, this industry is no different to any other, nor has basic economics changed. Deep Learning is an amazingly powerful new technology that has the potential to change the world. But how you make products/services out of it which we all value, pay for and thus provide the basis of employment is the usual risk/reward cycle ANY business has to subject itself to. More believe in the technology = more investors willing to fund businesses that have negative free cash-flow for longer. Yes, the competitive landscape seems stacked against new entrants, but that is no different to when today's teach behemoths started. And yes, as with any industry, monopolies are not great and, according to Kara Swisher, maybe tech at large, today, is an unhealthy monopoly.
        • NOCompromisER 10 days ago
          Will this technology be encumbered by patents/license ? I guess it is most likely already patented (or very close) and you will need a license. xLTSM is not open source
    • brookst 11 days ago
      I think you’re making a satirical point about how commercial R&D has far outstripped academia, but it’s not 100% clear.
  • smusamashah 11 days ago
    Can someone ELI5 this? Reading comments it sounds like it's going to replace transformers which LLMs are based on? Is it something exponentially better than current tech on scale?
    • probably_wrong 11 days ago
      LSTMs are a recurrent architecture for neural networks, meaning that your output depends both on your current input and your previous output. This is similar to how language works, as the next word in your sentence must fit both the idea you're trying to convey (your input) and the words you've said up until now (your previous output).

      LSTMs where very popular for a while (I think the first good version of Google Translate used them) but they had two critical downsides: their performance went down with longer outputs, and they where a bit annoying to parallelize because computing the output for the 10th word required first computing the output of the previous 9 words - no way to use 10 parallel computers. The first problem was solved with Attention, a scaffolding method that prevented degradation over longer sequences. Eventually someone realized that Attention was doing most of the heavy lifting, built an attention-only network that could be easily parallelized (the Transformer), and LSTMs lost the top place.

      Are xLSTMs better? On paper I'd say they could be - they seem to have a solid theory and good results. Will they dethrone Transformers? My guess is no, as it wouldn't be the first time that the "better" technology ends up losing against whatever is popular. Having said that, it is entirely possible that some inherently recurrent tasks like stock price prediction could get a boost from this technology and they may find their place.

  • jasonjmcghee 11 days ago
    They reference "a GPT-3 model with 356M parameters"

    So GPT-3 Medium (from the GPT-3 paper) - feels pretty disingenuous to list that as no one is referencing that model when they say "GPT-3", but the 175B model.

    I wasn't aware that size of the model (356M) was released- what am I missing here?

    I also think it's relatively well understood that (with our current methods) transformers have a tipping point with parameter count, and I don't know of any models less than ~3B that are useful- arguably 7B.

    Compare these benchmarks to, say, the RWKV 5/6 paper https://arxiv.org/abs/2404.05892

    • CuriouslyC 11 days ago
      phi3 mini is surprisingly capable given its size. You can teach small transformers to do stuff well, you just can't have good general purpose small models.
      • jasonjmcghee 11 days ago
        Totally. But they aren't fine tuning these afaict- but comparing general purpose capabilities.
        • Der_Einzige 10 days ago
          The point still stands, Phi3 is an excellent model and shows that good models don’t need that many parameters

          You should see the work on ReFT coming from mannings group showing that you can instruction fine tune models by modifying like, 0.00001% of the parameters. By doing it this way, you significantly mitigate the risk of catastrophic forgetting.

  • elygre 11 days ago
    I have no idea about what this is, so going off topic:

    The name XLSTM reminds me of the time in the late eighties when my university professor got accepted to hold a presentation on WOM: write-only memory.

    • woadwarrior01 11 days ago
      I think it's a fine name. The prefix ensures that people don't confuse it with vanilla LSTMs. Also, I'm fairly certain that they must've considered LSTM++ and LSTM-XL.
    • pquki4 11 days ago
      I mean, if you look it another way, XSLT is a real thing that gets used a lot, so I don't mind appending an M there.
  • sigmoid10 11 days ago
    Another week, another paper that thinks they can revive recurrent networks. Although this time the father of LSTM is a co-author, so this paper should not come as a surprise. Sadly, the results seem to indicate that even by employing literally all tricks of the trade, their architecture can't beat the throughput of flash-attention (not by a long shot, but that is not surprising for recurrent designs) and, on top of that, it is even slower than Mamba, which offers similar accuracy at lower cost. So my money is on this being another DOA architecture, like all the others we've seen this year already.
    • l33tman 11 days ago
      To put another perspective on this, lots of modern advancements in both ML/AI and especially computer graphics has come from ideas already from the 70-80s that were published, forgotten, and revived. Because underlying dependencies change, like the profile of the HW of the day. So just let the ideas flow, not every paper has to have an immediate payoff.
      • KeplerBoy 11 days ago
        To be fair, Hochreiter seems pretty confident that this will be a success.

        He stated in interviews "Wir werden das blöde GPT einfach wegkicken" (roughly: We will simply kick silly GPT off the pitch) and he just founded a company to secure funding. Interesting times.

        Someone gathered most of the available information here: https://github.com/AI-Guru/xlstm-resources

        • imjonse 11 days ago
          With all due respect for his academic accomplishments, confidence in this domain in the current climate is usually a signal towards potential investors; it can be backed by anything between solid work (as I hope this turns out to be) and a flashy slide deck combined with a questionable character.
          • KeplerBoy 11 days ago
            Which is a legitimate stance.

            Being a researcher at a public university in a country that doesn't exactly splurge on this kind of research he has to get creative to get any meaningful amount of funding.

            • l33tman 11 days ago
              To say the least. It's a bit unfortunate that there is about 0 culture in the EU regarding moonshot projects compared to silicon valley. I've tried to get money a couple of times from government grants for (yet another..) foundational AI model, neuroscience inspired, but the grants instead seem to almost exclusively go to well developed industrial companies that now wants some free money to "leverage" ChatGPT in their existing internal processes.. and being still in the research phase, the more risk-averse VCs here are not touching stuff like this either.

              So I guess what's left is doing these grand proclamations that you are going to "knock the crown off OpenAI" etc. Though, some sort of vision is good to have for sure :)

    • karalala 11 days ago
      Already seeing major flaws in the paper.

      The benchmarking done in the table 1 is extremely questionable. Their table basically contradicts the results from multiple peer reviewed papers, especially for RNNs which report results much closer to baseline transformers (and conducted much larger experiments btw).

      Page 40 they mention that all models are trained with the same lr for comparability.

      > Contradicts their own scaling laws table which uses different lr for different models

      > And no it is not a fair comparison to use the same lr to test all these different models. Benchmarking results just looks like they are using tuned hyperparameters for their model which happens to not work for other models.

      • bingbingbing777 11 days ago
        You should publish a response paper and get them to retract their paper if it has major flaws.
        • karalala 11 days ago
          Its xlstm contradicting existing peer reviewed papers lmao. Either xlstm should fix their benchmarks or existing peer reviewed papers should retract.

          RWKV-v6 > RWKV-v5 > RWKV-v4, not the other way round obviously. HGRN 8 ppl worse than baseline transformers? NIPS 2023 spotlight paper btw.

          • AIsore 10 days ago
            Are you saying this is obvious because people have published the exact same benchmarks which are 100% comparable in journals? If so where are they? I have seen quite a few published benchmarks that could not quite be reproduced, tbh. So, again, what makes this "obvious" to you?
          • logicchains 11 days ago
            I thought it was common knowledge that architecture comparisons in papers aren't worth the paper they're printed on; there are so many ways to deliberately or accidentally structure things to favour one architecture over the others. Ultimately the lmsys chatpot arena will be the final judge.
            • karalala 11 days ago
              True, but they normally arent this far off. HGRN claims that they outperform transformer for 1B parameter model trained on the pile. HGRN performing 8ppl worse suggests that its useless.
              • AIsore 10 days ago
                My experience - many are far off and most of the time published tables of different papers are hard to compare. If you make the assertion here of these results to be flawed, I would like to see more substance (code, reproduction,...). And for balance, for the same reason, hard to verify the accuracy of these results without further insight.
      • rrr_oh_man 11 days ago
        Could you explain for a dum-dum?
        • karalala 11 days ago
          Results of xlstm are promising but will need larger scale experiments.

          However they completely messed up benchmarking experiments for various RNN models which in their papers claim comparable and even better performance than base transformer.

          • AIsore 11 days ago
            These experiments seem pretty large already though, no? How are you so sure they messed up benchmarking? Is the code out already?
  • beAbU 11 days ago
    I thought this was some extension or enhancement to XSLT.