Jamba: Production-grade Mamba-based AI model

(maginative.com)

346 points | by bubblehack3r 30 days ago

23 comments

  • smusamashah 30 days ago
  • a_wild_dandan 30 days ago
    To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
    • az226 29 days ago
      They use less memory for inference but remember the details less well. For instance if you’re implementing code and want edits, it will forget various functions to be part of the script. Even transformers aren’t perfect at this and SSMs are even worse. For many use cases, that ability isn’t needed as much so the memory savings is a bigger lever.
  • eigenvalue 30 days ago
    Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.
  • Reubend 30 days ago
    It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.
    • refulgentis 30 days ago
      I would too -- long context has been such a red herring across providers, Claude 3 is the first I've seen that seems to genuinely have some sort of qualitative leap in noticing things.

      It is worth noting I'm fairly sure there's no inherent theoratical decrease to accuracy in long contexts, the claimed theoratical change is an _increase_ in long-term accuracy in long contexts.

      • tempusalaria 30 days ago
        Every long context sucks right now. All the model providers benchmark on fact recall which is very limited. Actual ability to do anything complicated beyond 16k tokens is not present in any current model I have seen.
        • ukuina 29 days ago
          This is not current. GPT-4-Turbo (128k) has lossless recall to the first 64k input tokens and produces output indistinguishable from GPT-4 (32k), though both are limited to 4k output tokens.

          Several downsides: Recall accuracy past the first 64k tokens suffers badly; Cost is astronomical; Response latency is too high for most interactive use-cases.

          I would point out the astounding leap in input context in just one year. Should we assume effectively-infinite (RAG-free) context in the near-future?

          • anoncareer0212 29 days ago
            This is grossly untrue in a way that denotes surface-level familiarity on several fronts

            You're referring to the needle-in-a-haystack retrieval problem.

            Which the person you're replying to explicitly mentioned is the only benchmark providers are using, for good reason.

            Consider the "translate Moby Dick to comedic zoomer" problem. This does not even come remotely close to working unless I do it in maximum chunks of 5,000 tokens.

            Consider the API output limit of 4096 tokens, across all providers.

            And no, you shouldn't assume effectively infinite (RAG free) context in the near future. This time last year, Anthropic was demonstrating 120,000 token context. It released 200K a few weeks ago. And runtime cost scales with N^2.

      • binalpatel 30 days ago
        Gemini 1.5 Pro is really good at long context in my experience.
        • neverokay 26 days ago
          It’s pretty good at blending the text chunks though, up to a point. It’s like compression, after awhile of passing in chucks your continued summary is too generalized and you lose resolution.
      • Arthur_ODC 30 days ago
        Long Context is great and all, but it sucks that all of these LLM's have really poor output length. If I feed something an entire book and ask for a comprehensive summary then I'm expecting at least a full 3-page summary. I get that they try to force these things to be "concise" to save on compute, but good lord it's so annoying.
        • pedrovhb 30 days ago
          Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.
          • Arthur_ODC 29 days ago
            Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.
            • refulgentis 29 days ago
              Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)

              To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)

        • CuriouslyC 30 days ago
          That's a chat gpt problem, if you hit the API it's not nearly so hard to get good output.
          • refulgentis 30 days ago
            I wouldn't say that, my latest big user story for making sure I'm handling huge inputs was "translate Moby dick to zoomer". Cant give any service chunks larger than ~5K tokens, over API, without it failing.

            (Miserably, like, I'd be fine if it gave a paragraph back. But at least on this "map" task, there's a critical point where there's so much input that the reward function ends up imitating the input more instead of chatting)

    • samus 30 days ago
      This one should have you covered :-) one out of every eight layers is a traditional Transformer layer, which should ensure precision, at least over short distances.
      • swyx 29 days ago
        > which should ensure precision, at least over short distances.

        why? i dont follow. transformers should provide some attention over -all- distances, no? why does layering truncate this to "short distances"?

        • samus 29 days ago
          I mean "short" in comparison to the unlimited, but lossy recall that the Mamba blocks provide. Transformers are limited to the context length, while Mamba can carry along state. While it can remember things from a lot farther back, it is limited and must thus eventually drop things and/or lose precision.
  • skybrian 30 days ago
    > Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.

    I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.

    • electric_mayhem 30 days ago
      It’s literally simulating a neural network.

      How much of your 5-sense experiential memories and decades of academic book learning are you bringing to understand my reply to your post?

      How many gigabytes do you think that’s equivalent to?

      • skybrian 30 days ago
        Jamba seems to be distributed as 21 5-gigabyte files [1] so I guess that’s another way of looking at it.

        [1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main

        • imtringued 29 days ago
          So what? I have seen models distributed as 26x 10GB files.
      • richardw 29 days ago
        It’s kinda simulating our brains but not really. When I attempted to dig more into how neurons work I realised that it’s a massive chasm of difference. Very much worth doing if you haven’t (you might know far better then me, this is for people who don’t yet.)

        In terms of results: Our brains are working with 20w of power and can be trained to compete with LLM’s using a tiny fraction of the world’s data. They also have to keep you breathing and your blood pumping and manage all the dangers of catching a ball near traffic. Or skiing, or poetry, or sunsets. And they remember stuff five minutes later and don’t need a training run that takes months.

        We have SO many opportunities to improve the AI architecture it’s ridiculous. This is a good thing.

        • reissbaker 29 days ago
          To be fair most of the brain is more like a pretrained model — it isn't being trained at any point after conception to keep your blood pumping or your lungs working, it does that out of the box roughly as soon as you sprout those organs (or the minute you're born, in the case of lungs). The training process was billions of years of evolution. And, well, given fairly persistent cross-cultural cognitive biases, I expect the conscious thought parts are starting from a pretrained model, too, and all we're doing in school is finetuning ;)
        • imtringued 29 days ago
          People don't understand that to simulate a single neuron, you need an entire neural network. So 70 billion parameters might at best be equivalent to a million neurons but that is assuming that your neural network architecture is akin to the connections between neurons. Considering the physical sparsity, you might need even more parameters to model the connections of a biological neural network. So less than a million neurons in practice.
      • _false 30 days ago
        I love both parent post perspectives on this.
    • riku_iki 30 days ago
      > that you need 80GB of GPU memory to analyze less than 1 megabyte of data

      80GB is compressed all human knowledge applied on that 1mb..

    • pama 29 days ago
      The big (huge?) memory requirement is during training. These LLMs work with high dimensional vectors and they calculate gradients with respect to high dimensional vectors and they do updates that require state of the optimizer. If you have 3 particles in 3 dimensions and you need their forces that creates 3 new 3D vectors and once you update their position along the forces then they also carry momenta. Now generalize these simple 3-body physics to the typical 60-layer creatures inside the LLM with vectors of several thousand dimensions, interactions/weights that are scaling like the squares of these vectors, to a total parameter count that adds up to the 10s to 100s of billions of parameters, and then take derivatives and start to keep track of momenta. It is a feat of modern engineering that some groups can train such models efficiently. I hope we will see more of the training stories becoming public in the near future.
      • nl 29 days ago
        This is wrong. You need big memory during inference too.

        The difference there is you can use tricks like quantisation and offloading to CPU to reduce it somewhat at the cost of accuracy and/or speed.

        • pama 28 days ago
          Not sure what you mean by wrong. I have never encountered a case yet when training an LLM (no matter what architecture) would require limited memory and was pointing out that the typical memory requirements for training are much higher yet than the typical requirements for inference.
        • brrrrrm 29 days ago
          Training is 3x the memory used by inference, and usually run at a much larger batch size
    • nl 29 days ago
      That’s all the world’s knowledge compressed into 80GB. It’s not analysing 1MB data, it’s analysing all of that knowledge plus and additional 1MB.
    • nostrowski 30 days ago
      Two things I'm curious to know:

      1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU? 2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?

      • patrakov 30 days ago
        Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.

        P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.

    • imtringued 29 days ago
      Compared to the human brain they are shockingly efficient. It's the hardware that isn't, but that is just a matter of time.
  • gautamcgoel 30 days ago
    Why include self-attention layers at all? In other words, why not just alternate SSM and MLP layers?
    • NLPaep 30 days ago
      Mamba is bad with long context. It doesn't remember phone numbers

      https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...

      • a_wild_dandan 30 days ago
        Good! DNNs unlock semantics (parsing, transforming, producing). That's the basis of general intelligence, not encyclopedic random string recall. Models shouldn't burn ungodly quantities of compute emulating DDR5 with their working memory. We need machines that think better, not memorize well. We already have plenty of those.

        Massive context windows, and their needle tests, are misguided. We won't reach human-level AGI by basically inventing a natural language RDBMS. Our resources should primarily target better reasoning systems for our models, reinforcement learning, etc.

        If we can build a GPT4-level problem solving system that coincidentally also can't remember telephone numbers, I'll consider it major progress.

        • 6gvONxR4sf7o 29 days ago
          Memorization usually refers to training data. It's often useful to have something that can utilize instructions losslessly, which is the distinction between these models.
      • Rodeoclash 30 days ago
        I can't remember phone numbers either but I can use a device suited to remembering them to look them up
        • orra 30 days ago
          Hell, it looks like you forgot you already said that (-:
          • Rodeoclash 29 days ago
            Haha, I blame the Harmonic app :/
        • imtringued 29 days ago
          What if your field of vision was infinite and you are looking at a unrolled telephone book?

          Would you need a device to remember the phone number? You wouldn't. You would need a method or algorithm to find the number, but there is no reason why that algorithm couldn't be part of the attention mechanism. The attention mechanism is akin to reading the entire phone book for every word you are about to say. It would be unreasonable to expect you to not find the right phone number eventually.

      • Rodeoclash 30 days ago
        I can't remember phone numbers either but I can use a device suited to remembering them to look them up.
  • google234123 30 days ago
    I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix
    • uoaei 30 days ago
      Surprised they hadn't found ways to advance their techniques with e.g. low-rank approximations, etc.
      • theGnuMe 29 days ago
        That’s one strategy. Also flash attention.
  • unraveller 30 days ago
    Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.
  • ninjahatori 30 days ago
    On a side note: working over longer contexts also reminds me of MemGPT(https://github.com/cpacker/MemGPT) I think a similar concept can be applied to Mamba architecture models too.
  • zelphirkalt 29 days ago
    Is there a Sparabo too?

    It is always funny to see old names associated with totally different new things!

  • toddmorey 29 days ago
    Released with open weights!
  • CGamesPlay 29 days ago
    Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.
  • haddr 30 days ago
    Will it be possible to run such model family in ollama?
    • andy99 30 days ago
      Mamba is supported in llama.cpp so should be (edit - apparently it's not strictly the mamba architecture, it's a mix of mamba and transformers, so it looks like it would have to be ported to llama.cpp)
  • kjkjadksj 29 days ago
    People need to pick better names. Mamba is already a popular python package and internet search tools are on their knees already.
  • moneycantbuy 29 days ago
    would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.
  • kelseyfrog 30 days ago
    I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?
    • spxneo 29 days ago
      256k is huge dude. that is like 1/2 of the average non fiction novel

      i think at least 200~300 pages of PDF

      im not complaining here and it also fits in GPU

    • refulgentis 30 days ago
      I'm not sure I follow fully, it is also the case for (handwaves) "traditional" LLMs that state + input = next state + output. Its just that output increases, so as output becomes input, eventually state + input / next state + output is greater than the context size.

      Re: linear scaling, that means the runtime cost is O(n) to context size, rather than traditional transformer O(n^2)

      • maccam912 30 days ago
        I think kelseyfrog meant that the state for a mamba model is supposed to "remember" stuff even if it doesn't have the actual tokens to reference any more. It might not be guaranteed to hang on to some information about tokens from a long time ago, but at least in theory it's possible, whereas tokens from before a context window in a tradional llms may as well never have existed.
        • kelseyfrog 30 days ago
          Yes, you said it better than I did :)
      • visarga 30 days ago
        That is valid for Mamba, this model (Jamba) is a mix of transformer and mamba layers, so it still has a quadratic memory cost, but divided by 8.
    • a_wild_dandan 30 days ago
      state = context

      The difference between SSMs and GPTs here is how that state/context scales. Per usual in engineering, there are big trade offs!

      • kelseyfrog 30 days ago
        I'm not following. State is a multi-dimensional vector and context is a list of tokens. State is perturbed by A and Bx(t), while context is appended to by sampling the predicted token distribution.
  • zzzzzzzzzz10 29 days ago
    Where can I download and use it?
  • cs702 30 days ago
    Please link to the original post:

    https://www.ai21.com/blog/announcing-jamba

    Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.

    The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.

    Thank you to the folks at AI21 for making Jamba available!

    • swyx 29 days ago
      i havent seen anyone mention this yet so i'll be the first - what is the comparison vs StripedHyena? https://www.together.ai/blog/stripedhyena-7b
      • cs702 29 days ago
        Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.
  • ipsum2 30 days ago
    @dang this is blogspam for the official post: https://www.ai21.com/blog/announcing-jamba
  • krasin 30 days ago
    The license is a proper open-source one: Apache 2.0. Thanks, AI21 Labs.
    • popalchemist 30 days ago
      In addition to the architectural and performance benefits, this is the big deal here, IMO.
    • spxneo 29 days ago
      im so used to seeing AGPLv3

      apache 2 is a more generous license

      • krasin 29 days ago
        AGPLv3 is a fine license too. But most of the models nowadays come with bullshit licenses, like Llama 2 with its "acceptable use policy" enforced by the license: https://ai.meta.com/llama/use-policy/
  • throawayonthe 29 days ago
    [dead]
  • sleepingreset 30 days ago
    god damn
  • htrp 30 days ago
    compute still has cost?
    • samus 30 days ago
      In not sure I understood your question.

      This model should have much lower computational cost since only one out of eight layers is a traditional transformer layer with masked self-attention. Additionally, half of the Mamba layers are MoEs.