Consistency LLM: converting LLMs to parallel decoders accelerates inference 3.5x

(hao-ai-lab.github.io)

461 points | by zhisbug 11 days ago

19 comments

DoctorOetker 11 days ago
This mirrors what I experienced when I enrolled in "free drawing" (no teaching) classes:
While people considered me a good drawer since I was a child, I remember just repeating either similar detailed drawings I drew before, or otherwise just taking plenty of time to draw. I believe anyone with time and patience can make a nice drawing of a scene.
The "free drawing" class had no rules or lectures: you brought the materials you wanted to work with (some brought ink, others pencils, while I brought charcoal). The only thing determined was the timing between poses for the model: for each session the first few poses were very short (say a minute), and then the pose durations would progressively lengthen until say 5 minute poses. At all times you were free to tear your picture up and retry drawing the pose again.
My drawing skills improved considerably. The short "warmups" actually force you to get proportions and outlines correct on the first tries. Conventional wisdom says haste makes waste, but when learning or refining skills, it seems natural selection has hardcoded the sensation of haste as a stressor prompting attention and learning.
I am convinced I could have drawn similar quality drawings before enrolling in those classes, except they would have taken me easily 5 or 10 x as long to draw. Being forced not to beat around the bush and feeling the penalty of making a hasty mistake (further decreasing time left for the second try in the remaining time) does seem to work.
My only gripe is that the technique is termed "Consistency" whereas I would reserve such a term for an improvement in performance not inference speed, although I understand that they indicate "consistency with what would ultimately have been generated one token at a time". I would rather dub it "Proficiency LLM", where the same output is expected, only without the inhibition of stuttering to the same conclusion.
[-]
- snyhlxde 11 days ago
  Hi we are CLLM authors and thanks for sharing your experience and insights! I can see this drawing skill refining process echoes with the training process in CLLM, the only thing is at this point stressor in CLLM training is not getting progressively demanding.
  For example, while drawing, you can set very specific time limit on how long you are allowed to draw in each trial and make the time progressively shorter. In CLLM, maybe we can make this the learning process more and more difficult by mapping more and more distant states in Jacobi trajectory to its final state.
  We are using the term "consistency" because we draw parallelism between consistency LLM and the consistency model in diffusion image generation where the training processes are analogous.
  [-]
  - boroboro4 10 days ago
    Do you use same dataset to train / eval the model? Was the model used for example trained on GSM8K dataset for example?
    [-]
    - snyhlxde 10 days ago
      Yes, we consider both domain-specific applications (spider for text2SQL, gsm8k for math, codesearchnet for python) as well as open-domain conversational applications (ShareGPT). We use test set from each application to evaluate CLLMs’ performance in our paper.
      On the other hand, technically CLLM works on any kind of queries. But the speedup might vary. Feel free to try out our codebase for your use cases!
  - Quarrel 10 days ago
    Is it just me, or does this read like it was written by an LLM ... ?!
    [-]
    - jasonjmcghee 10 days ago
      It's just much more formal than people generally speak on HN.
    - snyhlxde 10 days ago
      lol I take that as a compliment. Good try but sadly no LLM in this writing :)
- aamargulies 11 days ago
  I had an interesting experience in an Invertebrate Zoology lab class one summer.
  We students were brought into a lab, given specimens to draw, and the only instructions we received were 'You have 30 minutes to draw this. Go.'
  There was no "here's how to draw. here's what to do and not to do". It was just basically "We don't care about any insecurities you might have. We don't care if you think you can't draw. No excuses, just fucking draw it. Now."
  Not only did we draw, but we (all of us) improved enormously over the course of the class as more animals were brought in and the exercise was repeated over and over and over again throughout the summer.
  What it taught us is that everyone, and I mean everyone, can draw. Our collective attitude shifted from "don't know if this is even possible" to "of course we can do this. this is easy. routine. trivial."
  Highly recommended approach.
  It was the most freeing and amazing class I had in college.
  [-]
  - Version467 10 days ago
    That sounds like a pretty awesome experience. Thanks for sharing.
- manmal 11 days ago
  Systems generally become more efficient when under stress. They are also forced into local optima - everything has upsides and downsides.
  [-]
  - sheepscreek 10 days ago
    Interestingly - this is the idea behind Nassim Taleb’s book “Antifragile” and the concept of “anti-fragility”.
    In essence, it promotes dynamic/evolutionary/always learning behaviour than performing the same set of steps every time, and in the process, becoming stronger than before.
    An example he shares is: how the breakdown of muscle tissue through exercise leads to more muscle development and an increase in strength. I guess it’s similar to LLM training using error/loss reducing functions (practice makes perfect) but dissimilar in the sense that training is a one—time action.
  - TeMPOraL 10 days ago
    > They are also forced into local optima
    The good ol', "under pressure, you don't rise to the occasion, but sink to the level of your training"?
miven 11 days ago
The authors mention that Jacobi decoding is equivalent to greedy autoregressive decoding, but in practice don't we often want the sampling temperature to be above zero to avoid repetitions and excessively generic responses?
I'm completely unfamiliar with this decoding strategy so maybe I'm just missing a simple way to account for that.
[-]
- snyhlxde 10 days ago
  Yes this is a great question! We are actively working on supporting other sampling strategies other than greedy sampling. In the context of CLLM training, instead of mapping to a static fixed point obtained from Jacobi decoding as the training ojbective, we term it dynamic fixed point. You can keep an eye on our github repo for new progress.
- matheist 11 days ago
  Agreed. It's straightforward to check that a token was the argmax, but it seems difficult to check that a token appeared with the probability you wanted it to. You could still do the fine-tuning step I guess, where you train the trajectories to approach n-token completions with the statistics you want, but I can't see how you can replace the "check for a fixed point" step. Maybe "check the result was above this fixed threshold for likelihood".
wangii 10 days ago
I feel it's a pretty dangerous optimization before we REALLY understand what's going on inside of the LLM. e.g. guys believe in the geometric interpretation will have something to say, and it would probably hurt if you are using "filler" tokens.
Besides, the assumption (not a universal fact) that "forming complete sentences in mind before articulating word by word" seems overly simplifies activities happens in our mind: do we really have a complete planning before start talking/typing? as a Buddhist I lean towards it's an illusion. further more, what about simultaneous thoughts? are we linear thinker in the sentence level?
anyway, pretty neat math!
[-]
- renonce 10 days ago
  The optimization does not affect the result of LLM, it's guaranteed to produce equivalent results as decoding directly. Let's not treat that LLM as some magic that resembles our mind, it's just another program that produces sentences that happens to make sense.
  [-]
  - naasking 10 days ago
    > Let's not treat that LLM as some magic that resembles our mind,it's just another program that produces sentences that happens to make sense.
    "That happen to make sense" is hiding a lot of magic. It would be statistically impossible to make as much sense as LLMs do in response to prompts if it did not actually make semantic distinctions. If it makes semantic distinctions, then it does resemble the human mind in at least one way.
  - wangii 10 days ago
    According to the original Jacobi decoding paper, it's set in the machine translation tasks, with encoder + decoder, in which parallel algo applied only to the decoder part.
  - sigmoid10 10 days ago
    Lets not treat our mind as something magical. It's just another program that learned to speak by consuming lots of training input. The implementation might look slightly different from the outside, but from a mathematical perspective, artificial neural networks are proven to be at least as capable as the human mind.
    [-]
    - baq 10 days ago
      The best part is, your comment works both when sarcastic and completely serious.
    - ben-schaaf 9 days ago
      > artificial neural networks are proven to be at least as capable as the human mind
      Do you have a source for this? I know we have models of neural networks designed to act like neurons, but those aren't what're being used.
      [-]
      - sigmoid10 6 days ago
        See the universal approximation theorem for fully connected perceptrons.
        [-]
        ben-schaaf 3 days ago
        That's really nowhere near enough of a proof. You'd need to prove that a human brain is equivalent to a mathematical function, and that that function can be sufficiently approximated by a NN to be functionally identical.
        Additionally UAT doesn't actually prove NNs can approximate any function. Non-continuous functions and infinitely large domains aren't covered.
      - xpe 9 days ago
        Define ‘capable’ and most of the confusion and potential controversy goes away.
- Etheryte 10 days ago
  That assumption might be useful in this context, but I think it's pretty clearly not true. Ask anyone to tell you about a complex past event with a lot of parallel branches and you'll quickly see them add bits, pieces and tangents midsentence to cover the full range of events. I don't think I've seen the sentence granularity hypothesis in any serious scientific context before.
- hatthew 10 days ago
  Can't speak for everyone but I definitely don't mentally form complete sentences before talking. Sometimes I grammatically talk myself into a corner in the middle of a sentence and need to use some awkward words/phrases to finish my thought, or simply pause and restart the phrase from the beginning.
  [-]
  - nomel 9 days ago
    I feel surprisingly disconnected from my speaking self, acting as more of an observer, who is sometimes surprised at what I come up with. It just flows. I feel I have very little need for input.
    But, I also feel fairly disconnected from my thinking self. I point my attention at something and solutions usually just pop out, maybe with some guidance/context forming required, in the form of internal dialog, which is usually of a rubber ducky style format [1], or mental testing of that mostly spontaneous solution.
    I feel the "real" me is the one sensing/observing, which includes the observing of those spontaneous solutions, and what I say.
    [1] Works with any problem space, not just coding "debugging": https://rubberduckdebugging.com/
    [-]
    - wangii 9 days ago
      are you practicing any meditation? it's regarded as "awaken" state in some practice! if you have any method, please share with me! thanks!
- int_19h 10 days ago
  We don't appear to be forming words sequentially from underlying parts, even though in many languages they are broken down in smaller units that carry semantic meaning themselves. There doesn't seem to be any clear reason for this to break down suddenly at sentence level.
- causal 10 days ago
  What is the geometric interpretation?
alfalfasprout 11 days ago
Wow, I'm mindblown this isn't getting more attention. This seems like a clear win for inference. Fine tuning cost for this is reasonable (around 0.01% of the original pre-training cost). And the performance wins seem fairly consistent.
[-]
- WhitneyLand 10 days ago
  Yes, seems like a huge important result for LLM performance.
  I’m not aware of any other paper that has offered to increase inference LLM performance to this degree. Has there ever been one before?
  At least while also:
  - Maintaining output quality. The benchmarks used were somewhat narrow but so far so good.
  - Improving not just query latency but also global throughput
  - Not requiring more compute
  - Having a relatively practical implementation and not adding big challenges and complexity
  You could argue the insight is incremental, as it builds on what’s been done with parallel/jacobi decoding. Those previous results were necessary and important, but this may be the one that finally extracts real world value from the promise of parallel decoding.
- lopuhin 11 days ago
  Similar or greater inference wins are achieved with speculative decoding which is already widely used, so while this is really interesting (and was tried before with less success AFAIK), it's not yet clear how impactful it would be.
  [-]
  - WhitneyLand 10 days ago
    I don’t see where similar wins have ever been achieved.
    Speculative decoding can reduce latency, but at the cost of using a lot more compute. The amazing thing here is latency and global throughput improvements would be realized because of the increase in efficiency.
    From what I understand speculative decoding can also come with more challenges insofar as trying to maintain overall output quality.
- snyhlxde 10 days ago
  Thanks for interesting in our work! Yes we found training with consistency loss + AR loss on even a subset of a dataset results in significant speedup (0.01% pre-training cost). Training on more data permits even further speedup: the model is able to learn from more frequently-appearing collocations and phrases.
  For more details, please check out our paper and you can also see speedup saturates as the size of training data grows.
andy12_ 11 days ago
At first I thoght that this was another Medusa-like paper, simply using more unembed heads for guessing subsequent tokes, but damn, not at all. This is amazing. And it doesn't even use extra parameters, it's just an auxiliary training loss.
[-]
- snyhlxde 10 days ago
  The only similarity between Medusa and CLLM is both train and adapt LLMs for fast inference. But they use completely different training technique, decoding technique and as you pointed out CLLMs don't need extra parameters or configuring attention mask for tree-based verification.
nico 11 days ago
Interesting
I think soon we are going to realize that we don’t really need training the models
We just need good indexing and sampling
Essentially at some level any LLM is equivalent to a DB of the dataset, with a great NLP interface on top
Both are just different methods of navigating stored data
[-]
- tempusalaria 10 days ago
  LLMs can easily produce data not in training dataset.
  LLMs do not navigate stored data. An LLM is not a DB of the training data.
  [-]
  - carlthome 10 days ago
    I've had the same thought as above but unfounded (just a feeling, pretty much) so I'm curious to learn more. Do you have any references I can check out that supports these claims?
    [-]
    - int_19h 10 days ago
      Come up with a novel puzzle that is guaranteed to not be in the training set, and ask GPT-4 to solve it.
      [-]
      - carlthome 7 days ago
        To control for that doesn't seems trivial.
- sdrg822 11 days ago
  But indexing *is* training. It's just not using end-to-end gradient descent.
- PeterisP 10 days ago
  The models are multiple orders of magnitude smaller than the compressed versions of their training data, they can not be the equivalent of a DB of it.
  [-]
  - lainga 10 days ago
    The training data is ideo-semantically compressed? News to me... is it perhaps stored in kanji?
- nsagent 11 days ago
  You might like, the Infinigram paper then. It was discussed recently:
  https://news.ycombinator.com/item?id=40266791
- JoannaWongs 10 days ago
  [flagged]
JKCalhoun 10 days ago
Anyone know somewhere someone dumb like me can "Ask an AI expert"?
I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
I guess I want to learn this stuff and should maybe follow one of those "write an LLM in an hour" type videos on YouTube.
[-]
- throwawaymaths 10 days ago
  > how is it that an LLM when given the same prompt does not respond in the same deterministic way?
  In software (not in the model) here's literally a random number generator that picks from a weighted set of "next-token" choices that the model spits out. The selection process can have a series of knobs to manipulate the responses. If you want it to be deterministic (if you have direct access to the software) you can tell it to set "top-k = 1" or "temperature = 0.0" (depending on your software) and it will be deterministic.
  Usually the default settings are not for determinism, because for whatever reason the quality of the results tends to not be that good when you go fully d.
- 8note 10 days ago
  For that answer, you can refer to the 3blue1brown videos
  The llm model outputs a vector of probabilities for tokens, and the llm user picks a token from the most likely list using a random number
- zipfcharge 10 days ago
  It's because an LLM is essentially a probability matrix. You type a prompt, then it calculates what's the probability of getting a next word and so on, eventually forming a sentence. The probability learned is based on the training data.
  Because of the underlying probability model, it's not going to be 100% deterministic. Plus a model like ChatGPT purposefully have "temperature" parameter that will further add randomisation to the whole process.
  My answer is based on this paper if you're interested to read more: The Matrix: A Bayesian learning model for LLMs, https://arxiv.org/abs/2402.03175
  [-]
  - flopriore 10 days ago
    Are there any ways to show the source of the information retrieved by the model? For instance, the LLM forms a sentence and it points to a stackoverflow answer with the same or similar content.
    [-]
    - JKCalhoun 10 days ago
      As I understand it, pretty sure that is impossible. When it is input a single datum, sure, trivial. As soon as it is fed a second one though the weights are already a kind of blend of the two tokens (so to speak).
      [-]
      - spmurrayzzz 10 days ago
        Its not impossible, but its definitely difficult. There is some overlap in the methods used to detect benchmark data contamination, though its not entirely the same thing. For the detection use case, you already know the text you're looking for and you are just trying to demonstrate that the model has "seen" the data in its training set. The challenge is proving that it is statistically improbable that the model could stochastically generate the same tokens without having seen them during training.
        Some great research exists in this area [1] and I expect much of it may be repurposed for black box attribution in the future (in addition to all the work being done in the mechanistic interpretability field)
        [1] https://arxiv.org/abs/2311.04850
- zozbot234 10 days ago
  > I want to ask, for example, how is it that an LLM when given the same prompt does not respond in the same deterministic way?
  You can control that in most systems with an inference-set parameter called "temperature". But setting the temperature as low as possible tends to lead to very low-quality answers - the system can't crawl out of some local optimum and ends up repeating itself over and over. Such answers may be "deterministic" but they're also not good.
- int_19h 10 days ago
  I found this to be a good start that explains things fairly methodically, but without losing the high-level perspective.
  https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...
- rahimnathwani 10 days ago
  For this particular question, ask chatgpt how temperature affects llm softmax sampling.
  For other things, study using Karpathy's videos.
renonce 10 days ago
> ... speculative decoding methods ... incurs extra memory cost during inference time.
Any detail on this? For speculative decoding you need a smaller model to generate "branches" which are fast but maybe inaccurate and verify these branches later with a larger model. However, only memory equivalent to a single token is needed for speculative decoding, and tokens in other branches are simply masked out during inference. With a context size of 1000 and ~30 branches for 5 tokens, the memory overhead would be 3% which is negligible. If your context size is much smaller compared to the number of branches - would someone who use a generative LLM with a context window of just 50 tokens care about generation speed?
Also, speculative decoding techniques are not restricted to greedy sampling - it's expected to behave exactly the same as the original model and sample with the expected probabilities. Most literature on speculative decoding already reports 2.6x-3.5x speedup. The blog post here reports 2.4x-3.4x generation speed - which isn't that much of an upgrade?
While I mentioned speculative decoding above and Medusa2 and Eagle seems to be the techniques that the author compares against, the core problem remains: whatever method you use to predict tokens ahead of time, there is a specific point where the previous tokens are absolutely needed before predicting the next token. It doesn't depend on what your model is or what your techniques are, it's just about what is mathematically achievable. How can you predict 5 tokens at once if the probability distribution of the 5th next token depends heavily on the previous 4 tokens? Speculative decoding, Jacobi decoding, multi-token parallel decoding, whatever.
If only greedy sampling is supported for this, then I wonder what are the advantages of this method, not to mention that other techniques already achieve the expected speedup. Comparing greedy sampling speedups to random sampling speedups is comparing apples to oranges, and I doubt if the speedup described by the method would remain after this method is adapted to random sampling (due to the core problem mentioned above).
[-]
- cxczz 10 days ago
  `the previous tokens are absolutely needed before predicting the next token'
  Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?
  On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.
- Palmik 10 days ago
  Speculative decoding requires you to load the smaller model into memory and run inference on it.
  [-]
  - renonce 10 days ago
    I think the smaller model is at least 20 times smaller. If you do speculative decoding on a 70B model an 1B model would be appropriate.
dvt 11 days ago
There's no free lunch™, so from what I can tell there's some pathway loss here. E.g. some Jacobi trajectories definitionally exclude higher temperature paths. Which might actually be a positive given data retrieval (but a negative if we want to maximize for creativity?).
[-]
- wrsh07 11 days ago
  There are better and worse algorithms. I'm not sure "there is no free lunch" always applies in a particularly meaningful way. Some things aren't on the pareto frontier.
  [-]
  - factormeta 10 days ago
    Kinda like the aiff -> mp3 conversion process. A lot of data is lost, but we human can really tell the too much of a difference?
    [-]
    - wrsh07 10 days ago
      There's no reason to think the current next token prediction models are optimal for predicting sentences (they aren't!)
      > An algorithm may outperform another on a problem when neither is specialized to the problem
      https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...
      [-]
      - stkdump 10 days ago
        I would go even further and say there isn't any indication that we are even close to what is possible. My subjective feeling is that with the current rate of progress it is entirely possible that we will have GPT-4 level performance locally on smartphone hardware within 3-10 years (unless companies decide again that they don't want to give this kind of power away)
        [-]
        naasking 10 days ago
        Probably. Advancements in ML algorithms, like this one, have been outpacing advancements in hardware for awhile now, so both are converging on making ML faster and ubiquitous.
toxik 11 days ago
Interesting stuff. I guess the idea has occurred to many but was well written and presented.
[-]
- programjames 10 days ago
  Yep. My roommate and I were talking about this a year ago. You can also do something similar for LLM steering.
doctor_eval 11 days ago
> Our research shows this process – mimicking human cognitive process of forming complete sentences in mind before articulating word by word
This is not how I work. Is there something wrong with me?
[-]
- jerbear4328 11 days ago
  Nor is it how I work, I think that's normal enough. I do have an idea of what I'm going to say before I say it, I think that's closer to what they meant. I think and speak in increments of ideas, not words.
  [-]
  - paulmd 11 days ago
    > I think and speak in increments of ideas
    extremely common among (but not unique to) people with ASD, those "increments of ideas" are called "gestalts".
    https://kidtherapy.org/helpful-articles/what-is-gestalt-lang...
- snyhlxde 10 days ago
  In some conversations, maybe it's easier to form complete sentences. In some others, the best we can do is: have a rough draft about what to say in mind and then refine it word by word while speaking.
- Filligree 11 days ago
  You might not have an internal monologue. A lot of us don't, and the ones that do are equally shocked every time they find out. For what it's worth, I'm in the same boat—can form sentences, but why would I? It'd slow me down.
  People who don't have inner monologues tend to assume that all that stuff is some form of analogy or metaphor. It's not. It's entirely literal.
  [-]
  - oceanplexian 11 days ago
    Do you mean in a real time conversation?
    Because I definitely dont "have an internal monologue about what I'm going to say" in the 100ms between when someone asks a casual question and I respond to it.
    [-]
    - int_19h 10 days ago
      Yes, it is possible to maintain an internal monologue in real time conversation. That is one of the reasons why some people usually take longer than 100ms to respond.
- throwawaymaths 10 days ago
  Are you sure. It might not be the whole sentence, but I would find it hard to believe that in practice the way you speak or write is like
  hello <think> May <think> be <think> I'll <think> go <think> get <think> break <think> fast
- DrSiemer 11 days ago
  They probably do not mean people form entire sentences before expressing them, I am not aware of anybody doing that. I assume it refers to people first coming up with a global outline of what they want to say before they start speaking.
- mdp2021 11 days ago
  "Rem tene, verba sequentur" (you hold the matter, then words come) is largely "how it works".
  You form logical ideas as you speak, as you speak your speech develops, so the translation is from ideas to sentences. It is not clear in which phase one would mentally form a complete sentence, nor why it should be relevant. You "see something [that makes sense]", then you describe it - iteratively.
- causal 10 days ago
  You are probably pretty far from the LLM extreme, though, of thinking one token at a time.
- giardini 11 days ago
  Probably.
programjames 10 days ago
> Surprisingly, we find such an objective is analogous to that of consistency models
This is why numerical methods should be part of the ML curriculum.
rcarmo 11 days ago
Can't wait to see something like this merged into ollama (I'm sure there would be plenty of people fine-tuning models for it).
[-]
- Me1000 11 days ago
  Ollama doesn't have their own inference engine, they just wrap llama.cpp. But yes, it will be awesome when it's more generally available.
- helloericsf 11 days ago
  The lab is tied to the vLLM project. I would say it might get picked up sooner by vLLM than other inference frameworks.
snyhlxde 10 days ago
from CLLM authors:
Thank you guys for the great questions and insights! We have made a Twitter posts with some more details and we invite you to engage with us on Twitter as well.
https://twitter.com/haoailab/status/1788269848788869299
[-]
paulclark 11 days ago
Is this how Groq (https://groq.com/) is so fast, or are they doing something different?
[-]
- buildbot 11 days ago
  Groq is serving an LLM from (100s of chips worth of) SRAM, so the effective bandwidth thus token generation speed is an order of magnitude higher than HBM. This would 3.5x their speed as well, it is orthogonal.
  [-]
  - gdiamos 10 days ago
    I'm surprised no one has done this for a GPU cluster yet - we used to do this for RNNs on GPUs & FPGAs at Baidu:
    https://proceedings.mlr.press/v48/diamos16.pdf
    Or better yet - on Cerebras
    Kudos to groq for writing that kernel
- wrsh07 11 days ago
  My understanding is that theirs is a pure hardware solution. The hardware is flexible enough to model any current NN architecture.
  (Incidentally, there are black box optimization algorithms, so a system as good as grok at inference might be useful for training even if it can't support gradient descent)
- throwawaymaths 10 days ago
  According to someone I talked to at groq event I was invited to (I did not sign an nda), They are putting ~8 racks of hardware per llm. Of course coordinating those racks to have exact timings between them to pull tokens through is definitely "part of the hard part".
m3kw9 11 days ago
They can quickly try with one of the open source models, then show a side by side demo
ec109685 11 days ago
Could someone please explain the intuition around this technique in more lament terms?
[-]
- TomatoCo 11 days ago
  For all of these "how can we batch predicting the next n tokens?" the intuition is basically that it takes a buttload of math to predict some of the tokens, but that most tokens are actually easy to guess. For example, if I asked "What was that phone number from that 80's song?" as soon as a model generates 867- it shouldn't take that much math at all to finish predicting 5309.
  [-]
  - snyhlxde 10 days ago
    A bit more intuition on how training works: in natural language processing, some phrases/collocations, for example "remind ... of ...", "make a decision", "learn a skill" etc. are used together. We can ask LLMs to learn such collections & frequently appearing n-grams. After learning, the model can use parallel decoding to predict many tokens that are frequently appear together in one forward pass.
- programjames 10 days ago
  "Try to fix all the words in a sentence at once. Keep iterating until you don't think it needs fixing."
fermuch 11 days ago
Would something like this apply to MAMBA/JAMBA too?
[-]
- wrsh07 10 days ago
  I think any next token predictor will benefit. Iiuc mamba is a next token predictor.
  I just skimmed the gradient article, but if their only change is swapping out the transformer block for the mamba block, I don't think it's already using this optimization
Linda231 11 days ago
[dead]