Ask HN: Is RAG the Future of LLMs?

It seems to be in vogue that RAG is one of the best solutions to reduce the problem of hallucinations in LLMs.

What do you think? Are there any other alternatives or solutions on sight?

125 points | by Gooblebrai 14 days ago

41 comments

  • gandalfgeek 14 days ago
    #1 motivation for RAG: you want to use the LLM to provide answers about a specific domain. You want to not depend on the LLM's "world knowledge" (what was in its training data), either because your domain knowledge is in a private corpus, or because your domain's knowledge has shifted since the LLM was trained.

    The latest connotation of RAG includes mixing in real-time data from tools or RPC calls. E.g. getting data specific to the user issuing the query (their orders, history etc) and adding that to the context.

    So will very large context windows (1M tokens!) "kill RAG"?

    - at the simple end of the app complexity spectrum: when you're spinning up a prototype or your "corpus" is not very large, yes-- you can skip the complexity of RAG and just dump everything into the window.

    - but there are always more complex use-cases that will want to shape the answer by limiting what they put into the context window.

    - cost-- filling up a significant fraction of a 1M window is expensive, both in terms of money and latency. So at scale, you'll want to filter out and RAG relevant info rather than indiscriminately dump everything into the window.

    • waldrews 14 days ago
      We're getting large context windows, but so long as pricing is by the input token, the 'throw everything into the context window' path isn't viable. That pricing model, and the context window limits, are a consequence of the quadratic cost of transformers though, and whatever the big context models like Gemini 1.5 are doing must have an (undisclosed) workaround.

      What needs to happen is a way to cheaply suspend and rehydrate the memory state of the forward pass after you've fed it a lot of tokens.

      That would be a sort of light-weight/flexible/easily modifiable/versionable/real-time-editable alternative to fine tuning.

      It's readily doable with the open weights LLM's, but none of them (yet) have the context length to make it really worthwhile (some of the coding LLM's have long context windows, but it doesn't solve the 'knowledge base' scenario).

      From a hosting perspective, if fine tunes are like VM's, such frozen overlays are like docker containers: many versions can live on the same server, sharing the base model and differing in the overlay layer.

      (a startup idea? who wants to collaborate on a proof of concept?)

      • spencerchubb 14 days ago
        When you describe the overlay layer, that sounds similar to the idea of low rank adaptation (LoRA). LoRA is kind of like finetuning, but it doesn't update every parameter, it adds a relatively small number of parameters and finetunes those

        Am I understanding what you're describing about the VMs and containers analogy?

        • waldrews 14 days ago
          Yup. I guess LoRA counts as fine tuning. Except I've never seen inference engines where they actually let you take the base model and the LoRA parameters as separate inputs (maybe it exists and I just haven't seen it). Instead, they bake the LoRA part into the bigger tensors as the final step of the fine tune. That makes sense in terms of making inference faster, but prevents the scenario where a host can just run the base model with any finetune you like, maybe switching them mid-conversation. Instead, if you want to host a fine-tuned model, you take the tensor blob and run a separate instance of the inference program on it. Incidentally, this is the one place where OpenAI and Azure pricing differs; OpenAI just charges you a big per-token premium for fine-tuned 3.5, and Azure charges you for the server to host the custom model. Likewise, the hosts for the open-weights models will charge you more to run your fine-tuned model than a standard model, even though it's the almost the same amount of GPU cycles, just because it needs to run on a separate server that won't be shared by multiple customers; that wouldn't be necessary if overlays were separated.

          I wouldn't be surprised if GPT-4's rumored mixture of many models does something like this overlay management internally.

      • msp26 14 days ago
        Great post. This exact limitation of web LLMs is why I'm leaning strongly towards local models for the easier stuff. Prompt caching can dramatically speed up fixed tasks.

        But frontier models are just too damn good and convenient so I don't think its possible to fully get away from web LLMs.

    • gavmor 14 days ago
      Thanks, this is how I view it, too: there will always be relevant context that was unavailable at training, eg because it didn't exist yet, because it doesn't belong to the trainers, or because it wasn't yet known to be relevant.

      One of these will remain true until every person has their own pet model which is fine-tuned, on keyup, on all public data and their own personal data. Still, something heinously parametric (like regional weather on some arbitrary date) I struggle to imagine fitting into a transformer.

      Edit: I can imagine every user getting a LoRA.

    • recursive4 14 days ago
      Here's a question you can ask yourself: "where does my context fall within the distribution of human knowledge?" RAG is increasingly necessary as your context moves towards the tail.
    • muratsu 14 days ago
      In addition to what you’ve shared, I find RAG to be useful for cases where LLM has the world knowledge (say it knows how to write javascript) but I want it to follow a certain style or dependencies (eg use function definitions vs function expressions, newest vs es6, etc). From what I’ve heard, it’s still cheaper/more performant to feed everything into the context than finetune models.
    • 2099miles 14 days ago
      Yeah I say cost is the biggest thing. Why doesn’t everyone just use GPT 4 for everything or Gemini ultra + RAG with all documents in the rag system with the best embedding model

      Among other things because it’s way too expensive and narrowing your scope cuts huge costs and isn’t hard to do at a high level

      • 0x008 14 days ago
        There is also the problem that most of the LLMs of today will somehow lose (or ignore) the middle if the context and prefer beginning and/or end.
  • mark_l_watson 14 days ago
    I wrote a book on LangChain and LlamaIndex about 14 months ago, and at the time I thought that RAG style applications were great, but now I am viewing them as being more like material for demos. I am also less enthusiastic about LangChain and LlamaIndex; they are still useful, but the libraries are a moving target and often it seems best to just code up what I need by hand. The moving target issue is huge for me, updating my book frequently has been a major time sync.

    I still think LLMs are the best AI tech/tools since I started getting paid to be an AI practitioner in 1982, but that is a low bar of achievement given that some forms of Symbolic AI failed to ever scale to solve real problems.

    • cl42 13 days ago
      Do you have a problem with LangChain and LlamaIndex due to their changing codebases/APIs/etc., or do you think there's a fundamental issue with RAG itself?
      • mark_l_watson 13 days ago
        Maybe a mixture of both?

        I think both projects were super useful, grateful for them. There is a lot of utility tucked away in both projects.

        A year ago,I started using LLM APIs in non-Python languages and realized that sometimes building from scratch is better.

        • cl42 13 days ago
          Thanks for clarifying. I generally agree with you -- I prefer direct API calls versus weird/poor abstractions. The only abstraction I use is one that standardizes the API calls/data structures between Gemini/GPT-x/Llama 2/Claude.

          That being said, I do use RAG a lot (though generally code most of the implementations myself).

        • shuss 11 days ago
          Fast changing libraries are a huge pain. That's why a no-code approach like Unstract (https://github.com/zipstack/unstract) makes sense.
  • cl42 14 days ago
    RAG will have a place in the LLM world, since it's a way to obtain data/facts/info for relevant queries.

    Since you asked about alternatives...

    (a) "World models" where LLMs structure information into code, structured data, etc. and query those models will likely be a thing. AlphaGeometry uses this[1], and people have tried to abstract this in different ways[2].

    (b) Depending on how you define RAG, knowledge graphs could be a form of RAG or alternatively an alternative to them. Companies like Elemental Cognition[3] are building distinct alternatives to RAG that use such graphs and give LLMs the ability to run queries on said graphs. Another approach here is to build "fact databases" where, you structure observations about the world into standalone concepts/ideas/observations and reference those[4]. Again, similar to RAG but not quite RAG as we know it today.

    [1] https://deepmind.google/discover/blog/alphageometry-an-olymp...

    [2] https://arxiv.org/abs/2306.12672

    [3] https://ec.ai/

    [4] https://emergingtrajectories.com/

    • 0x008 14 days ago
      I don't understand why knowledge graph would be an alternative to RAG? Knowledge graphs can (and are already) used as part of a RAG pipeline.
      • ac2u 14 days ago
        Not sure either... unless it means a workflow where instead of chunking your knowledgebase docs and generating embeddings, you do entity and relationship extractions on them instead and store an index of knowledge graph serialisations to source documents.
      • kordlessagain 12 days ago
        You can do set overlap on the extracted terms and that gives similar relevancy to the chunks as done by their embedding vectors. It’s just a lower dimensionally of the relationships between the chunk and the query.
    • grugagag 14 days ago
      I hope it goes away because I have a real carpet to replace it with
  • supreetgupta 1 day ago
    TrueFoundry has recently introduced a new open-source framework called Cognita, which utilizes Retriever-Augmented Generation (RAG) technology to simplify the transition by providing robust, scalable solutions for deploying AI applications.

    Try it out: https://github.com/truefoundry/cognita

  • darkteflon 14 days ago
    Unless we’re going to paste a whole domain corpus into the context window, we’re going to continue to need some sort of “relevance function” - a means of discriminating what needs to go in from what doesn’t. That could be as simple as “document A goes in, document B doesn’t”.

    That’s RAG. Doesn’t matter that you didn’t use vectors or knowledge graphs or FTS or what have you.

    Then the jump from “this whole document” to “well actually I only need this particular bit” puts you immediately into the territory of needing some sort of semantic map of the document.

    I don’t think it makes conceptual sense to think about using LLMs without some sort of domain relevance function.

    • arunsr1ni 13 days ago
      Recursive retrieval, although this is too early to say, is trying to solve for the smaller chunk sized documents. I still understand that the need for semantic map will persist.
  • mif 14 days ago
    For those of us who don’t know what RAG is (including myself), RAG stands for Retrieval Augmented Generation.

    From the video in this IBM post [0], I understand that it is a way for the LLM to check what its source and latest date of information is. Based on that, it could, in principle, say “I don’t know”, instead of “hallucinating” an answer. A RAG is a way to implement this feature for LLMs.

    [0] https://research.ibm.com/blog/retrieval-augmented-generation...

    • simonw 14 days ago
      The best way to understand RAG is that it's a prompting hack where you increase the chance that a model will answer a question correctly by pasting a bunch of text that might help into the prompt along with their question.

      The art of implementing RAG is deciding what text should be pasted into the prompt in order to get the best possible results.

      A popular way to implement RAG is using similarity search via vector search indexes against embeddings (which I explained at length here: https://simonwillison.net/2023/Oct/23/embeddings/). The idea is to find the content that is semantically most similar to the user's question (or the likely answer to their question) and include extracts from that in the prompt.

      But you don't actually need vector indexes or embeddings at all to implement RAG.

      Another approach is to take the user's question, extract some search terms from it (often by asking an LLM to invent some searches relating to the question), run those searches against a regular full-text search engine and then paste results from those searches back into the prompt.

      Bing, Perplexity, Google Gemini are all examples of systems that use this trick.

      • petervandijck 13 days ago
        An additional (and I was quite surprised by this) trick is to ask an LLM to reformulate the user prompt “to be more concise and precise”, then run a vector similarity search against that, which (in our experience) leads to better matches.

        There are many tricks to get better context to send to your LLM, and that’s a large part of making the system give good answers.

  • rldjbpin 14 days ago
    RAG is to me just a glued-up solution to try counter the obvious limitations of LLMs that they essentially find the next token in a very convincing fashion.

    i am sure there can be newer ways to do prompt injection in an elegant way, but for the most part the llm is either summarizing the injected prompt or regurgitating it.

    if the output is satisfactory, it is still more convenient than writing custom rules for answers for each kind of question you want to address.

  • spencerchubb 14 days ago
    I believe RAG is a temporary hack until we figure out virtually infinite context.

    I think LLM context is going to be like cache levels. The first level is small but super fast (like working memory). The next level is larger but slower, and so on.

    RAG is basically a bad version of attention mechanisms. RAG is used to focus your attention on relevant documents. The problem is that RAG systems are not trained to minimize loss, it is just a similarity score.

    Obligatory note that I could be wrong and it's just my armchair opinion

    • simonw 14 days ago
      It's not just about context length, it's about performance.

      A 100 million token context that takes an hour to start returning an answer to a prompt isn't very useful for most things.

      As long as there is a relationship between the length of the context and the time it takes to produce an output, there will be a reason to be selective about what goes into that context - aka a reason to use RAG techniques.

      • verdverm 14 days ago
        There's also the focus. A long context means the LLM is more easily distracted. Context "gets the right neurons firing" and why RAG does well. It's also why reranking with RAG improves output, in further increases the quality and/or relevance into a smaller context.

        Which is all to say really, data quality is probably the most distinguishing factor in LLM/RAG systems... to get back to OP's questions

      • spencerchubb 14 days ago
        Good point! If datasets get that large that is an argument in favor of RAG
    • mikecaulley 14 days ago
      This doesn’t consider compute cost; the RAG model is much more efficient compared to infinite context length.
      • epistasis 14 days ago
        Agreed. I think that RAG implemented via tool-calling, with multiple agents talking to each other, is a much much more likely evolution in the future versus a single unified model.

        I could very well be wrong! But we wouldn't want LLMs to be performing lots of arithmetic calculations via exploiting hidden parts of themselves that do linear regression or whatever, far better to just give them the calculator and get results faster and cheaper. Similarly, we can give them a search engine (RAG) and let them figure it out more efficiently.

    • ttul 14 days ago
      https://arxiv.org/html/2404.07143v1

      “This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation”

    • darby_eight 14 days ago
      > I believe RAG is a temporary hack until we figure out virtually infinite context.

      I'd assume "large enough" context is the actual goal here, not "virtually infinite".

    • choilive 14 days ago
      I think that "next level" would essentially be a RAG referential information system that gets data via search engines or databases. Maybe we will have the "google search" equivalent completely intended for LLM clients where all data is stored, searched, and returned via vector embeddings, but it could tap on exabytes of information.
      • 0x008 14 days ago
        That is actually already what is happening, tools like llama_index are trying to provide all the necessary integrations.
    • 2099miles 14 days ago
      As a concept infinite context will not be discovered. We can spin up huge models with large contexts that cost more but even 1M context is 1 expensive and 2 not 3trillion(size of some datasets)

      Rag is not bad attention, rag is the user not knowing the context to give the LLM

    • fallingknife 14 days ago
      That's going to be quite a while then because LLMs don't even perform well at anything close to the current context limit. They will miss important details when the context gets over a couple thousand tokens.
  • p1esk 14 days ago
    It’s strange: most answers here assume the next gen models won’t be able to perform RAG on its own. IMO, it would be wise to assume the opposite - anything humans currently do to make models smarter will be built in.
    • simonw 14 days ago
      A Large Language Model itself can't perform RAG: a model is a big binary blob of matrices that you run prompts against.

      Anything that can do RAG is, by definition, a system that wraps an LLM with additional code that performs the retrieval.

      It's the difference between ChatGPT (software that wraps a model and can extra features such as tool usage, Code Interpreter, RAG lookup via Bing etc) and GPT-4 Turbo (a model).

      • p1esk 14 days ago
        Why can’t a model explore its environment (given access) and find and use those tools? Why can’t a model fire up a query (sql, http, or whatever) to do the retrieval if it determines it needs more information?
        • simonw 14 days ago
          I'm talking about Large Language Models - the architectures behind most of the current generative text AI boom.

          In order to use tools they need to be run as part of a system that grants them access to tools, eg via the reAct pattern. https://til.simonwillison.net/llms/python-react-pattern

          • p1esk 14 days ago
            Are you talking about a simple script which executes model’s instructions? It can literally be as simple as connecting the model to my computer command line. I will tell it what I want done and give it my cc number or website creds.

            Obviously I’m talking about next gen models, like gpt5/6.

            • 0x008 14 days ago
              For that a model needs to recursively parse its output. That is not possible with the current architectures, we need Agent Frameworks for that.
        • dlubarov 14 days ago
          I think it's just a matter of semantics - "model" usually refers to a neural network or some similarly pure, deterministic computation, or so I thought.
        • arunsr1ni 13 days ago
          Though the terminology is incorrect (a model by itself cannot 'perform' word other than encode/decode, the llm agents are developed to do what you're stating. But when you think about an llm agent doing all sorts of things like sql etc, it effectively becomes a RAG (which in itself is a process than a technology)
          • alfor 13 days ago
            Does the LLM get trained as the database get builded?

            As you train a model it should build it's knowledge database at the same time. I guess that's how it work?

            When an LLM get trained on next word guessing, can it search the web to improve his results?

    • esafak 13 days ago
      I want an LLM that can perform self-inspection; attribute each response to its sources. And then I would replace RAG with continuous learning.
    • 2099miles 14 days ago
      I mean you can’t build rag into a model, it’s part of a system or application.

      You wouldn’t say a steering wheel is built into an engine. People just build cars with engines.

      • p1esk 14 days ago
        I guess I’m a bit confused - what would be an equivalent to a “steering wheel” in the context of a human brain as an “engine”? Or would this steering wheel be part of the brain? If so, why do we need to separate the two in the context of an LLM?
  • teleforce 13 days ago
    Another potent alternative is perhaps Differentiable Search Index (DSI) based on Transformer Memory:

    Transformer Memory as a Differentiable Search Index:

    https://arxiv.org/abs/2202.06991

  • machinelearning 14 days ago
    Both RAG and infinite contexts in their current states are hacks.

    Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.

    Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.

    The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.

    This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.

    • Prosammer 14 days ago
      uh yeah it works out of the box, this is how most RAG systems are designed, just look at pgvector for example.
      • machinelearning 14 days ago
        Nope that’s not how most rag systems work today. I looked at pgvector and couldn’t find anything similar.

        Do you have a link? Or maybe you misunderstood what I was taking about

        • Prosammer 11 days ago
          Sorry for the late response. I must be misunderstanding your comment. I read your comment as "RAG doesn't pre-compute KV for each document, which is inefficient". With RAG, you convert your text into vectors and then store them in a DB — this is the pre-compute. Then you just need to compute the vector of your query, and search for vector similarity. So it seems to me like RAG doesn't suffer from inefficiency you were saying it suffers from.
          • machinelearning 11 days ago
            No, you've only discussed the Retrieval part of RAG, not the generation part.

            The current workflow is to use the embedding to retrieve documents then dump the text corresponding to the embedding into the LLM context for generation.

            Often, the embedding is from a different model from the LLM and it is not compatible with the generation part.

            So yea, RAG does not pre-compute the KV for each document.

            • Prosammer 11 days ago
              I see what you're saying now, thanks for clarifying.
  • sigmoid10 14 days ago
    The latest research suggests that the best thing you can do is RAG + finetuning on your target domain. Both give roughly equal percentage gains, but they are independent (i.e. they accumulate if you do both). As context windows constantly grow and very recent architectures move more towards linear context complexity, we'll probably see current RAG mechanisms lose importance. I can totally imagine a future where if you have a research level question about physics, you just put a ton of papers and every big graduate physics textbook into the current context instead of searching text snippets using embeddings etc.
  • nimish 14 days ago
    RAG is an easy way to incorporate domain knowledge into a generalized model.

    It's 1000x more efficient to give it a look-aside buffer of info than to try to teach it ab initio.

    Why do more work when the data is already there?

  • cjbprime 14 days ago
    It's hard to imagine what could happen instead. Even with a model with infinite context, where we imagine you could supply e.g. your entire email archive with each message in order to ask questions about one email, the inference time is still proportional to each input token.

    So you'd still want to use RAG as a performance optimization, even though today it's being used as more of a "there is no other way to supply enough of your own data to the LLM" must-have.

  • nl 14 days ago
    In the ~2 year timeframe we'll be using RAG.

    Longer term it gets more interesting.

    Assuming we can solve long (approaching infinite) context, and solve the issues with reasoning over long context that LangChan correctly identified[1] then it becomes a cost and performance (speed) issue.

    It is currently very very expensive to run a full scan of all knowledge for every inference call.

    And there are good reasons why databases use indexes instead of table scans (ie, performance).

    But maybe we find a route forward towards adaptive compute over the next two years. Then we can use low compute to find items of interest in the infinite contest window, and then use high compute to reason over them. Maybe this could provide a way forward on the cost issues at least.

    Performance is going to remain an issue. It's not clear to me how solvable that is (sure you can imagine ways it could be parallelized but it seems likely there will be a cost penalty on planning that)

    [1] https://blog.langchain.dev/multi-needle-in-a-haystack/

    • 0x008 14 days ago
      Speaking as software-developers we have to see RAG as kind of a caching-mechanism.

      Instead of computing every token every time on the whole context, we can grab a cache to make some shortcut. We do the same in software development all the time. Of course it's a performance issue.

  • sc077y 14 days ago
    RAG is a fantastic solution and I think it's here to stay one way or another. Yes the libs surrounding it are lacking because the field is moving so fast and yes I'm mainly talking about LangChain. RAG is just one way of grounding, that being said I think it's Agent Workflows that will really be the killer here. The idea that you can assist or even perhaps replace an entire task fulfilling unit aka worker with an LLM assisted by RAG is going to be revolutionary.

    The only issue right now is the cost. You can make a bet that GPU performance will double every year or even 6 months according to Elon. RAG addresses cost issues today aswell by only retrieving relevant context, once LLMs get cheaper and context windows widen which they will, RAG will be easier, dare I say trivial.

    I would argue RAG is important today on its own and as a grounding, no pun intended, for agent workflows.

  • haolez 14 days ago
    I don't think so. Token windows are always increasing and new architectures (Le Cunn is proposing some interesting stuff with world models) might make it cheaper to add knowledge to the model itself. I think it's more of a necessity of our current state of the art than something that I'd bet on.
    • 0x008 14 days ago
      I think no matter how large the context windows will get and no matter how fast the inference speeds will get, there will always be break-point where the context we have will be so large, that either cost or inference time are not a good experience in real life, and we have to split up the context.

      We cannot simply state that at some point in time RAG will not be necessary. Like everything in the computer science world it always will depend on our data size and the resource constraints we have.

      Unless of course we can process a corpus the size of the whole internet in <1 second. However, I doubt this can be achieved in the next 20 years.

  • redskyluan 14 days ago
    What I Observe: Simple RAG is Fading, but Complex RAG Will Persist and Evolve - Involving Query Rewriting, Data Cleaning, Reflection, Vector Search, Graph Search, Rerankers, and More Intelligent Chunking. Large Models Should Not Just Be Knowledge Providers, But Tool Users and Process Drivers"
    • danielmarkbruce 14 days ago
      It's becoming so complex that it will stop being called RAG. It's just an application that uses an LLM as one part of it.
      • arunsr1ni 13 days ago
        I see many folks misunderstanding RAG for a technology. It's just a process. No matter the complexity, if the underlying principal is to augment the LLM with specific information to the question at hand - it is RAG.
        • danielmarkbruce 11 days ago
          Folks are doing query expansion, chain of thought, all kinds of stuff that isn't even about adding extra info and doesn't involve R. They are just complex LLM applications.
    • esafak 13 days ago
      Do you know a good article on this?
  • waldrews 14 days ago
    I work on statistical quality control methods for the hallucination problem. Model how difficult/error prone a query is, and prioritize sending it to humans to verify the LLM's answer if it's high risk. Some form of human control like that is the only way to really cut hallucinations down to something like human-equivalent level (human answers are unreliable too, and should be subject to quality control with reputation scores and incentives as well).

    RAG can augment the LLM with specific knowledge, which may make it more likely to give factually correct answers in those domains, but is mostly orthogonal to the hallucination problem (except to the extent that LLM's hallucinate when asked questions on a subject they don't know).

  • zamalek 13 days ago
    RAG can't create associations to data that isn't superficially (found by the indexing strategy) assosciated to the query. For example, you might query about one presidential candidate and lose out on the context of all other presidential candidates (probably a bad example, but gets the point across).

    It is "search and summarize." It is not "glean new conclusions." That being said, "search and summarize" is probably good for 80%.

    LoRA is an improvement, but I have seen benchmarks showing that it struggles to make as deep inferences as regular training does.

    There isn't a one-size fits all... Yet.

    • henry-aryn-ai 13 days ago
      Well, theoretically you should be able to replace the "search" part of "search and summarize" with more analytics-y things - counts, aggregations, joins, whatever - and throw some prompt formatting at it and I'll bet you can get some pretty good conclusions out of an LLM. Not sure you can call it RAG, but that can probably cover a good 90% of the remaining 20%
  • throwaway74432 14 days ago
    I suspect we'll discover/invent an IR (intermediate representation) that behaves like RAG in that it primes the LLM to produce a specific bit of knowledge/facts, but the IR is a lot less like normal english, and more like a strange pseudo-english.
  • dragonwriter 13 days ago
    RAG is mostly a hack to address limited context windows, or limited use of wide context windows – some models have large windows but don’t use content that isn’t near the beginning or end well, or expensive content windows (LLM-as-service typically charges by the token, so RAG can reduce cost).

    “Stuffing relevant data into the context window rather than relying purely on training” is a solution to confabulation, though, just like providing relevant reference information to a person who is being pressured to answer a question is.

  • simonw 14 days ago
    I wouldn't call it the "future of LLMs". I do see it as both the present and future of one of the application areas of LLMs, which is answering questions against a custom collection of content.
  • mehulashah 14 days ago
    We think that RAG is fundamentally limited:

    https://www.aryn.ai/post/rag-is-a-band-aid-we-need-llm-power...

    We do see a world where LLMs are used to answer questions (Luna), but it’s a more complex compound AI system that references a corpus (knowledge source), and uses LLMs to process that data.

    The discussion around context sizes is a red herring. They can’t grow as fast the demand for data.

    • datadrivenangel 13 days ago
      so basically RAG but with a much more sophisticated retrieval system to put information into the response?
  • sc077y 14 days ago
    Thinking back, if LLMs are able to have Memory store and access then RAG becomes useless. RAG is like a system that shoves bits down the RAM (Context Window) and ask the cpu(LLM) to compute something. But If you expand the RAM to a ridiculous amount or you use the HDD, it's no longer necessary to do that. RAG is a suboptimal way of having long term memory. That being said, today it is useful. And when or if this problem gets solved is not easy to say. In the meantime, RAG is the way to go.
  • lqhl 14 days ago
    LLM applications can benefit from Retrieval-Augmented Generation (RAG) in a similar way that humans benefit from search engines like Google. Therefore, I believe RAG cannot be replaced by prompts or fine-tuning.

    https://myscale.com/blog/prompt-engineering-vs-finetuning-vs...

  • 0x008 14 days ago
    What people are not always considering is that RAG has many more applications than just selecting the relevant context chunks. After all, the R in RAG is not for Vector Search, it is for "Retrieval".

    With RAG tools that exist today, we can already do things like

    - providing summaries

    - hierarchical summarization

    - generation of questions / more prompts to nudge the model

    - caching

    - using knowledge graphs, function calling, or database connectors for non-semantic data querying

    etc.

  • atleastoptimal 13 days ago
    What will happen as inference cost goes down is RAG will just be one master LLM calling a bunch of smaller LLMs spanning the context window of every document you are querying. Context windows went from 8k to like 128k or more in like a year. In a few years we will have practically unlimited context windows for minimal cost.
  • stainlu 14 days ago
    Yes. So basically RAG is RAM for human and AI to interact with each other. Doing no RAG is on one side unprecise (for lack of context) and on the other side inefficient (think of non-RAG as having a more general attention).

    Inefficiency (in other words, higher expense)is sometimes even easier to perceive for decision-makers

  • syndacks 14 days ago
    Are there any best practices for doing RAG over, say, a novel? (50k-100k words) things that would make this unique compared, say, RAG over smaller docs or research papers: - ability to return specific sentences/passages of a character while also keeping their arch in mind from beginning to end of story
  • Lerc 14 days ago
    It doesn't matter how much knowledge augmentation is provided, if it is less than infinite, hallucinations are going to be a problem. This is a mitigation, not a solution.

    The solution is finding a way for models to recognise the absence of knowledge.

  • tslmy 14 days ago
    To make a LLM relevant to you, your intuition might be to fine-tune it with your data, but:

    1. Training a LLM is expensive.

    2. Due to the cost to train, it’s hard to update a LLM with latest information.

    3. Observability is lacking. When you ask a LLM a question, it’s not obvious how the LLM arrived at its answer.

    There’s a different approach: Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, frameworks like LlamaIndex:

    1. retrieves information from your data sources first,

    2. adds it to your question as context, and

    3. asks the LLM to answer based on the enriched prompt.

    RAG overcomes all three weaknesses of the fine-tuning approach:

    1. There’s no training involved, so it’s cheap.

    2. Data is fetched only when you ask for them, so it’s always up to date.

    3. The framework can show you the retrieved documents, so it’s more trustworthy.

    (https://lmy.medium.com/why-rag-is-big-aa60282693dc)

    • choilive 14 days ago
      This is the state of LLMs today - it is likely that we will have models in the future that can do some form of "online" training - or new training methods that aren't nearly as compute intensive. There are many people working on these scaling issues with LLMs today. We already have new attention heads that work around the quadratic time and space complexity of the input prompts.
  • edude03 14 days ago
    I’m bullish on RAG since you’ll always need a way to work with new information without retraining or fine tuning an LLM. Even as humans we essentially do RAG with google
  • darby_eight 14 days ago
    It's certainly the near future; it's the main option that offers parameterization of behavior outside the initial prompt and training data.
  • thatjoeoverthr 14 days ago
    With this technology, faster chips will solve it _for an application-specific definition of "solved."_

    The models aren't actually capable of taking into account everything in their context window with industrial yields.

    These are stochastic processes, not in the "stochastic parrot" sense, but in the sense of "you are manufacturing emissions and have some measurable rate of success." Like a condom factory.

    When you reduce the amount of information you inject, you both decrease cost and improve yield.

    "RAG" is application specific methods of estimating which information to admit to the context window. In other words, we use domain knowledge and labor to reduce computational load.

    When to do that is a matter of economy.

    The economics of RAG in 2024 differ from 2022, and will differ in 2026.

    So the question that matters is, "given my timeframe, and current pricing, do I need RAG to deliver my application?"

    The second question is, "what's an acceptable yield, and how do I measure it?"

    You can't answer that for 2026, because, frankly, you don't even know what you'll be working on.

  • intended 14 days ago
    The best solution to reducing the problem of Hallucinations is first someone telling us what their Error rates in production are.
    • xpe 14 days ago
      Based on what assumptions can one validly claim this is the ‘best’ solution?

      (In response to “The best solution to reducing the problem of Hallucinations is first someone telling us what their Error rates in production are.”)

      If I were to make an educated guess, one response would be akin to “one cannot correct errors without measuring the errors at prediction time”. This is incorrect; measuring prediction errors is not the only way.

      Some kinds of errors can be reduced or even eliminated before they are generated. Consider the space of models that are structurally incapable of making certain classes of errors.

      To explain it in another way: “turning the crank on a model” isn’t the only way to reason about it.

      For example, in statistical modeling, there are kinds of models that do not suffer from out of sample extrapolation.

      Also, the ideas of trust regions (from RL) might inspire alternative methods worth exploring.

      • intended 14 days ago
        The issue at hand right now, is that the discussion on Hallucinations is missing information on interaction with clients and users.
        • xpe 13 days ago
          I'm not seeing a connection between the question I asked and your response. It doesn't seem to me that you engaged; it feels more like you "bounced off" what I wrote. What did I miss? What did you miss? Communication is rarely as transparent as people think. [1]

          [1] "Expecting Short Inferential Distances" by E. Yudkowsky 2007

  • stuaxo 14 days ago
    I think it's a good complement.

    If LLMs are akin to a "low resolution jpeg of the internet", RAGs allow checking of facts.

  • ayushl 14 days ago
    We'll get a token level RAG , something similar to routing mechanisms in MoE
  • hbarka 14 days ago
    Does RAG depend on a vector database?
    • 0x008 14 days ago
      tl;dr

      we only need a vector database if we want to do semantic vector-embedding search AND our dataset is too large for memory

      long answer:

      RAG is just a pattern of working with LLMs, independent of particular database technologies.

      Basically it means you inject some context data or domain specific data into the prompt to achieve the following:

      1. Nudging the model into the right direction so it will use the "correct parts" of it's "global knowledge". We want to nudge it into the correct domain or a domain very similar to our problem. We do this to generate better answers and reduce hallucinations or to adhere to a particular style we want to achieve. This is basically just a prompt-engineering technique.

      2. Achieving in-context learning by providing some context to the model. This is also a prompt engineering technique used to make the model reason about things it does not know from the training and to allow it to reason about a particular text you provide at runtime.

      What I just explained is basically the "AG"-Part of RAG: Augmented Generation. It boils down to putting some external data into the context.

      The R (Retrieval), on the other hand, is about finding the correct data to put in the context.

      If you copy an email and paste into the prompt of ChatGPT, you are basically already doing RAG. The R is copy pasting the Email from Gmail, and the AG is putting it into the Prompt.

      So, no, generally speaking you don't need a vector Database.

      In the real world you will probably have a large context (some files on the disk, a PDF file, etc.) and you don't want to manually select the relevant bits of information, of course. We want to automate the Retrieval part of RAG. This is mostly what people mean when they talk about "RAG". Essentially it is a information retrieval, or search problem.

      A lot of programmers and projects which come from the LLM community will use RAG with a semantic vector search. Semantic vector search has shown to be quite good at selecting relevant context chunks from full text and it also feels natural because it will use embedding models, which are very similar to LLMs. By using vector-embedding search we simulate with a manual step what we think will happen when we put all the data into the LLM context: Selection through semantic interpretation.

      In this case we need some form of "vector search", but we do not need a database for that per se, we can also do that in-memory. However, if our dataset is too large to be efficiently processed in-memory, we would need a vector database.

      However, we can also use any other means of information-retrieval techniques or search techniques, like keyword search, full-text search, knowledge graphs, sql queries, and many more, to retrieve the relevant bits of information we want to pass to the model.

      In that case we also don't need a vector database. Maybe some other database, maybe a search tool like elastic.io, maybe no database at all.

      • manishsharan 14 days ago
        Thank you for your insights. AS someone who is absolutely new to this, could you clarify the role of embedding in RAG? Do I need to use OpenAI's OpenAI embedding model (text-embedding-ada-002) always to use for embeddings ?

        Supposing I do use text-embedding-ada-002 model and store the index in a vector database, will I be able use these for RAG with other LLMs such as Claude Haiku etc. ? Or does each LLM have its own text embedding model ?

        • 0x008 12 days ago
          The Embedding model is only involved in the "R" part (retrieval). You need to use the same embedding model for indexing and retrieval, though, that is important.

          The Embedding model is independent of the actual LLM used in generation. You can use any embedding model with any LLM. In fact ada is only one option and there are lot of really good embedding models, also ones you can easily run local, or ones which are a lot of better than ada, readily available. ada is not the best choice in all cases.

        • esafak 13 days ago
          The role of the embedding is enabling similarity search.

          You need to use the same embedding for indexing and retrieval. Beyond that, you want to select an appropriate embedding; optimized for indexing the kind of content you have.

      • uxcolumbo 13 days ago
        This is a great explanation for someone like me, ie total novice.

        You’ve explained it in a way so that each part of the abbreviation makes sense now.

        Thanks!