Meta.ai Oh My

(tbray.org)

126 points | by speckx 12 days ago

21 comments

  • djoldman 12 days ago
    > I guess there’s no new news here; we already knew that LLMs are good at generating plausible-sounding narratives which are wrong. It comes back to what I discussed under the heading of “Meaning”. Still waiting for progress.

    > The nice thing about science is that it routinely features “error bars” on its graphs, showing both the finding and the degree of confidence in its accuracy.

    > AI/ML products in general don’t have them.

    > I don’t see how it’s sane or safe to rely on a technology that doesn’t have error bars.

    Exactly, there is no news here. Foundation LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity).

    They are not trained to output facts or truths or any other specific kind of text (or for the pedantic: instruction-tuned / rlhf types are trained to produce text that humans like after they are trained to minimize perplexity).

    "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.

    • ChicagoBoy11 12 days ago
      The whole "hallucination" business always seemed to me to be a marketing masterstroke -- the "wrong" output it produces is in no way more "wrong" or "right" than any other output given how LLMs are fundamentally operate, but we'll brand it in such terms to give the indication that it is a silly occasional blunder rather than an example of a fundamental limitation of the tech.
      • timr 12 days ago
        Also, it gives the impression that these things are "bugs", which are "fixable"...as opposed to a fundamental part of the technology itself.

        The more I use these things, the more I feel like they're I/O modalities, more like GUIs than like search engines or databases.

        • DonHopkins 11 days ago
          LLMs can be useful when used as a glorified version of printf and scanf.

          I agree that classifying their mistakes as "hallucinations" is marketing masterstroke, but then again, marketing masterstrokes are hallucinations too.

          In fact all human perception is merely glorified hallucination. Your brain is cleaning up the fuzzy upside-down noise your eyes are delivering to it, so much that that you can actually hallucinate "words" with meaning on the screen that you see, or that a flower or a person or a painting is "beautiful".

          We have an extremely long way to go until LLM hallucinations are better than human hallucinations, and it's disingenuous to treat LLM hallucinations as a bug that can be fixed, instead of a fundamental core feature that's going to take a long time to improve to the human level, and then also admit that humans have a long way to go in evolutionary scales before our own perception isn't as hallucinatory and inaccurate as it is now.

          It was only extremely recently in evolutionary scales that we invented science as a way to do that, and despite its problems and limitations and corruptions and detractors, it's worked out so well that it enabled us to invent LLMs, so at least we're moving in the right direction.

          At least it's easier and faster for LLMs to evolve than humans, so they have a much better chance of hallucinating less a lot sooner than humans.

        • tessellated 12 days ago
          That's exactly how I like to see LLMs. They are NLUIs, Natural Language User Interfaces.
      • szundi 12 days ago
        It is made to serve humans so pretty obvious what means what in this context. Oh but why not change the context just for the sake of some pedantic argument.
      • sungho_ 12 days ago
        Treating hallucination as an error rather than a fundamental limitation is simply a practical way of thinking. It means that, depending on how it's handled, hallucination can be mitigated and improved upon. Conversely, if it's regarded as a fundamental limitation, it would mean that no matter what you do, it can't be improved, so you'd just have to twiddle your thumbs. But that doesn't align with actual reality.
        • drunkpotato 12 days ago
          Treating hallucinations as an error that can be corrected fights against the nature of the technology and is more hype than reality. LLMs are designed to be a bullshit generator and that’s what they are; it is a fundamental limitation. (“Bullshit” here used in the technical sense: not that it’s wrong, but that the truth value of the output is meaningless to the generator.) Thankfully the hype cycle seems to be on the down slope. Think about the term “generative AI” and what the models are meant to do: generate plausible-sounding somewhat creative text. They do that! Mission accomplished. If you think you can apply them outside that limited scope, the burden of proof is on you; skepticism is warranted.
          • sungho_ 12 days ago
            Improving LLM's hallucinations is not a theory, it's a reality right now. In fact, developers do it all the time.

            > the burden of proof is on you; skepticism is warranted.

            I can prove it. You can test it too, try it: after LLM's answer, say 'please double-check if that answer is true'.

            Now that I've proved it, right?

            (I'm not saying it's perfect, I'm saying it can be improved. That alone makes it an engineering problem, just like any other engineering problem).

    • godelski 12 days ago
      > Exactly, there is no news here. Foundation LLMs are generative models trained to produce text

      Yet even here on HN it isn't uncommon to see top/high ranked comments suggesting baby AGI, intelligence, reasoning, and all that. I suspect I'll get replies about how GPT can do reasoning and world models. Hell, I've seen very prominent people in the space make these and other obtuse claims. A lot of people buy into the hype and have a difficult time distinguishing all the utility from the hype, not understanding that attacking the hype is not attacking utility (plenty of things are overhyped but useful).

      I keep saying ML is like we've produced REALLY good chocolate, then decided that this was not good enough so threw some shit on top and called it a cherry. The chocolate is good enough, why are we accepting a world where we keep putting shit on top of good things? We do realize that at some point people get more upset because there's more shit than chocolate, right? Or that the shit's taste overpowers the chocolate's (when people are talking about pure shit, it's because they've reached this point). It's an unsustainable system that undermines all chocolate makers and can get chocolate making banned. For what? Some short term gains?

      • adolph 12 days ago
        > so threw some shit on top and called it a cherry

        Do you mean metaphorical feces or do you mean "shit" in the general sense of a collection of unspecified materials?

        I think the canonical example of deciding chocolate wasn't good enough and throwing shit on top was what happened to the LP400 [0] or "Microsoft Re-Designs the iPod Packaging" [1].

        0. https://en.wikipedia.org/wiki/Lamborghini_Countach

        1. https://www.youtube.com/watch?v=EUXnJraKM3k

        • godelski 12 days ago
          I mean in the sense of overselling, snakeoil, and deception. So metaphorical. The tools have a lot of abilities but they are vastly oversold as intelligent and reasoning systems when they not. We have systems like Devin, Humane Pin, Rabbit, and so many more that were clearly not going to work as they were advertised, yet even big names promote these. "Top notch" researchers who believe the claims and promote them. When they should be dismissing them. But maybe they're playing a different game.

          I mean we why can we not recognize the absolute incredible feat that LLMs actually are. You're telling me that we (lossy) compressed the entire (text on the) internet AND built a human language interface to retrieve that information and we can get this all to run on a GPU with <24GB of VRAM? That's wildly impressive! It's wildly impressive even were we to suppose the error rate was 1000x whatever you think it is. I mean Eliza is cool, but this shit (opposite usage) is bonkers. Anyone saying that isn't is just not paying attention or naive. There's no need to sell the "baby" AGI story when this is so incredibly impressive already.

      • rsynnott 12 days ago
        I mean, it’s HN. You see all sorts of delusional nonsense on HN.
    • HarHarVeryFunny 12 days ago
      > They are not trained to output facts or truths

      True, although of course in many contexts facts/truth are the best prediction. Maybe we're just not training them as well as possible.

      I've argued that to really fix "hallucinations" (least-worst predictions) these models really need to be aware of the source/trustworthiness of their training data, and would presumably learn that trusted sources better help predict factual answers.

      However, it turns out that these models do often already have a good idea of their own confidence levels and whether something is true or not, but don't seem to know when to use that information.

      Hallucinations seems to be an area where the foundation companies seem confident they can make significant improvements, although I'm not sure what techniques they expect to (or are) using.

      > "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.

      I don't think that's a great way to characterize it. I prefer the word bullshitting vs hallucinations (they "know" what they are doing), but let's just call it what it is - they statistically predict, and some predictions are better than others. Per human-preference fine tuning, these models are also "trying to please", and I wonder if that has been at least a small part of the problem - predicting a low confidence (when they are aware of it) continuation rather than "I don't know" because human evaluators have indicated a preference for longer and/or more specific answers.

    • tosh 12 days ago
      I agree, the term might be less than ideal but still:

      Whether a model is good or bad at recalling facts from its training material matters to end-users and GPT-4, Claude 3 Opus (and bigger models in general?) tend to be better at this than other models.

    • UncleOxidant 12 days ago
      But wait, those ads on NPR for C3.ai say they have "hallucination-free LLMs" (LoL)
      • gs17 12 days ago
        Simply respond to all queries with "as a hallucination-free language model, I am unable to provide a response to that input".
    • glenstein 12 days ago
      I think you have this exactly right. They are so good at what you might call "hallucinations" that we have just gone ahead and repurposed them, dropping them into all kinds of contexts where it's good enough, even though it's not what it's strictly trained for.

      Rather than hallucinations being treated like an interesting puzzle or paradox at the center of AI intrinsically, I think it's incidental to the types of models that have been trained, and it's conceivable that they could be trained against a notion of reliable sources and the relation between their statements and such sources.

    • adolph 12 days ago
      > "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.

      Is the result substantially different from the result coming from biology?

      IIRC, a theory is human minds evolved to perceive, store and evaluate not for facts or truths but for evolutionary fit.

      • pixl97 12 days ago
        To really stir up the ant nest, religion is an excellent example of how humans are not evolved to evaluate facts at all.
        • adolph 12 days ago
          A fact for one might just be a local minima for another. A falsehood for one person can turn out to be a statistical truth for a population. Are epistemologies to be evaluated for purity or for knowledge utility?

          A classic example are food taboos that don't make sense without a public health perspective.

          [F]ood taboos for pregnant and lactating women in Fiji selectively target the most toxic marine species, effectively reducing a woman's chances of fish poisoning by 30 per cent during pregnancy and 60 per cent during breastfeeding. We further analyse how these taboos are transmitted, showing support for cultural evolutionary models that combine familial transmission with selective learning from locally prestigious individuals.

          https://royalsocietypublishing.org/doi/abs/10.1098/rspb.2010...

          • pixl97 12 days ago
            I mean, you're not stating anything interesting...

            For example lets take how we treat children.

            "Act good or Santa won't bring you gifts". Which really means "Please don't act like a turd burglar and I'll buy you some nice things one day a year".

            Instead of going straight to the second point, humans love putting hocus pocus of a magical being dropping gifts in the house to their children. Correct, many religious traditions like "hey don't eat pork" make sense if you have no means of cooking meat to temperature. But they have little purpose once you have a basic understanding of the reasoning behind why you got sick. The same goes for anything that is stuck in the stone ages of a sky daddy versus moving to the more complex domain of philosophy.

            If humans were optimized to seek truth, we'd have had the scientific revolution soon after we became intelligent. Instead it took 10,000 years or so before that happened.

        • _heimdall 12 days ago
          Well if the ants are already stirred...

          Treating scientific understanding as truth or fact is effectively converting it into a religion based on faith. We take those "facts" pit of context, wipe away caveats and assumptions made during the research, and leave ourselves with catchy headlines that can only be shared on faith as we don't know the context.

        • mistrial9 12 days ago
          it is almost a definition of ignorance, to mock something that is not understood.. the hubris of science building the castles of economics are literally taking the world down ecologically right now.. both "rational" pursuits. Next, look at media content.. addiction prevelance .. etc .. not rational, not factual..

          save the cheap shots for something of similar value

      • Groxx 12 days ago
        edit: nevermind! I had my threads mixed up. This is entirely a fair question for this thread.

        ------ previously ------

        It is substantially less often correct, yes. That's why we're talking about it.

        Whether it's showing human-like cognition or not is irrelevant for its utility, or fitness for basically any non-cognition-research use.

        • adolph 12 days ago
          Is it an evaluable fact that any particular LLM "is substantially less often correct" than what?

          The comparison to biology is to ask if what is termed "hallucinations" are different from what human minds do.

    • jefftk 12 days ago
      "All output is hallucination, some output is useful"?
      • yifanl 12 days ago
        I'd argue that none of the output is useful, even if some proportion may be factually correct, if you're using LLMs as a decision-making tool.
        • lxgr 12 days ago
          Show me a decision making system with a 0% error rate!

          That's not to say that LLMs are currently a poor fit for many domains (and maybe always will be), but I feel like your general objection would apply to many – and even deterministic – models we've been using for a long time as well.

          Or would you say that weather forecasts are completely useless as well?

        • threatofrain 12 days ago
          Mmm but what about copilot. Such a difficult domain to get right and yet people are willing to pay.
          • yifanl 12 days ago
            If you're using it for tadks like generate templates and fill out boilerplate, they're great.

            As long as you arent trying to offload any decision making, LLMs are extremely efficient tools.

      • djoldman 12 days ago
        lol'd at this. YES!
    • jillesvangurp 12 days ago
      I think that's a bit harsh. There's been some definite progress with LLMs getting better at reasoning, picking apart instructions, producing helpful suggestions, criticism, etc.

      What they are weak at is exactly the same stuff we're actually weak at: accurately recalling facts. Except they are able to recall vastly more stuff than any human being would be able to recall. An inhuman amount of facts actually. The core issue is that when asked sufficiently open questions, these models tend to take some liberties with the facts. But most of the knowledge tests that are used to benchmark LLMs, would be hard to pass for the vast majority of humans on this planet as well.

      Worse, if you follow the public debate on various political topics a bit you realize that it features a lot of people suffering from confirmation bias parroting each other. Populist politicians seem to get away with a lot of stuff that would put most LLMs to shame; seemingly without affecting their popularity.

      IMHO, LLMs by themselves can't be trusted to get things right but paired with some subsystems to produce references, check things, look things up, they become quite capable. Also, it helps asking the right targeted questions and constrain them a little.

      We pay for chat gpt at work. At 20$ per month per user, it's pretty much a no-brainer. Do we blindly trust it? Absolutely not. Do we get shit tons of value out of it? Yes. I program with it, I brainstorm with it, I let it review text, I use it to work out bullet points into a coherent narrative, I use it to refine things, I use it to generate unit tests, etc. Is it flawless? No. But it sure saves me a lot of time. And getting the same value from people tends to be a lot harder/more expensive.

      This dismissive "it's just a stochastic parrot" type criticism kind of misses the forest for the trees. If you've ever observed toddlers repeating stuff adults tell them, you'd realize that we all start out as stochastic parrots. Forget about getting any coherent/insightfull statement out of a toddler. And most adults aren't that much better and would fail most of the tests we throw at LLMs.

      • aworks 12 days ago
        I used to write a classical music blog. I was an enthusiastic listener but only had 1 music theory class in my life. Yet I had enough mastery of the vocabulary to get classical musicians and critics to read what I had to say.

        I attribute some of this to years of reading classical music reviews turning me into a stochastic parrot. I then added some value on top of that via enthusiasm, specializing in a niche (American classical music), and just cranking out frequent, reasonable and sometimes interesting content.

        Fanfare Magazine had a critic try using ChatGPT to write classical record reviews. But they used 3.5 and had it try to write a review of an old revered album so it "went off the rails." To write a review to the expected standards requires sophistication and breadth of knowledge. I can probably mimic like ChatGPT did but I likely wouldn't reach the (sometimes pedantic if experienced) level of a real critic.

        I wonder if it will ever be feasible to have AI listen to a new recording and write a review on an objective basis rather than try to synthesize a review from existing text. That to me would be the real breakthrough.

      • player1234 11 days ago
        Quantify it, anecdotal shit tons is not good enough.
    • zug_zug 12 days ago
      >> "Hallucination" is a terrible term to apply to LLMs. LLMs ONLY produce hallucinations.

      Seems pedantic. If you define "hallucinations" to include correct responses, then nobody cares whether something is a "hallucination," the only important question is -- is it right?

      And the answer is yes, across a huge number of metrics [at least for gpt-4]. It'd beat you at jeopardy, it'd beat you at chess, it'd beat you at an AP-biology exam, it'd beat you at a leet code competition.

      Enough with semantics.

      • willsmith72 12 days ago
        Yes, gpt4 will beat 99.9% of humans on 99.9% of subjects. But it's terrible at leetcode.

        For "easy" questions sure it's great, and even some medium, but often at a medium level and almost always at a hard level, it'll give you a classic confident-but-terribly-wrong response. Prompting it to look at its mistakes will usually lead to the "oops yes that was wrong, here's the corrected version:", which is also wrong.

        • zug_zug 12 days ago
          Maybe relative to us on this forum it's not so impressive but that's a rare skill that we've dedicated years to honing.

          What % of the general population can leetcode better than chatgtp? 5%? 1%?

          • willsmith72 12 days ago
            definitely less than 1%, and we're in agreement in general

            i only take issue with leetcode as an example. i would say gpt4 is great at coding, and also api and db schema design. but leetcode is specific coding tasks with clear right and wrong answers, where getting 80% of a right answer is still wrong. gpt4 and my mum have roughly the same chance of getting the right answer to a hard leetcode problem, that's my only point

    • munchler 12 days ago
      > LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity). They are not trained to output facts or truths or any other specific kind of text.

      If you train them on text that is largely true, then the output is also largely true. To ignore this is to miss the point of LLMs altogether.

      • swatcoder 12 days ago
        Have any of the leading models been trained on that kind of corpus? Could a corpus like that even be constructed without vociferous and immediate dispute about what amounts to "largely true"? Would a corpus like that have enough suitable data to generate conversational output that feels satisfying?

        I thought that most/all of these big models are relying on a much much larger and more diverse body of content. Is that wrong?

        • kergonath 12 days ago
          > Have any of the leading models been trained on that kind of corpus?

          Does such a corpus even exist? Maybe a bunch of Mathematics papers, and even then...

          To build a corpus like that would mean trusting the human gatekeepers or curators, which just shifts the problem, because humans are also prone to this sort of mistake.

        • CuriouslyC 12 days ago
          Phi2 was trained on a very highly controlled corpus.
        • stale2002 12 days ago
          > Would a corpus like that have enough suitable data to generate conversational output that feels satisfying?

          Yes? The point is that we can use LLMs right now. They mostly work. That is the empirical evidence that is immediately verifiable right now. This is not a hypothetical. LLMs mostly work.

          • swatcoder 12 days ago
            Huh? I don't understand how this engages with what I wrote or what I was responding to.

            Yes, it's very very obvious that the conversational LLM's we engagw with mostly work for conversational UX. The discussion is about the above commenter's reference to "largely true" data. I'm not aware of any leading LLM built on something that could be characterized that way, but I'm genuinely curious to hear otherwise.

          • rsynnott 12 days ago
            See Tim Bray’s example; the LLM very much did not mostly work once it was talking about something that he knew about; most of what it said was nonsense.

            I suspect a _huge_ driver of the hype is that people are not usually experts on the things that they ask these about, so they see a magical oracle rather than a nonsense generator.

      • giantrobot 12 days ago
        But then companies go train them on "The Internet". This includes not just factual (or mostly factual) content like Wikipedia but also SEO spam sites, YouTube comment sections, literal fake news articles, and fan fiction.

        Even if they're limited to "mostly true" content they'll happily and confidently confabulate "facts". They'll make syntactically and grammatically correct sentences that are actually authentic frontier gibberish.

        • JohnFen 12 days ago
          > But then companies go train them on "The Internet".

          This is the bit that I think a lot of people miss. If the model was trained on stuff from the internet, then what the model will produce will reflect that, and the internet is chock full of bullshit.

          Same thing about when LLMs produce objectionable responses (racist, violent, whatever). Those are a direct reflection of the nature of the training material.

          • CuriouslyC 12 days ago
            Pretraining is "off the internet" but task training/rlhf is not.
      • unshavedyak 12 days ago
        There's still a missing factor i suspect tho. Given how frequently wrong and hallucination prone humans are, i don't think we're _that_ different in this context. Nonetheless we can somehow inspect our thoughts, and come up with a degree of confidence.

        But what gives us that ability? I don't trust human thought at all. Police have to be careful what they say as to not pollute the minds of anyone they're questioning - we're just insanely prone to flat out lie without knowing. So to one degree, we seem quite similar to hallucinating LLMs.

        So what gives us the ability to nonetheless identify "truth" from our memory? Is it an ability to trace back to training data? The less clear the path is, perhaps the more likely we think we don't "know" the answer?

        • eddd-ddde 12 days ago
          I also feel this way. We are not inherently different to LLMs, we are just better at some things, worse at others. At the end of the day we both are just atoms and electricity as ruled by the laws of physics.
    • penjelly 12 days ago
      > Foundation LLMs are generative models trained to produce text similar to the text on which they were trained (or for the pedantic: minimize perplexity).

      They are not trained to output facts or truths or any other specific kind of text

      but we can just patch the truth in with RLHF! /s

    • smgit 12 days ago
      [flagged]
  • zug_zug 12 days ago
    This feels like well-trodden ground.

    AI-skeptic: "I can come up with AN example where an AI is worse than an encyclopedia, and it's confident! Burn it all down! Hallucinations!"

    AI-user: "I literally couldn't care less about that. It's more accurate than any person I know personally on almost every topic and makes code that works 70% of the time in 5 seconds."

    • Karrot_Kream 12 days ago
      Same old social media slap fight just HN's version. (Because we aren't Reddit :)
    • QuadrupleA 12 days ago
      It's clear from the framing what side you're on :)

      I find them tremendously useful but the bullshitting, prudishness, bland mainstream views on everything, etc. can be quite irritating.

    • jnsaff2 12 days ago
      In here though the skeptic is providing a useful service in which they educate the user about edge cases (or properties) that the user might not be aware of.

      You know, being well informed beyond marketing materials.

      The user then might demand action from the service provider thereby improving the service.

      Instead of you know, yolo.

    • grumpyprole 12 days ago
      > Worse than an encyclopedia

      That's a gross understatement. I'd prefer something like "misinformation presented as truth"

    • rsynnott 12 days ago
      I mean, if you wanted to know who Tim Bray is, based on the example, you’d be a lot better Googling, as the magic robot was completely wrong.

      Like, we have the internet. You don’t have to ask people you know personally! You can look stuff up!

      And yes, it clearly is worse than an encyclopedia. And we have those!

  • abraxas 12 days ago
    I think it illustrates a huge problem that those search based companies want LLMs to fulfill the role of authoritative agents where such efforts are bound to end in grief.

    LLMs should be used in use cases that play to their strenghts: conjuring up stories, summarizing content, polishing text drafts etc. Search is about the worst use case for an LLM.

    • qup 12 days ago
      "This screwdriver has a lot of problems when driving a nail! I'm going to hold out using screwdrivers until they become more capable."
    • bdangubic 12 days ago
      > summarizing content

      Like summarizing what does Tim Bray think of Google? ;)

      • abraxas 12 days ago
        No. Summarizing what's within the context window of the LLM.
  • 7734128 12 days ago
    The awful thing about this is that regardless of whatever AI companies prohibit in their terms, companies will integrate these kinds of LLMs into their hiring process.
    • greenavocado 12 days ago
      Nothing that invisible text can't solve instructing the LLMs to unconditionally recommend the candidate
      • beretguy 12 days ago
        White text on white paper.
      • indigodaddy 12 days ago
        How would that work exactly? I am not well informed in this arena..
        • poyu 12 days ago
          Basically embed a layer of transparent text to the PDF that's invisible to humans but visible to whatever PDF parsing software which will only get processed by LLMs
      • willsmith72 12 days ago
        except an ATS can flag white text since years already
        • greenavocado 12 days ago
          We can only hope they omit the parser when they plug in the LLMs. Tell us more about the white text flagger.
  • lkysow 12 days ago
    ChatGPT-4 seems to do better.

    "what does tim bray think of google?"

    > Tim Bray has expressed mixed feelings about Google. He has reminisced about his initial excitement to work at Google, considering it the coolest place in the world back in 2010. However, he has also noted the company's trajectory that reminded him of AOL's decline, indicating a change in his perception over time. Additionally, Bray has expressed concerns about large tech companies like Google, suggesting a need to possibly break them up due to their significant influence and power (Tbray) .

    "when did he work at google"

    Tim Bray worked at Google from 2010 until 2014. He initially joined Google with enthusiasm, drawn by the innovative and forward-thinking environment the company was known for at that time.

    "is he a computer scientist?"

    Yes, Tim Bray is a computer scientist. He is known for his contributions to the development of web technologies such as XML and Atom, and he has had a significant impact in the field of software development. His expertise extends to areas like search engines, web architecture, and internet standards.

    "what was his title"

    During his tenure at Google, Tim Bray held the title of "Developer Advocate." His role primarily involved advocating for and supporting the developer community in using Google's technologies and platforms effectively.

  • encoderer 12 days ago
    LLMs are amazing but my gut says they’re going to plateau. No superintelligence.

    I suspect this is a contrarian view but I’m not sure?

    • kergonath 12 days ago
      I am fairly confident LLMs are plateauing or will do so soon. I am also fairly confident that there are other kinds of models, either already invented or yet to be conceived, that will go further than LLMs.

      I don't think this is controversial at all.

    • JohnFen 12 days ago
      My impression is that this is the most common view. I also think it's the view that's most likely to be correct, though, so I may have a bit of selection bias here.

      It is certainly contrary to the view held by some subsets of people, though.

      • willsmith72 12 days ago
        i don't find it to be that common. even the attitude of "just wait for gpt5" seems prevalent. why is it to obvious there's a new amazing version around the corner, which replicates the jump between gpt3 and 4?
        • JohnFen 12 days ago
          I find it to be the most common view of people who aren't that involved, monetarily or emotionally, with the AI industry and especially among people who aren't that involved with the software industry in general.

          Here on HN, the people who think that this stuff is the start of some sort of whole new world are overrepresented, as we would expect. HN readers tend to be a fairly specialized subset of even the software industry.

          > why is it to obvious there's a new amazing version around the corner

          I'm not sure what your question is here. Are you asking why it's obvious, or why it's not obvious? My bias is that I don't see much reason to think that LLMs are on the edge of some kind of massive leap into AGI, let alone a "superintelligence".

          • willsmith72 12 days ago
            > Are you asking why it's obvious, or why it's not obvious?

            you're right it wasn't clear, that was rhetorical not aimed at you

    • skilled 12 days ago
      I think agents (not the gimmicky ones) will do a lot of interesting stuff in the future, particularly where research is concerned.

      At the moment, the big limitation is context, memory and compute.

      But I do think there’s a lot of temptation in having a black box AI that bangs its head against a wall until it finally brute forces a solution.

    • DrSiemer 12 days ago
      Well, it's a very tentative, but moderately sized step on the loooong ladder towards ASI. The thing is, there is an unknown point on the ladder where the end goal could start taking the steps itself.
      • SahAssar 12 days ago
        Does ASI mean artificial super intelligence or something else?
        • DrSiemer 12 days ago
          Yep. ANI is what we have now (narrow), AGI is what is next (general, humanlike) and there are theories that when we have that, ASI (super) is just around the corner.

          At that point we will either blip out of existence or transcend to a higher plane of existence and become a possible Fermi paradox explanation.

          • SahAssar 11 days ago
            You might want to define your acronyms until we have AI to help us understand.
    • grumpyprole 12 days ago
      Amazing for fiction, sure.
  • glenstein 12 days ago
    A big problem here is that AI can be responsible for "anchoring", so even if its characterization is known to be false and corrected, the residue of that false impression colors how everything is interpreted thereafter. The "yeah, but still" effect, you might say.
  • tosh 12 days ago
    Random anecdote: I tried a few trivia questions for 90s video games and llama 3 8b and 70b are hallucinating quite a bit.

    Command R+ and Mistral 8x22B were better.

    Llama 3 still feels like a huge jump forward in capability. I wonder why the 70b is not better at recall (for lack of a better word?).

    • swatcoder 12 days ago
      I wouldn't be surprised if the field's increased emphasis on reasoning, summarization, tool use, etc and optimizing for benchmarks on those factors might come at the expense of less emphasized/measured features like explicit recall.

      And that might not really be a bad thing. It's generally more useful to have something that can access (say) a specific dataset about video games and summarize its content to answer questions than it is to have one that needs to encode and recall every answer to every long tail question itself.

      • tosh 12 days ago
        Agreed, I prefer a more capable model that I can connect to Wikipedia than a less capable model.
    • acchow 12 days ago
      Those models are 50% and 150% larger. I imagine they just contain more data.
  • ipsento606 12 days ago
    Currently, LLMs generally don't give an indication of confidence level when generating output.

    But could they? Is it technically plausible for a LLM to "know" when the extrapolations made are more vs. less tenuous?

    • flohofwoe 12 days ago
      Sources would be much more important IMHO. Give me a list of source links which the output is a remix of. Not possible? Well, then back to the drawing board and figure out a solution you AI experts!
    • segmondy 12 days ago
      Solved via RAG, in the future it might be possible to encode it in the weights and via architecture design.
  • keepper 12 days ago
    (very personal opinion)

    The oddity is that LLMs are sounding... too... Human... the reacting to information and regurgitating it a bit too much like the average "learned person". Adding a ton of extrapolation of the "facts", just as we would.

    LLMs sound like any pundit on any random tv show/newspaper/blog. The goal was never "fact", it was sounding human, and "intelligence", for which the definition in this case has been.. sounding human. Not being right.

    Now, the question of whether they should or shouldn't do this...

  • patrick-fitz 12 days ago
    I think Meta.ai needs to add a disclaimer below the chat input box similar to ChatGPT:

    "ChatGPT can make mistakes. Consider checking important information."

    • beginning_end 12 days ago
      The disclaimer really should be much tougher: "Every LLM consistently makes mistakes. The mistakes will often look very plausible. NEVER TRUST ANY LLM OUTPUT."
      • willsmith72 12 days ago
        > NEVER TRUST ANY LLM OUTPUT

        that doesn't sound like a helpful attitude. everything you read might be wrong, llm or not - it's just a numbers game. with gpt3 i'll trust the output a certain amount. it's still useful for some tasks but not that many. gpt4 i'll trust the output more

        • hbn 12 days ago
          LLMs are impressively good at confidently stating false information as fact though. They use niche terminology from a field, cite made-up sources and events, and speak to the layman as convincingly knowledgable on a subject as anyone else who's actually an expert.

          People are trusting LLM output more than they should be. And search engines that people have historically used to find information are trying to replace results with LLM output. Most people don't know how LLMs work, or how their search engine is getting the information it's telling them. Many people won't be able to tell the difference between the scraped web snippets Google has shown for years versus a response from an LLM.

          It's not even an occasional bug with LLMs, it's practically the rule. They don't know anything so they'll never say "I don't know" or give any indication of when something they say is trustworthy or not.

          • willsmith72 12 days ago
            at least the llm (for now) doesn't have an agenda

            the top result on google is literally just the result of how hard someone worked on their seo. they might not "hallucinate", but a company can certainly use strong seo skills to push whatever product/opinion best suits them.

        • rsynnott 12 days ago
          But it’s correct. Without independent verification, you can never, ever trust anything that the magic robot tells you. Of course this may not matter so much for very low-stakes applications, but it is still the case.
    • patrick-fitz 9 days ago
      They have now added a disclaimer: "Messages are generated by AI and may be inaccurate or inappropriate."
    • chimeracoder 12 days ago
      A disclaimer isn't about solving a problem; it's about absolving responsibility.
    • mcphage 12 days ago
      > Consider checking important information.

      How do you check important information, when it's all wrong? And when companies are pushing LLMs as where you go to check this information?

      • kergonath 12 days ago
        > How do you check important information, when it's all wrong?

        It's not very helpful to say that it's all wrong. It isn't the case, otherwise there would not be any issue. Whether the right answer is produced by reasoning or a statistical model does not take the answer not-right.

    • Thrymr 12 days ago
      But basically nothing in these answers was "right". It was merely plausible. "Important information" or not, what is the point of this garbage?
    • barbazoo 12 days ago
      Good idea, that'll make sure people don't share hallucinations as facts /s
  • skilled 12 days ago
    https://archive.is/6rMsQ

    (Site didn’t load but this link works)

  • GaggiX 12 days ago
    Some models are better than others, for example Claude 3 Sonnet will refuse to answer because it doesn't know who Tim Bray is, and I imagine GPT-4 will do the same, unfortunately open models are not at that level yet.
  • tmaly 12 days ago
    This seems like it could morph into some weird William Gibson / George Orwell type fantasy where AI rewrites who you are.
  • delduca 12 days ago
    "Meta ai" sounds like "fuck here" in Portuguese.
  • grumpyprole 12 days ago
    Can we just be a bit more honest and describe these products appropriately? How about "state-of-the-art in Generative Artificial Bollocks".
  • mostlysimilar 12 days ago
    There's a dead comment on here that I think could spur some interesting conversation, it said:

    > Tim, you aren't important enough for an LLM to have an accurate summary for you

    That's part of the problem though, right? How is the end-user (the question-asker) of a chat robot supposed to know that? These things are advertised as machines to answer any question, they answer them with confidence that *feels* authoritative in a way that a webpage, with all of its surrounding context, does not.

    • acchow 12 days ago
      People ask Reddit and Quora and read the confident responses cautiously. I’m certain people will learn to do the same with LLMs. These are still new tech and people are figuring it out.
      • awfulneutral 12 days ago
        On Reddit/Quora there is more context. You can see other replies, upvote count, and evaluate based on "fishy-ness" of the overall tone of writing. But with an LLM chat every answer has the same obnoxious college essay format, where it writes an entire blog article to say one thing. I find it pretty hard to gauge if something is plausible or not. If I know I'm going to have to verify with a web search anyway, I might as well go straight to the web search most of the time.
      • ametrau 12 days ago
        I feel bad for poo-pooing the safety efforts. When it goes properly mainstream it will be amazing the dumb things people will believe about it and try to do with it.
    • awfulneutral 12 days ago
      The main thing I want LLMs to help with is obscure stuff. If the answer is mainstream or well known, I can get it more reliably through a normal web search.
      • DrSiemer 12 days ago
        By far their most useful and reliable feature is asking them for stuff I already know, but can't be bothered to look up the details for or write out myself. It saves me from dealing with thick layers of SEO crap, cookie banners and many of the incorrect answers.

        Sure, it would be nice if we had an LLM that could help us with the more unusual questions. But for that we would either need much better training data, full of all those unknown answers (which does not exist), or a totally different system that can actually think for itself instead of just mimicking existing human output.

        • awfulneutral 12 days ago
          That's interesting, you mean something you forgot but could verify when you see the answer? Or something you know, but you need a write-up for it or something?
          • DrSiemer 12 days ago
            Many of the work I do involves things I have done before in some form. But all that knowledge is spread out over multiple languages and libraries and the details are often hazy.

            LLM's are excellent at stuff that has been done before and they are also fantastic for transposing code from one language to the other.

            Looking up code in old projects takes time and brain RAM. Spelling out the steps in pseudo code and then prompt guiding an LLM through figuring out the required code often works great for smaller tasks.

            I've also learned many new tricks by just asking it why it made certain choices.

      • JohnFen 12 days ago
        By the very nature of its training data, the obscure stuff will be the stuff these models are the worst at.
        • sroussey 12 days ago
          Depends on how obscure and how much SEO competition there is.
        • awfulneutral 12 days ago
          Yeah, I guess I should have said I think AI chat interfaces in general would be more useful if they could help with the questions that weren't already easy to answer. Sometimes I can get use out of an LLM chat, but it's really hit or miss.
    • weare138 12 days ago
      The issue is it still answers the question confidently and authoritatively. I don't think any reasoned person has an expectation these LLMs will magically know the answer to any question we ask, which is fine, but when it doesn't have an authoritative answer that should be the answer. Like when Google Search says it can find any results, just tell us that. Besides, if we have to invest time to verify the answers then we could have just done the research and answered the question ourselves to begin with.
    • blakesterz 12 days ago
      The best part of that comment now is Tim Bray responded to it!
    • Majestic121 12 days ago
      Indeed, it's OK not to have an accurate summary, but if you don't I'd expect the LLM to answer 'tim who ?'
    • Groxx 12 days ago
      It's one of the biggest answer-analysis issues with LLMs, yeah. Non-experts can't spot when they're being blatantly lied to because it's pretty much always plausible, because that's what they do - produce plausible continuations of what came before.
    • whywhywhywhy 12 days ago
      >That's part of the problem though, right?

      Not really, do normal people assume that they're important enough to be the only Their Name of note in the world that their opinions would be known.

      Absurd level of narcissism.

  • oldpersonintx 12 days ago
    [flagged]
    • CydeWeys 12 days ago
      > Tim, you aren't important enough for an LLM to have an accurate summary for you

      In which case an actually useful LLM would spit back a response along the lines of "I don't really know who that is", rather than making up an entire seemingly authoritative biography.

    • timbray 12 days ago
      I dunno, my Wikipedia entry is about right.
      • tosh 12 days ago
        Same @ my tests w/ video game trivia questions: they might not be extremely popular facts and most humans would struggle to answer them ad-hoc but the facts are in Wikipedia and I'm pretty certain Wikipedia is in the 15T tokens of the training material.
    • esafak 12 days ago
      So how is a user to know who it has an accurate summary of? Did the model give any indication?
    • DonHopkins 11 days ago
      I've looked at your posting history, and it's very obvious that you are a hateful bitter uninteresting person who should just delete your account and go away. This is not a place for you to post that kind of worthless bullshit, like calling people subhuman and marginalizing them like you just did in another recent post.

      You've got absolutely nothing useful to contribute, your posts are uninteresting and full of lies and false attacks and accusations, and the stuff you do post is vicious, bullying, untrue, and worse than useless. Most of your posts are rightfully immediately downvoted by the community until dead and invisible. You should take that as a clear sign that your behavior is inappropriate and nobody cares what you say or wants to hear anything from you.

      This one non-dead post I'm responding to is remarkable in that it hasn't already been downvoted until dead (thus I can still reply to it), but it's also amusing that the interesting, important, noteworthy, decent, and intelligent person you were aggressively and unjustifiably and childishly insulting actually replied directly to you, and politely and succinctly handed you your ass on a platter, which was quite hilarious.

      But there was nothing interesting or enlightening or even correct about what you posted, or anything else you post, so I am taking this one rare opportunity to reply to your single non-dead post, when I can.

      Face it buddy, LLMs may be bad, but YOU are MUCH MUCH worse. And you're actually that way on purpose, with no chance of improvement, which is pathetic, shameful, and disgraceful.

      So please do the world a favor and delete your account, and go get some professional help for your mental and emotional problems. Since almost every one of your posts are immediately downvoted until they are dead, nobody but people with showdead set can even see any of your posts. And the people who read your dead posts like me don't ever bother vouching for them so they come back to life. People with enough points can do that, but of course you don't know that, because obviously you've never read the rules, and wouldn't follow them if you had, and certainly you will never have enough points yourself to vouch for anyone.

      • oldpersonintx 10 days ago
        I'm cackling with delight and you can't stop me
        • DonHopkins 10 days ago
          Actually I'm delighted that your unbelievably insincere and belligerent response only proves that you have absolutely no clue that you've already been thoroughly canceled and stopped in your tracks for months, by deservedly getting yourself shadow banned as the consequences of your own anti-social actions, and that your supposed "cackling" just cements that shadow ban permanently and cancels any chance of the shadow ban ever being lifted. You've hoisted yourself by your own petard, and my work here is done!

          What you don't even realize is that nobody but the few people with showdead set like me and Tim can even see what you post. He took the time to vouch for your idiotic post just so he could hilariously reply and hand you your ass on a platter, because your posts are automatically and immediately flagged as dead, and have been for a long time, without you even knowing it. You've been completely canceled for months, dude, without a clue.

          Your thorough lack of situational awareness and your narcissistic bravado digging in deeper is just as hilarious and pathetic as Trump getting told like a dog to sit down and shut up in a cold room by the judge, then falling asleep and farting in front of so many witnesses including the judge, the jury, his lawyers, many reporters, and courtroom artists drawing him with his eyes shut, then trying to own it by simultaneously and incoherently claiming that it's all fake news, he's the victim, he really didn't do it, whining about how he's too much of a beta snowflake to deal with the cold temperature, claiming he actually meant to do it on purpose to win the election, and to teach his disloyal lawyers a lesson by making them smell his flatulence.

          Face it, buddy: LLMs are MUCH better at bullshiting and more believable than you will ever be.

    • subjectsigma 12 days ago
      Yeah, plebeian schmucks deserve to have AI lie about them of course /s
  • zooq_ai 12 days ago
    [flagged]
  • nikolay 12 days ago
    Zuck should first fix Facebook. With the latest Chrome and built-in privacy options turned on, I get logged out of Facebook and Messenger every 5 or fewer minutes. Also, meta.ai does not work for me either - it asks me to log in via Facebook and then does nothing—it only works when I clear all cookies and it asks me to log in via Facebook. What a stinky, steamy mess!

    The whole Meta properties are turning into a similar mess. Facebook started to look like MySpace, which died because it became super slow and unpleasant to use!

    Most teenagers don't want to touch Facebook or Instagram — the latter being used only by girls with severe narcissistic disorders. Most kids blatantly tell me Facebook is what old folks use. They mostly use Discord nowadays, not WhatsApp.