17 comments

  • retrofrost 30 days ago
    This is amazing work, but to me it highlights some of the biggest problems in the current AI zeitgeist, we are not really trying to work on any neuron or ruleset that isnt much different from the perceptron thats just a sumnation function. Is it really that suprising that we just see this same structure repeated in the models. Just because feedforward topologies with single neuron steps are the easiest to train and run on graphics cards does that really make them the actual best at accomplishing tasks? We have all sorts of unique training methods and encoding schemes that don't ever get used because the big libraries don't support them. Until, we start seeing real varation in the fundamental rulesets of neuralnets we are always just going to be fighting against the fact these are just perceptrons with extra steps.
    • visarga 30 days ago
      > Just because feedforward topologies with single neuron steps are the easiest to train and run on graphics cards does that really make them the actual best at accomplishing tasks?

      You are ignoring a mountain of papers trying all conceivable approaches to create models. It is evolution by selection, in the end transformers won.

      • retrofrost 30 days ago
        Just because papers are getting published doesn't mean its actually gaining any traction. I mean we have known that time series of signals recieves plays a huge role in how bio neurons functionally operate and yet we have nearly no examples of spiking networks being pushed beyond basic academic exploration. We have known glial cells play a critical role in biological neural and yet you can probably count the number of papers that examine using an abstraction of that activity in neural net, on both your hands and toes. Neuroevolution using genetic algorithms has been basically looking for a big break since NEAT. Its the height of hubris to say that we have peaked with transformers when the entire field is based on not getting trapped in local maxima's. Sorry to be snippy, but there is so much uncovered ground its not even funny.
        • gwervc 30 days ago
          "We" are not forbidding you to open a computer, start experimenting and publishing some new method. If you're so convinced that "we" are stuck in a local maxima, you can do some of the work you are advocating instead of asking other to do it for you.
          • Kerb_ 30 days ago
            You can think chemotherapy is a local maxima for cancer treatment and hope medical research seeks out other options without having the resources to do it yourself. Not all of us have access to the tools and resources to start experimenting as casually as we wish we could.
            • erisinger 30 days ago
              Not a single one of you bigbrains used the word "maxima" correctly and it's driving me crazy.
              • vlovich123 30 days ago
                As I understand it a local maxima means you’re at a local peak but there may be higher maximums elsewhere. As I read it, transformers are a local maximum in the sense of outperforming all other ML techniques as the AI technique that gets the closest to human intelligence.

                Can you help my little brain understand the problem by elaborating?

                Also you may want to chill with the personal attacks.

                • erisinger 30 days ago
                  Not a personal attack. These posters are smarter than I am, just ribbing them about misusing the terminology.

                  "Maxima" is plural, "maximum" is singular. So you would say "a local maximum," or "several local maxima." Not "a local maxima" or, the one that really got me, "getting trapped in local maxima's."

                  As for the rest of it, carry on. Good discussion.

                  • FeepingCreature 30 days ago
                    A local maxima, that is, /usr/bin/wxmaxima...
                  • gyrovagueGeist 30 days ago
                    While "local maximas" is wrong, I think "a local maxima" is a valid way to say "a member of the set of local maxima" regardless of the number of elements in the set. It could even be a singleton.
                    • dragonwriter 30 days ago
                      No, a member of the set of local maxima is a a local maximum, just like a member of the set of people is a person, because it is a definite singular.

                      The plural is also used for indefinite number, so “the set of local maxima” remains correct even if the set has cardinality 1, but a member of the set has definite singular number irrespective of the cardinality of the set.

                    • Tijdreiziger 30 days ago
                      You can't have one maxima in the same way you can't have one pencils. That's just how English works.
                      • pixl97 30 days ago
                        You can't have one local maxima, it would be the global maxima. So by saying local maxima you're assuming the local is just a piece of a larger whole, even if that global state is otherwise undefined.
                        • reverius42 30 days ago
                          No, you can’t have one local maxima, or one global maxima, because it’s plural. You can have one local or global maximum, or two (or more) local or global maxima.
                        • folli 30 days ago
                          "You can't have one local pencils, it would be the global pencils"
              • antonvs 30 days ago
                “Maxima” sounds fancy, making it catnip for people trying to sound smart.
              • tschwimmer 30 days ago
                yeah, not a Nissan in sight
            • mikewarot 30 days ago
              MNIST and other small and easy to train against datasets are widely available. You can try out anything you like even with a cheap laptop these days thanks to a few decades of Moore's law.

              It is definitely NOT out of your reach to try any ideas you have. Kaggle and other sites exist to make it easy.

              Good luck! 8)

              • retrofrost 30 days ago
                My pet project has been trying to use elixir with NEAT or HyperNEAT to try and make a spiking network, then when thats working decently drop some glial interactions I saw in a paper. It would be kinda bad at purely functional stuff, but idk seems fun. The biggest problems are time and having to do a lot of both the evolutionary stuff and the network stuff. But yeah the ubiquity of free datasets does make it easy to train.
            • importantbrian 26 days ago
              Not to mention not everyone can be devoted to doing cancer research. Some Drs. and Nurses are necessary to you know actually treat the people who have cancer.
          • haltIncomplete 30 days ago
            All we’re doing is engineering new data compression and retrieval techniques: https://arxiv.org/abs/2309.10668

            Are we sure there’s anything “net new” to find within the same old x86 machines, within the same old axiomatic systems of the past?

            Math is a few operations applied to carving up stuff and we believe we can do that infinitely in theory. So “all math that abides our axiomatic underpinnings” is valid regardless if we “prove it” or not.

            Physical space we can exist in, a middle ground of reality we evolved just so to exist in, seems to be finite; I can’t just up and move to Titan or Mars. So our computers are coupled to the same constraints of observation and understanding as us.

            What about daily life will be upended reconfirming decades old experiment? How is this not living in sunk cost fallacy?

            When all you have is a hammer…

            I’m reminded of Einstein’s quote about insanity.

            • aldousd666 30 days ago
              Einstein didn't say that about insanity, but... systems exist and are consistently described by particular equations at particular scales. Sure we can say everything is quantum mechanics, even classical physics can technically be translated as a series of wave functions that explain the same behaviors we observe, if we could measure it... But it's impractical, and some of the concepts we think of as fundamental to certain scales, like nucleons, didn't exist at others, like equations that describe the energy of empty space. So, it's maybe not quite a fallacy to point out that not every concept we find to be useful, like deep learning inference, encapsulate every rule at every scale that we know about down to the electrons, cogently. Because none of our theories do that, and even if they did, we couldn't measure or process all the things needed to check and see if we're even right. So we use models that differ from each other, but that emerge from each other, but only when we cross certain scale thresholds.
            • samus 30 days ago
              If you abstract far enough then yes, everything what we are doing is somehow akin to what we have done before. But that then also applies to what Einstein has done.
        • typon 30 days ago
          Do you really think that transformers came to us from God? They're built on the corpses of millions of models that never went anywhere. I spent an entire year trying to scale up a stupid RNN back in 2014. Never went anywhere, because it didn't work. I am sure we are stuck in a local minima now - but it's able to solve problems that were previously impossible. So we will use it until we are impossibly stuck again. Currently, however, we have barely begun to scratch the surface of what's possible with these models.
        • leoc 30 days ago
          (The singulars are ‘maximum’ and ‘minimum’, ‘maxima’ and ‘minima’ are the plurals.)
        • samus 30 days ago
          Who said that we peaked with transformers? I sure hope we did not. The current focus on them is just institutional inertia. Worst case another AI winter comes, at the end of which a newer, more promising technology would manage to attract funding anew.
      • nicklecompte 30 days ago
        His point is that "evolution by selection" also includes that transformers are easy to implement with modern linear algebra libraries and cheap to scale on current silicon, both of which are engineering details with no direct relationship to their innate efficacy at learning (though indirectly it means you scale up the training data for more inefficient learning).
        • wanderingbort 30 days ago
          I think it is correct to include practical implementation costs in the selection.

          Theoretical efficacy doesn’t guarantee real world efficacy.

          I accept that this is self reinforcing but I favor real gains today over potentially larger gains in a potentially achievable future.

          I also think we are learning practical lessons on the periphery of any application of AI that will apply if a mold-breaking solution becomes compelling.

      • foobiekr 30 days ago
        "won"

        They barely work for a lot of cases (i.e., anything where accuracy matters, despite the bubble's wishful thinking). It's likely that something will sunset them in the next few years.

        • victorbjorklund 30 days ago
          That is how evolution works. Something wins until something else comes along and win. And so on forever.
          • Retric 30 days ago
            Evolution generally favors multiple winners in different roles over a single dominate strategy.

            People tend to favor single winners.

            • advael 30 days ago
              I both think this is a really astute and important observation and also think it's an observation that's more true locally than of people broadly. Modern neoliberal business culture generally and the consolidated current incarnation of the tech industry in particular have strong "tunnel vision" and belief in chasing optimality compared to many other cultures, both extant and past
              • imtringued 29 days ago
                In neoclassical economics, there are no local maxima, because it would make the math intractable and expose how much of a load of bullshit most of it is.
                • foobiekr 29 days ago
                  Yep. This. It’s impressive how communication is instantaneous, unimpeded, complete and transparent in economics.

                  Those things aren’t even true in a 500 person company let alone an economy.

        • refulgentis 30 days ago
          It seems cloyingly performative grumpy old man once you're at "it barely works and it's a bubble and blah blah" in response to a discussion about their comparative advantage (yeah, they won, and absolutely convincingly so)
          • wizzwizz4 30 days ago
            That's like saying Bitcoin won cryptography.
        • oldpersonintx 30 days ago
          [dead]
      • antonvs 30 days ago
        I’d say it’s more that transformers are in the lead at the moment, for general applications. There’s no rigorous reason afaik that it should stay that way.
      • jjtheblunt 29 days ago
        > in the end transformers won

        we're at the end?

      • dartos 30 days ago
        I mean RWKV seems promising and isn’t a transformer model.

        Transformers have first mover advantage. They were the first models that scaled to large parameter counts.

        That doesn’t mean they’re the best or that they’ve won, just that they were the first to get big (literally and metaphorically)

        • refulgentis 30 days ago
          It doesn't seem promising, a one man band has been doing a quixotic quest based on intuition and it's gotten ~nowhere, and it's not for lack of interest in alternatives. There's never been a better time to have a different approach - is your metric "times I've seen it on HN with a convincing argument for it being promising?" -- I'm not embarrassed to admit that is/was mine, but alternatively, you're aware of recent breakthroughs I haven't seen.
          • dartos 28 days ago
            RWKV has shown that you can scale RNNs to large parameter counts.

            The fact that one person (initially) was able to do it highlights how much low hanging fruit there is for non transformers.

            Also, the fact that a small number of people designed, trained, and published 5 versions of a perfectly serviceable (as in has decent summarizing ability. The biggest LLM use case) model which doesn’t have the time complexity of transformers is a big deal.

        • tkellogg 30 days ago
          Yeah, I'd argue that transformers created such capital saturation that there's a ton of opportunity for alternative approaches to emerge.
          • dartos 30 days ago
            Speak of the devil. Jamba just hit the front page.
      • szundi 30 days ago
        “end”
    • ikkiew 30 days ago
      > the perceptron thats just a sumnation[sic] function

      What would you suggest?

      My understanding of part of the whole NP-Complete thing is that any algorithm in the complexity class can be reduced to, among other things, a 'summation function'.

    • ldjkfkdsjnv 30 days ago
      Cannot understand people claiming we are in a local maxima, when we literally had an ai scientific breakthrough only in the last two years.
      • xanderlewis 30 days ago
        Which breakthrough in the last two years are you referring to?
        • 6gvONxR4sf7o 30 days ago
          If you had to reduce it to one thing, it's probably that language models are capable few shot and zero shot learners. In other words, training a model to simply predict the next word on naturally occurring text, you end up with an tool you can use for generic tasks, roughly speaking.
          • xyzzy_plugh 30 days ago
            It turns out a lot of tasks are predictable. Go figure.
        • ldjkfkdsjnv 30 days ago
          the LLM scaling law
    • posix86 30 days ago
      I don't understand enough about the subject to say, but to me it seemed like yes, other models have better metrics with equal model size i.t.o. number of neurons or asymptotic runtime, but the most important metric will always be accuracy/precision/etc for money spent... or in other words, if GPT requires 10x number of neurons to reach the same performance, but buying compute & memory for these neuros is cheaper, then GPT is a better means to an end.
    • blueboo 30 days ago
      The bitter lesson, my dude. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

      If you find a simpler, trainable structure you might be onto something

      Attempts to get fancy tried and died

  • derefr 30 days ago
    Help me understand: when they say that the facts are stored as a linear function… are they saying that the LLM has a sort of N-dimensional “fact space” encoded into the model in some manner, where facts are embedded into the space as (points / hyperspheres / Voronoi manifolds / etc); and where recalling a fact is — at least in an abstract sense — the NN computing / remembering a key to use, and then doing a key-value lookup in this space?

    If so: how do you embed a KV-store into an edge-propagated graphical model? Are there even any well-known techniques for doing that “by hand” right now?

    (Also, fun tangent: isn't the "memory palace" memory technique, an example of human brains embedding facts into a linear function for easier retrieval?)

    • jacobn 30 days ago
      The fundamental operation done by the transformer, softmax(Q.K^T).V, is essentially a KV-store lookup.

      The Query is dotted with the Key, then you take the softmax to pick mostly one winning Key (the Key closest to the Query basically), and then use the corresponding Value.

      That is really, really close to a KV lookup, except it's a little soft (i.e. can hit multiple Keys), and it can be optimized using gradient descent style methods to find the suitable QKV mappings.

      • naveen99 30 days ago
        Not sure there is any real lookup happening. Q,K are the same and sometimes even v is the same…
        • toxik 30 days ago
          Q, K, V are not the same. In self-attention, they are all computed by separate linear transformation of the same input (ie the previous layer’s output). In cross-attention even this is not true, then K and V are computed by linear transformation of whatever is cross-attended, and Q is computed by linear transformation of the input as before.
          • ewild 30 days ago
            yeah a common misconception people think because the input is the same they forget that their is a pre attention linear transofrmation for q k and v (using the decoder only version obv v is diff with encoder decoder bert style)
          • naveen99 26 days ago
            It’s still a stretch to call that a look up.
    • bionhoward 30 days ago
      [Layer] Normalization constrains huge vectors representing tokens (input fragments) to positions on a unit ball (I think), and the attention mechanism operates by rotating the unconstrained ones based on the sum of their angles relative to all the others.

      I only skimmed the paper but believe the point here is that there are relatively simple functions hiding in or recoverable from the bigger network which specifically address certain categories of relationships between concepts.

      Since it would, in theory, be possible to optimize such functions more directly if they are possible to isolate, could this enable advances in the way such models are trained? Absolutely.

      After all, one of the best criticisms of “modern” AI is the notion we’re just mixing around a soup of linear algebra. Allowing some sense of modularity (reductionism) could make them less of a black box and more of a component driven approach (in the lagging concept space and not just the leading layer space)

    • thfuran 30 days ago
      >isn't the "memory palace" memory technique, an example of human brains embedding facts into a linear function for easier retrieval?

      I'm not sure I see how that's a linear function.

    • samus 30 days ago
      The memory palace is a hack that works because in an evolutionary sense our brain's purpose is to help us navigate our world and be effective in it. To do that, it has to be really good at remembering locations, to plot paths through and between them, and to translate that into speech or motion.
  • mike_hearn 30 days ago
    This is really cool. My mind goes immediately to what sort of functions are being used to encode programming knowledge, and if they are also simple linear functions whether the standard library or other libraries can be directly uploaded into an LLMs brain as it evolves, without needing to go through a costly training or performance-destroying fine-tune. That's still a sci-fi ability today but it seems to be getting closer.
    • Animats 30 days ago
      That's a good point. It may be possible to directly upload predicate-type info into a LLM. This could be especially useful if you need to encode tabular data. Somewhere, someone probably read this and is thinking about how to export Excel or databases to an LLM.

      It's encouraging to see people looking inside the black box successfully. The other big result in this area was that paper which found a representation of a game board inside a LLM after the LLM had trained to play a game. Any other good results in that area?

      The authors point out that LLMs are doing more than encoding predicate-type info. That's just part of what they are doing.

      • wongarsu 30 days ago
        The opposite is also exciting: build a loss function that punishes models for storing knowledge. One of the issues of current models is that they seem to favor lookup over reasoning. If we can punish models (during training) for remembering that might cause them to become better at inference and logic instead.
        • qlk1123 30 days ago
          I believe it will add some spice to the model, but you shouldn't go too far at that direction. Any social system has a rule set, which has to be learnt and remembered, not infered.

          Two exmaples. (1) grammars in natural languages. You can just see in another commenter here uses "a local maxima", and then how people react to that. I didn't even notice becuase English grammar has never been native to me. (2) Mostly, prepositions between two languages, no matter how close they are, don't have a direct mapping. The learner just has to remember it.

        • kossTKR 30 days ago
          Interesting. Reminds me of a sci-fi short i read years ago where AI's "went insane" when they had too much knowledge because they'd spent too much time looking through data and get a buffer overflow.

          I know some of the smaller models like PHI-2 are training for reasoning specifically before by training on question answer sets, though this seems like the opposite to me.

        • azinman2 30 days ago
          But how to do when pre training is basically predict the next token?
      • AaronFriel 30 days ago
        It indeed is. An attention mechanism's key and value matrices grow linearly with context length. With PagedAttention[1], we could imagine an external service providing context. The hard part is the how, of course. We can't load our entire database in every conversation, and I suspect there are also challenges around training (perhaps addressed via LandmarkAttention[2]) and building a service efficiently retrieve additional key-value matrices.

        The external service vector database may require tight timings necessary to avoid stalling LLMs. To manage 20-50 tokens/sec, answers must arrive within 50-20ms.

        And we cannot do this in real-time, pausing the transformer when a layer produces a query vector stalls the batch, so we need a way to predict queries (or embeddings) several tokens ahead of where they'd be useful and inject the context in when it's needed, and to know when to page it out.

        [1] https://arxiv.org/abs/2309.06180

        [2] https://arxiv.org/abs/2305.16300

    • politician 30 days ago
      Hah! Maybe Neo was an LLM. "I know kung-fu."
  • mikewarot 30 days ago
    I wonder if this relation still holds with newer models that have have even more compute thrown at them?

    My intuition is that the structure inherent to language makes Word2Vec possible. Then training on terabytes of human text encoded with Word2Vec + Positional Encoding makes it possible to then have the ability to predict the next encoding at superhuman levels of cognition (while training!).

    It's my sense that the bag of words (as input/output method) combined with limited context windows (to make Positional Encoding work) is a huge impedance mismatch to the internal cognitive structure.

    Thus I think that given the orders of magnitude more compute thrown at GPT-4 et al, it's entirely possible new forms of representation evolved and remain to be discovered by humans probing through all the weights.

    I also think that MemGPT could, eventually, become an AGI because of the unlimited long term memory. More likely, though, I think it would be like the protagonist in Memento[1].

    [1] https://en.wikipedia.org/wiki/Memento_(film)

    [edit - revise to address question]

    • autokad 30 days ago
      sorry if I misread your comment, but you seem to be indicating that LLMs such as chat gpt (which use gpt 3+) are bag of words models? they are sequence models.
      • mikewarot 30 days ago
        I edited my response... I hope it helps... my understanding is that the output gives probabilities for all the words, then one is chosen with some random thrown in (via the #temperature) then fed back in... which to me seems to equate to bag of words. Perhaps I mis-understood the term.
        • smaddox 30 days ago
          Bag of words models use a context that is a "bag" (i.e. an unorder map from elements to their counts) of words/tokens. GPT's use a context that is a sequence (i.e. an ordered list) of words/tokens.
  • zyklonix 30 days ago
    This reminds me of the famous "King - Man + Woman = Queen" embedding example. The fact that embeddings have semantic properties in them explains why simple linear functions would work as well.
  • estebarb 30 days ago
    I find this similar to what relation vectors do in word2vec: you can add a vector of "X of" and often get the correct answer. It could be that the principle is still the same, and transformers "just" build a better mapping of entities into the embedding space?
    • PaulHoule 30 days ago
      I think so. It’s hard for me to believe that the decision surfaces inside those models are really curved enough (like the folds of your brain) to really take advantage of FP32 numbers inside vectors: that is I just don’t believe it is

        x = 0 means “fly”
        x = 0.01 means “drive”
        x = 0.02 means “purple”
      
      but rather more like

        x < 1.5 means “cold”
        x > 1.5 means “hot”
      
      which is one reason why quantization (often 1 bit) works. Also it is a reason why you can often get great results feeding text or images through a BERT or CLIP-type model and then applying classical ML models that frequently involve linear decision surfaces.
      • taneq 30 days ago
        Are you conflating nonlinear embedding spaces with the physical curvature of the cerebellum? I don't think there's a direct mapping.
        • PaulHoule 30 days ago
          My mental picture is that violently curved decision surfaces could look like the convolutions of the brain even though they have nothing to do with how the brain actually works.

          I think of how tSNE and other algorithms sometimes produce projections that sometimes look like that (maybe that’s just what you get when you have to bend something complicated to fit into a 2-d space) and frequently show cusps that to me look like a sign of trouble (took me a while in my PhD work to realize how Poincaré sections from 4 or 6 dimensions can look messed up when a part of the energy surface tilts perpendicularly to the projection surface.)

          I still find it hard to believe that dense vectors are the right way to deal with text despite the fact that they work so well. For images it is one thing because changing one pixel a little doesn’t change the meaning of an image, but changing a single character of a text can completely change the meaning of the text. Also there’s the reality that if you randomly stick together tokens you get something meaningless, so it seems almost all of the representation space covers ill formed texts and only a low dimensional manifold holds the well formed texts. Now the decision surfaces really have to be nonlinear and crumpled over all but I think there’s a definitely a limit on how crumpled those surfaces can be.

          • Y_Y 30 days ago
            This is interesting. It makes me think of an "immersion"[0], as in a generalization of the concept of "embedding" in differential geometry.

            I share your uneasiness about mapping words to vectors and agree that it feels as if we're shoehorning some more complex space into a computationally convenient one.

            [0] https://en.wikipedia.org/wiki/Immersion_(mathematics)

  • whatever1 30 days ago
    Llms seem like a good compression mechanism.

    It blows my mind that I can have a copy of llama locally on my PC and have access to virtually the entire internet

    • krainboltgreene 30 days ago
      > have access to virtually the entire internet

      It isn't even close to 1% of the internet, much less virtually the entire internet. According to the latest dump, Common Crawl has 4.3B pages, but Google in 2016 estimated there are 130T pages. The difference between 130T and 4.3B is about 130T. Even if you narrow it down to Google's searchable text index it's "100's of billions of pages" and roughly 100P compared to CommonCrawl's 400T.

      • fspeech 30 days ago
        130T unique pages? That seems highly unlikely as that averages to over 10000 pages for each human being alive. If gp merely wants texts of interest to self as opposed to an accurate snapshot it seems LLMs should be quite capable, one day.
        • darby_eight 30 days ago
          It doesn't seem that hard to believe given how much automatically generated "content" (mostly garbage) there is.

          I think a more interesting question is how much information there is on the internet, especially after optimal compression. I'm guessing this is a very difficult question to answer, but also much higher than LLMs currently store.

        • cypress66 29 days ago
          Is it? Every user profile in every website is a page. Every single tweet is a page.
          • fspeech 16 days ago
            Tweets count. HN posts also count (actually as high quality texts :). IOT devices reporting status based on the same templates should not qualify as unique pages (count the number of templates if you want). Now if I do a search for some news, many almost verbatim copies would show up. They should only count as one, as we are looking for unique texts!
      • whatever1 29 days ago
        The internet to me and to most of the people is the 10 first search results for the various terms we search for.
    • Culonavirus 30 days ago
      Yea except it's a lossy compression. With the lost part being hallucinated in at inference time.
      • AnotherGoodName 30 days ago
        Lossy and lossless are way more transferable than people give credit.

        Long winded explanation as best as i can in a HN comment. Essentially for state of the art compression both the encoder and the decoder have the same algorithm. They look at the bits encoded/decoded so far, they both run exactly the same prediction on those bits seen so far using some model that predicts based on past data (AI is fantastic for this). If the prediction was 99% likely that the next bit is a '1' the encoder only writes a fraction of a bit to represent that (assuming the prediction is correct) and on the other side the decoder will have the same prediction at that point and either read the next large number of bits to correct or it will be able to simple write '1' to the output and start on the prediction of the next bit given that now written '1'.

        Essentially lossy predictions of the next data are great tools to losslessly compress data as those predictions of the next bit/byte/word minimize the data needed to losslessly encode that next bit/byte/word. Likewise you can trivially make a lossy compressor out of a lossless one. Lossy and lossless just aren't that different.

        The longstanding Hutter prize for AI in fact judges the AI on how well it can compress data. http://prize.hutter1.net/ This is based in the fact that what we think of as AI and compression are quite interchangeable. There's a whole bunch of papers out on this.

        http://prize.hutter1.net/hfaq.htm#compai

        I have nothing to do with Hutter but i know all about AI and data compression and their relation.

      • Kuinox 30 days ago
        If you've read the article, the LLM hallucinations aren't due to the model not knowing the information but a function that choose to remember the wrong thing.
        • sinemetu11 30 days ago
          From the paper:

          > Finally, we use our dataset and LRE-estimating method to build a visualization tool we call an attribute lens. Instead of showing the next token distribution like Logit Lens (nostalgebraist, 2020) the attribute lens shows the object-token distribution at each layer for a given relation. This lets us visualize where and when the LM finishes retrieving knowledge about a specific relation, and can reveal the presence of knowledge about attributes even when that knowledge does not reach the output.

          They're just looking at what lights up in the embedding when they feed something in, and whatever lights up is "knowing" about that topic. The function is an approximation they added on top of the model. It's important to not conflate this with the actual weights of the model.

          You can't separate the hallucinations from the model -- they exist precisely because of the lossy compression.

        • ewild 30 days ago
          even this place has people not reading the articles. we are doomed
    • nyrikki 29 days ago
      PAC learning is compression.

      PAC learnable, Finite VC dimensionality, and the following form of compression are fully equivalent.

      https://arxiv.org/abs/1610.03592

      Basically each individual neuron/perceptron just splits a space into two subspaces.

  • MuffinFlavored 30 days ago
    I don't understand how a "CSV file/database/model" of 70,000,000,000 (70B) "parameters" of 4-bit weights (a 4 bit value can be 1 of 16 unique numbers) gets us an interactive LLM/GPT that is near-all-knowledgable on all topics/subjects.

    edit: did research, the 4-bit is just a "compression method", the model ends up seeing f32?

    > Quantization is the process of mapping 32-bit floating-point numbers (which are the weights in the neural network) to a much smaller bit representation, like 4-bit values, for storage and memory efficiency.

    > Dequantization happens when the model is used (during inference or even training, if applicable). The 4-bit quantized weights are converted back into floating-point numbers that the model's computations are actually performed with. This is done using the scale and zero-point determined during the initial quantization, or through more sophisticated mapping functions that aim to preserve as much information as possible despite the reduced precision.

    so what is the relationship to "parameters" and "# of unique tokens the model knows about (vocabulary size)"?

    > At first glance, LLAMa only has a 32,000 vocabulary size and 65B parameters as compared to GPT-3,

    > The 65 billion parameters in a model like LLAMA (or any large language model) essentially function as a highly intricate mapping system that determines how to respond to a given input based on the learned relationships between tokens in its training data.

    • Filligree 30 days ago
      It doesn't, is the simple answer.

      The slightly more complicated one is that a compressed text dump of Wikipedia isn't even 70GB, and this is lossy compression of the internet.

      • ramses0 30 days ago
        Is there some sort of "LLM-on-Wikipedia" competition?

        ie: given "just wikipedia" what's the best score people can get on however these models are evaluated.

        I know that all the commercial ventures have a voracious data-input set, but it seems like there's room for dictionary.llm + wikipedia.llm + linux-kernel.llm and some sort of judging / bake-off for their different performance capabilities.

        Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?

        • bionhoward 30 days ago
          Yes, that’s known as the Hutter Prize http://prize.hutter1.net/
          • ramses0 30 days ago
            Not exactly, because LLM's seem to be exhibiting value via "lossy knowledge response" vs. "exact reproduction measured in bytes", but close.
            • AnotherGoodName 30 days ago
              Lossy and lossless are more interchangeable in computer science than people give credit so i wouldn't dwell on that too much. You can optimally convert one into the other with arithmetic coding. In fact the actual best in class algorithms that have won the hutter prize are all lossy behind the scenes. They make a prediction on the next data using a model (often AI based) which is a lossy process and with arithmetic coding they losslessly encode the next data with bits proportional to how correct the prediction was. In fact the reason why the hutter prize is lossless compression is exactly because converting lossy to lossless with arithmetic coding is a way to score how correct a lossy prediction is.
        • CraigJPerry 30 days ago
          >> Or does the training truly _NEED_ every book every written + the entire internet + all knowledge ever known by mankind to have an effective outcome?

          I have the same question.

          Peter Norvig’s GOFAI Shakespeare generator example[1] (which is not an LLM) gets impressive results with little input data to go on. Does the leap to LLM preclude that kind of small input approach?

          [1] link should be here because I assumed as I wrote the above that I would just turn it up with a quick google. Alas t’was not to be. Take my word for it, somewhere on t’internet is an excellent write up by Peter Norvig on LLM vs GOFAI (good old fashioned artificial intelligence)

      • MuffinFlavored 30 days ago
        say the average LLM these days has a unique token (vocabulary) size of ~32,000 (not its context size, # of unique tokens it can pick between in a response. English words, punctuation, math, code, etc.)

        the 60-70B parameters of models is basically like... just stored patterns of "if these 10 tokens in a row input, then these 10 tokens in a row output score the highest"

        Is that a good summary?

        > The model uses its learned statistical patterns to predict the probability of what comes next in a sequence of text.

        based on what inputs?

        1. previous tokens in the sequence from immediate context

        2. tokens summarizing the overall topic/subject matter from the extended context

        3. scoring of learned patterns from training

        4. what else?

        • wongarsu 30 days ago
          That would be equivalent to a hidden markov chain. Those have been around for decades, but we have only managed to make them coherent for very short outputs. Even GPT2 beats any Markov chain, so there has to be more going on

          Modern LLMs are able to transfer knowledge between different languages, so it's fair to assume that some mapping between human language and a more abstract internal representation happens at the input and output, instead of the model "operating" on English or Chinese or whatever language you talk with it. And once this exists, an internal "world model" (as in: a collection of facts and implications) isn't far, and seems to indeed be something most LLMs do. The reasoning on top of that world model is still very spotty though

        • numeri 30 days ago
          Your suggested scheme (assuming a mapping from 10 tokens to 10 tokens, with each token taking 2 bytes to store) would take (32000 * 20) * 2 bytes = 2.3e78 TiB of storage, or about 250 MiB per atom in the observable universe (1e82), prior to compression.

          I think it's more likely that LLMs are actually learning and understanding concepts as well as memorizing useful facts, than that LLMs have discovered a compression method with that high of a compression ratio, haha.

          • mjburgess 30 days ago
            LLMs cannot determine the physical location of any atoms. they cannot plan movement, and so on.

            LLMs are just completing patterns of text that have been given before, 'everthing ever written' is both a lot for any individual person to read; but also, almost nothing, in that to propertly describe a table requires more information

            text is itself an extremely compressed medium which lacks almost any information about the world; it succeeds in being useful to generate because we have that information and are able to map it back to it

            • numeri 30 days ago
              I didn't imply that they know anything about where atoms are, I was just pointing out the sheer absurdity of that volume of data.

              I should make it clear that my comparison there is unfair and mostly just funny – you don't need to store every possible combination of 10 tokens, because most of them will be nonsense, so you wouldn't actually need that much storage. That being said, it's been fairly solidly proven that LLMs aren't just lookup tables/stochastic parrots.

              • mjburgess 30 days ago
                > fairly solidly proven that LLMs aren't just lookup tables/stochastic parrots

                Well i'd strongly disagree. I see no evidence of this; I'm am quite well acquainted with the literature.

                All empirical statistical AI is just a means of approximating an empirical distribution. The problem with NLP is that there is no empirical function from text tokens to meanings; just as there is no function from sets of 2D images to a 3D structure.

                We know before we start that the distributions of text tokens are only coincidentally related to the distributions of meanings. The question is just how much value that coincidence has in any given task.

                (Consider, eg., that if I ask, "do you like what i'm wearing?" there is no distribution of responses which is correct. I do not want you to say "yes" 99/100, or even 100/100 times. etc. what I want you to say is a word caused a mental state you have: that of (dis)liking what i'm wearing.

                Since no statistical AI systems generate outputs based on causal features of reality, we know a priori that almost all possible questions that can be asked cannot be answered by LLMs.

                They are only useful where questions have cannonical answers; and only because "cannonical" means that a text->text function is likely to be conidentally indistinguishable from a the meaning->meaning function we're interested in).

                • tel 30 days ago
                  That suggests that no statistical method could ever recover hidden representations though. And that’s patently untrue. Taken to its greatest extreme you shouldn’t even be able to guess between two mixed distributions even when they have wildly non-overlapping ranges. Or put another way, all of statistical testing in science is flawed.

                  I’m not saying you believe that, but I fail to see how that situation is structurally different from what you claim. If it’s a matter of degree, how do you feel things change as the situation becomes more complex?

                  • mjburgess 29 days ago
                    Yes, I think most statistical testing in science is flawed.

                    But, to be clear, the reason it could ever work at all has nothing to do with the methods or the data itself, it has to do with the properties of the data generating process (ie., reality, ie., what's being measured).

                    You can never build representations from measurement data, this is called inductivism and it's pretty clearly false: no representation is obtained from just characterising measurement data. Theres no cases where I can think of that this would work -- temperature isnt patterns in thermometers; gravity isnt patterns in the positions of stars; and so on.

                    Rather you can decide between competing representations using stats in a few special cases. Stats never uncovers hidden representations, it can decide between different formal models which include such representations.

                    eg., if you characterise some system as having a power-law data generating process (eg., social network friendships), then you can measure some parameters of that process

                    or, eg., if you arrange all the data to already follow a law you know (eg., F=Gmm/r^2) then you can find G, 'statistically'.

                    This has caused a lot of confusion histroically: it seems G is 'induced over cases', but all the representaiton work has alerady been done. Stats/induction just plays the role of fine-tuning known representatios. it never builds any

                    • tel 29 days ago
                      Okay, I think I follow and agree legalistically with your argument. But I also think it basically only exists philosophically. In practice, we make these determinations all the time. I don't see any reason why a sufficiently sophisticated representation, learned through statistical optimization, is, in practice, different from a semantic model.

                      If there were such a thing, it'd be interesting to propose how our own minds, at least to the degree that they can be seen as statistical learners in their own right, achieve semantics. And how that thing, whatever it might be, is not itself a learned representation driven by statistical impression.

                      • mjburgess 28 days ago
                        We arent statistical learners. We're abductive learners.

                        We move, and in moving, grow representations in our bodies. These representations are abstracted in cognition, and form the basis for abductive explanations of reality.

                        We leave plato's cave by building vases of our own, inside the cave, and comparing them to shadows. We do not draw outlines around the shadows.

                        This is all non-experimental 'empirical' statistics is: pencil marks on the cave wall.

                        • tel 28 days ago
                          So we craft experiments.

                          If someone else crafted an experiment, and you were informed of it and then shown the results, if this was done repeatedly enough, would you be incapable of forming any sort of semantic meaning?

                          • mjburgess 27 days ago
                            If they only showed the measures, yes.

                            The meaning of the measures is determined by the experiment, not by the data. "Data" is itself meaningless, and statistics on data is only informative of reality because of how the experimenter creates the measurement-target relationship.

                            • tel 27 days ago
                              Okay I think I buy that. I don’t know if I agree, but trying to argue for a position against it has been sufficiently illuminating that I just need to chew on it more.

                              There’s no doubt in my mind that experimental learning is more efficient. Especially if you can design the experiments against your personal models at the time.

                              At the same time, it’s not clear to me that one could not gain similar value purely by, say, reading scientific journals. Or observing videos of the experiments.

                              At some point the prevalence of “natural experiments” becomes too low for new discover through. We weren’t going to accidentally discover an LHC hanging around. We needed giant telescopes to find examples of natural cosmological experiments. Without a doubt, thoughtful investment in experimentation becomes necessary as you push your knowledge frontier forward.

                              But within a realm where tons of experimental data is just available? Seems very likely that a learner asked to predict new experimental results outside of things they’ve directly observed but well within the space of models they’ve observed lots of experimentation around should still find that purely as an act of compression, their statistical knowledge would predict something equivalent to the semantic theory underlying it.

                              We even seemed to observe just this in multimodal GPT-4 where it can theorize about the immediate consequences of novel physical situations depicted in images. I find it to be weak but surprising evidence of this behavior.

                              • mjburgess 26 days ago
                                I'd be interested in the GPT-4 case, if you have a paper (etc.) ?

                                You are correct to observe that science, as we know it, is ending. We're way along the sigmoid of what can be known, and soon enough, will be drifting back into medieval heuristics ("this weed seems to treat this disease").

                                This isnt a matter of efficiency, it's a necessity. Reality is under-determined by measurement; to find out what it is like, we have to have many independent measures whose causal relationship to reality is one we can control (through direct, embodied, action).

                                If we only have observational measures, we're trapped in a madhouse.

                                Let's not mistake science for pseudoscience, even if the future is largely now, pseudoscientific trash.

                                • tel 26 days ago
                                  I thought the examples I was thinking of were in the original GPT-4 Technical Report, but all I found on re-reading were examples of it explaining "what's funny about" a given image. Which is still a decent example of this, I think. GPT-4 demonstrates a semantic model about what entails humor.
                                  • mjburgess 26 days ago
                                    it entails only that the associative model is coincidentally indinstiguishable from a semantic one in the cases where it's used

                                    it is always trivial to take one of these models and expose it's failure to operate semantically, but these cases are never in the marketing material.

                                    Consider an associative model of addition, all numbers from -1bn to 1bn, broken down into their digits, so that 1bn = <1, 0, 0, 0, 0, 0, 0, 0, 0>

                                    Using such a model you can get the right answers for more additions than just -1bn to 1bn, but you can also easily find cases where the addition would fail.

                                    It's never adding.

                                    • tel 26 days ago
                                      I think part of what I suspect is going on here too is more computation and finiteness. It seems correct that LLM architectures cannot perform too much computation (unless you unroll it in the context).

                                      On the other hand you can look at statistical model identification in, say, nonlinear control. This can absolutely lead to unboundedly long predictions.

                • richardatlarge 29 days ago
                  Too true. I often point out to others that a transformer like gpt-4 operates wholly on numbers- it knows nothing of meaning in the real world- nothing at all
                  • dragonwriter 29 days ago
                    > I often point out to others that a transformer like gpt-4 operates wholly on numbers- it knows nothing of meaning in the real world- nothing at all

                    This is like saying a brain operates wholly on electrochemical states and knowns nothing about meaning in the real world, though; the mechanistic description is accurate, the cognitive conclusion attached to it is, at best, based on unsupported conjecture about the relation of mechanism to understanding.

                    • richardatlarge 29 days ago
                      All the magic in the universe will not allow you to find anything about a human from looking solely at the neural structure of its brain, however sophisticated. The brain is a representation (genome) of lived experience (phenome). The transformer has only ever experienced alphanumeric data input. Second hand experience.
          • pk-protect-ai 30 days ago
            There is something wrong with these arithmetic: "(32000 * 20) * 2 bytes = 2.3e78 TiB of storage" ... The factorial is missing somewhere in there ...
          • idontpost 30 days ago
            [dead]
        • HarHarVeryFunny 30 days ago
          > the 60-70B parameters of models is basically like... just stored patterns of "if these 10 tokens in a row input, then these 10 tokens in a row output score the highest"

          > Is that a good summary?

          No - there's a lot more going on. It's not just mapping input patterns to output patterns.

          A good starting point to understand it are linguist's sentence-structure trees (and these were the inspiration for the "transformer" design of these LLMs).

          https://www.nltk.org/book/ch08.html

          Note how there are multiple levels of nodes/branches to these trees, from the top node representing the sentence as a whole, to the words themselves which are all the way at the bottom.

          An LLM like ChatGPT is made out of multiple layers (e.g. 96 layers for GPT-3) of transformer blocks, stacked on top of each other. When you feed an input sentence into an LLM, the sentence will first be turned into a sequence of token embeddings, then passed through each of these 96 layers in turn, each of which changes ("transforms") it a little bit, until it comes out the top of the stack as the predicted output sentence (or something that can be decoded into the output sentence). We only use the last word of the output sentence which is the "next word" it has predicted.

          You can think of these 96 transformer layers as a bit like the levels in one of those linguistic sentence-structure trees. At the bottom level/layer are the words themselves, and at each successive higher level/layer are higher-and-higher level representations of the sentence structure.

          In order to understand this a little better, you need to understand what these token "embeddings" are, which is the form in which the sentence is passed through, and transformed by, these stacked transformer layers.

          To keep it simple, think of a token as a word, and say the model has a vocabulary of 32,000 words. You might perhaps expect that each word is represented by a number in the range 1-32000, but that is not the way it works! Instead, each word is mapped (aka "embedded") to a point in a high dimensional space (e.g. 4096-D for LLaMA 7B), meaning that it is represented by a vector of 4096 numbers (cf a point in 3-D space represented as (x,y,z)).

          These 4096 element "embeddings" are what actually pass thru the LLM and get transformed by it. Having so many dimensions gives the LLM a huge space in which it can represent a very rich variety of concepts, not just words. At the first layer of the transformer stack these embeddings do just represent words, the same as the nodes do at the bottom layer of the sentence-structure tree, but more information is gradually added to the embeddings by each layer, augmenting and transforming what they mean. For example, maybe the first transformer layer adds "part of speech" information so that each embedded word is now also tagged as a noun or verb, etc. At the next layer up, the words comprising a noun phase or verb phrase may get additionally tagged as such, and so-on as each transformer layer adds more information.

          This just gives a flavor of what is happening, but basically by the time the sentence has reached the top layer of the transformer it has been able to see the entire tree structure of the sentence, and only then have "understand" it well enough to predict a grammatically and semantically "correct" continuation from which it is able to predict continuation words.

          • MichaelZuo 30 days ago
            Thanks for the explanation.

            Since unicode has well over 64000 symbols, does that imply models, trained on a large corpus, must necessarily have at least 64000 ‘branches’ at the bottom layer?

            • HarHarVeryFunny 30 days ago
              The size of the character set (unicode) doesn't really factor into this. Input words are broken down into multi-character tokens (some words will be one token, some split into two, etc), then these tokens mapped into the embedding vectors which is what the model is then operating on.

              The linguistic sentence structure tree for any input sentence is a useful way to think about what is happening as the input sentence is fed into the model and processed through it layer by layer, but doesn't have any direct correspondence to the model. The model has a fixed number of layers of fixed max-tokens width, so nothing changes according to the sentence passing through it.

              Note that the bottom level of the sentence structure tree is just the words of the sentence, so the number of branches is just the length of the sentence. The model doesn't actually represent these branches though - just the embeddings corresponding to the input, which are transformed from input to output as they are passed through the model and each layer does it's transformer thing.

          • richardatlarge 29 days ago
            Now you tell me… awesome explanation- thanks
    • Acumen321 30 days ago
      Quantization in this context is the precision of each value in the vector or matrix/tensor.

      If the model in question has a token embedding length of 1024, even if it was a 1 bit quantization, each token has 2^1024 possible values.

      If the context length is 32,000 tokens, there are 32,000^2^1024 possible inputs.

  • wslh 30 days ago
    Can we roughly say that LLMs produces (training mode) a lot of IF-THENs in an automatic way from a vast quantity of information (nor techniques) that was not available before?
  • robertclaus 30 days ago
    I think this paper is cool and I love that they ran these experiments to validate these ideas. However, I'm having trouble reconciling the novelty of the ideas themselves. Isn't this result expected given that LLM's naturally learn simple statistical trends between words? To me it's way cooler that they clearly demonstrated not all LLM behavior can be explained this simply.
  • uoaei 30 days ago
    This is the "random linear projections as memorization technique" perspective on Transformers. It's not a new idea per se, but nice to see it fleshed out.

    If you dig into this perspective, it does temper any claims of "cognitive behavior" quite strongly, if only because Transformers have such a large capacity for these kinds of "memories".

    • tel 30 days ago
      Do you have a reference on “random linear projections as memorization”? I know random projections quite well but haven’t seen that connection.
  • vsnf 30 days ago
    > Linear functions, equations with only two variables and no exponents, capture the straightforward, straight-line relationship between two variables

    Is this definition considering the output to be included in the set of variables? What a strange way to phrase it. Under this definition, I wonder what an equation with one variable is. Is a single constant an equation?

    • hansvm 30 days ago
      It's just a change in perspective. Consider a vertical line. To have an "output" variable you have to switch the ordinary `y=mx+b` formulation to `x=c`. The generalization `ax+by=c` accommodates any shifted line you can draw. Adding more variables increases the dimension of the space in consideration (`ax+by+cz=d` could potentially define a plane). Adding more equations potentially reduces the size of the space in consideration (e.g., if `x+y=1` then also knowing `2x+2y=2` wouldn't reduce the solution space, but `x-y=0` would, and would imply `x=y=1/2`, and further adding `x+2y=12` would imply a lack of solutions).

      Mind you, the "two variable" statement in this news piece is a red-herring. The paper describes higher-dimension linear relationships, of the form `Mv=c` for some constant matrix `M`, some constant vector `c`, and some variable vector `v`.

      On some level, the result isn't _that_ surprising. The paper only examines one layer (not the whole network), after the network has done a huge amount of embedding work. In that layer, they find that under half the time they're able to get over 60% of the way there with a linear approximation. Another interpretation is that the single layer does some linear work and shoves it through some nonlinear transformations, and more than half the time that nonlinearity does something very meaningful (and even in that under half the time where the linear approximation is "okay", the metrics are still bad).

      I'm not super impressed, but I don't have time to full parse the thing right now. It is a bit surprising; if memory serves, one of the authors on this paper had a much better result in terms of neural network fact editing in the last year or two. This looks like a solid research idea, solid work, it didn't pan out, and to get it published they heavily overstated the conclusions (and then the university press release obviously bragged as much as it could).

    • ksenzee 30 days ago
      I think they're trying to say "equations in the form y = mx + b" without getting too technical.
    • 01HNNWZ0MV43FF 30 days ago
      Yeah I guess they mean one independent variable and one dependent variable

      It rarely matters because if you had 2 dependent variables, you can just express that as 2 equations, so you might as well assume there's exactly 1 dependent and then only discuss the number of independent variables.

    • olejorgenb 30 days ago
      I would think `x = 4` is considered an equation, yes?
      • pessimizer 30 days ago
        And linear at that: x = 0y + 4
    • pb060 30 days ago
      Aren’t functions and equations two different things?
  • i5heu 30 days ago
    So it is entirely possible to decouple the reasoning part from the information part?

    This is like absolutely mind blowing if this is true.

    • learned 30 days ago
      A big caveat mentioned in the article is that this experiment was done with a small set (N=47) of specific questions that they expected to have relatively simple relational answers:

      > The researchers developed a method to estimate these simple functions, and then computed functions for 47 different relations, such as “capital city of a country” and “lead singer of a band.” While there could be an infinite number of possible relations, the researchers chose to study this specific subset because they are representative of the kinds of facts that can be written in this way.

      About 60% of these relations were retrieved using a linear function in the model. The remaining appeared to have nonlinear retrieval and is still a subject of investigation:

      > Functions retrieved the correct information more than 60 percent of the time, showing that some information in a transformer is encoded and retrieved in this way. “But not everything is linearly encoded. For some facts, even though the model knows them and will predict text that is consistent with these facts, we can’t find linear functions for them. This suggests that the model is doing something more intricate to store that information,” he says.

  • leobg 30 days ago
    > In one experiment, they started with the prompt “Bill Bradley was a” and used the decoding functions for “plays sports” and “attended university” to see if the model knows that Sen. Bradley was a basketball player who attended Princeton.

    Why not just change the prompt?

      Name, University attended, Sport played
      Bill Bradley,
    • numeri 30 days ago
      This is research, trying to understand the fundamentals of how these models work. They weren't actually trying to find out where Bill Bradley went to university.
      • leobg 30 days ago
        Of course. But weren’t they trying to find out whether or not that fact was represented in the model’s parameters?
        • wnoise 30 days ago
          No, they were trying to figure out if they had isolated where facts like that were represented.
  • seydor 30 days ago
    Does this point to a way to compress entire LLMs by selecting a set of relations?
  • aia24Q1 30 days ago
    I thought "fact" means truth.