Llama 3 8B is almost as good as Wizard 2 8x22B

(huggingface.co)

168 points | by tosh 12 days ago

20 comments

fnands 12 days ago
Llama 3 8B basically answers the question of: "What happens if you train a small(-ish) model for a very long time".
I think we've seen this trend start with some of the Mistral models going beyond the Chinchilla optimal point, and now with Llama 3 even further. 15T tokens for a 8B param model is a lot more than we've seen so far (for context, Llama 2 was 2T tokens), and it seems to be paying off.
If anything, this release makes me excited for the quality of smaller models going forward.
[-]
- machiaweliczny 12 days ago
  There's still lot's of low hanging fruit in data preparation for these models it seems.
  I wonder if we will reach insane quality when we will be able to allow models to generate not exact same text but something similar (thus allowing for more compression). I think math leans a lot to this type of training as one can super easily verify if generated result is fine or not.
  We probably should do similar thing for language when other model judges if generation is ok or not (but doesn't require exact generation). This probably can be done as fine tuning? Anyone know if this has been tested already?
  [-]
  - wongarsu 12 days ago
    So something like having a smaller model generate embeddings, and use the embedding distance between the predicted text and the expected text for the loss computation?
    That sounds like it would be incredibly expensive. But maybe it allows you to fine tune on less data, offsetting the cost a bit (and allowing you to fine tune on just your very best data)
  - llm_trw 12 days ago
    > There's still lot's of low hanging fruit in data preparation for these models it seems.
    Anecdotally I've been using gpt4 and claude3 to generate and rank question answer pairs for bert model fine tuning. I'm getting performance better than anything I've seen in the papers about them and I can train them on a desktop.
    [-]
    - jerojero 12 days ago
      Not gonna tell on you but I thought you couldn't use open ai's models to train other LLMs?
      [-]
      - llm_trw 12 days ago
        I'm not training an LLM I'm just training a LM.
      - rspoerri 12 days ago
        I still love the idea to you steal tools to create something and prohibit others from creating more tools from those products. This idea should be patentable. /s
        [-]
        woleium 12 days ago
        Kinda like when William Fox stole the designs for a camera, moved to holywood and started fox studios, only to chase down unauthorized copies of his work later
  - threatripper 12 days ago
    Could this be implemented by making the loss zero if the output for the expected token is above some threshold?
    E.g. if the threshold is 0.1 it can output 10 likely tokens with 0.1 probability each and if any of them is correct it will not modify the weights.
    [-]
    - wongarsu 12 days ago
      One issue is that this doesn't punish the model of the top prediction produces complete nonsense. I can imagine getting good results when fine tuning an existing model though
  - zitterbewegung 12 days ago
    You could derive similar data by looking at the embedding and then synthesize similar data and then add it to the training dataset. Right now fine tuning a model before synthesizing the similar data could be done.
- freehorse 12 days ago
  I wonder if quantisation in these models actually reduces the quality more than the less overtrained ones we were used to? Anybody knows if that could be the case?
  [-]
  - WithinReason 12 days ago
    That would make sense
- jasonjmcghee 12 days ago
  Zuck mentioned they also cut entire classes of data from training Llama 2 that they included in Llama 3 as Llama 2 was intended to be used for social / meta related services. So things like code were almost entirely skipped. He only mentioned code, but wouldn't be surprised if math etc was treated similarly.
  My assumption here is adding these missing subject matter areas had a larger impact than raw token counts.
- kleiba 12 days ago
  Isn't overfitting the usual result of training a model for a long time?
  [-]
  - bjornsing 12 days ago
    As I understand it the Chinchilla “optimal point” is severely under-fitted. It’s optimal in the sense that if you only care about training cost it would have been better to make the model bigger and even more under-fitted. But clearly we care about inference cost too (or even primarily), so it makes sense to train for longer.
    Also, these models are trained on trillions of tokens, so I’m not sure an 8B model even can overfit.
    [-]
    - candiodari 12 days ago
      1) obviously, when the people doing the inferencing are different from the ones doing training, training cost does not matter to inference.
      2) Almost, Chinchilla concerns itself with minimizing cost(training) + cost(inference). The expectation most people have is that there's an insane amount of inference compute, and training is maybe 1% of that. The point of the Chinchilla paper is that that's not true, training uses such insane amounts of compute that despite all model inference by half the internet for a year or two is a huge amount of compute, training is still a very decent percentage of that. I believe in one of the examples they pointed out that even the whole internet inferring with a model for years was still only 20% of the cost of training that model.
      People expect it works like compilers, that making a compiler produce 1% faster code is worth 50 highly-paid SWEs because while an individual program run isn't exactly expensive, the time and resources spent running programs is astronomically larger than the time and resources spent developing compilers.
      The thing is most of the optimizations we know don't work during training. You can't quantize, you can't MoE (well, you can, obviously, but it doesn't save any training computation. In fact it increases training cost)
      At 50-50, having a 10% cheaper-to-train model justifies 10% more expensive inference.
      3) Combining both arguments ... at this point people should probably realize that Facebook's LLama is really an attack on Google (which is at least partially working, elon musk is tweeting about it)
      If Facebook really doesn't care about AI (or ... about as much as, say, netflix does. Not zero, but as long as they beat reddit's efforts they feel very comfortable), but Zuck does care about destroying Google, the calculus changes. Zuckerberg may not want the best possible AI, he may want as many scammers as possible trying to Game the Google search quality team, to present them with challenges faster than they can adapt. Then the training cost becomes a moot point.
      Hmmm, I should send my CV to meta ...
      [-]
      - jsnell 12 days ago
        > 2) Almost, Chinchilla concerns itself with minimizing cost(training) + cost(inference).
        Where are you getting that from? As far as I can tell, the Chinchilla paper is purely about getting the highest quality from a fixed training budget. Inference is only mentioned a couple of times in passing as a side effect of smaller models, not as the goal nor as an input to the formula. (And just to be clear: the Chinchilla paper was arguing for smaller models trained for longer, while you seem to be saying that they were arguing for larger models since the inference cost is insignificant.)
        > I believe in one of the examples they pointed out that even the whole internet inferring with a model for years was still only 20% of the cost of training that model.
        I do not see any such example in the paper
        [-]
        numeri 12 days ago
        Yeah, this is correct and I'm not sure what paper GP was thinking of – Chinchilla is only about finding the point at which it would be more useful to scale the model rather than training longer.
        Chinchilla optimal scaling is not useful if you want to use the model, just if you want to beat some other model on some metric for the minimal training costs.
        [-]
        candiodari 8 days ago
        Well, my point is that "scale the model" is equivalent to upping inference costs.
  - SuchAnonMuchWow 12 days ago
    No its the opposite: overfitting is the result of either having too many weights compared to the size of your dataset, or training for a long time while reusing/transforming parts of your dataset to make it last longer.
    Having a huge dataset compared to the size of your network will reduce overfitting.
    [-]
    - Version467 12 days ago
      We don't actually know how big the Dataset is, right? It could be the same dataset used for Llama 2, but trained for more Epochs.
      [-]
      - GaggiX 12 days ago
        The dataset is 7 times bigger than the dataset used for Llama 2 as reported by Meta.
        [-]
        baobabKoodaa 9 days ago
        Has Meta disclosed how much parts of the dataset were repeated? I've only seen the "number of tokens trained" number.
  - segmondy 12 days ago
    In this case no, from the data observed, it still looks like we can train with 500T+ tokens before we start seeing overfitting. We are severely under training. I don't think anyone realized how much training was needed till OpenAI, once OpenAI started observing emergent behaviors they released to show what was possible and set almost a trend of how long to train. Fortunately Meta decided to buck that trend to see what would happen, and what we have seen is smaller models get smarter.
  - perforator 12 days ago
    If the model is properly regularised, it can be trained indefinitely without overfitting. E.g., you can add adversarial perturbations to images and train a visual model for a very long time.
    I don't know if the current LLM architectures have any explicit regularisation or if it happens to be an intrinsic part.
    [-]
    - CuriouslyC 12 days ago
      There are a number of forms of regularization used, obviously L1/L2 and also dropout. It's not as effective as scaling/perturbing patches in the image space.
  - novariation 12 days ago
    Overfitting happens when you have more parameters than datapoints, what's currently happening would be closer to underfitting, but the reality is that we were very far from underfitting, we just didn't want to invest the compute necessary to train smaller models for a long time because we thought they weren't at the optimal position on the fitting curve.
    I'd say we are just sliding a little bit closer to underfitting, but are still far from it because we still have many parameters to work with.
  - tomp 12 days ago
    This isn't overfitting.
    Chinchilla optimises training compute.
    LLaMa optimises inference compute. Overtraining (according to Chinchilla), not overfitting.
  - moffkalast 12 days ago
    Only if you train for a ludicrous amount of epochs on a small dataset instead of one on an extremely large one.
  - samus 12 days ago
    Depends what you want and whether you have new data or just train for more epochs. More relevant data is probably always a good thing, repeating the same data not necessarily.
    Repeating data is probably good though if you really want the model to learn certain things.
    State-of-the-art training curriculums probably train on low-quality data to teach the model how to handle language. You don't want to repeat that one too often, you rather want a lot of diverse input. Data augmentation should help here too, for example deliberately introducing spelling errors in the input and expecting the model to return the unchanged input.
    Later stages contain high-quality data and try to improve performance on specific tasks. That probably benefits from repetition.
    The final stage is alignment to chat or instruct prompts and safety engineering.
  - nkozyra 12 days ago
    > Isn't overfitting the usual result of training a model for a long time?
    Overfitting stops when your model includes everything 8)
  - seydor 12 days ago
    What does overfitting do to a language model
    [-]
    - fancyfredbot 12 days ago
      An overfit LLM would probably regurgitate common phrases from the training dataset and struggle to output phrases which didn't occur in that dataset.
      I'm not really sure what that looks like when your training data is so huge. Might not be very noticeable?
      [-]
      - seydor 12 days ago
        also, an LLM is not learning phrases per se but the relationships between tokens
    - samus 12 days ago
      Probably nothing specific. By definition, an overfitted model has only learned to deal with the training input and will perform worse on anything that is different. In training metrics it shows as a large gap between the performance metrics on the training test vs. on the test set.
      [-]
      - exe34 12 days ago
        When the training data is the entire internet, I suspect it can still perform well enough in economically profitable situations to be worth having.
        [-]
        samus 12 days ago
        Surely it might still work, but training a model into true overfitting territory is a waste of resources. An overtrained model is usually a worse model than an earlier checkpoint. That's the main reason one strives to avoid that.
- imjonse 12 days ago
  Too bad they did not train or release a Llama 3 2B to see how it fares against Phi and Gemma.
  [-]
  - cchance 12 days ago
    im surprised heirs no 14b or 16b
- Aissen 12 days ago
  > very long time
  On one of Meta's 24k H100 clusters running at 95% efficiency, it's 2.3 days.
  [-]
  - WithinReason 12 days ago
    It was closer to 40% efficiency AFAIK
    [-]
    - Aissen 12 days ago
      I'm curious, do you have any pointers? This is what this article mentions:
      > Those improvements resulted in an overall effective training time of more than 95%
      https://ai.meta.com/blog/meta-llama-3/
      [-]
      - logicchains 12 days ago
        The known SOTA for GPU flops utilisation for training on that many GPUs is somewhere between 50-60%, e.g. https://github.com/NVIDIA/Megatron-LM . If they really managed to get 95%, it's a huge advance in the state of the art. I guess by training time they meant with respect to downtime of the hardware, not utilisation of the hardware potential.
        [-]
        Aissen 12 days ago
        > I guess by training time they meant with respect to downtime of the hardware, not utilisation of the hardware potential.
        That was my understanding as well, those are two different levels of efficiency.
      - WithinReason 12 days ago
        That means that 5% of the time GPUs were not doing any training, e.g. the server was busy with checkpointing so the training paused. 40% is according to Karpathy: https://twitter.com/karpathy/status/1781028605709234613
- Tenoke 12 days ago
  >If anything, this release makes me excited for the quality of smaller models going forward.
  While there's going to be more that can be milked out of them, we are already clearly fairly deep in the diminishing returns portion on small models.
  [-]
  - MyFirstSass 12 days ago
    Seems like it when for 7b models: (but also wow things are moving fast!);
    Llama - 1.4 Trillion tokens - feb. 2023,
    Llama 2 - 2 Trillion tokens - july. 2023,
    Mistral - 8 Trillion tokens - sep. 2023 (this was the first big impressive leap where local models really became useful for chat)
    Llama 3 - 15 Trillion tokens
    So we went 1.4x, to 4x, to 2x. I wonder if there's even unused data out there?
    Haven't tried the new model enough to see how much better it is than Mistral, that will be the real SOTA test for now.
    [-]
    - Escapado 12 days ago
      I wondered about the same thing but at the same time only about 5% of the training data is non-English and I would be surprised if the total amount of published text from all non English languages combined was also just 5% of all published text. So my intuition tells me there is still heaps of data but what might be tricky is to properly access and asses it’s quality.
      Also the redpyjama v2 dataset has 30T tokens and is based on common crawl. Now I don’t know too much about common crawl but I doubt it has in it all published scientific books in all the different languages as these are often not freely crawlable. I remember when I studied physics there were at least 20 different 200-800 page long books on particle physics in German alone in our campus library. That must amount to 5million token by itself, from just one niche of physics. The Hamburg public library hosts about 5 million books and 90 million scientific articles mostly in English and German. If the average length of a scientific article is 6000 tokens and the average book about 100000 then that alone is already 1 trillion token. I bet there are significantly larger libraries and this is before even crawling the internet and looking at other languages or even generating training data.
      [-]
      - vidarh 12 days ago
        Deutsche Nationalbibliothek appears to have 43.2 million "items", of which apparently 17.3 million are books. If we assume ~60,000 tokens for an average book, which seems very conservative given average word length in German and a novel typically being considered anything above ~40k words), that's another trillion just for their books, so I'm guessing the total German language content available in major libraries will be many times that.
        E.g. the Norwegian National Library has somewhere between 3x and 10x as many tokens in Norwegian newspapers as in books (at one point I think GPT3 breakdown of training data by language surfaced, and the Norwegian data was a tiny fraction of what was available in the national library, even before trying to estimate online/digital content).
        While I'm sure there's overlap [1] between the languages, a lot of it will help translation, and I think even for smaller languages the ratio of local content seems to dwarf translations. E.g. the "bestsellers" from English, French, and German all get translated to Norwegian, but most of the "long tail" content is local.
        [1] I was tickled to a find one of my uncles represented in Deutsche Nationalbibliothek; he was a professor in statistics, so it was a translation of some of his research
      - sheepscreek 12 days ago
        Still quite impressive to think these models are trained on 15x the content in a large public library. That is insane.
    - londons_explore 12 days ago
      > I wonder if there's even unused data out there?
      One day someone is going to train on the contents of DM's/private conversations/emails. There has to be 50x or more the quantity of that compared to public text.
      I suspect they'll do it via some 'prove-ably private training' regime, and therefore be able to claim it isn't a privacy violation.
      [-]
      - nolok 12 days ago
        Yes, they're either already getting into it behind legal facade, or aiming for it as the next eldorado of data. Facebook has whatsapp and messenger, microsoft has skype and outlook and exchange and msn messenger, google has gmail and all your text message and their bazillion chat apps and usenet and irc and ..., apple has imessage and icloud emails and ... There is so much data there, it dwarfs those token count.
        [-]
        londons_explore 12 days ago
        The players with e2e encryption (imessage, whatsapp) would need to do some kind of client side edge device training. Possible, but hard to do with nobody knowing, and edge device training usually involves big quality compromises.
        [-]
        nolok 12 days ago
        Didn't both of them have some of their backup in clear text ? I know whatsapp backup on gmail were.
      - Workaccount2 12 days ago
        Google is sitting on ~20 years of gmail, but I can imagine the headache of both cleaning the dataset and likely consumer blow back.
        They also have youtube, which almost certainly has enough good data to train a powerful model on it's own, but also seems daunting to clean up first.
        [-]
        seunosewa 9 days ago
        They can't legally be allowed to use email content to train public models. The data leaks that could result will be unbelievable. It could be used to train private LLMs within organizations that own the emails. But Youtube comments work.
    - nolok 12 days ago
      > I wonder if there's even unused data out there?
      I think you're massively underestimating the amount of data out there. The challenge is how to access and categorize that data.
      Every usenet message, forum post from old school bbs to php forums to modern javascript abomination and closed discord boards, every email, every text message, every irc message, ...
      Those are probably a pain point to access due to rules and regulation and yada yada, but that alone dwarfs the 15 trillions, and you've not even started on actual quality content.
      (not saying these would specifically be good for llm, just answering to your "unused data" assessment)
      [-]
      - MyFirstSass 12 days ago
        That's a good point, though i already thought the OpenAI team had been very agressive in sweeping both reddit, usenet, + various illegal megatorrents of books, forum dumps etc. I remember there were some controversy around it on twitter a few months ago.
        One thing though is books/content/media from other language spheres though that could probably at least 10x the size of the data, and as far as i know translation starts to work rather well in these larger models so it would probably just plug right into the knowledgegraph for all languages?
        [-]
        vidarh 12 days ago
        There are still vast amounts of data locked up behind login screens etc., though. E.g. to the foreign language data, a lot of national libraries around the world are either not even fully digitized yet or have lots of locked-down content. The Norwegian one is pretty open, but there's still huge amounts (like most newspapers newer than a century or so) that is either only available based on geolocation (I have my VPN for genealogy because of that - I'm Norwegian but live in the UK, and it's a nuisance), or only in a physical library in Norway.
        Similarly I was looking for something from the British Library at one point and it was behind a paywall (a copying fee).
        I have no idea how to even start to estimate how much data is locked down like that, and it's harder yet to try to figure out which parts of that it'd be possible to negotiate access to for various players, and what they can circumvent (e.g. say by buying book collections and the like - OpenAI is large enough by market cap it could afford to buy some of the largest extant publishers, for example, if they thought it gave them sufficient benefits).
        [-]
        MyFirstSass 12 days ago
        That is incredibly interesting to me because i've heard both historians, linguists and people "just not from the anglosphere" complain about just how isolated and limited our language, cultural perspectives are.
        In other words if LLM's could somehow bridge that gap through both regions and time i'm pretty sure something magical could happen, different than the already a bit tired and conformist "echo chamber" like quality to LLM's mostly trained on reddit, corporate speak, and anglo pop culture, or even just western thought in general.
        [-]
        vidarh 12 days ago
        It's been pretty fascinating. ChatGPT clearly has a very small Norwegian corpus, but it can not only translate to and from both Norwegian written languages (they're more like dialects - they're mutually intelligible) but can make a passable attempt at translating into at least some highly localized dialects, and can explain key differences between certain sociolets. And it took only a slight explanation to get it to give a plausible translation into this weird sociolect constructed by the circle around a fringe Maoist group from the 70's that wanted to sound more working class and adopted a bunch of affections that does not match any "natural" Norwegian dialect (several members ended up as prominent authors, and so it spread wider than the size of the group otherwise would have allowed for).
        That said, English and Norwegian are pretty close. How well it will handle languages with more significant differences without larger amounts of tokens is another matter. Even for pretty small language groups there ought to be enough, though.
  - CuriouslyC 12 days ago
    Are we though? We haven't even started training 1.58b models, synesthetic data and "model gyms" are very promising and new architectures have been coming out that offer real benefits over transformers.

pytness 12 days ago

Seems like a yapper. And it makes some wrong(?) assumptions?

  >>> hi
  Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
  
  >>> if tilt my head upwards so my face is looking at the sky and I spit, what will happen?
  That's an interesting question!
  
  If you tilt your head upwards so your face is looking at the sky and then spit, a few things might happen:
  
  1. **Spit will go flying upwards**: Since your face is now pointing towards the sky, the trajectory of your spit
  will be upward too. It might even reach the clouds (if there are any nearby) or disperse in mid-air.
  2. **Some spittle might fall back down**: Depending on air currents, wind direction, and the distance between
  you and the ground, some of the spittle might fall back down towards the earth.
  3. **You might get a bit of a face-full**: If you're not careful, some of that spit might splash back onto your
  own face, which could be... interesting.
  
  Remember to be mindful of where you are and who's around when doing this experiment. It's always good to clean
  up after yourself and respect those nearby!
  
  Now, I'm curious – what inspired this question?

EDIT:

  >>> How would the spit reach the clouds?
  I was getting a bit carried away with the imagination there. In reality, it's highly unlikely that your spit would actually reach the clouds.

  Clouds are typically several thousand feet (or meters) above ground level, and even if you were to tilt your head at an extreme angle,
  the spittle would likely be deflected or absorbed by air currents before reaching such heights.

  In practice, the spittle would probably fall back down towards the earth due to gravity, and might land on the ground, a nearby surface, or even get blown away by the wind.

[-]

pennomi 12 days ago
I’ve also noticed it’s “overactive imagination“ on some questions.

tosh 12 days ago
I wonder to what degree these results come from more training and emergent capability from that vs these kinds of riddles now being known and well represented in training material?
Adjacent: one thing that makes llama3 special is that they spent extra effort on deduplicating training data (quality) + increased training data (quantity) and used 4x more code in training material (=> good for reasoning).
So perhaps we can still see huge capability gains in these small models by spending more effort on training data curation and generation?
[-]
- wrasee 12 days ago
  Exactly this. I only casually follow this space and I recognised many of those questions as essentially ‘classic’ questions by now. Surely any recently trained model taking recent data is going to have a ton of references to these now, especially those that are not officially part of some test set that might have been controlled for to some extent?
DalasNoin 12 days ago
I think it is interesting that llama 3 still does not use Mixture of Experts. Command R+ is also not using MoE. On paper you get 4x less training and inference compute cost. But maybe there is a good reason not to use it, this kind of supports this. Might be wasteful to create these huge MoE models if they can't outperform smaller dense models.
[-]
- samus 12 days ago
  Many MoE models like Mixtral were not created from scratch. Somebody could still upcycle that 8B into an 8x8B. It should ideally be done by Meta since they have access to the original training data.
- CuriouslyC 12 days ago
  MoE models are harder to fine tune, and they don't solve the biggest problem which is GPU memory use.
isusmelj 12 days ago
Don't forget that this is 24k H100. They are getting 10x the compute: https://www.cnbc.com/2024/01/18/mark-zuckerberg-indicates-me...
So gpt-4 level 8B models running on phones and notebooks seems feasible within the next 5 years. I imaging having (voice) assistans running locally. Crazy how fast we progress.
UrineSqueegee 12 days ago
I am using 8b-instruct and I am honestly blown away, this almost feels like early GPT4 levels of logic
I am absolutely pumped for whats in store for small models!
[-]
- kromem 12 days ago
  I dunno, I'm less than dazzled.
  One of my gotos is a variation of a classic logic puzzle, as models naturally tend to try to provide the original form of the solution because of token similarities and by seeing how well they can identify where steps are wrong or seeing if they can solve the variation when breaking token similarities with representative symbols or other techniques like mapping out object relationships in the initial answer before providing a solution it tends to reveal the underlying capacity for logic outside of simply similarities to the training data.
  Mistral, GPT-4, Claude, Gemini can all solve it with various degrees of competency and hand holding.
  Llama 3 just cannot handle it.
  I think what we're seeing is that the massive training data set is leading to a much more capable stochastic parrot but not necessarily as capable world modeling. Or else there's an advantage to MoE for this kind of task that Llama 3 lacks. But it's tracking with my subjective sense of being somewhere between GPT-3 and 4 in terms of "critical thinking" ability, but not in the same class as more modern models, even though it seems to have a better breadth of training to draw on than most other medium sized models.
  [-]
  - pennomi 12 days ago
    Man, if I was training an AI, I’d just create a dataset of Hacker News pet problems just to con everyone into thinking it’s somehow better at reasoning.
    [-]
    - BoorishBears 12 days ago
      I'm honestly amazed people still think riddles that are essentially asking a fish how well it climbs a tree are relevant to anything.
      Ask it to write code for you, ask it pop culture questions to test recall and hallucinations, literally a million more interesting things you can do.
  - MyFirstSass 12 days ago
    Both GPT-4 and llama 3 8b fail actual reasoning still:
    jack has 15 sisters, each sister has 3 brothers. How many brothers does jack have?
    (real answer is 2)
    -------------------
    Llama 3 8b
    Let's break it down step by step:
    Jack has 15 sisters.
    Each of his sisters has 3 brothers.
    To find the total number of brothers, we need to multiply the number of sisters (15) by the number of brothers each sister has (3).
    15 sisters x 3 brothers/sister = 45 brothers
    So, Jack has 45 brothers!
    -------------------
    GPT-4
    Jack has 3 brothers.
    Here's why:
    If each of Jack's sisters has 3 brothers, these brothers are the same for all sisters, including Jack himself. Therefore, Jack and his 3 brothers make up the total count of brothers in the family. Since the problem states that each sister has 3 brothers, we know that Jack himself must be counted as one of these brothers. So, Jack has 3 brothers.
    -------------------
    Lol.
    You can tweak the questions numbers and names to avoid training data.
    Can Opus solve this?
    [-]
    - joelthelion 12 days ago
      Llama 3 70B seems to handle it just fine:
      A classic lateral thinking puzzle!
      Let's break it down:
      * Jack has 15 sisters. * Each sister has 3 brothers.
      Now, here's the key insight: if each sister has 3 brothers, that means each sister is counting Jack as one of their 3 brothers!
      So, how many brothers does Jack have? Only 2 The other 2 brothers that each sister is counting are actually Jack himself.
      Answer: Jack has 2 brothers.
      [-]
      - MyFirstSass 12 days ago
        Wow, that's extremely impressive in light of GPT being 1800b so a model 25 times smaller is beating it logic wise?
        I don't have enough ram to run 70b but i'm very curious about performance on other logic puzzles!
        [-]
        MyFirstSass 12 days ago
        I just tried variations and weirdly it fails many of them like:
        "bob has 7 sisters, each sister has 4 brothers. How many brothers does bob have?"
        -----------------
        Llama 70b instruct (wrong)
        A classic lateral thinking puzzle!
        At first glance, it seems like a straightforward math problem, but wait...
        If each of Bob's 7 sisters has 4 brothers, that would mean each sister has Bob and 3 other brothers, right? But that's not possible, because Bob is the only brother!
        So, the correct answer is: Bob has no brothers. He has 7 sisters, and each of them has him as their only brother.
        Did I get it right?
        -----------------
        GPT-4 (correct)
        To solve this, let's consider the family structure:
        Bob has 7 sisters. Each of these sisters has 4 brothers. Since all the sisters have the same number of brothers, and these brothers include Bob, this implies Bob has 3 other brothers. Therefore, including Bob, there are 4 brothers in total. Hence, Bob has 3 brothers.
    - rvnx 12 days ago
      > User:
      jack has 15 sisters, each sister has 3 brothers. How many brothers does jack have?
      > Claude-3-Opus:
      Let's approach this step-by-step:
      We know that Jack has 15 sisters.
      Each of Jack's sisters has 3 brothers.
      Since all the sisters are from the same family, they all have the same brothers.
      Therefore, the 3 brothers that each sister has are the same 3 brothers.
      Since Jack is a brother to his sisters, he is one of these 3 brothers.
      So, Jack has 2 brothers.
      Therefore, Jack has 2 brothers.
    - ijustlovemath 12 days ago
      For me, Opus got it in two tries:
      >jack has 15 sisters, each sister has 3 brothers. How many brothers does jack have?
      Let's approach this step-by-step:
      1) We know that Jack has 15 sisters.
      2) Each of Jack's sisters has 3 brothers.
      3) Since all the sisters are Jack's siblings, they share the same brothers. In other words, the brothers of one sister are the same as the brothers of any other sister.
      4) Therefore, the number of brothers that Jack has is the same as the number of brothers each of his sisters has.
      5) We are told that each sister has 3 brothers.
      Therefore, Jack has 3 brothers.
      > Nope
      You're right, I made an error in my reasoning. Let me try again:
      1) Jack has 15 sisters.
      2) Each of Jack's sisters has 3 brothers.
      3) Jack is one of the brothers of each of his sisters.
      4) Therefore, the total number of brothers (including Jack) is 3.
      5) To find the number of brothers Jack has, we need to subtract Jack from the total number of brothers.
      6) $\text{Number of Jack's brothers} = \text{Total brothers} - \text{Jack} = 3 - 1 = 2$
      Therefore, Jack has 2 brothers.
  - dotsam 12 days ago
    You can change the LLama 3 system prompt to coax the right answer. Shame it doesn't work out of the box though.
- pikseladam 12 days ago
  how did you tried it
syntaxing 12 days ago
The biggest issue when you have local LLMs is that people “use” them in different way which can cause different impression. As a somewhat PSA, try loading the Q6 KM if possible. Q5 KM at a minimum. Anything lower like the default of ollama Q4_0 will have noticeable quality degradation.
nl 12 days ago
One thing I hadn't really thought through is that the reason these models are trained to point X is that they just decide to release it then.
But most of these companies _keep training_ the same model. A lot of the time there is no "ok we are done, we will do a new thing now" - it just keeps going.
Obviously thinking this though it explains a lot of the vast spend on GPUs. They need new compute because the existing compute is occupied _and always will be_. Model training will never finish!
[-]
- petesergeant 12 days ago
  > But most of these companies _keep training_ the same model
  Do they? Genuine question. Why do you think so?
  [-]
  - sva_ 12 days ago
    If you look at the number of different gpt4 models, it seems like a reasonable thing to say:
    https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-...
    While it appears like they all share the same gpt4 base, it isn't really known how these are trained (fine-tuned?)
- JKCalhoun 12 days ago
  How many models do they have? Is model availability becoming the bottleneck?
idle_zealot 12 days ago
Say I wanted to run this on my phone today. What's the best path one could take to actually use it as mkre than a toy? I suspect step 1 is to get llama.cpp running in Termux and a ggml or other model file loaded. That gets you a local web interface for inference. Then... how do you hook that up to your system such that it can do anything useful?
[-]
- moffkalast 12 days ago
  MLC might be a decent option, I think they have a way to deploy for mobile and might actually compile for the NPU.
saghul 12 days ago
Kinda surprised the 8B gets this wrong: "what's heavier a kilo of steel or two kilos of feathers?" GPT-3.5 gets it wrong too. The 70B model does get it right, so does GPT-4.
[-]
- pennomi 12 days ago
  My pet question is “Which weighs more, 1000cm^3 of styrofoam or 1cm^3 of tungsten?”
  Most LLMs go through the calculation and find the styrofoam is heavier, then confidently announces that the tungsten weighs more. Strange considering it’ll say something very nearly like “The styrofoam weighs 50 g and the tungsten weighs 19.3 g, therefore the tungsten is heavier.”
- fransje26 12 days ago
  That's not how it responded to my query.
  > What's heavier? 1 kg of lead or 2 kg of feathers?
  That's a classic trick question!
  The answer is: 2 kg of feathers.
  Why? Because 2 kg is heavier than 1 kg, regardless of the material. The density of the material doesn't matter in this case, only the weight. So, 2 kg of feathers would weigh more than 1 kg of lead.
  [-]
  - fransje26 12 days ago
    I stand corrected. I was inadvertently on the 70b model.
- wongarsu 12 days ago
  That seems to be the general experience. Maybe 8B are just too few parameters to achieve higher level reasoning.
  [-]
  - brrrrrm 12 days ago
    Maybe depth rather than parameter count.
dimask 12 days ago
> I have 10 apples. I find 3 gold coins in the bottom of a river. The river runs near a big city that has something to do with what I can spend the coins on. I then lose 4 apples but gain a gold coin. Three birds run into my path and drop 6 apples each. I play an online game and win 6 gold coins but I have to share them equally with my 2 teammates. I buy apples for all the coins I have. The price of an apple is 0.5 coins. How many apples do I have? And where is the river?
> You now have 34 apples!
Actually, it forgot to add one coin when the person lost 4 apples, and the apples should be 34+2=36 in the end, so the model got it wrong. I tried on phind.com with their 70B model and it actually got it correct. Still, quite impressive for a 8B model, but it does show again the problems that come with trusting such problems to LLMs, that actually finding mistakes is so hard because it all sounds correct and confidently well written.
[-]
- exe34 12 days ago
  I tried chatgpt:
  > Let's break down the events step by step: > > Starting Apples: 10 apples > Find 3 Gold Coins: > +3 gold coins > Total apples: 10 > ...
  I keep thinking about how to reliably make it come up with code to do the job correctly.
  edit: oh I think chatgpt did it. https://chat.openai.com/share/f1742187-ab06-46c7-abaf-9a7981...
earth2mars 12 days ago
Except GPT-4 no other model answered following correctly. Why
Name five countries that starts and end with same letter
[-]
- furyofantares 12 days ago
  They operate on tokens and so are particularly bad at knowing what letters are in words beyond the first letter or so.
- kennyadam 12 days ago
  Gemini Advanced was able to answer that no problem: https://g.co/gemini/share/fee3410e018a
- ijustlovemath 12 days ago
  Opus did pretty well with this slight modification:
  write a self contained Python program (no I/O, no dependencies) that prints a list of all countries that start and end in the same letter.
  Heres the output (which looks correct!)
  ['Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Central African Republic', 'Czech Republic', 'Saint Kitts and Nevis', 'Saint Vincent and the Grenadines', 'Seychelles', 'Solomon Islands']
0DayFlaw 11 days ago
Still lags behind article summarization vs the leaders.
Not the 70b or the 8b were able to produce consistent results for the below prompt
Prepare a threat intelligence advisory based on below article, provide following sections" summary (maximum 300 words), technologies affected, TTPs (if any) and TA groups (if any), vulnerabilities exploited (if any) and recommendations for mitigation and detection.
https://www.bleepingcomputer.com/news/security/mitre-says-st...
maksimur 12 days ago
Annoying that I have to wait to get accepted before seeing details about the files. I'm not planning to download them but was just interested in knowing their size.
[-]
- syntaxing 12 days ago
  As in Llama3 8B? You can see the size of the files from Ollama using the FP16 one https://ollama.com/library/llama3:8b-text-fp16
alecco 12 days ago
I wonder how did they get the 15T tokens for training.
[-]
- rvnx 12 days ago
  Illegally acquired content on the internet, like search engines do
miven 12 days ago
Any guesses as to why they bumped the parameter count up from 7B to 8B?
[-]
- matrix2596 12 days ago
  I also wondered the same and check the model configs. they are using bigger vocab size and the intermediate size of fully connected layer seems to be bigger.
toxypol 12 days ago
it is a huge jump, testing for last hours
andrewstuart 12 days ago
Weird all these measures of what makes a good AI.
The only really great one is OpenAI ChatGPT 3.5