Phi-3 Technical Report

(arxiv.org)

410 points | by varunvummadi 10 days ago

19 comments

  • modeless 10 days ago
    Everyone needs to take these benchmark numbers with a big grain of salt. According to what I've read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy. Nobody should be assuming these numbers will translate directly into a high ranking on the LMSYS leaderboard, or usefulness in everyday tasks. Let's not dethrone Llama 3 until some real world testing can be done.

    That said, I don't think it's impossible for a small model to be very good. I see their "synthetic data" as essentially a way of distilling GPT-4 into smaller models. It would be exciting if a large fraction of the performance of huge models could be transferred to small ones! If true, then Chinchilla-optimal training could make sense again, as you could optimally train a ginormous model and then distill it afterward for efficient inference.

    • bt1a 10 days ago
      This won't dethrone Llama 3, but it's equally impressive.

      They mention this model's relative weakness in the TruthfulQA eval, since it's more lossy trying to pack 'knowledge' into a small model relative to problem-solving skills (which shine on MMLU)

      Regardless - still a very useful thing to have offline and on the fly. Those scores are nothing to scoff at.

      Given that these pipelines are likely harder harder to imitate than new architectures like Transformers, I assume there has been and will be an intense focus on synthetic data generation and cleansing. Llama 3 used 15T of tokens in its training corpus vs 4.8T in the "scaled-up" version of phi-3. If you made it to the end of this disjointed ramble I'm sorry

      • Grimblewald 10 days ago
        Even llama3 has its issues. Ive been quite impressed so far but if the context gets a little long it freaks out, gets stuck repeating the same token or just fails to finish an answer. This is for the full f16 8B model, so it cant be put down to quantization. It also doesnt quite handle complex instructions as well as the benchmarks would imply should.
        • andai 10 days ago
          Supposedly LLMs (especially smaller ones) are best suited to tasks where the answer is in the text, i.e. summarization, translation, and answering questions.

          Asking it to answer questions on its own is much more prone to hallucination.

          To that end I've been using Llama 3 for summarizing transcripts of YouTube videos. It does a decent job, but... every single time (literally 100% of the time), it will hallucinate a random name for the speaker.* Every time! I thought it might be the system prompt, but there isn't one.

          My own prompt is just "{text}\n\n###\n\nPlease summarize the text above."

          If I ask it to summarize in bullet points, it doesn't do that.

          I'm assuming there was something in the (instruct) training data that strongly encourages that, i.e. a format of summaries beginning with the author's name? Seems sensible enough, but obviously backfires when there's literally no data and it just makes something up...

          *In videos where the speaker's name isn't in the transcript. If it's a popular field, it will often come up with something plausible (e.g. Andrew Ng for an AI talk.) If it's something more obscure, it'll dream up something completely random.

          • kiratp 10 days ago
            The technique to use is to give the model an “out” for the missing/negative case.

            "{text}\n\n###\n\nPlease summarize the text above. The text is a video transcript. It may not have the names of the speakers in it. If you need to refer to an unnamed speaker, call them Speaker_1, Speaker_2 and so on."

          • woodson 10 days ago
            Especially for small models I had very bad results for use in translation. Even trying all kinds of tricks didn’t help (apparently prompting in the target language helps for some). Encoder-decoder models such as FLAN-T5 or MADLAD-400 seemed far superior at equal or even smaller model size.
            • andai 10 days ago
              I forget which model (LLaMA 3?) but I heard 95% of the training data was English.
          • Grimblewald 9 days ago
            for sure, so my use case for example is

            "using the following documentation to guide you {api documentation}, edit this code {relevant code}, with the following objective: Replace uses of {old API calls} in {some function} with with relevant functions from the supplied documentation"

            It mostly works, but if the context is a little to long, sometimes it will just spam the same umlaut or number (always umlaut's or numbers) over and over for example. Perhaps some fine-tuning of parameters like temp. or repetition penalty might fix it, time will tell.

            • andai 9 days ago
              Are you using ollama? Devs said there was a bug that occurs when context is full, they're working on it.
        • behohippy 10 days ago
          I had this same issue with incomplete answers on longer summarization tasks. If you ask it to "go on" it will produce a better completion, but I haven't seen this behaviour in any other model.
          • Grimblewald 9 days ago
            Neither, still, the answers it does provide - despite a few hiccups - are truly outstanding. I am really impressed with this model, even with its issues. Though, I am sure the issues such that they are, are a month or two away from being fixed. For what its worth, I haven't played as much with the bigger model, but it seems to not struggle with the same, though take that with a grain of salt, it runs too slow on my hardware for me to rapidly test things.
      • IvanAchlaqullah 10 days ago
        > TruthfulQA

        Wait, people still use this benchmark? I hear there's a huge flaw on it.

        For examples, fine-tuning the model on 4chan make it scores better on TruthfulQA. It becomes very offensive afterwards though, for obvious reasons. See GPT-4chan [1]

        [1] https://www.youtube.com/watch?v=efPrtcLdcdM

        • thomashop 10 days ago
          Couldn't it be that training it on 4chan makes it more truthful for some reason?
          • wongarsu 10 days ago
            Could it be that people who can talk anonymously with no reputation to gain or lose and no repercussions to fear actually score high on truthfulness? Could it be that truthfulness is actually completely unrelated to the offensiveness of the language used to signal in-group status?
            • cptcobalt 10 days ago
              This unironically feels like good research & paper potential.
        • andy99 10 days ago
          Not sure I understand your example? It's not an offensiveness benchmark, in fact I can imagine a model trained to be inoffensive would do worse on a truth benchmark. I wouldn't go so far as to say truthfulQA is actually testing how truthful a model is or its reasoning. But it's one of the least correlated with other benchmarks which makes it one of the most interesting. Much more so than running most other tests that are highly correlated with MMLU performance. https://twitter.com/gblazex/status/1746295870792847562
        • nurumaik 10 days ago
          >scores better

          >very offensive

          Any cons?

        • hoseja 10 days ago
          Looks like a good and useful benchmark.
        • andai 10 days ago
          "Omit that training data..."
    • HarHarVeryFunny 10 days ago
      The Chinchilla paper is about how to design/train a model to optimize use of computing power, which we can equate to cost (FLOPs cost dollars).

      The question that Chinchilla tries to answer is: for a given training budget (which you can think of as dollars or FLOPs), what is the optimal trade off of model size and quantity of training data to get the most performant model? Build a large model and train with less data, or build a smaller one and train with more data?

      However, another consideration is minimizing total lifetime cost of the model: training cost + inference cost. You could train a model for longer (costing more) in order to get a given level of performance from a smaller model that will be cheaper for inference, or vice versa. For any given projected model lifetime inference volume, there is going to be a different answer.

      It's not that Chinchilla-optimal models stopped making sense, but rather that this sort of consideration has people willing to pump more money (tokens) into smaller models to reduce inference cost for that level of capability.

      • KuriousCat 10 days ago
        Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?
        • HarHarVeryFunny 9 days ago
          They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.
    • ankit219 10 days ago
      Not trying to disparage them, but their models always give a feeling that it is overfitted on benchmarks hence they perform so well. On everyday tasks, it's much worse - chat or simple completion tasks.

      Distilling can work and there are papers which suggest it does, but we still do not have a reliable mechanism which can distill knowledge from larger teacher models to smaller student models.

      • moffkalast 10 days ago
        This was the case for Phi-2, it was notoriously rubbish in practical use.
    • spmurrayzzz 10 days ago
      I don't think we can call it distillation, at least not in the conventional ML use of the word as you're not interacting with any of the actual model architecture, specifically — not computing the loss between the predictions of the parent model and the distilled target model.

      This is an important distinction when it comes to assess model collapse risk, which is a risk I think has probably been overstated enough to this point where now its being understated.

    • refulgentis 10 days ago
      Phi-2 wasn't chat/instruct tuned, so it didn't act good in chat, it was a base model. But the benchmark #s were real.
      • nl 10 days ago
        I had a lot of issues trying to get Phi-2 to perform as well as the benchmarks indicated on non-chat tasks.

        It felt a lot like it was overfitted to the exact type of tasks (ie, not a data leak) in the benchmarks but if you were trying something a bit off track if didn't know what to do. At the time my hypothesis was that the small model just didn't have the capacity to generalise well enough, but since then Gemma 2B has come out and seems to be ok.

        So now I have no idea why, but yes: the benchmarks for Phi-2 didn't represent how it worked for me on real world tasks where you'd expect it top be ok.

      • irjustin 10 days ago
        I'm pretty naive so please forgive it's a stupid question.

        To me, what the parent comment is saying is that even though the benchmarks are cool, it's not super helpful to the every day person. Because if you can't chat with it very well (even for a narrow context) what utility does it have with great benchmarks?

        • svnt 10 days ago
          Both are saying the same thing: in order for the base model that is phi to perform well as a chat agent, it would need to be tuned for that purpose before its benchmark results could have real-world value.
          • imjonse 10 days ago
            From this report. Phi-2 was not instruct tuned indeed.

            "Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety."

  • oersted 10 days ago
    Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

    And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

    Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

    So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

    (I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

    Phi-3-mini 3.8b: 71.2

    Phi-3-small 7b: 74.9

    Phi-3-medium 14b: 78.2

    Phi-2 2.7b: 58.8

    Mistral 7b: 61.0

    Gemma 7b: 62.0

    Llama-3-In 8b: 68.0

    Mixtral 8x7b: 69.9

    GPT-3.5 1106: 75.3

    (these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

    • jxy 10 days ago
      This inductive logic is way overblown.

      > Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

      Judging by a single benchmark? Without even trying it out with real world usage?

      > And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

      Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

      By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

      • oersted 10 days ago
        It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

        As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

        I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

        It's a very significant demonstration of what could be possible soon.

        • saretup 10 days ago
          Firstly, English is a highly subjective category.

          Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

          While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

          • YetAnotherNick 10 days ago
            Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

            Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

            • imtringued 10 days ago
              When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".
              • YetAnotherNick 10 days ago
                Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.
                • rgbrgb 10 days ago
                  > just for better writing

                  in my use cases, better writing makes a better answer

    • ignoramous 10 days ago
      > Phi-3-mini 3.8b: 71.2

      Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

      A model on par with GPT-3.5 running on phones!

      (weights haven't been released, though)

      • coder543 10 days ago
        > (weights haven't been released, though)

        Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

        Hopefully Microsoft will continue that trend with Phi-3.

        > outputs 1212 tokens/sec on iOS

        I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

    • karmasimida 10 days ago
      Where did you get this from?

      > So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

      No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

    • alecco 10 days ago
      At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.
    • crakenzak 10 days ago
      Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

      Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

      • nl 10 days ago
        > What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

        In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

      • regularfry 10 days ago
        It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

        What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

      • Deverauxi 10 days ago
        5 years? 5 years is a millennia these days.

        We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

        • refulgentis 10 days ago
          Absolutely not on the first one. Not even close.
          • ashirviskas 10 days ago
            Why not? There's still 7 months left for breakthroughs.
            • refulgentis 10 days ago
              Small leaves wiggle room, but it's extremely unlikely trad small, <= 7B, will get there this year even on these evals.

              UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.

      • bugglebeetle 10 days ago
        We already do. It’s called LLama 3 70B Instruct.
        • vitorgrs 10 days ago
          Llama 3 is awful in non-English. 95% of their training data is in English....

          GPT is still the king when talking about multiple languages/knowledge.

      • stavros 10 days ago
        Is it released?
    • viraptor 10 days ago
      On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.
      • candiodari 10 days ago
        I wonder if that's a positive or negative. How does it affect hallucinations?
        • viraptor 10 days ago
          It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.
    • blackeyeblitzar 10 days ago
      It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

      I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

    • moralestapia 10 days ago
      >And on LMSYS English, Llama 3 8B is well above GPT-4

      Source?

      • oersted 10 days ago
        Right thanks for the reminder, I added it
        • moralestapia 10 days ago
          Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

          I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

          Great model, but let's not overplay it.

          • oersted 10 days ago
            In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

            You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

            It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

            • moralestapia 10 days ago
              Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!
              • mlyle 10 days ago
                > at ~1,000 tokens/sec

                12 tokens per second.

                • moralestapia 10 days ago
                  Whoops, made the same mistake as @ignoramous :P
    • zone411 10 days ago
      > So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

      No, we don't. LMsys is just one, very flawed benchmark.

      • ukuina 10 days ago
        Why is LMsys flawed?

        Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

      • oersted 10 days ago
        Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.
    • infecto 10 days ago
      "But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.
  • visarga 10 days ago
    This shows the power of synthetic content - 3.3 trillion tokens! This approach can make a model even smaller and more efficient than organic text training, and it will not be able to regurgitate NYT articles because it hasn't seen any of them. This is how copyright infringement claims can be placated.
  • pkoiralap 10 days ago
    They have started putting some models in huggingface: https://huggingface.co/collections/microsoft/phi-3-6626e15e9...
    • Patrick_Devine 10 days ago
      And of course if you want to try it out locally, `ollama run phi3`.
    • minimaxir 10 days ago
      And with a MIT license!
  • mythz 10 days ago
    I'll believe it till I try it for myself, Phi-2 was the clear worst of the 20 LLMs we evaluated (was also smallest so was expected).

    But it was slow for its size, generated the longest responses with the most hallucinations, as well as generating the most empty responses. It was also the model ranked with the lowest quality answers.

  • ein0p 10 days ago
    Tried it: as soon as you ask something outside the head of the likely training data distribution it starts hallucinating like crazy. This isn’t surprising to me as a researcher: you need the associative memories of a larger model to cover the tail with at least something. That said, it’ll likely work well at specific narrow tasks once fine tuned. Just don’t expect it to really “beat GPT-3.5” at the general chat use case
  • brcmthrowaway 10 days ago
    If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.
    • PedroBatista 10 days ago
      They'll just do what they have been doing for ~20 years, they wait, pick the "winner", polish the "user experience", call it Apple magic and incorporate that into their product cycles.

      Some day will be the day their joke book becomes so mediocre it will not stick anymore, but I think they are safe on this one, for now..

      • mirekrusin 10 days ago
        Considering that experiments cost tens to hundreds of millions of dollars a pop this may be not that bad strategy.
      • fauigerzigerk 10 days ago
        True for hardware, but their record on software is far less convincing.
        • sroussey 10 days ago
          I expect apple to have local LLMs in hardware in five years or less.
      • dkarras 10 days ago
        There is enormous value in polishing the user experience, especially if no one else is doing it (or maybe capable of doing it?). It will never get old as long as they are the only ones doing it.
    • vessenes 10 days ago
      I don't think MS has a special sauce here, just a willingness to publish. To the extent MS has disclosed the bulk of what they are doing with Phi, it's a combination of really nice initial idea "Use written texts + GPT-4 to generate high quality prompts where we know the answer is great because it's written down" and engineering.

      To me this is advancing the state of the art as to the impact of data quality, but it doesn't look to me like the phi series have some magical special sauce otherwise. Data of quality and synthetic data creation are not magical moats that Apple can't cross.

      I'll say too that I'm psyched to try Phi-3; the sweet spot for me is a model that can be a local coding assistant and still answer random q&a questions with some sophistication. I'm skeptical that 3-8b parameter models will bring the high-level of sophistication sometimes needed in this cycle; there's still a very large gap with the larger models in daily use, despite some often close benchmark scores.

      Anyway, Apple-Phi-3 is in no way an impossibility.

    • bt1a 10 days ago
      I tore my hair out developing a SwiftUI app that could run llama.cpp and whisper.cpp simultaneously. Was able to run a Q3_K Mistral 7B along with a smaller whisper model eventually, but grinding through XCode is a nightmare.

      They're working on MLX but it only recently got swift bindings. They just don't have the DEVELOPERS DEVELOPERS DEVELOPERS coked out attitude i guess

    • esafak 10 days ago
      Did they ever claim to be a powerhouse in foundation models? Did your MacBook or iPhone become obsolete or stop working? They use the models, they don't release them because they don't hoard data.
    • WanderPanda 10 days ago
      The opposite is the case, with all the advancements, even by doing nothing, Apple (like everyone, including hobbyists) is moving closer to the frontier. Hopefully this trend stays alive!
    • oersted 10 days ago
      If anything this is good for them. Apple's play here has always been getting their devices ready for running LLMs locally. This makes it way easier.
    • Deverauxi 10 days ago
      They have something like 140 billion dollars in cash.

      They’ll be fine.

    • IncreasePosts 10 days ago
      How exactly does publicized research lead to them not being able to catch up? I don't think anything in this paper is patentable.
    • moralestapia 10 days ago
      I don't recall Nokia being a 3 trillion dollar company. Your vibes may vary, though.
    • astrange 10 days ago
      I think that when people release new interesting software products it's good for hardware companies.
    • thoughtegting 10 days ago
      If I were apple, I would be developing something in total secrecy and then release something ahead of the rest of competition when people least expect it. very big ifs but siri can be updated everywhere overnight and I dont see them rushing into anything like this
      • golergka 10 days ago
        If I were apple, I would just buy one of the major LLM companies. They have the cash.
        • bingbingbing777 10 days ago
          They've been buying AI companies and have nothing to show for it.
          • golergka 10 days ago
            Showing off work in progress is not really their thing.
          • elbear 10 days ago
            Why do you think that is? Do you think their culture is an obstacle or is it something else?
    • seydor 10 days ago
      Apple's advantage is that their devices are safeguarding people from the dangers of AI
      • talldayo 10 days ago
        That's a very eloquent variation on the word "censorship"

        Are you next going to tell us that the CIA's access to iCloud data protects their users from terrorism too?

      • fauigerzigerk 10 days ago
        How so? And what dangers?
    • ec109685 10 days ago
      Eh, I think it’s showing that this class of model is becoming commoditized given there is a new one launching every week.
  • m3kw9 10 days ago
    Phi-2 was useless for practical purposes except if you want to show your friends that it can write a poem, llama3 8b was slightly better but is still same category, it’s complete trash with coding vs gpt4. Llama3 400b “iS OPen SoURce!” But no you will need to pay to access because most one can not practically afford an A100 and set it up properly.

    What I’m trying to say is that user experience is now as key as the model smarts and these barely touching gpt4 models cannot beat OpenAI right now as a whole package.

    • azinman2 10 days ago
      I just tried to give gpt4 a scrape of a menu page and asked it to reformat it to csv. It hallucinated several parts and missed others. Llama3-70b hasn’t done that. So far it’s been more reliable. You can run a quantized version on consumer hardware or pay significantly less ($10 vs $1 in, $30 vs $1 out) on hosted platforms.
  • abidlabs 10 days ago
    Hugging Face Paper Page and Discussion: https://huggingface.co/papers/2404.14219
    • whereismyacc 10 days ago
      they're just spamming "weights or it didn't happen"

      i mean, fair

  • blackoil 10 days ago
    Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?
  • anticensor 9 days ago
    This paper broke ArXiv's HTML generator: https://github.com/arXiv/html_feedback/issues/1090
  • ur-whale 10 days ago
    That's a whole lot of Zhangs!
  • smartmic 10 days ago
    Hm, roundabout 84 authors of one "scientific" paper. I wonder if this says something about (a) the quality of its content, (b) the path were academic (?) paper publishing goes to, (c) nothing at all, or (d), something entirely else.
    • lysecret 10 days ago
      Just means you need a big machine and a lot of capital to make advancement. Take a look at any paper coming out of cern.
    • 0cf8612b2e1e 10 days ago
      You should see physics. Stuff involving the large hadron collider can be pages of authors.

      It costs so little to share the credit if someone was an asset.

    • a_bonobo 10 days ago
      I have been on far larger author lists :) There's probably a whole team for the training data generation and assessment, a whole team for the safety assessment (section 4), that stuff adds up.
    • samus 10 days ago
      It's a tech report. Fair enough to include the whole lab.
  • simonw 10 days ago
    I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.
    • minimaxir 10 days ago
      At the least, there's multiple benchmarks noted in the paper (21!) and the results are consistent across all of them.

      I'd trust Microsoft to do decontamination testing, although the paper doesn't explicitly mention it other than "The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models."

  • Havoc 10 days ago
    Both precious phi have been epic letdowns when I actually tried them myself so quite low confidence in this being reflective of real world. Will try it anyway though
    • imjonse 10 days ago
      Phi-3 is instruct tuned though so hopefully better.
      • Havoc 9 days ago
        Yeah initial testing is looking promising
  • homarp 10 days ago
  • hackerlight 10 days ago
    Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.
    • vessenes 10 days ago
      Actually the original Phi papers did talk about their synthetic data strategy, and it's very cool -- essentially invert high quality textbook text using GPT-4 to create prompts, where the textbooks supply the answers. There may be more undisclosed, but it remains in my mind as one of the best ideas of the last twelve months -- so smart, and interesting, and apparently, it works well.
      • YetAnotherNick 10 days ago
        No they don't use textbook text at all despite the paper title. They just asked GPT-4 to generate "textbook quality" content, which doesn't even exactly looks like textbook.
      • astrange 10 days ago
        I feel like literal dictionaries would make good training data; wonder if any of them have done that. LLMs are good at faking so it's hard to tell by asking them.
      • torginus 10 days ago
        Except everything that comes out of an LLM (like GPT4) is highly suspect (at least in my experience).
        • samus 10 days ago
          1. They need it for style and language, not necessarily for the facts

          2. Since GPT-4 is seen as the very best general-purpose LLM in existence, it makes sense to emulate its performance with less resources.

          3. Phi models are also trained with other high-quality data

      • xarope 10 days ago
        perhaps that's the best path forward? Text and reference books (hopefully unbiased) for answers, and web scraped data for conversational tone.
    • minimaxir 10 days ago
      Yes, "chinchilla optimal" is a meme, but 15T might turn out to be too many tokens.
      • wrsh07 10 days ago
        My understanding from this tweet thread [1] is that chinchilla probably overspecified some of the hyperparameters to the model

        tl;dr I'm looking forward to having lots of models (ideally models) trained with a wide range of parameters to narrow down "what is actually optimal"

        I think there is an interesting tradeoff of data quality and data volume, though

        (Eg if we train with the highest quality 10% of our data, does the model improve if we use the other 90%? What if we increase our data size by 10x?)

        [1] https://twitter.com/tamaybes/status/1780639257389904013

  • maximsicora 10 days ago
    insane