I'm a complete newb when it comes to AI, and I am getting pretty ashamed of it too. How do I take a model like this and use it in my day to day? Can I somehow use in, say, VSCode? How do I point it at my code base, and use it to help me write new code?
You run most of these models in something that wraps them in an HTTP API. I use Ollama, which I think is the most popular but I’m not in a great position to judge. My impression is that it handles running models on CPU better.
So you’d basically install Ollama, download one of the versions of this model off HuggingFace, create a Modelfile since this isn’t in the default Ollama repo, and then Ollama can answer prompts with the model. Modelfiles are very simple, based on Dockerfiles. It takes like 15 seconds to make one if you aren’t messing with the various parameters.
Once it’s in Ollama, just get one of the various GPT plugins for VSCode and give it the Ollama URL (http://localhost:11434 by default). I use continue.dev but there are many.
Continue takes over the tab autocomplete with the LLM, and has a chat window on the right where you can use keyboard shortcuts to copy code into the prompt and ask it to edit/generate code or ask questions about existing code.
Thank you so much! That sounds surprisingly straightforward. I expected a lot more fiddling to get going.
Where would I start if I wanted to use a model programmatically ? Like let's say I am building a chat bot. I have a large data set of replies I want the model to mimic, and I'd want to do this in Python. Of course, I'd probably use a different model than Granite.
This is stretching my own knowledge, so if someone else knowledgeable wants to take a stab here I would appreciate a response as well!
Before doing that, I would start basic. Pull llama3 and see what it does with your prompts. You may be surprised how much is already in there and just not need to involve your own data at all. If that doesn’t work, check HuggingFace to see if someone has already made a model/fine tune/LoRA for what you’re trying to do. There are many, eg I found a Magic The Gathering rules model the other day.
If those fails, or you just want to play with your own data, you’ll need to figure out what “mimic” means.
If the model does okay with generating content but the content is factually wrong or missing background, you may be able to just do RAG (retrieval augmented generation). Basically running your documents through an AI that converts them to embeddings (some kind of vector, I don’t understand how they work). Then when you run a query, you can search for related embeddings and pass them to the model so that it “knows” the content that was in the document. This is the easiest; open-webui (the Ollama web chat interface) has some RAG support. Danswer is open source and built from the ground up to do RAG, and has built in support for ingesting from Slack, Drive, etc, etc. OpenAI also has embedding as a service.
A step up from that is making a LoRA. To my novice eyes, LoRA’s are basically a diff of the models parameters or weights. So rather than training a whole new model, you just add deltas to an existing one. These let you “teach” the model something while preserving the base generation capabilities of the underlying model. Ie you won’t have to worry about making sure you feed it enough data that it can speak English properly, because it gets that from the base model, you only have to give it enough data to speak about whatever you’re training it on.
If that doesn’t make any sense, go check CivitAI for Stable Diffusion (image model) LoRAs. The effects are way more obvious on image AIs.
Anyways, LoRAs are trained so you’re into training there. I think HuggingFace has tools that make this easy, but I don’t know enough to say anything with confidence.
The last option, which you almost certainly don’t want, is to train a new base model like llama3. You’re starting from 0 there; you have no existing model so you will have to teach it everything. It will take a ton of data, it will take forever to train, and it will likely be much worse than even randomly clicking models on HuggingFace. Meta has spent who knows how much on Llama and it still hallucinates.
If you end up training, you’ll probably end up doing it in the cloud unless you have tons of VRAM doing nothing. Prices are pretty reasonable, I think A100s are around $2/hr. I don’t know how to gauge how long it needs to train, but I believe it’s related to the amount of data you’re training on. I believe it’s pretty reasonable for LoRAs though, I’m guesstimating in the $20-ish range?
Edit: oh, and I’m not affiliated in any way, but I found out last night that Fireworks’ new function calling model is free while it’s in beta, which is a neat/fun thing to play with. https://fireworks.ai/blog/firefunction-v1-gpt-4-level-functi... it’s also open weights if you want to run it locally, but it’s a 40B model so I can’t on my 3060
https://github.com/TabbyML/tabby can run self-hosted AI coding assistants. I tried it a while ago and it worked with Nvim pretty easily. There is a VS code extension too. The extension will just sort of "read" with you and provide suggestions from time to time. Anytime the suggestion is good you can press some key (<TAB> by default) to accept it. It's basically autocomplete on steroids.
If you like Emacs (I use both Emacs and VSCode, for slightly different coding use cases), then the Emacs elamma [1]package is very nice. It is set up out of the box to use Ollama and to use M-x commands for code completion, summarization, and dozens of other useful functions. I love it, your mileage may vary.
Based on their own numbers, 8B seems decent, but 34B not worth it compared to general-purpose trained models even on specific tasks. Which is an interesting result.
> Our process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub
I wonder why companies like IBM are jumping on the LLM bandwagon and training/releasing models that have no chance of competing with Llama/Mistral? To me it just looks like a complete waste of $$ because nobody will use them in any serious scenarios
IBM made $60 billion in revenue last year. Where do you think it all came from? The same companies/governments that buy their overpriced crap are going to buy these new LLMs as well.
Their customers aren’t going to build their own RAG and agent frameworks, vector DBs, data ingest pipelines, finetunes, high scale inference serving solutions, etc, etc.
Right, but they can just use Llama/Mistral for free, instead of their inferior models, which I'm sure take quite a bit of resources to train in the first place.
Enterprises think differently. They want data provenance, privacy, ability to mitigate/transfer risk etc. If IBM is willing to offer that, there will be enterprises that bite.
IBM goes at great lengths to train models on clean data that has lower risk of copyright or legal issues attached. Just take a look at the model description.
That data issue is important enough for some companies to pick mediocre model over llama or mistral.
What if I told you that a lot of freely licensed code on GitHub is not clean? That the authors may have read something and rewritten it in a way that wasn’t transformative? So it basically has the same problems.
What if I told you the supposedly clean "The Stack" dataset contains at least one GPL repository inside, just because their license detection tool bugged out?
IBM and other big players are vigilant about these things, and this is what companies pay for.
Their software may not be better in some metrics, but they're cleaner in some and their support contracts allows people to sleep tight at night.
This is what money buys. Peace of mind and continuity.
Indemnity is moving the goal posts, no? So you’re conceding that their data isn’t clean. But they say it’s clean.
This support contract stuff: what are you talking about? You download these models, you use them. What would you pay for? It’s not clean data, they say it’s clean: why would I pay liars? Let’s game out the indemnity idea. I pay $10k/mo for 12 months. Then OpenAI loses v. NYTimes, ruled LLM training is not fair use, need express permission. IBM pulls the models. What the hell did I pay $120k for? And by the way, you can pay a law student 1 beer to tell you OpenAI is going to lose because of Warhol v Goldsmith. You can do whatever you want with your money, but I personally would not waste it on worthless indemnity.
First of all, "The Stack" is the dataset that models like StarCoder is trained upon. I don't know what's the data source for IBM Granite family.
I know the Stack is not clean, because they included my fork of GDM's greeter, which is GPL licensed.
My words about IBM was in general. I can't tell anything about their models, because I didn't see mention of "The Stack", and I don't know what their models are based on.
On the other hand, IBM doesn't like risks from my experience, so they would play it way safer than other companies.
If their data is not clean to begin with, then shame on them, and hope their AI efforts burn to the ground.
BTW, LLM training is not fair use. For start, Fair Use's definition automatically excludes "for profit" usage. Just because OpenAI has a non-profit part and training done here doesn't make them immune to consequences of for profit operations.
Agreed. For example their research lab in Zurich has been absolutely world-leading in things like atomic force microscopy (AFM) for four decades, including the Nobel prize in Physics in 1986 (AFM) and 1987 (high-temperature superconductivity). They also invented things like trellis coding and token ring.
PALO ALTO, Calif. – IBM defined at (trade show ed.) Hot Chips a new interface for the 2020 version of its Power 9 CPUs. The Open Memory Interface (OMI) will enable packing on a server more main memory at higher bandwidth than DDR, and as a potential Jedec standard could rival GenZ and Intel’s CLX.
OMI basically removes the memory controller from the host, relying instead on a controller on a relatively small DIMM card. Microchip’s Microsemi division already has a DDR controller running on cards in IBM’s labs. The approach promises to deliver up to 4TBytes memory on a server at about 320GBytes/second or 512GB at up to 650GB/s sustained rates.
IBM doesn't have fabs, but they still do R&D into semiconductors that very much target future commercial processes. They do a fair bit on quantum computing too, to name just a couple of things.
“The South Korean technology giant Samsung Electronics was awarded a total of 6,165 United States patents in 2023, the most of any company. Qualcomm ranked second among companies, with 3,854 U.S. patents granted, followed by the likes of Taiwan Semiconductor Manufacturing Company and IBM.” — https://www.statista.com/statistics/274825/companies-with-th...
Those mainframes are actually pretty modern and interesting.
If IBM split off half of their mainframe division and let some competition get going I think the segment could actually be something to contend with.
The basic idea of the IBM mainframe is almost perfect for what a lot of companies actually need (massively reliable hardware to support lots of middling software; most work is shunting data around) but everyone knows they're going to get locked into IBM.
On the contrary, the maintenance and continued improvement of an entire ISA and ISA specific operating systems is exactly my idea of hardcore tech, i.e. continuing to pay a chip org to design new chips for said ISA every generation and implement new instructions...and continuing to pay OS and compiler programmers to work those into their OS's and compilers...I'm not sure where we draw the line on maintenance vs. continued development here, but I'm not sure I'd call that purely maintenence.
There really aren't a lot of companies out there that can claim to do similar (and of course besides s390x, an ancient and venerable CISC, IBM also has Power, so they are doing this 2x over). You'll find a lot of IBM employees contributing to what I'd consider "hardcore" tech like LLVM and the Linux kernel as a result, because they genuinely have a large amount of expertise in those and similar areas. And here I'm not even really including Red Hat, but if you include them then they are even more overweight in the hardcore tech category.
If anything, a lot of the rest of the tech industry has left "hardcore tech" behind due to efficiency concerns as a result of a longrunning industry wide process of consolidation and commodification that IBM has resisted for obvious reasons. IBM is hardcore to a fault if anything.
TLDR: I actually think IBM punches above their weight in the "hardcore tech" area so long as our definition is sufficiently low level rather than say, cloud services, in which case fair enough you can probably fairly say they suck at that.
Here I've also chosen to entirely ignore IBM research.
When they pitch potential clients for their services, their slides on LLM, AI, ML, etc., must be their own. Whether they use it or not for the services does not matter. These are like the side projects that service companies release to help them close their clients.
Same reason they jumped on the clown bandwagon, it's the kind of offering it's expected to have when you're a company like that. Huge size, leading research departments, big enterprise customers.
They've been doing "AI" for ages. Notably Watson over the last couple of decades or so.
>I wonder why companies like IBM are jumping on the LLM bandwagon and training/releasing models that have no chance of competing with Llama/Mistral
Did you even read the benchmarks they post on that link? Assuming they're not outright lying, their 8B model is superior to Llama/Mistral models of the same size for coding tasks.
On the other hand i spend my time wondering why people like you think someone should just throw away their ideas simply because there's already someone in the niche.
So you’d basically install Ollama, download one of the versions of this model off HuggingFace, create a Modelfile since this isn’t in the default Ollama repo, and then Ollama can answer prompts with the model. Modelfiles are very simple, based on Dockerfiles. It takes like 15 seconds to make one if you aren’t messing with the various parameters.
Once it’s in Ollama, just get one of the various GPT plugins for VSCode and give it the Ollama URL (http://localhost:11434 by default). I use continue.dev but there are many.
Continue takes over the tab autocomplete with the LLM, and has a chat window on the right where you can use keyboard shortcuts to copy code into the prompt and ask it to edit/generate code or ask questions about existing code.
the server is here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...
And you can search for any GGUF on huggingface
Where would I start if I wanted to use a model programmatically ? Like let's say I am building a chat bot. I have a large data set of replies I want the model to mimic, and I'd want to do this in Python. Of course, I'd probably use a different model than Granite.
Before doing that, I would start basic. Pull llama3 and see what it does with your prompts. You may be surprised how much is already in there and just not need to involve your own data at all. If that doesn’t work, check HuggingFace to see if someone has already made a model/fine tune/LoRA for what you’re trying to do. There are many, eg I found a Magic The Gathering rules model the other day.
If those fails, or you just want to play with your own data, you’ll need to figure out what “mimic” means.
If the model does okay with generating content but the content is factually wrong or missing background, you may be able to just do RAG (retrieval augmented generation). Basically running your documents through an AI that converts them to embeddings (some kind of vector, I don’t understand how they work). Then when you run a query, you can search for related embeddings and pass them to the model so that it “knows” the content that was in the document. This is the easiest; open-webui (the Ollama web chat interface) has some RAG support. Danswer is open source and built from the ground up to do RAG, and has built in support for ingesting from Slack, Drive, etc, etc. OpenAI also has embedding as a service.
A step up from that is making a LoRA. To my novice eyes, LoRA’s are basically a diff of the models parameters or weights. So rather than training a whole new model, you just add deltas to an existing one. These let you “teach” the model something while preserving the base generation capabilities of the underlying model. Ie you won’t have to worry about making sure you feed it enough data that it can speak English properly, because it gets that from the base model, you only have to give it enough data to speak about whatever you’re training it on.
If that doesn’t make any sense, go check CivitAI for Stable Diffusion (image model) LoRAs. The effects are way more obvious on image AIs.
Anyways, LoRAs are trained so you’re into training there. I think HuggingFace has tools that make this easy, but I don’t know enough to say anything with confidence.
The last option, which you almost certainly don’t want, is to train a new base model like llama3. You’re starting from 0 there; you have no existing model so you will have to teach it everything. It will take a ton of data, it will take forever to train, and it will likely be much worse than even randomly clicking models on HuggingFace. Meta has spent who knows how much on Llama and it still hallucinates.
If you end up training, you’ll probably end up doing it in the cloud unless you have tons of VRAM doing nothing. Prices are pretty reasonable, I think A100s are around $2/hr. I don’t know how to gauge how long it needs to train, but I believe it’s related to the amount of data you’re training on. I believe it’s pretty reasonable for LoRAs though, I’m guesstimating in the $20-ish range?
Edit: oh, and I’m not affiliated in any way, but I found out last night that Fireworks’ new function calling model is free while it’s in beta, which is a neat/fun thing to play with. https://fireworks.ai/blog/firefunction-v1-gpt-4-level-functi... it’s also open weights if you want to run it locally, but it’s a 40B model so I can’t on my 3060
[1] https://github.com/s-kostyaev/ellama
> Our process to prepare code pretraining data involves several stages. First, we collect a combination of publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub
Their customers aren’t going to build their own RAG and agent frameworks, vector DBs, data ingest pipelines, finetunes, high scale inference serving solutions, etc, etc.
There’s an incredible amount of stuff to buy.
That data issue is important enough for some companies to pick mediocre model over llama or mistral.
IBM and other big players are vigilant about these things, and this is what companies pay for.
Their software may not be better in some metrics, but they're cleaner in some and their support contracts allows people to sleep tight at night.
This is what money buys. Peace of mind and continuity.
And more importantly, IBM will guarantee it in the case that they're wrong. _That's_ what companies pay for.
So will OpenAI, according to Sam Altman. Can they be trusted?
IBM has proven itself in various ways over the years, OpenAI hasn't.
While IBM is a behemoth of a money making machine, they put money where their mouth is. OpenAI does not.
So I'll trust IBM, but not OpenAI.
[0]: https://youtu.be/z8VhNF_0I5c
This support contract stuff: what are you talking about? You download these models, you use them. What would you pay for? It’s not clean data, they say it’s clean: why would I pay liars? Let’s game out the indemnity idea. I pay $10k/mo for 12 months. Then OpenAI loses v. NYTimes, ruled LLM training is not fair use, need express permission. IBM pulls the models. What the hell did I pay $120k for? And by the way, you can pay a law student 1 beer to tell you OpenAI is going to lose because of Warhol v Goldsmith. You can do whatever you want with your money, but I personally would not waste it on worthless indemnity.
I know the Stack is not clean, because they included my fork of GDM's greeter, which is GPL licensed.
My words about IBM was in general. I can't tell anything about their models, because I didn't see mention of "The Stack", and I don't know what their models are based on.
On the other hand, IBM doesn't like risks from my experience, so they would play it way safer than other companies.
If their data is not clean to begin with, then shame on them, and hope their AI efforts burn to the ground.
BTW, LLM training is not fair use. For start, Fair Use's definition automatically excludes "for profit" usage. Just because OpenAI has a non-profit part and training done here doesn't make them immune to consequences of for profit operations.
There will be market for their services. Maybe a different one, but there will be.
The gist is still current, but you need to fill in AWS as the current uncontroversial choice.
But I’m pretty sure both models have “we’re not responsible” clauses.
Citation needed
All I've seen from them in my professional experience is actually legacy mainframe maintenance.. Not shovelware, but very far from hardcore tech.
PALO ALTO, Calif. – IBM defined at (trade show ed.) Hot Chips a new interface for the 2020 version of its Power 9 CPUs. The Open Memory Interface (OMI) will enable packing on a server more main memory at higher bandwidth than DDR, and as a potential Jedec standard could rival GenZ and Intel’s CLX.
OMI basically removes the memory controller from the host, relying instead on a controller on a relatively small DIMM card. Microchip’s Microsemi division already has a DDR controller running on cards in IBM’s labs. The approach promises to deliver up to 4TBytes memory on a server at about 320GBytes/second or 512GB at up to 650GB/s sustained rates.
https://research.ibm.com/blog/albany-semiconductor-research-... etc
IBM doesn't have fabs, but they still do R&D into semiconductors that very much target future commercial processes. They do a fair bit on quantum computing too, to name just a couple of things.
https://research.ibm.com/blog/ibm-molecule-generation-experi...
https://www.smithsonianmag.com/smart-news/ibm-engineers-push...
If IBM split off half of their mainframe division and let some competition get going I think the segment could actually be something to contend with.
The basic idea of the IBM mainframe is almost perfect for what a lot of companies actually need (massively reliable hardware to support lots of middling software; most work is shunting data around) but everyone knows they're going to get locked into IBM.
There really aren't a lot of companies out there that can claim to do similar (and of course besides s390x, an ancient and venerable CISC, IBM also has Power, so they are doing this 2x over). You'll find a lot of IBM employees contributing to what I'd consider "hardcore" tech like LLVM and the Linux kernel as a result, because they genuinely have a large amount of expertise in those and similar areas. And here I'm not even really including Red Hat, but if you include them then they are even more overweight in the hardcore tech category.
If anything, a lot of the rest of the tech industry has left "hardcore tech" behind due to efficiency concerns as a result of a longrunning industry wide process of consolidation and commodification that IBM has resisted for obvious reasons. IBM is hardcore to a fault if anything.
TLDR: I actually think IBM punches above their weight in the "hardcore tech" area so long as our definition is sufficiently low level rather than say, cloud services, in which case fair enough you can probably fairly say they suck at that.
Here I've also chosen to entirely ignore IBM research.
They've been doing "AI" for ages. Notably Watson over the last couple of decades or so.
I've not seen any proper evaluations for Granite against, say, Llama or Mistral.
Until we do it's probably too early to say they can't compete, at least in some areas where others perform poorly.
Previous Granite models were on the level of first llama in my benchmarks.
I’m expecting this version to be roughly comparable to llama 2
Did you even read the benchmarks they post on that link? Assuming they're not outright lying, their 8B model is superior to Llama/Mistral models of the same size for coding tasks.