In the rapidly evolving world of artificial intelligence (AI), I find that there is a lot of confusion about the "kinds" of Large Language Models (LLMs) that are available and how model families evolve. Which model is best for your company? Should I use an instruct, chat, base, GGUF, or merged model? What are all these weird names and what do they mean?
To illustrate this confusion, look at the current number of downloads for Meta's LLaMA 3 models on Hugging Face (at the time of writing this blog post):
As you can see, both the "Meta-Llama-3-8B" base model and the "Instruct" version of this model both have over 2M downloads. To me, this represents confusion. The base model will perform much worse than the "Instruct" model for most people unless you are fine-tuning, and there just aren't 2M people fine-tuning (i.e., the number of people just doing inference is much, much higher).
If you helping to bring AI into your organization, it is crucial to understand the current landscape and make informed decisions about which models to integrate into your organization's infrastructure. This post will break down the LLM ecosystem and give you a mental model to help you make sane decisions.
To navigate the LLM landscape effectively, it's important to understand the three main categories of models available:
Each category has its strengths and use cases, and the choice depends on your specific needs, resources, and risk tolerance. You will find models fitting categories #2 and #3 on Hugging Face, similar to how you will find open source code on GitHub. Model releases on Hugging Face include model parameter files, configuration, etc. that allows the models to be utilized and fine-tuned using common tooling (like the transformers
package from Hugging Face).
Also similar to code on Github, you will find that open models released on Hugging Face are licensed in a wide variety of ways. Look at the range of licenses here:
Some of these licenses are code licenses (e.g., Apache 2.0), some are data licenses (e.g., CC-*), and some are one off licenses. This mix is due to the fact that people are still confused about how to license models, given that you need both data (model parameters) and code (inference functions and model classes) to run a model.
In addition to general model categories, you should also know that model "families" evolve after the release of the initial model (e.g., Meta's LLaMA 3). The process normally looks like the following:
meta-llama/Meta-Llama-3-8B
), which is primarily intended to be used for further fine-tuning (not for off-the-shelf inference). This is sometimes call the "base" or "pre-trained" model.meta-llama/Meta-Llama-3-8B-Instruct
). This version is generally better for most applications unless you plan to do custom fine-tuning. Suffixes vary, but these fine-tunes could include suffixes like "Instruct" (in the case of LLaMA 3), "it" (in the case of Gemma), and "Chat" (in the case of Yi).NousResearch/Hermes-2-Pro-Llama-3-8B
).unsloth/llama-3-8b-bnb-4bit
).When exploring models on platforms like Hugging Face, you'll encounter various suffixes and terms. Here's a summary (although not everyone follows these standards):
8B
, 13B
, 70B
, etc.: The "size" of the model in terms of numbers of parameters (usually in increments of billions), where larger models could be more capable generally but are harder to run and scale (without expensive hardware). Many smaller models these days (in the 7-13B parameter range can fit on a single accelerate and may even perform better than closed, proprietary models on certain tasks)Instruct/it
: Fine-tuned for following various kinds of instructions. Better for general use.Chat
: Optimized for conversational tasks.Code
: Optimized for code generation or tech assistance.Tool-Use
/ Function-Calling
: Fine-tuned for the output of structured data, most often used in agentic applications where the LLM is generating JSON (or other output) to call APIs or functions.GGUF
, 8bit
, 4bit
, GGML
, GPTQ
, AWQ
: Optimized versions for specific hardware or reduced precision (e.g., for running on Macbooks or CPUs).Let's look at a few examples to demonstrate the point.
meta-llama/Meta-Llama-3-8B-Instruct
This is a "LLaMA 3" family model trained by the entity (Meta) that originally released the model. It has a size of 8 billion parameters and it is fine-tuned for general instruction following. It has a custom license ("llama3") that is not a widely used code or data license, and, thus, you should review the license to make sure you comply.
HuggingFaceTB/SmolLM-1.7B
This is a "SmolLM" family model trained by the entity (Hugging Face) that originally released the model. It has a size of 1.7 billion parameters and it is NOT fine-tuned for any specific task or tasks (i.e., it is a base model and most useful if you are interested in fine-tuning the model). It has a standard code license ("Apache 2.0") that is permissive and allows commercial use.
TheBloke/OpenHermes-2.5-Mistral-7B-GGUF
This is a "Mistral" family model fine-tuned by an entity (TheBloke) other than the one that released the model (Mistral). It has a size of 7 billion parameters and it is fine-tuned on an open access dataset called Open Hermes 2.5 (which can also be found in Hugging Face under teknium/OpenHermes-2.5
). The model is also optimized using GGUF, a model format that is optimized for quick loading and saving of models. GGUF is designed for use with GGML and other local model executors (llama.cpp). It has a standard code license ("Apache 2.0") that is permissive and allows commercial use.
When choosing a model for your organization, consider:
For most organizations doing inference, an "Instruct" or otherwise fine-tuned model will outperform a base model. Optimized models (e.g., GGUF versions) can run on less powerful hardware but may sacrifice some performance. For production environments, consider using high-precision models if your infrastructure allows.
As you navigate the LLM landscape, remember that the field is rapidly evolving. Stay informed about new releases and community developments. Experiment with different models to find what works best for your specific needs.
Prediction Guard supports multiple model families and a variety of fine-tuned. These are seamlessly integrated with critical safeguarding functionality to ensure that you can find (and safely integrate) the right model. Please reach out if we can be helpful as you explore the LLM landscape! Book a call here or join our Discord to chat.