Decoding the LLM Landscape: A Guide for Tech Leaders

In the rapidly evolving world of artificial intelligence (AI), I find that there is a lot of confusion about the "kinds" of Large Language Models (LLMs) that are available and how model families evolve. Which model is best for your company? Should I use an instruct, chat, base, GGUF, or merged model? What are all these weird names and what do they mean?

To illustrate this confusion, look at the current number of downloads for Meta's LLaMA 3 models on Hugging Face (at the time of writing this blog post):

As you can see, both the "Meta-Llama-3-8B" base model and the "Instruct" version of this model both have over 2M downloads. To me, this represents confusion. The base model will perform much worse than the "Instruct" model for most people unless you are fine-tuning, and there just aren't 2M people fine-tuning (i.e., the number of people just doing inference is much, much higher).

If you helping to bring AI into your organization, it is crucial to understand the current landscape and make informed decisions about which models to integrate into your organization's infrastructure. This post will break down the LLM ecosystem and give you a mental model to help you make sane decisions.

Understanding the LLM Ecosystem

To navigate the LLM landscape effectively, it's important to understand the three main categories of models available:

Closed, Proprietary Models
- Examples: OpenAI's GPT series, Anthropic's Claude, Google's Gemini
- Characteristics: Highly capable but with limited access and control
Open, Restrictively Licensed Models
- Examples: Cohere's Command R, Unbabel Tower, original LLaMA
- Characteristics: Downloadable but with restrictions on commercial use
Open, Permissively Licensed Models
- Examples: Llama 3, Mistral, Gemma
- Characteristics: Full access, customizable, clear licensing

Each category has its strengths and use cases, and the choice depends on your specific needs, resources, and risk tolerance. You will find models fitting categories #2 and #3 on Hugging Face, similar to how you will find open source code on GitHub. Model releases on Hugging Face include model parameter files, configuration, etc. that allows the models to be utilized and fine-tuned using common tooling (like the transformers package from Hugging Face).

Also similar to code on Github, you will find that open models released on Hugging Face are licensed in a wide variety of ways. Look at the range of licenses here:

Some of these licenses are code licenses (e.g., Apache 2.0), some are data licenses (e.g., CC-*), and some are one off licenses. This mix is due to the fact that people are still confused about how to license models, given that you need both data (model parameters) and code (inference functions and model classes) to run a model.

The Evolution of Model Families

In addition to general model categories, you should also know that model "families" evolve after the release of the initial model (e.g., Meta's LLaMA 3). The process normally looks like the following:

Base Model: Meta (or whoever is releasing the model) releases the initial model without a suffix (meta-llama/Meta-Llama-3-8B), which is primarily intended to be used for further fine-tuning (not for off-the-shelf inference). This is sometimes call the "base" or "pre-trained" model.
Instruct Model: Meta further fine-tunes the base model on a curated set of instructions, creating the "Instruct" version (meta-llama/Meta-Llama-3-8B-Instruct). This version is generally better for most applications unless you plan to do custom fine-tuning. Suffixes vary, but these fine-tunes could include suffixes like "Instruct" (in the case of LLaMA 3), "it" (in the case of Gemma), and "Chat" (in the case of Yi).
Community Iterations: Research communities and developers then take over and produce a huge number of fine-tunes and optimizations. Some of these communities or developers have automated pipelines to create these, some of created by academic research labs, and others are created by "merging" models together.
- Further fine-tuning: Researchers and companies create specialized versions based on their own curated prompts (e.g., Hermes Pro from Nous Research, NousResearch/Hermes-2-Pro-Llama-3-8B).
- Optimizations: The community develops versions optimized for different hardware by quantizing, pruning, or otherwise modifying the model structure or parameters (e.g., GGUF, 4bit, GGML, GPTQ, and AWQ like unsloth/llama-3-8b-bnb-4bit).

When exploring models on platforms like Hugging Face, you'll encounter various suffixes and terms. Here's a summary (although not everyone follows these standards):

(no suffix): The original, untrained model. Best for custom fine-tuning.
8B, 13B, 70B, etc.: The "size" of the model in terms of numbers of parameters (usually in increments of billions), where larger models could be more capable generally but are harder to run and scale (without expensive hardware). Many smaller models these days (in the 7-13B parameter range can fit on a single accelerate and may even perform better than closed, proprietary models on certain tasks)
Instruct/it: Fine-tuned for following various kinds of instructions. Better for general use.
Chat: Optimized for conversational tasks.
Code: Optimized for code generation or tech assistance.
Tool-Use/ Function-Calling: Fine-tuned for the output of structured data, most often used in agentic applications where the LLM is generating JSON (or other output) to call APIs or functions.
GGUF, 8bit, 4bit, GGML, GPTQ, AWQ: Optimized versions for specific hardware or reduced precision (e.g., for running on Macbooks or CPUs).

Let's look at a few examples to demonstrate the point.

Example #1: `meta-llama/Meta-Llama-3-8B-Instruct`

This is a "LLaMA 3" family model trained by the entity (Meta) that originally released the model. It has a size of 8 billion parameters and it is fine-tuned for general instruction following. It has a custom license ("llama3") that is not a widely used code or data license, and, thus, you should review the license to make sure you comply.

Example #2: `HuggingFaceTB/SmolLM-1.7B`

This is a "SmolLM" family model trained by the entity (Hugging Face) that originally released the model. It has a size of 1.7 billion parameters and it is NOT fine-tuned for any specific task or tasks (i.e., it is a base model and most useful if you are interested in fine-tuning the model). It has a standard code license ("Apache 2.0") that is permissive and allows commercial use.

Example #3: `TheBloke/OpenHermes-2.5-Mistral-7B-GGUF`

This is a "Mistral" family model fine-tuned by an entity (TheBloke) other than the one that released the model (Mistral). It has a size of 7 billion parameters and it is fine-tuned on an open access dataset called Open Hermes 2.5 (which can also be found in Hugging Face under teknium/OpenHermes-2.5 ). The model is also optimized using GGUF, a model format that is optimized for quick loading and saving of models. GGUF is designed for use with GGML and other local model executors (llama.cpp). It has a standard code license ("Apache 2.0") that is permissive and allows commercial use.

Making Informed Decisions

When choosing a model for your organization, consider:

Use Case: Are you doing general inference, specialized tasks, or custom fine-tuning?
Resources: What kind of hardware do you have available?
Privacy and Control: Do you need to keep data on-premises or otherwise air-gapped?
Licensing: What are your commercial use requirements?

For most organizations doing inference, an "Instruct" or otherwise fine-tuned model will outperform a base model. Optimized models (e.g., GGUF versions) can run on less powerful hardware but may sacrifice some performance. For production environments, consider using high-precision models if your infrastructure allows.

The Path Forward

As you navigate the LLM landscape, remember that the field is rapidly evolving. Stay informed about new releases and community developments. Experiment with different models to find what works best for your specific needs.

Prediction Guard supports multiple model families and a variety of fine-tuned. These are seamlessly integrated with critical safeguarding functionality to ensure that you can find (and safely integrate) the right model. Please reach out if we can be helpful as you explore the LLM landscape! Book a call here or join our Discord to chat.