Welcome Llama 3 - Meta’s new open LLM

Published April 18, 2024
Update on GitHub

This article is also available in Chinese 简体中文.

Introduction

Meta’s Llama 3, the next iteration of the open-access Llama family, is now released and available at Hugging Face. It's great to see Meta continuing its commitment to open AI, and we’re excited to fully support the launch with comprehensive integration in the Hugging Face ecosystem.

Llama 3 comes in two sizes: 8B for efficient deployment and development on consumer-size GPU, and 70B for large-scale AI native applications. Both come in base and instruction-tuned variants. In addition to the 4 models, a new version of Llama Guard was fine-tuned on Llama 3 8B and is released as Llama Guard 2 (safety fine-tune).

We’ve collaborated with Meta to ensure the best integration into the Hugging Face ecosystem. You can find all 5 open-access models (2 base models, 2 fine-tuned & Llama Guard) on the Hub. Among the features and integrations being released, we have:

Table of contents

What’s new with Llama 3?

The Llama 3 release introduces 4 new open LLM models by Meta based on the Llama 2 architecture. They come in two sizes: 8B and 70B parameters, each with base (pre-trained) and instruct-tuned versions. All the variants can be run on various types of consumer hardware and have a context length of 8K tokens.

In addition to these 4 base models, Llama Guard 2 was also released. Fine-tuned on Llama 3 8B, it’s the latest iteration in the Llama Guard family. Llama Guard 2, built for production use cases, is designed to classify LLM inputs (prompts) as well as LLM responses in order to detect content that would be considered unsafe in a risk taxonomy.

A big change in Llama 3 compared to Llama 2 is the use of a new tokenizer that expands the vocabulary size to 128,256 (from 32K tokens in the previous version). This larger vocabulary can encode text more efficiently (both for input and output) and potentially yield stronger multilingualism. This comes at a cost, though: the embedding input and output matrices are larger, which accounts for a good portion of the parameter count increase of the small model: it goes from 7B in Llama 2 to 8B in Llama 3. In addition, the 8B version of the model now uses Grouped-Query Attention (GQA), which is an efficient representation that should help with longer contexts.

The Llama 3 models were trained ~8x more data on over 15 trillion tokens on a new mix of publicly available online data on two clusters with 24,000 GPUs. We don’t know the exact details of the training mix, and we can only guess that bigger and more careful data curation was a big factor in the improved performance. Llama 3 Instruct has been optimized for dialogue applications and was trained on over 10 Million human-annotated data samples with combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).

Regarding the licensing terms, Llama 3 comes with a permissive license that allows redistribution, fine-tuning, and derivative works. The requirement for explicit attribution is new in the Llama 3 license and was not present in Llama 2. Derived models, for instance, need to include "Llama 3" at the beginning of their name, and you also need to mention "Built with Meta Llama 3" in derivative works or services. For full details, please make sure to read the official license.

Llama 3 evaluation

Here, you can see a list of models and their Open LLM Leaderboard scores. This is not a comprehensive list and we encourage you to look at the full leaderboard. Note that the LLM Leaderboard is specially useful to evaluate pre-trained models, as there are other benchmarks specific to conversational models.

Model License Pretraining length [tokens] Leaderboard score
MPT-7B Apache 2.0 1,000B 5.98
Falcon-7B Apache 2.0 1,500B 5.1
Llama-2-7B Llama 2 license 2T 8.72
Qwen 2 7B Apache 2.0 ? 23.66
Llama-3-8B Llama 3 license 15T 13.41
Llama-2-13B Llama 2 license 2T 10.99
Falcon-40B Apache 2.0 1,000B 11.33
Falcon-40B Apache 2.0 1,000B 11.33
Llama-2-70B Llama 2 license 2T 18.25
Llama-3-70B Llama 3 license 15T 26.37
Mixtral 8x22B Apache 2 ? 25.49

How to prompt Llama 3

The base models have no prompt format. Like other base models, they can be used to continue an input sequence with a plausible continuation or for zero-shot/few-shot inference. They are also a great foundation for fine-tuning your own use cases. The Instruct versions use the following conversation structure:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

This format has to be exactly reproduced for effective use. We’ll later show how easy it is to reproduce the instruct prompt with the chat template available in transformers.

Demo

You can chat with the Llama 3 70B instruct on Hugging Chat! Check out the link here: https://huggingface.co/chat/models/meta-llama/Meta-Llama-3-70B-instruct

Using 🤗 Transformers

With Transformers release 4.40, you can use Llama 3 and leverage all the tools within the Hugging Face ecosystem, such as:

  • training and inference scripts and examples
  • safe file format (safetensors)
  • integrations with tools such as bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning), and Flash Attention 2
  • utilities and helpers to run generation with the model
  • mechanisms to export the models to deploy

In addition, Llama 3 models are compatible with torch.compile() with CUDA graphs, giving them a ~4x speedup at inference time!

To use Llama 3 models with transformers, make sure to install a recent version of transformers:

pip install --upgrade transformers

The following snippet shows how to use Llama-3-8b-instruct with transformers. It requires about 16 GB of RAM, which includes consumer GPUs such as 3090 or 4090.

from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipe(
    messages,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)

Arrrr, me hearty! Me name be Captain Chat, the scurviest pirate chatbot to ever sail the Seven Seas! Me be here to swab the decks o' yer mind with me trusty responses, savvy? I be ready to hoist the Jolly Roger and set sail fer a swashbucklin' good time, matey! So, what be bringin' ye to these fair waters?

A couple of details:

  • We loaded the model in bfloat16. This is the type used by the original checkpoint published by Meta, so it’s the recommended way to run to ensure the best precision or to conduct evaluations. For real world use, it’s also safe to use float16, which may be faster depending on your hardware.
  • Assistant responses may end with the special token <|eot_id|>, but we must also stop generation if the regular EOS token is found. We can stop generation early by providing a list of terminators in the eos_token_id parameter.
  • We used the default sampling parameters (temperature and top_p) taken from the original meta codebase. We haven’t had time to conduct extensive tests yet, feel free to explore!

You can also automatically quantize the model, loading it in 8-bit or even 4-bit mode. 4-bit loading takes about 7 GB of memory to run, making it compatible with a lot of consumer cards and all the GPUs in Google Colab. This is how you’d load the generation pipeline in 4-bit:

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

For more details on using the models with transformers, please check the model cards.

Inference Integrations

In this section, we’ll go through different approaches to running inference of the Llama 3 models. Before using these models, make sure you have requested access to one of the models in the official Meta Llama 3 repositories.

Integration with Inference Endpoints

You can deploy Llama 3 on Hugging Face's Inference Endpoints, which uses Text Generation Inference as the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of large language models. It has features such as continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.

To deploy Llama 3, go to the model page and click on the Deploy -> Inference Endpoints widget. You can learn more about Deploying LLMs with Hugging Face Inference Endpoints in a previous blog post. Inference Endpoints supports Messages API through Text Generation Inference, which allows you to switch from another closed model to an open one by simply changing the URL.

from openai import OpenAI

# initialize the client but point it to TGI
client = OpenAI(
    base_url="<ENDPOINT_URL>" + "/v1/",  # replace with your endpoint url
    api_key="<HF_API_TOKEN>",  # replace with your token
)
chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "user", "content": "Why is open-source software important?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

Integration with Google Cloud

You can deploy Llama 3 on Google Cloud through Vertex AI or Google Kubernetes Engine (GKE), using Text Generation Inference.

To deploy the Llama 3 model from Hugging Face, go to the model page and click on Deploy -> Google Cloud. This will bring you to the Google Cloud Console, where you can 1-click deploy Llama 3 on Vertex AI or GKE.

Integration with Amazon SageMaker

You can deploy and train Llama 3 on Amazon SageMaker through AWS Jumpstart or using the Hugging Face LLM Container.

To deploy the Llama 3 model from Hugging Face, go to the model page and click on Deploy -> Amazon SageMaker. This will display a code snippet you can copy and execute in your environment. Amazon SageMaker will now create a dedicated inference endpoint you can use to send requests.

Fine-tuning with 🤗 TRL

Training LLMs can be technically and computationally challenging. In this section, we’ll look at the tools available in the Hugging Face ecosystem to efficiently train Llama 3 on consumer-size GPUs. Below is an example command to fine-tune Llama 3 on the No Robots dataset. We use 4-bit quantization, and QLoRA and TRL’s SFTTrainer will automatically format the dataset into chatml format. Let’s get started!

First, install the latest version of 🤗 TRL.

pip install -U transformers trl accelerate

If you just want to chat with the model in the terminal you can use the chat command of the TRL CLI (for more info see the docs):

trl chat \
--model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
--device cuda \
--eos_tokens "<|end_of_text|>,<|eod_id|>"

You can also use TRL CLI to supervise fine-tuning (SFT) Llama 3 on your own, custom dataset. Use the trl sft command and pass your training arguments as CLI argument. Make sure you are logged in and have access the Llama 3 checkpoint. You can do this with huggingface-cli login.

trl sft \
--model_name_or_path meta-llama/Meta-Llama-3-8B \
--dataset_name HuggingFaceH4/no_robots \
--learning_rate 0.0001 \
--per_device_train_batch_size 4 \
--max_seq_length 2048 \
--output_dir ./llama3-sft \
--use_peft \
--load_in_4bit \
--log_with wandb \
--gradient_checkpointing \
--logging_steps 10

This will run the fine-tuning from your terminal and takes about 4 hours to train on a single A10G, but can be easily parallelized by tweaking --num_processes to the number of GPUs you have available.

Note: You can also replace the CLI arguments with a yaml file. Learn more about the TRL CLI here.

Additional Resources

Acknowledgments

Releasing such models with support and evaluations in the ecosystem would not be possible without the contributions of many community members, including

Thank you to the Meta Team for releasing Llama 3 and making it available to the open-source AI community!