Fine-tune and deploy open LLMs as containers using AIKit - Part 1: Running on a local machine

Community Article Published June 4, 2024

Welcome to the first part of our three-part series on leveraging AIKit for deploying large language models (LLMs). In this series, we'll explore inference and fine-tuning, automating these processes using GitHub Actions and Kubernetes, and addressing the security implications of deploying LLMs in production environments.

In Part 1, we'll delve into the fundamentals of inference and fine-tuning using AIKit. You'll learn how to set up AIKit on your local machine, configure your environment, and perform both inference and fine-tuning tasks effortlessly. We'll provide step-by-step instructions and practical examples to ensure you're comfortable with the basics before moving on to more advanced topics.

Part 2 will guide you through automating the inference and fine-tuning processes using GitHub Actions and Kubernetes with Workload Identity. This section will cover setting up a workflow that transitions from fine-tuning to deploying models and scaling with Kubernetes clusters. By the end of this part, you'll be equipped to create a robust, automated workflow for your AI projects.

In Part 3, we'll address the security implications of running LLM workloads in production, in relation to containers and Kubernetes. We'll discuss how AIKit tackles vulnerabilities (CVEs), container security runtime best practices, ensures your models are secure using signed images with cosign, and supports air-gapped environments with local or remote container registries. This part will provide you with the necessary tools and best practices to maintain a secure and efficient AI environment.

🙋‍♂️ What is AIKit?

AIKit is a comprehensive cloud-native, vendor-agnostic solution designed for developers looking to fine-tune, build, and deploy LLMs with ease.

🔍 Key Offerings of AIKit

Inference: AIKit offers extensive inference capabilities across various formats. Its compatibility with the OpenAI API through a drop-in replacement REST API allows seamless integration with any OpenAI API-compatible client to interact with open LLMs.
Fine-Tuning: AIKit provides an extensible interface for a fast, memory-efficient, and straightforward fine-tuning experience.

🤔 Why AIKit?

🐳 Operates without the need for GPUs, Internet, or additional tooling beyond Docker for inference!
🤏 Creates containers with minimal image sizes, enhancing security with fewer vulnerabilities and a smaller attack surface, thanks to a custom distroless-based image!
🎵 Offers robust fine-tuning support that's fast and memory efficient!
🚀 Utilizes an easy-to-use declarative configuration for both inference and fine-tuning.
✨ Fully compatible with the OpenAI API, allowing use with any compatible client.
📸 Supports multi-modal models and image generation with Stable Diffusion.
🦙 Compatible with various model formats like GGUF, GPTQ, EXL2, GGML, Mamba, and more.
🚢 Ready for Kubernetes deployment, including a comprehensive Helm chart!
📦 Capable of supporting multiple models within a single image.
🖥️ Facilitates GPU-accelerated inferencing with NVIDIA GPUs.
🌈 Supports air-gapped environments with self-hosted, local, or any remote container registries to store model images for inference on the edge.
🔐 Ensures security with signed images using cosign, and continuous CVE patching with Copacetic.

💪 Try out inference on your local machine!

You can kick off AIKit on your local machine without a GPU with a simple Docker command:

docker run -d --rm -p 8080:8080 ghcr.io/sozercan/llama3:8b

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "llama-3-8b-instruct",
    "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
  }'

Output should be similar to:

{  
// ...
    "model": "llama-3-8b-instruct",  
    "choices": [  
        {  
            "index": 0,  
            "finish_reason": "stop",  
            "message": {  
                "role": "assistant",  
                "content": "Kubernetes is an open-source container orchestration system that automates the deployment, scaling, and management of applications and services, allowing developers to focus on writing code rather than managing infrastructure."  
            }  
        }  
    ],  
// ...
}

That's it! 🎉 The API is OpenAI-compatible, so this is a drop-in replacement for any OpenAI API-compatible client. AIKit comes with a number of pre-made images to help you get started right away!

While you can run inference using CPU, rest of this tutorial will require you to have an NVIDIA GPU due to fine-tuning requirements. Please see GPU Acceleration on NVIDIA runtime requirements.

📻 Fine-Tuning

The fine-tuning process allows the adaptation of pre-trained models to domain-specific data. To get started, you need to create a builder to be able to access host GPU devices:

docker buildx create --name aikit-builder \
    --use --buildkitd-flags '--allow-insecure-entitlement security.insecure’

AIKit is designed to be extensible, allowing for multiple fine-tuning implementations. For this tutorial, we'll use Unsloth, known for its fast and memory-optimized fine-tuning capabilities, to fine-tune our model.

Create a YAML file with your configuration. For example, a minimal configuration looks like this:

cat <<EOF >>aikitfile-finetune.yaml
#syntax=ghcr.io/sozercan/aikit:latest
apiVersion: v1alpha1
baseModel: "unsloth/llama-2-7b-bnb-4bit"
datasets:
  - source: "yahma/alpaca-cleaned"
    type: "alpaca"
  unsloth:
EOF

For full specifications, see AIKit's fine-tuning documentation.

Start the fine-tuning process by calling docker build:

docker buildx build --builder aikit-builder --allow security.insecure \
    --file aikitfile-finetune.yaml \
    --output _output --target unsloth --progress plain .

This will kick off the fine-tuning process. Keep in mind that this will not create an image but output a file specified in the folder designated by the --output parameter. Depending on the GPU, base model, and dataset, this process might take some time.

After the fine-tuning process is complete, AIKit will automatically convert to a GGUF formatted model in the output directory. In our case, this is the _output directory. Let’s take a look:

$ ls -al _output
-rw-r--r--  1 sozercan sozercan 7161089856 Mar  3 00:19 aikit-model-q4_k_m.gguf

Now that we have our fine-tuned model, let's create a model image similar to the pre-built Llama 3 model we ran earlier.

📦 Creating a model image

We will create a model image from the fine-tuned model we built earlier. Start by creating an aikitfile-inference.yaml with the following structure:

cat <<EOF >>aikitfile-inference.yaml
#syntax=ghcr.io/sozercan/aikit:latest
apiVersion: v1alpha1
debug: true
runtime: cuda
models:
  - name: custom
    source: model-q4_k_m.gguf
    promptTemplates:
    - name: instruct
      template: |
        Below is an instruction that describes a task. Write a response that appropriately completes the request.

        ### Instruction:
        {{.Input}}

        ### Response:
config: |
  - name: custom
    backend: llama
    parameters:
      model: model-q4_k_m.gguf
    context_size: 4096
    gpu_layers: 35
    f16: true
    batch: 512
    mmap: true
    template:
      chat: instruct
EOF

Now, we will create a model image using the aikitfile-inference.yaml configuration file, with _output as our context, and we will create an image tagged myllm:demo in our local Docker container store.

docker buildx build _output --tag myllm:demo \
    --file aikitfile-inference.yaml \
    --load --progress plain

This process will copy in the model and the necessary runtimes. Let’s examine our image:

docker images

Output will contain the custom model we just built:

REPOSITORY    TAG       IMAGE ID       CREATED          SIZE
myllm         demo      e8a2da8a320a   3 minutes ago    11.6GB

Now, we are ready to run our fine-tuned model!

🏃‍♀️‍➡️ Running the fine-tuned model

Similar to the model image we ran earlier, we will call docker run, but this time with an NVIDIA GPU enabled with --gpus all flag.

docker run -d --rm --gpus all -p 8080:8080 myllm:demo

Once the container is running, you can perform an inference by sending a POST request to the AIKit API:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "custom",
    "messages": [{"role": "user", "content": "Compose a haiku about cats"}]
}'

Finally, we get our response back from the LLM:

{
// ...
  "model": "custom",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "Cats are playful and cute,\nThey purr and meow with delight,\nA purrfect companion."
      }
    }
  ],
// ...
}

In this tutorial, we covered how to deploy a pre-made AIKit image, fine-tune a model with a dataset to create a custom model, and run that model to get a response!

🔮 What’s next?

In the next part of this series, we'll dive into automating and scaling this workflow using GitHub Actions and Kubernetes. This will enable you to create a robust, automated pipeline for your AI projects, ensuring seamless transitions from fine-tuning to deployment. Stay tuned!

📚 Additional Resources

For more detailed information, you can refer to the following resources:

Thank you for following along! If you have any questions or feedback, feel free to open an issue in the AIKit GitHub repository.

Upvote