Edit model card

Llama-2-7b-chat-quantized.w8a16

Model Overview

  • Model Architecture: Llama-2
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Quantized: INT8 weights
  • Release Date: 7/2/2024
  • Version: 1.0
  • Model Developers: Neural Magic

Quantized version of Llama-2-7b-chat. It achieves an average score of 53.37% on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 53.41%.

Model Optimizations

This model was obtained by quantizing the weights of Llama-2-7b-chat to INT8 data type. Only the weights of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the INT8 and floating point representations of the quantized weights. AutoGPTQ is used for quantization. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Evaluation

The model was evaluated with the lm-evaluation-harness using the vLLM engine.

Accuracy

Open LLM Leaderboard evaluation scores

Llama-2-7b-chat Llama-2-7b-chat-quantized.w8a16
(this model)
arc-c
25-shot
53.41% 53.37%
hellaswag
10-shot
78.65% 78.53%
mmlu
5-shot
47.34% 47.32%
truthfulqa
0-shot
45.58% 45.61%
winogrande
5-shot
72.45% 72.45%
gsm8k
5-shot
23.20% 22.82%
Average
Accuracy
53.41% 53.37%
Recovery 100% 99.93%
Downloads last month
0
Safetensors
Model size
1.88B params
Tensor type
I32
·
FP16
·
Inference API
Input a message to start chatting with neuralmagic/Llama-2-7b-chat-quantized.w8a16.
This model can be loaded on Inference API (serverless).