feihu.hf commited on
Commit
4d8ab9d
1 Parent(s): 27bd103

update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -128,6 +128,12 @@ Or you can install vLLM from [source](https://github.com/vllm-project/vllm/).
128
 
129
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
130
 
 
 
 
 
 
 
131
  ## Citation
132
 
133
  If you find our work helpful, feel free to give us a cite.
 
128
 
129
  **Note**: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts**. We advise adding the `rope_scaling` configuration only when processing long contexts is required.
130
 
131
+ ## Benchmark and Speed
132
+
133
+ To compare the generation performance between bfloat16 (bf16) and quantized models such as GPTQ-Int8, GPTQ-Int4, and AWQ, please consult our [Benchmark of Quantized Models](https://qwen.readthedocs.io/en/latest/benchmark/quantization_benchmark.html). This benchmark provides insights into how different quantization techniques affect model performance.
134
+
135
+ For those interested in understanding the inference speed and memory consumption when deploying these models with either ``transformer`` or ``vLLM``, we have compiled an extensive [Speed Benchmark](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html).
136
+
137
  ## Citation
138
 
139
  If you find our work helpful, feel free to give us a cite.