Edit model card

My exllamav2 based quantization for Xwin-LM-70B-V0.1 targetted for 48G VRAM, seems to have hit a sweet spot in evaluations.

  • Original model: https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
  • Exllamav2 4.8bpw conversion from https://huggingface.co/firelzrd/Xwin-LM-70B-V0.1-fp16-safetensors.
  • Fits in 48G (2x24G) VRAM with 4k or 8k context with or without the 8bit cache enabled.
  • Recommended settings: 6400 context, alpha_value 1.6, gpu_split 20,23.5
  • alpha_value at or over 1.75 seems to result in an occasional 'stutter', very obvious when the model outputs dates. Ex ("The Sixth Sense (19999)")
  • Seems to have hit some lucky quantization and the 4.800b was better than the 4bit-128g, 4bit-32g, Q4_K_S, 4.650b, 4.900b and even the 5.000b!
  • Experimentation has shown that alpha_value at 1.6 instead of 1.75 seems better at 1.5x context and even 1.5625x
  • Maybe obvious to some but there is no perplexity impact to using an 8bit cache.

Made using exllamav2/convert.py with the following command:

python3 convert.py -i models/firelzrd_Xwin-LM-70B-V0.1-fp16-safetensors/ \
 -cf models/matatonic_Xwin-LM-70B-V0.1-exl2-4.800b \
 -o tmp/ \
 -c parquet/wikitext-test.parquet \
 -b 4.800

Perplexity (wikitext) evaluated as:

Model Perplexity Comment (alpha_value)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.21780776977539 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.900b 3.2188525199890137 4096 ctx (not released)
firelzrd_Xwin-LM-70B-V0.1-exl2_5-bpw 3.22019362449646 4096 ctx (8b cache)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.239454746246338 5120 ctx (1.375)
LoneStriker_Xwin-LM-70B-V0.1-4.65bpw-h6-exl2 3.2419090270996094 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.2434027194976807 6400 ctx (1.6)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.2434027194976807 6400 ctx (1.6, 8b cache)
xwin-lm-70b-v0.1.Q4_K_S.gguf 3.2480294704437256 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.253002405166626 6144 ctx (1.75)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-32g-actorder_True 3.266364574432373 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.278069496154785 6656 ctx (1.95)
TheBloke_Xwin-LM-70B-V0.1-GPTQ_gptq-4bit-128g-actorder_True 3.2803425788879395 4096 ctx
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.304278612136841 7168 ctx (2.125)
matatonic_Xwin-LM-70B-V0.1-exl2-4.800b 3.359946727752685 8192 ctx (2.5)

*) Should be better than xwin-lm-70b-v0.1.Q4_K_M.gguf also, which reports 4.8bpw, but so far my perplexity eval has not been successful.

Downloads last month
10
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.