bartowski/llama-3-70B-Instruct-abliterated-GGUF · OOM when trying to load the Q8 in Oobabooga (I have 98GB VRAM)

AIGUYCONTENT

Jun 29

•

edited Jun 29

As the title says...

for the: llama-3-70B-Instruct-abliterated-Q8_0-00001-of-00003.gguf

I downloaded all three files and placed into /models folder in oobabooga. I did not concatenate.

I have 98GB of VRAM (4090, 4080, 3090, 3090, 3080) on a SuperMicro server and EPYC 7F52 32GB 3200DDR4 RAM.

I am NOT trying to load the model into RAM or CPU. GPU only. I'm having issues running EXL2 files for some strange reason. Getting error message with that. So, this is why I'm trying to load a huge Q8 into my available VRAM.

I'm getting an OOM error within a second or two of trying to load the file.

I'm also seeing this error message: AttributeError: 'LlamaCppModel' object has no attribute 'model'

This is a new server build. However, everything is updated (CUDA etc) and other GGUF files work.

Tensor_split is: 20,20,7,14,20.

I have the following checked when loading the model:

flash_attn
tensorcores
no-mmap
n-gpu-layers: 81
n_ctx: 5120
n_batch 512
threads: 16
threads_batch: 32
rope_freq_base: 500000

Here is from terminal:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17341.25 MiB on device 1: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
13:12:15-377410 ERROR Failed to load the model.

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2467 G /usr/lib/xorg/Xorg 144MiB |
| 0 N/A N/A 2664 G /usr/bin/gnome-shell 43MiB |
| 0 N/A N/A 4983 G /usr/bin/nvidia-settings 0MiB |
| 0 N/A N/A 24058 G ...onEnabled --variations-seed-version 76MiB |
| 1 N/A N/A 2467 G /usr/lib/xorg/Xorg 45MiB |
| 1 N/A N/A 25365 C /usr/NX/bin/nxnode.bin 419MiB |
| 2 N/A N/A 2467 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2467 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 2615 C+G ...libexec/gnome-remote-desktop-daemon 239MiB |
| 4 N/A N/A 2467 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+

Thanks if you can help.

BTW, the error message I get when trying to load EXL2 files is: UserWarning: do_sample is set to False. However, min_p is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset min_p.
warnings.warn(

This is no matter what EXL2 file I try to load.

bartowski

Owner Jun 29

that's a tricky one.. your tensor split may be causing the issue, can you try raising it a bit like 22,22,8,14,22? As it stands you've only split it with 81GB, whereas the unloaded file is already 75GB. pretty sure there's overhead for splitting and the of course there's the context. Do you see the same if you go for like Q6_K?

AIGUYCONTENT

Jun 30

•

edited Jun 30

that's a tricky one.. your tensor split may be causing the issue, can you try raising it a bit like 22,22,8,14,22? As it stands you've only split it with 81GB, whereas the unloaded file is already 75GB. pretty sure there's overhead for splitting and the of course there's the context. Do you see the same if you go for like Q6_K?

I just downloaded the Q6. Had to lower n_gpu_layers to 55 before I could get it to load--and run at .96/tokens per second.

For comparison, I was able to run 81/81 layers offloaded most of your q5/q6 quants with just a 4090 and two 3090s....on my gaming pc. ~10 tokens per second.

I wonder if this is something to do with OobaBooga and how it deals with multi gpus. One of the 3090 only has 19% user dedicated memory.

Also noticed this message in terminal the second I loaded the LLamam3 70b Instruct Abliterated or the Higgs GGUF:

"Detected duplicate leading "<|begin_of_text|>" in prompt, this will likely reduce response quality, consider removing it..."

I might just scrap this Oobabooga installation and re-install.

Do you know of a good reccomended front end for Aphrodite? I was told it handles multi gpu much better. Was unable to get it working with Kobold Lite last night (had to get chatgpt to help).

Thanks for all your hard work!!!

When my system was running good (ironically on my gaming pc and not the purpose-built server I put together last week)---Your Q5_K_M Llama3 70B Instruct Abliterated quant was my go-to quant for work!