Glacially slow on a RTX 4090??

#1
by clevnumb - opened

What am I doing wrong? I have OogaBooga (Win 11), I AM loading 8-bit @ 4096 context and it's unusable due to the response time...

Owner

Huh, I have 3090, after loading it with exllamav2_HF it takes ~23GB VRAM. But this model default context size is 200k, so check if you have context size left as defaul value - this will indeed slow things down ,and if so, change context to 4096 tokens (max_seq_len slider in ooba).

Yeah it's set to 4096 token size and it's amazingly slow...slower than any 70b 2.4bpw model I have working currently through exllamav2_HF...but l will test it more and re-check my settings and let you know!

OK, other thing that comes to mind - ooba sometimes gets broken, later today ill check my version and python dependencies.

my ooba is updated to this commit, its not the newest one but works(git log):

commit 4b25acf58f78ee8821fc5bf325f602583bfa513f (HEAD -> main, origin/main, origin/HEAD)
Merge: 11288d1 c1b99f4
Author: oobabooga 112222186+oobabooga@users.noreply.github.com
Date: Thu Dec 21 20:22:48 2023 -0300

Merge pull request #5039 from oobabooga/dev

Merge dev branch

my pyton venv dependencies are as follows (pip freeze > requirements_dump.txt):
https://pastebin.com/6BM5q9wG

My Nvidia driver version: 31.0.15.4592 (not the newest one)

I've redownloaded the quant and after running it with ooba I get:

loader options:https://pasteboard.co/LTYDTd3grwuE.png
memory usage: https://pasteboard.co/mxeTV65hGYVt.png
performance: https://pasteboard.co/a8LW5u4cEqXC.png

So if you can't spot a difference in ooba commit version and pip dependencies I can only suggest trying out GGUF quants (maybe running kobolcpp): https://huggingface.co/TeeZee/Kyllene-57B-v1.0-GGUF, https://github.com/LostRuins/koboldcpp.

I hope you will get it sorted, its probably some dependency issue.

Thanks for the info, good to have that , but I resolved it by being sure to launch CMD_Windows.BAT before launching my Start_Windows.bat and arguments...I guess it wasn't running in the correct environment before and that was messing something up... Well it maintains speed for about a dozen responses and then gets markedly slower, but much better than before!

What Instruction template should I be using? OOBE says "It seems to be an instruction-following model with template "Custom (obtained from model metadata)". In the chat tab, instruct or chat-instruct modes should be used."

With ALPACA template chosen, every response I get from a character ends with a sharp bracket enclosed /s:

Thank you.

Owner

I', glad to hear that, if you use only ooba, Alpaca format works, in Silly Tavern roleplay or Story formats works best. Overall, due to diverse formats in upstream models, almost all formats should work (chatML, Alpaca, LimaRP)

Sign up or log in to comment