RTX 3090 24GB working with extra env var
#4
by
cktlco
- opened
Thanks for the great work!
FYI that I was able to get the README demo script to run on Linux with a RTX 3090 24GB only after using the following env var to avoid a CUDA OOM:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python readme_demo.py
Awesome
Its running with this change, but its extremely slow for me on an L4
Agreed, it's too slow to be usable for anything other than adhoc tests.
As an alternative for enthusiasts, here is a 4-bit quantized version of molmo-7B which fits in ~12GB VRAM and is much more responsive:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit
Nice! Yeah the transformers integration is extremely slow unfortunately as it for loops through the experts; We need to integrate it into vLLM/SGLang/llama.cpp like the other OLMoE models.