RTX 3090 24GB working with extra env var

#4
by cktlco - opened

Thanks for the great work!

FYI that I was able to get the README demo script to run on Linux with a RTX 3090 24GB only after using the following env var to avoid a CUDA OOM:

 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python readme_demo.py

Its running with this change, but its extremely slow for me on an L4

Agreed, it's too slow to be usable for anything other than adhoc tests.

As an alternative for enthusiasts, here is a 4-bit quantized version of molmo-7B which fits in ~12GB VRAM and is much more responsive:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit

Nice! Yeah the transformers integration is extremely slow unfortunately as it for loops through the experts; We need to integrate it into vLLM/SGLang/llama.cpp like the other OLMoE models.

Sign up or log in to comment