RTX 3090 24GB working with extra env var

by cktlco - opened 3 days ago

Discussion

cktlco

3 days ago

•

edited 3 days ago

Thanks for the great work!

FYI that I was able to get the README demo script to run on Linux with a RTX 3090 24GB only after using the following env var to avoid a CUDA OOM:

 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python readme_demo.py

Violet99

3 days ago

Awesome

pbarker

3 days ago

Its running with this change, but its extremely slow for me on an L4

cktlco

2 days ago

Agreed, it's too slow to be usable for anything other than adhoc tests.

As an alternative for enthusiasts, here is a 4-bit quantized version of molmo-7B which fits in ~12GB VRAM and is much more responsive:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit

Muennighoff

Ai2 org 1 day ago

Nice! Yeah the transformers integration is extremely slow unfortunately as it for loops through the experts; We need to integrate it into vLLM/SGLang/llama.cpp like the other OLMoE models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment