Can the model be sharded over several GPUs?

#13
by silverfisk - opened

Is there some way of sharding this model to my two GPUs?
I have a Quadro RTX 5000 laptop GPU with 16 GB VRAM + an NVIDIA GeForce RTX 3090 eGPU with 24GB. When loading the model it seems to only try to load it into the 3090 GPU.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 23.69 GiB total capacity; 22.05 GiB already allocated; 117.69 MiB free; 23.21 GiB reserved in total by PyTorch)

I thought the model would be split automagically since I've seen other smaller models loads use both GPUs with the same code. But maybe this is a newbie question, I've just gotten started with this yesterday.

Update:
I got it to load without OOM now. Now it's very slow, but at least not OOM so far. Adding it here in case it is useful for someone else.
The device_map="auto", below did it:

model = AutoGPTQForCausalLM.from_quantized(
                model_id,
                model_basename=model_basename,
                use_safetensors=True,
                trust_remote_code=True,
                device_map="auto",
                use_triton=False,
                quantize_config=None,
            )

Sign up or log in to comment