Memory Error While Fine-tuning AYA on 8 H100 GPUs

#23
by ArmanAsq - opened

Hello,

I am currently trying to fine-tune an AYA model on 8 H100 GPUs, but I'm encountering a memory error. My system has 640 GB of GPU RAM, which I assumed would be sufficient for this task. I'm not using PEFT or LoRA, and my batch size is set to 1.
I'm wondering if anyone has encountered a similar issue and could provide some guidance. How many GPUs are typically recommended for this task? Any help would be greatly appreciated.

Thanks in advance!

Cohere For AI org

Hey @ArmanAsq

I think I answered your question on our Discord so closing this one for now :)

shivi changed discussion status to closed

Sign up or log in to comment