Error using multi-gpu support

#26
by bobwhiterabbit - opened

The following code (the example code in the card) to support loading on multiple gpu (since the model doesn't fit on one of my rtx3090), doesn't work. The embedding model is still loaded in the RAM.

import torch.nn.functional as F
from transformers import AutoModel
from torch.nn import DataParallel


# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?',
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained("nvidia_nv-embed-v1", trust_remote_code=True)
for module_key, module in model._modules.items():
    model._modules[module_key] = DataParallel(module)

# get the embeddings
max_length = 4096

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size = 5
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

image.png

I also tried this ,but it cannot run on multiple GPU,but actually still running on the 1 gpu.why is that?how to make it running on multiple gpu

Thanks for question. Below example shows the full script for multi-gpu implementation.

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from torch.nn import DataParallel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v1', trust_remote_code=True)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

for module_key, module in model._modules.items():
    model._modules[module_key] = DataParallel(module)

# get the embeddings
max_length = 4096

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
batch_size=2
query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32)
passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

Hi Nada5,

Thanks for giving the sample code above! I have some follow-up questions. I'm new to NLP and this particular model, please kindly correct me if I'm not making any sense.

  • The code above utilizes DataParallel to do multi-gpu runs. Assume that encoding in a forward pass, DataParallel loads the model into all assigned GPUs, split the input so each GPU handles a portion of the input.
  • In my case, I have several 40G GPUs. I have observed that loading the model alone occupies 31400 MB of GPU memory. Then, encoding even the simplest sentence (for example, a one-liner) would result in GPU out of memory. Thus, what I'm looking for is to load the model across multiple small GPUs, something like device_map='auto'. However, NVEmbedModel does not supportdevice_map='auto'.
  • Is there a plan to support device_map=auto? Or, what are the alternative approaches to load the model across multiple small GPUs?

Thank you in advance

Hi Nada5, thanks for your answer, but the code sample you provided doesn't work for multi gpu. You allocate the first gpu to the model, and in my case it's 24gb cards, so i get cuda out of memory error. To use multi model is mostly what Clonylu said: implement device_map='auto'. Thanks!

Sign up or log in to comment