如何实现多卡加载bge-reranker-v2.5-gemma2-lightweight

by dingguofeng - opened 19 days ago

Discussion

dingguofeng

19 days ago

你好，感谢你们这个非常优秀又出色的工作。请问一下，这个模型怎么实现多卡加载呀？单卡24G的4090好像加载不动float32

cfli

18 days ago

可以在加载AutoModelForCausalLM.from_pretrained的时候引入参数torch_dtype=torch.float16，这样大概需要11G的GPU memory就可以加载了

dingguofeng

18 days ago

•

edited 18 days ago

可以在加载AutoModelForCausalLM.from_pretrained的时候引入参数torch_dtype=torch.float16，这样大概需要11G的GPU memory就可以加载了

感谢您的回复。torch.float16和torch.bfloat16我都已尝试过，单卡24G确实可以加载。但是我发现当我把torch.float16改成torch.bf时会掉点。我去debug也确实看到了是由于精度的问题。所以我现在在我的实验上保证精度，想用torch.float32(因为这个模型也是float32的)。device_map="auto"我也尝试过，虽然能把模型加载到多个卡上，但是数据不能。他会报下面这个错。
mask attention_index = torch.arange(max final_lengths, device=hidden_states.device).unsqueeze(0) >= final useful lengths[:, None]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!。
所以还是想知道怎么多卡加载您的模型，麻烦了

cfli

17 days ago

很抱歉，我们之前没有尝试过使用 device_map="auto" 的多卡加载方式。使用 torch_dtype=torch.float16 加载模型可以加快推理速度，同时也不会损失太多的精度，因此我们推荐采用这种方式进行加载。此外，我们的评估也是基于 fp16 进行的。

dingguofeng

17 days ago

很抱歉，我们之前没有尝试过使用 device_map="auto" 的多卡加载方式。使用 torch_dtype=torch.float16 加载模型可以加快推理速度，同时也不会损失太多的精度，因此我们推荐采用这种方式进行加载。此外，我们的评估也是基于 fp16 进行的。

好的，那我就用fp16，十分感谢您的回复

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment