修改 quantization.py 中待量化权重的移动逻辑

#47

by MikuRabbit - opened Mar 21

base: refs/heads/main

←

from: refs/pr/47

Discussion Files changed

-5

MikuRabbit

Mar 21

删除了 quantize() 方法中将待量化的权重移动到 torch.cuda.current_device() 的逻辑
在 QuantizedLinear 的 __init__ 方法中增加对于待量化是否在 CUDA 设备上的判断

修改 quantization.py 中待量化权重的移动逻辑4e17ef37

MikuRabbit

Mar 21

•

edited Mar 21

quantization.py 中的 quantize()方法会将需要量化的权重移动到 torch.cuda.current_device() 上，而在不指定 CUDA_VISIBLE_DEVICES 环境变量的情况下，torch.cuda.current_device() 会始终返回 cuda:0。

上述的情况在多卡环境下可能引发一些难以排查的问题。例如：GPU 0 上在执行其他任务，并且没有足够的显存。用户在 GPU 1 上加载模型并对其进行量化 model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).quantize(4).to('cuda:1')。此时 quantize() 会尝试将需要量化的权重移动到 GPU 0 上，但由于 GPU 0 上没有足够的显存，程序会出现 RuntimeError: CUDA error: out of memory。

zRzRzRzRzRzRzR changed pull request status to merged Mar 25

qiqi657s

May 9

未量化版本，我的cpu 内存，可以足够运行。但是Gpu,就内存不足，无法运行。
所以，之前我是先"cpu载入"量化，然后转移到Gpu进行推理。
量化与转移代码 model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True).quantize(4).half().cuda().eval()

今天，git pull了模型, 然后就无法启动了。提示，显存不足。分析后，发现是，现在修改为用GPU载入全部，然后在进行量化。这样，gpu显存不足无法载入，就无法进行量化。
最后，quantization.py 文件还原为了旧版本，就可以了。
旧版本号如下
SHA-1: 6d10497ab99fa606ad954e2530106dd8ec361fe0

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment