当前位置: 首页> 娱乐> 八卦 > vllm使用BitAndBytes量化模型失败

vllm使用BitAndBytes量化模型失败

时间:2025/7/10 18:06:37来源:https://blog.csdn.net/yuanlulu/article/details/142027931 浏览次数:0次

ValueError: BitAndBytes quantization with TP or PP is not supported yet

使用加载hf模型时,使用load_in_8bit来量化模型(底层其实是调用bitsandbytes来量化):

import argparse
import os
import torchdef parse_arguments():parser = argparse.ArgumentParser()parser.add_argument('--model_path',help="model and tokenizer path",default='/docker_shared/Baichuan2-7B-Chat-test2',)return parser.parse_args()def convert_bin2st_from_pretrained(model_path):from transformers import AutoModelForCausalLMmodel = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_path,low_cpu_mem_usage=True,trust_remote_code=True,torch_dtype=torch.float16, load_in_8bit=True) model.save_pretrained(model_path, safe_serialization=True)if __name__ == '__main__':args = parse_arguments()print(f"covert  {args.model_path} into safetensor")convert_bin2st_from_pretrained(args.model_path)

然后使用vllm加载量化后的模型,报错了:

WARNING 09-07 23:25:16 config.py:318] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
........
File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 353, in verify_with_parallel_configraise ValueError(
ValueError: BitAndBytes quantization with TP or PP is not supported yet.
ERROR 09-07 23:25:19 api_server.py:171] RPCServer process died before responding to readiness probe

意思是vllm不支持在bitsandbytes量化后的模型中使用tensor并行加速,也就是–tensor-parallel-size的值不能大于1。

WARNING 09-07 23:44:11 config.py:357] CUDA graph is not supported on BitAndBytes yet, fallback to the eager mode

使用–tensor-parallel-size 1 加载模型,继续遇到错误

WARNING 09-07 23:44:11 config.py:318] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 09-07 23:44:11 config.py:357] CUDA graph is not supported on BitAndBytes yet, fallback to the eager mode.
.......
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/baichuan.py", line 405, in load_weightsparam = params_dict[name]
KeyError: 'model.layers.0.mlp.down_proj.SCB'
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]ERROR 09-07 23:44:19 api_server.py:171] RPCServer process died before responding to readiness probe

仍然加载失败。

vllm支持哪些量化方式呢

查看vllm的help信息,可以看到vllm支持的量化方式

--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}, -q {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,squeezellm,compressed-tensors,bitsandbytes,qqq,experts_int8,None}Method used to quantize the weights. If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use`dtype` to determine the data type of the weights.

这些量化方式并不是vllm启动时做的,而是提前转换好的,vllm只是支持这些量化模型的加载,这些量化功能本身不在vllm里。

关于vllm支持的量化方式文档在:https://docs.vllm.ai/en/latest/quantization/supported_hardware.html。这个网页中有关于各种量化方法的使用。
在这里插入图片描述
按照这里的说法,bitsandbytes也是支持的,不清楚为啥我上面加载失败了。

关键字:vllm使用BitAndBytes量化模型失败

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

责任编辑: