运行cli_demo时抛异常,pytorch2.3安装了的,是什么原因? #1259
hotcolaava
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
CUDA extension not installed.
CUDA extension not installed.
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.18it/s]
Traceback (most recent call last):
File "D:\project\Qwen\cli_demo.py", line 210, in
main()
File "D:\project\Qwen\cli_demo.py", line 116, in main
model, tokenizer, config = _load_model_tokenizer(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\project\Qwen\cli_demo.py", line 54, in _load_model_tokenizer
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\modeling_utils.py", line 3928, in from_pretrained
model = quantizer.post_init_model(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\optimum\gptq\quantizer.py", line 588, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting
disable_exllama=True
in the quantization config objectBeta Was this translation helpful? Give feedback.
All reactions