Improve GPU memory usage but slower inference speed? #8

ys-zong · 2024-04-16T22:26:17Z

Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,

model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)

from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)

and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve GPU memory usage but slower inference speed? #8

Improve GPU memory usage but slower inference speed? #8

ys-zong commented Apr 16, 2024

Improve GPU memory usage but slower inference speed? #8

Improve GPU memory usage but slower inference speed? #8

Comments

ys-zong commented Apr 16, 2024