vLLM generating repeated/duplicate responses #12276

saraswatmks · 2025-01-21T18:43:06Z

saraswatmks
Jan 21, 2025

I am trying to serve a LLAMA 70B fine tuned model using vLLM on A100 80GB GPU. The model is fine tuned for chat assistant use case. I notice after 10-15 message conversation, the LLM response start to repeat itself. Below is an example:

User: Hello
Assistant: How are you ?
.
// assume 10 - 15 messages in between
.
.
User: How bad is world economy?
Assistant: World economy is really bad these days due to war.

User: What about US economy?
Assistant: World economy is really bad these days due to war.

User: Why are some people good and bad?
Assistant: World economy is really bad these days due to war.

User: Why do men earn higher than women?
Assistant: World economy is really bad these days due to war.

Currently, I want the server to handle 10 concurrent requests, each having a fixed prompt with 1350 token length. Therefore, I set the following arguments as
--max-model-len 4096, --max-seq-len 10, --max-num-batched-tokens 40960

Besides, I have tried adding and removing bitsandbytes quantization, I still get the repeated response.
I am running the vllm server using the docker image: vllm/vllm-openai:latest
I have tried enabling and disabling the --enable-prefix-caching as well but still no effect
I have tried different values for repetition_penalty, frequency_penalty, presence_penalty but still not effect.
Besides llama 70B, I also tried with llama 8B, mistral 22B and get the same repetition in response.

Note: I don't get this problem when I deploy the model using llama.cpp, everything seems to work fine.

How can I fix this repetition issue? Looking for some guidance here.

setianke · 2025-07-03T11:56:07Z

setianke
Jul 3, 2025

Facing the same issue - tried the same change or parameters removing --enable-prefix-caching, currently using Mistral Small 3.2 2506 and Qwen 3 32B.

0 replies

ashgold · 2025-07-05T13:30:41Z

ashgold
Jul 5, 2025

This is a really serious issue.

Have you tried turning off APC using the following option?
--no-enable-prefix-caching

1 reply

helloninglei Jul 10, 2025

I tried run with disable prefix-caching but doesn't work, however it works for V0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

vLLM generating repeated/duplicate responses #12276

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

vLLM generating repeated/duplicate responses #12276

Uh oh!

Uh oh!

saraswatmks Jan 21, 2025

Replies: 2 comments · 1 reply

Uh oh!

setianke Jul 3, 2025

Uh oh!

ashgold Jul 5, 2025

Uh oh!

helloninglei Jul 10, 2025

saraswatmks
Jan 21, 2025

Replies: 2 comments 1 reply

setianke
Jul 3, 2025

ashgold
Jul 5, 2025