vLLM generating repeated/duplicate responses #12276
saraswatmks
announced in
Q&A
Replies: 2 comments 1 reply
-
Facing the same issue - tried the same change or parameters removing |
Beta Was this translation helpful? Give feedback.
0 replies
-
This is a really serious issue. Have you tried turning off APC using the following option? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to serve a LLAMA 70B fine tuned model using vLLM on A100 80GB GPU. The model is fine tuned for chat assistant use case. I notice after 10-15 message conversation, the LLM response start to repeat itself. Below is an example:
User: Hello
Assistant: How are you ?
.
// assume 10 - 15 messages in between
.
.
User: How bad is world economy?
Assistant: World economy is really bad these days due to war.
User: What about US economy?
Assistant: World economy is really bad these days due to war.
User: Why are some people good and bad?
Assistant: World economy is really bad these days due to war.
User: Why do men earn higher than women?
Assistant: World economy is really bad these days due to war.
Currently, I want the server to handle 10 concurrent requests, each having a fixed prompt with 1350 token length. Therefore, I set the following arguments as
--max-model-len 4096
,--max-seq-len 10
,--max-num-batched-tokens 40960
bitsandbytes
quantization, I still get the repeated response.vllm/vllm-openai:latest
--enable-prefix-caching
as well but still no effectrepetition_penalty
,frequency_penalty
,presence_penalty
but still not effect.Note: I don't get this problem when I deploy the model using llama.cpp, everything seems to work fine.
How can I fix this repetition issue? Looking for some guidance here.
Beta Was this translation helpful? Give feedback.
All reactions