-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @NickLucche, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request continues the ongoing work to generalize the speech-to-text endpoint, making it less dependent on specific model architectures like Whisper. It achieves this by introducing a dedicated configuration class for ASR models and abstracting prompt generation and configuration retrieval into the model interface, allowing different transcription models to integrate more seamlessly.
Highlights
- Centralized ASR Configuration: Introduced a new
SpeechToTextConfig
dataclass invllm/config.py
to centralize and standardize configuration parameters for speech-to-text models, such as sample rate, audio chunking behavior, and overlap settings. This moves away from hardcoded constants within the endpoint logic. - Decoupled Prompt Generation: Refactored the
OpenAISpeechToText
endpoint to delegate the responsibility of constructing transcription prompts to the individual ASR models. This is achieved by replacing the genericget_decoder_prompt
with a newget_generation_prompt
method in theSupportsTranscription
interface, allowing models like Whisper to define their specific prompt formats, including handling of encoder and decoder prompts. - Model-Specific ASR Settings: Enhanced the
SupportsTranscription
interface with a newget_speech_to_text_config
method. This enables concrete ASR model implementations (e.g., Whisper) to provide their own default or derived speech-to-text configurations (like sampling rate, hop length, and max audio clip duration) based on their internal feature extractors, promoting greater modularity and extensibility.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code changes introduce abstract prompt and config for transcription models. I've identified one critical issue that will cause a runtime error and one medium-severity issue related to a change in logic for prompt construction. Addressing these will improve the correctness and robustness of the implementation.
vllm/config.py
Outdated
max_audio_clip_s: int = 30 | ||
"""Maximum duration in seconds for a single audio clip without chunking. | ||
Audio longer than this will be split into smaller chunks if | ||
`allow_audio_chunking` is enabled, otherwise it will be rejected.""" | ||
|
||
allow_audio_chunking: bool = False | ||
"""Whether to allow splitting long audio files into smaller chunks. | ||
When True, audio longer than `max_audio_clip_s` will be automatically | ||
split with overlapping segments. When False, long audio will be rejected. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_audio_clip_s: int = 30 | |
"""Maximum duration in seconds for a single audio clip without chunking. | |
Audio longer than this will be split into smaller chunks if | |
`allow_audio_chunking` is enabled, otherwise it will be rejected.""" | |
allow_audio_chunking: bool = False | |
"""Whether to allow splitting long audio files into smaller chunks. | |
When True, audio longer than `max_audio_clip_s` will be automatically | |
split with overlapping segments. When False, long audio will be rejected. | |
""" | |
max_audio_clip_s: Optional[int] = 30 | |
"""Maximum duration in seconds for a single audio clip without chunking. | |
Audio longer than this will be split into smaller chunks if | |
`allow_audio_chunking` is enabled, otherwise it will be rejected.""" |
Think it's always nicer to try to remove such flags. Couldn't we get the same behavior of "allow_audio_chunking = False" by setting max_audio_clip_s
to None
?
vllm/config.py
Outdated
splitting long audio. This helps maintain context across chunk boundaries | ||
and improves transcription quality at split points.""" | ||
|
||
min_energy_split_window_size: int = 1600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also allow None
to disable?
prompts.append(cast(PromptType, prompt)) | ||
# The model has control over the construction, as long as it | ||
# returns a valid PromptType. | ||
prompt = self.model_cls.get_generation_prompt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that'd work!
self.task_type = task_type | ||
|
||
self.asr_config = self.model_cls.get_speech_to_text_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@patrickvonplaten thanks for the review! 🙏🏻 Looking into the cleanest way to allow models to expose multiple endpoints now. The quickest solution isn't the most general one, so I might get back with a separate PR if things are too involved. |
@DarkLight1337 what do you think? |
num_prompt_tokens = max( | ||
len(res.prompt_token_ids) - 4, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this part also needs to be part of the model definition?
This pull request has merge conflicts that must be resolved before it can be |
41802a6
to
b74d9af
Compare
Waiting for #20812 to get green lights here. @DarkLight1337 @patrickvonplaten feel free to review again, current state should be final. |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
f0ddc61
to
e0f4c64
Compare
Signed-off-by: NickLucche <[email protected]>
…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>
…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>
Continuing generalization work started here #20179 to make the endpoint less whisper-centric.
TODO:
cc @DarkLight1337