Skip to content

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 12, 2025

Conversation

NickLucche
Copy link
Contributor

@NickLucche NickLucche commented Jul 8, 2025

Continuing generalization work started here #20179 to make the endpoint less whisper-centric.

TODO:

  • Allow transcriptions models to expose other API endpoints if specified

cc @DarkLight1337

Copy link

github-actions bot commented Jul 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Jul 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @NickLucche, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request continues the ongoing work to generalize the speech-to-text endpoint, making it less dependent on specific model architectures like Whisper. It achieves this by introducing a dedicated configuration class for ASR models and abstracting prompt generation and configuration retrieval into the model interface, allowing different transcription models to integrate more seamlessly.

Highlights

  • Centralized ASR Configuration: Introduced a new SpeechToTextConfig dataclass in vllm/config.py to centralize and standardize configuration parameters for speech-to-text models, such as sample rate, audio chunking behavior, and overlap settings. This moves away from hardcoded constants within the endpoint logic.
  • Decoupled Prompt Generation: Refactored the OpenAISpeechToText endpoint to delegate the responsibility of constructing transcription prompts to the individual ASR models. This is achieved by replacing the generic get_decoder_prompt with a new get_generation_prompt method in the SupportsTranscription interface, allowing models like Whisper to define their specific prompt formats, including handling of encoder and decoder prompts.
  • Model-Specific ASR Settings: Enhanced the SupportsTranscription interface with a new get_speech_to_text_config method. This enables concrete ASR model implementations (e.g., Whisper) to provide their own default or derived speech-to-text configurations (like sampling rate, hop length, and max audio clip duration) based on their internal feature extractors, promoting greater modularity and extensibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce abstract prompt and config for transcription models. I've identified one critical issue that will cause a runtime error and one medium-severity issue related to a change in logic for prompt construction. Addressing these will improve the correctness and robustness of the implementation.

vllm/config.py Outdated
Comment on lines 4954 to 4963
max_audio_clip_s: int = 30
"""Maximum duration in seconds for a single audio clip without chunking.
Audio longer than this will be split into smaller chunks if
`allow_audio_chunking` is enabled, otherwise it will be rejected."""

allow_audio_chunking: bool = False
"""Whether to allow splitting long audio files into smaller chunks.
When True, audio longer than `max_audio_clip_s` will be automatically
split with overlapping segments. When False, long audio will be rejected.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_audio_clip_s: int = 30
"""Maximum duration in seconds for a single audio clip without chunking.
Audio longer than this will be split into smaller chunks if
`allow_audio_chunking` is enabled, otherwise it will be rejected."""
allow_audio_chunking: bool = False
"""Whether to allow splitting long audio files into smaller chunks.
When True, audio longer than `max_audio_clip_s` will be automatically
split with overlapping segments. When False, long audio will be rejected.
"""
max_audio_clip_s: Optional[int] = 30
"""Maximum duration in seconds for a single audio clip without chunking.
Audio longer than this will be split into smaller chunks if
`allow_audio_chunking` is enabled, otherwise it will be rejected."""

Think it's always nicer to try to remove such flags. Couldn't we get the same behavior of "allow_audio_chunking = False" by setting max_audio_clip_s to None?

vllm/config.py Outdated
splitting long audio. This helps maintain context across chunk boundaries
and improves transcription quality at split points."""

min_energy_split_window_size: int = 1600
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also allow None to disable?

prompts.append(cast(PromptType, prompt))
# The model has control over the construction, as long as it
# returns a valid PromptType.
prompt = self.model_cls.get_generation_prompt(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that'd work!

self.task_type = task_type

self.asr_config = self.model_cls.get_speech_to_text_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@NickLucche
Copy link
Contributor Author

@patrickvonplaten thanks for the review! 🙏🏻

Looking into the cleanest way to allow models to expose multiple endpoints now. The quickest solution isn't the most general one, so I might get back with a separate PR if things are too involved.

@NickLucche NickLucche changed the title abstract promtp and config for transcription models [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models Jul 9, 2025
@NickLucche
Copy link
Contributor Author

@DarkLight1337 what do you think?

Comment on lines -267 to -268
num_prompt_tokens = max(
len(res.prompt_token_ids) - 4, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this part also needs to be part of the model definition?

Copy link

mergify bot commented Jul 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 10, 2025
@NickLucche NickLucche force-pushed the transcriptions-api-generalize branch from 41802a6 to b74d9af Compare July 11, 2025 10:11
@mergify mergify bot removed the needs-rebase label Jul 11, 2025
@NickLucche
Copy link
Contributor Author

Waiting for #20812 to get green lights here.

@DarkLight1337 @patrickvonplaten feel free to review again, current state should be final.

Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
@NickLucche NickLucche force-pushed the transcriptions-api-generalize branch from f0ddc61 to e0f4c64 Compare July 11, 2025 12:27
Signed-off-by: NickLucche <[email protected]>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 11, 2025
@vllm-bot vllm-bot merged commit 3c7d942 into vllm-project:main Jul 12, 2025
76 of 78 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Multi-modality Core Jul 12, 2025
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants