[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

NickLucche · 2025-07-08T17:32:43Z

Continuing generalization work started here #20179 to make the endpoint less whisper-centric.

TODO:

Allow transcriptions models to expose other API endpoints if specified

github-actions · 2025-07-08T17:32:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @NickLucche, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request continues the ongoing work to generalize the speech-to-text endpoint, making it less dependent on specific model architectures like Whisper. It achieves this by introducing a dedicated configuration class for ASR models and abstracting prompt generation and configuration retrieval into the model interface, allowing different transcription models to integrate more seamlessly.

Highlights

Centralized ASR Configuration: Introduced a new SpeechToTextConfig dataclass in vllm/config.py to centralize and standardize configuration parameters for speech-to-text models, such as sample rate, audio chunking behavior, and overlap settings. This moves away from hardcoded constants within the endpoint logic.
Decoupled Prompt Generation: Refactored the OpenAISpeechToText endpoint to delegate the responsibility of constructing transcription prompts to the individual ASR models. This is achieved by replacing the generic get_decoder_prompt with a new get_generation_prompt method in the SupportsTranscription interface, allowing models like Whisper to define their specific prompt formats, including handling of encoder and decoder prompts.
Model-Specific ASR Settings: Enhanced the SupportsTranscription interface with a new get_speech_to_text_config method. This enables concrete ASR model implementations (e.g., Whisper) to provide their own default or derived speech-to-text configurations (like sampling rate, hop length, and max audio clip duration) based on their internal feature extractors, promoting greater modularity and extensibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The code changes introduce abstract prompt and config for transcription models. I've identified one critical issue that will cause a runtime error and one medium-severity issue related to a change in logic for prompt construction. Addressing these will improve the correctness and robustness of the implementation.

vllm/model_executor/models/whisper.py

patrickvonplaten · 2025-07-08T21:20:40Z

vllm/config.py

+    max_audio_clip_s: int = 30
+    """Maximum duration in seconds for a single audio clip without chunking.
+    Audio longer than this will be split into smaller chunks if
+    `allow_audio_chunking` is enabled, otherwise it will be rejected."""
+
+    allow_audio_chunking: bool = False
+    """Whether to allow splitting long audio files into smaller chunks.
+    When True, audio longer than `max_audio_clip_s` will be automatically
+    split with overlapping segments. When False, long audio will be rejected.
+    """


Suggested change

max_audio_clip_s: int = 30

"""Maximum duration in seconds for a single audio clip without chunking.

Audio longer than this will be split into smaller chunks if

`allow_audio_chunking` is enabled, otherwise it will be rejected."""

allow_audio_chunking: bool = False

"""Whether to allow splitting long audio files into smaller chunks.

When True, audio longer than `max_audio_clip_s` will be automatically

split with overlapping segments. When False, long audio will be rejected.

"""

max_audio_clip_s: Optional[int] = 30

"""Maximum duration in seconds for a single audio clip without chunking.

Audio longer than this will be split into smaller chunks if

`allow_audio_chunking` is enabled, otherwise it will be rejected."""

Think it's always nicer to try to remove such flags. Couldn't we get the same behavior of "allow_audio_chunking = False" by setting max_audio_clip_s to None?

patrickvonplaten · 2025-07-08T21:20:51Z

vllm/config.py

+    splitting long audio. This helps maintain context across chunk boundaries
+    and improves transcription quality at split points."""
+
+    min_energy_split_window_size: int = 1600


Also allow None to disable?

patrickvonplaten · 2025-07-08T21:21:45Z

vllm/entrypoints/openai/speech_to_text.py

-            prompts.append(cast(PromptType, prompt))
+            # The model has control over the construction, as long as it
+            # returns a valid PromptType.
+            prompt = self.model_cls.get_generation_prompt(


yeah that'd work!

patrickvonplaten · 2025-07-08T21:21:58Z

vllm/entrypoints/openai/speech_to_text.py

        self.task_type = task_type

+        self.asr_config = self.model_cls.get_speech_to_text_config(


NickLucche · 2025-07-09T14:48:31Z

@patrickvonplaten thanks for the review! 🙏🏻

Looking into the cleanest way to allow models to expose multiple endpoints now. The quickest solution isn't the most general one, so I might get back with a separate PR if things are too involved.

vllm/config.py

NickLucche · 2025-07-10T09:18:07Z

@DarkLight1337 what do you think?

vllm/model_executor/models/whisper.py

DarkLight1337 · 2025-07-10T09:43:48Z

vllm/entrypoints/openai/speech_to_text.py

-                        num_prompt_tokens = max(
-                            len(res.prompt_token_ids) - 4, 0)


I guess this part also needs to be part of the model definition?

mergify · 2025-07-10T17:45:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NickLucche · 2025-07-11T12:02:01Z

Waiting for #20812 to get green lights here.

@DarkLight1337 @patrickvonplaten feel free to review again, current state should be final.

Signed-off-by: NickLucche <[email protected]>

…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>

mergify bot added the frontend label Jul 8, 2025

gemini-code-assist bot reviewed Jul 8, 2025

View reviewed changes

vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved

vllm/model_executor/models/whisper.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jul 8, 2025

View reviewed changes

DarkLight1337 added this to Multi-modality Core Jul 9, 2025

DarkLight1337 moved this to In Progress in Multi-modality Core Jul 9, 2025

NickLucche marked this pull request as ready for review July 9, 2025 14:44

NickLucche requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin, tlrmchlsmth, houseroad, hmellor and aarnphm as code owners July 9, 2025 14:44

NickLucche changed the title ~~abstract promtp and config for transcription models~~ [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models Jul 9, 2025

patrickvonplaten reviewed Jul 9, 2025

View reviewed changes

vllm/config.py Show resolved Hide resolved

DarkLight1337 reviewed Jul 10, 2025

View reviewed changes

vllm/model_executor/models/whisper.py Show resolved Hide resolved

DarkLight1337 reviewed Jul 10, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 10, 2025

NickLucche force-pushed the transcriptions-api-generalize branch from 41802a6 to b74d9af Compare July 11, 2025 10:11

mergify bot removed the needs-rebase label Jul 11, 2025

NickLucche added 8 commits July 11, 2025 12:27

abstract promtp and config for transcription models

36e0110

Signed-off-by: NickLucche <[email protected]>

chunk by default

e57aa6b

Signed-off-by: NickLucche <[email protected]>

merge chunk and min_energy property in config

0b39148

Signed-off-by: NickLucche <[email protected]>

mypy

d6c9060

Signed-off-by: NickLucche <[email protected]>

get_num_audio_tokens

2156ceb

Signed-off-by: NickLucche <[email protected]>

revert prompt change

55515e6

Signed-off-by: NickLucche <[email protected]>

remove hop_length from config

14dbdeb

Signed-off-by: NickLucche <[email protected]>

pre-commit

e0f4c64

Signed-off-by: NickLucche <[email protected]>

NickLucche force-pushed the transcriptions-api-generalize branch from f0ddc61 to e0f4c64 Compare July 11, 2025 12:27

precommit

08e1161

Signed-off-by: NickLucche <[email protected]>

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 11, 2025

vllm-bot merged commit 3c7d942 into vllm-project:main Jul 12, 2025
76 of 78 checks passed

github-project-automation bot moved this from In Progress to Done in Multi-modality Core Jul 12, 2025

Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions …

0419f03

…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>

patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions …

58b7f0f

…models (vllm-project#20637) Signed-off-by: NickLucche <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

NickLucche commented Jul 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Jul 8, 2025

Uh oh!

patrickvonplaten Jul 8, 2025

Uh oh!

patrickvonplaten Jul 8, 2025

Uh oh!

patrickvonplaten Jul 8, 2025

Uh oh!

NickLucche commented Jul 9, 2025

Uh oh!

Uh oh!

NickLucche commented Jul 10, 2025

Uh oh!

Uh oh!

DarkLight1337 Jul 10, 2025

Uh oh!

mergify bot commented Jul 10, 2025

Uh oh!

NickLucche commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

		self.task_type = task_type

		self.asr_config = self.model_cls.get_speech_to_text_config(

Uh oh!

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models #20637

Conversation

NickLucche commented Jul 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Jul 9, 2025

Uh oh!

Uh oh!

NickLucche commented Jul 10, 2025

Uh oh!

Uh oh!

DarkLight1337 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 10, 2025

Uh oh!

NickLucche commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

NickLucche commented Jul 8, 2025 •

edited by github-actions bot

Loading