-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
v1: Add Whisper model support (encoder-decoder) #21088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a significant and well-structured pull request that adds Whisper (encoder-decoder) model support to vLLM's V1 engine. The changes are comprehensive, touching on the attention backend, KV cache management, scheduler, and GPU model runner to accommodate the new architecture.
I've identified one critical issue in _build_encoder_attn_metadata
where a missing else
block could lead to a size mismatch and a runtime error. I've provided a code suggestion to fix this potential bug. Other than that, the implementation looks solid and correctly integrates encoder-decoder support into the existing V1 framework. Great work on this complex feature!
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This is already some work to support encoder-decoder models:
Can you coordinate with @maxdebayser to avoid duplicate work? |
Yeah, I've been talking with @russellb as there are a few overlapping points in our PRs for example disabling prefix caching and chunked prefill. |
prefix_kv_lens=attn_metadata.prefix_kv_lens, | ||
suffix_kv_lens=attn_metadata.suffix_kv_lens, | ||
max_kv_len=attn_metadata.max_seq_len, | ||
def _forward_encoder_attention( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Yep, we're in contact. Did you mean to link something different than #20226? Roughly though, Max had worked on encoder-only support, and I was doing encoder-decoder, which is mostly a superset of encoder-only changes, though I haven't actually tested any encoder-only models with my branch yet. |
@@ -552,7 +622,7 @@ def forward( | |||
seqused_k=seqused_k, | |||
max_seqlen_k=max_seqlen_k, | |||
softmax_scale=self.scale, | |||
causal=True, | |||
causal=FlashAttentionImpl._get_causal_option(attn_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just add the causal
flag to CommonAttentionMetadata
and manipulate the slot-mapping on the CommonAttentionMetadata
so we can make more of this backend agnostic?
(kinda like: #21093)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! thanks for the feedback
follow-up on next steps and collaboration with @maxdebayser We're going to combine our work and try to land it all in a few stages. PR 1) Combine parts of his encoder-only PR (#19988) with the encoder-without-kv-cache changes in this branch. That will be a new jointly-authored PR that will cover encoder-only attention. PR 2) Update this PR with what's left to make Whisper / encoder-decoder work. That includes some Whisper model changes and a bunch of changes to support cross-attention (encoder-decoder type). PR 3) Add the last parts of Max's original PR, which supports token_type_ids to run the bert classifier models that need them. |
96be9ad
to
4da8b7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one!
return not (attn_type == AttentionType.ENCODER | ||
or attn_type == AttentionType.ENCODER_ONLY | ||
or attn_type == AttentionType.ENCODER_DECODER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isn't this attn_type == AttentionType.DECODER
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true
self.use_irope = use_irope | ||
self.vllm_flash_attn_version = get_flash_attn_version() | ||
if is_quantized_kv_cache(self.kv_cache_dtype) \ | ||
and not flash_attn_supports_fp8(): | ||
raise NotImplementedError( | ||
"FlashAttention does not support fp8 kv-cache on this device.") | ||
|
||
@staticmethod | ||
def _get_causal_option(attn_type: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _is_causal_attention
?
if (attn_type in (AttentionType.ENCODER, AttentionType.ENCODER_DECODER, | ||
AttentionType.ENCODER_ONLY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can re-use get_causal_opt
@@ -61,8 +64,14 @@ def get_num_blocks_to_allocate( | |||
""" | |||
num_blocks_to_allocate = 0 | |||
for i, manager in enumerate(self.single_type_managers): | |||
num_blocks_to_allocate += manager.get_num_blocks_to_allocate( | |||
request_id, num_tokens, new_computed_blocks[i]) | |||
if cross_attn and isinstance(manager, CrossAttentionManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be an assert when cross_attn is True?
assert isinstance(manager, CrossAttentionManager)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In the whisper case, there are 2 managers in self.single_type_managers
. The first is a FullAttentionManager
and corresponds to the KV cache group for decoder self attention. The second manager is a CrossAttentionManager
and corresponds to the KV cache group for cross-attention in the decoder.
For some operations we want to only interact with cross-attention, or interact with everything else but cross-attention, so there's some checks like this. I think a runtime check like this is a bit more reliable than making assumptions about how KV cache groups may be structured in here.
16f557d
to
a9e3459
Compare
I got this caught up with |
This brings Whisper support to V1 to close one of the remaining feature gaps with V0. Most of the changes apply to encoder-decoder models generally, though Whisper is the only one explicitly tested and is the only encoder-decoder model updated to support V1. **Whisper Model Implementation:** - Remove SupportsV0Only interface constraint to enable V1 compatibility - Update get_multimodal_embeddings() to return list format required by V1 **Flash Attention Backend:** - Add encoder attention metadata fields (encoder_seq_start_loc, max_encoder_seq_len, cross_slot_mapping) - Implement encoder self-attention support without using KV cache - Add cross-attention support for encoder-decoder models with proper KV cache handling **KV Cache Manager:** - Introduce CrossAttentionManager for handling cross-attention KV cache in encoder-decoder models - Add CrossAttentionSpec for cross-attention cache specification with encoder-based sizing - Implement allocate_slots_for_cross_attn() for static encoder-length-based allocation - Add cross-attention block allocation logic separate from decoder token growth **Scheduler:** - Disable prefix caching for encoder-decoder models - Implement cross-attention block allocation during request scheduling - Add cross-attention block tracking in state management **GPU Model Runner:** - Add encoder input extraction for audio features processing - Implement encoder attention metadata building for both self-attention and cross-attention - Add cross-attention KV cache group handling with proper slot mapping - Modify input batch creation to accommodate encoder sequence lengths - Add encoder input processing in forward pass with proper device/dtype handling - Update profiling and memory management for encoder-decoder models The implementation maintains backward compatibility while adding comprehensive encoder-decoder support, with particular focus on Whisper's audio processing pipeline and cross-attention mechanisms between encoder and decoder. Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Russell Bryant <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Russell Bryant <[email protected]>
This brings Whisper support to V1 to close one of the remaining
feature gaps with V0. Most of the changes apply to encoder-decoder
models generally, though Whisper is the only one explicitly tested
and is the only encoder-decoder model updated to support V1.
Whisper Model Implementation:
Flash Attention Backend:
KV Cache Manager:
Scheduler:
GPU Model Runner:
The implementation maintains backward compatibility while adding comprehensive
encoder-decoder support, with particular focus on Whisper's audio processing
pipeline and cross-attention mechanisms between encoder and decoder.
Related to:
Signed-off-by: Russell Bryant [email protected]
TODO items: