-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Enable 4bit bnb prequant MOE #21548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable 4bit bnb prequant MOE #21548
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to enable 4bit bnb prequant MOE. The changes involve modifying the validation logic for MoE quantization and updating a model implementation to selectively quantize the MoE router gate.
My review identified a critical issue in the validation logic that could lead to a runtime crash, and a high-severity design issue in the model implementation that impacts maintainability and correctness. I've provided suggestions to fix both issues.
if self.pre_quant and self.load_8bit: | ||
raise ValueError( | ||
"Prequant BitsAndBytes models with FusedMoE is not " | ||
"supported yet.") | ||
if self.load_8bit: | ||
raise ValueError( | ||
"BitsAndBytes 8bit quantization with FusedMoE is not " | ||
"Prequant BitsAndBytes 8bit models with FusedMoE is not " | ||
"supported yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change incorrectly allows 8-bit MoE quantization when pre_quant
is false. The BitsAndBytesMoEMethod
for 8-bit weights is not implemented and raises a NotImplementedError
(see vllm/model_executor/layers/quantization/bitsandbytes.py
). This will lead to a runtime crash for a configuration that this check now allows.
To enable 4-bit pre-quantized MoE while correctly disallowing 8-bit MoE, you should only check for self.load_8bit
. The original check for self.pre_quant
was what blocked 4-bit pre-quant, and removing it is correct, but the self.load_8bit
check should remain as it was to prevent using the unimplemented 8-bit MoE path.
if self.load_8bit:
raise ValueError(
"BitsAndBytes 8bit quantization with FusedMoE is not "
"supported yet.")
self.gate = ReplicatedLinear(config.hidden_size, | ||
config.num_experts, | ||
bias=False, | ||
quant_config=None, | ||
quant_config=self._maybe_ignore_quant_config(quant_config), # Some quantization methods do not quantize the gate | ||
prefix=f"{prefix}.gate") | ||
|
||
def _maybe_ignore_quant_config(self, quant_config: QuantizationConfig): | ||
if not isinstance(quant_config, (BitsAndBytesConfig)): | ||
return None | ||
return quant_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation for deciding whether to quantize the MoE gate has a couple of issues:
- Hardcoded Logic: It specifically checks for
BitsAndBytesConfig
, making it difficult to extend to other quantization methods that might also support gate quantization. This can lead to silent failures where a new quantization method is expected to quantize the gate but doesn't. A more extensible approach would be to have a method on theQuantizationConfig
itself to indicate this capability. - Incorrect Type Hint: The type hint for
quant_config
in_maybe_ignore_quant_config
isQuantizationConfig
, but it can beNone
at the call site. - Clarity: The function name
_maybe_ignore_quant_config
is a bit ambiguous.
self.gate = ReplicatedLinear(config.hidden_size,
config.num_experts,
bias=False,
quant_config=self._get_gate_quant_config(quant_config),
prefix=f"{prefix}.gate")
def _get_gate_quant_config(
self, quant_config: Optional[QuantizationConfig]
) -> Optional[QuantizationConfig]:
# The gate is only quantized for BitsAndBytes for now.
if isinstance(quant_config, BitsAndBytesConfig):
return quant_config
return None
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update