Skip to content

Commit e088f0f

Browse files
tlrmchlsmthchaunceyjiangyoukaichaoxuechendimadamczyk-intel
authored
Upstream sync (#17)
* Revert "[Misc] Add S3 environment variables for better support of MinIO." (vllm-project#17021) * [misc] tune some env vars for GB200 (vllm-project#16992) Signed-off-by: youkaichao <[email protected]> * [INTEL-HPU][v0] Port delayed sampling to upstream (vllm-project#16949) Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> * [doc] add download path tips (vllm-project#17013) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Triton FA function takes no keyword arguments (vllm-project#16902) Signed-off-by: vllmellm <[email protected]> * [V1] Avoid socket errors during shutdown when requests are in in-flight (vllm-project#16807) Signed-off-by: Nick Hill <[email protected]> * [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#16998) Signed-off-by: Lucas Wilkinson <[email protected]> * [Misc] Improve readability of get_open_port function. (vllm-project#17024) Signed-off-by: gitover22 <[email protected]> * [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers (vllm-project#16964) Signed-off-by: chaunceyjiang <[email protected]> * [CI] Run v1/test_serial_utils.py in CI (vllm-project#16996) Signed-off-by: Russell Bryant <[email protected]> * Mistral-format support for compressed-tensors (vllm-project#16803) Signed-off-by: mgoin <[email protected]> * Categorize `tests/kernels/` based on kernel type (vllm-project#16799) Signed-off-by: mgoin <[email protected]> * [Doc] Add top anchor and a note to quantization/bitblas.md (vllm-project#17042) Signed-off-by: windsonsea <[email protected]> * Ensure that `pid` passed to `kill_process_tree` is `int` for `mypy` (vllm-project#17051) Signed-off-by: Harry Mellor <[email protected]> * [CI] Update structured-output label automation (vllm-project#17055) Signed-off-by: Russell Bryant <[email protected]> * Improve Transformers backend model loading QoL (vllm-project#17039) Signed-off-by: Harry Mellor <[email protected]> * `CacheConfig.block_size` should always be `int` when used (vllm-project#17052) Signed-off-by: Harry Mellor <[email protected]> * Use `@property` and private field for `data_parallel_rank_local` (vllm-project#17053) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar (vllm-project#15949) Signed-off-by: Travis Johnson <[email protected]> * [BugFix][V1] Fix int32 token index overflow when preparing input ids (vllm-project#16806) * [V1][Spec Decode] Always use argmax for sampling draft tokens (vllm-project#16899) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] workaround for CI build failure (vllm-project#17070) Signed-off-by: csy1204 <[email protected]> Co-authored-by: Michael Goin <[email protected]> * [Quantization]add prefix for commandA quantized model (vllm-project#17017) * [Minor] Use larger batch sizes for A100/B100/B200/MI300x (vllm-project#17073) Signed-off-by: Woosuk Kwon <[email protected]> * [Bugfix] Enable V1 usage stats (vllm-project#16986) Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> * More informative error when using Transformers backend (vllm-project#16988) Signed-off-by: Harry Mellor <[email protected]> * Addendum Fix to support FIPS enabled machines with MD5 hashing (vllm-project#17043) Signed-off-by: sydarb <[email protected]> * [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… (vllm-project#16472) Signed-off-by: 开哲 <[email protected]> Co-authored-by: 开哲 <[email protected]> * [V1] Update structured output (vllm-project#16812) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [doc] update to hyperlink (vllm-project#17096) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Add docs for runai_streamer_sharded (vllm-project#17093) Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Chore] Remove Sampler from Model Code (vllm-project#17084) Signed-off-by: Woosuk Kwon <[email protected]> * Disable enforce_eager for V1 TPU sampler and structured output tests (vllm-project#17016) Signed-off-by: mgoin <[email protected]> * Simplify `TokenizerGroup` (vllm-project#16790) Signed-off-by: Harry Mellor <[email protected]> * Fix OOT registration test (vllm-project#17099) Signed-off-by: Harry Mellor <[email protected]> * [V1][PP] Optimization: continue scheduling prefill chunks (vllm-project#17080) Signed-off-by: Rui Qiao <[email protected]> * [Misc] Remove OLMo2 config copy (vllm-project#17066) Signed-off-by: Isotr0py <[email protected]> * Improve static type checking in `LoRAModelRunnerMixin` (vllm-project#17104) Signed-off-by: Harry Mellor <[email protected]> * [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning (vllm-project#16954) Signed-off-by: shen-shanshan <[email protected]> * [Frontend] Using matryoshka_dimensions control the allowed output dimensions. (vllm-project#16970) * Add missing rocm_skinny_gemms kernel test to CI (vllm-project#17060) Signed-off-by: mgoin <[email protected]> * [Misc] refactor example series - structured outputs (vllm-project#17040) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (vllm-project#16665) Signed-off-by: Mark McLoughlin <[email protected]> * [CI] Add automation for the `tool-calling` github label (vllm-project#17118) Signed-off-by: Russell Bryant <[email protected]> * Updating builkite job for IBM Power (vllm-project#17111) Signed-off-by: Aaruni Aggarwal <[email protected]> * existing torch installation pip command fix for docs (vllm-project#17059) * Molmo Requirements (vllm-project#17026) Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> * Add `:markdownhelp:` to `EngineArgs` docs so markdown docstrings render properly (vllm-project#17124) Signed-off-by: Harry Mellor <[email protected]> * Improve configs - `LoRAConfig` + `PromptAdapterConfig` (vllm-project#16980) Signed-off-by: Harry Mellor <[email protected]> * [Docs] Generate correct github links for decorated functions (vllm-project#17125) Signed-off-by: Russell Bryant <[email protected]> * Add collective_rpc to llm engine (vllm-project#16999) Signed-off-by: Yinghai Lu <[email protected]> * Add chat template for Llama 4 models (vllm-project#16428) Signed-off-by: Max de Bayser <[email protected]> * [Misc] Add example to run DeepSeek with Ray Serve LLM (vllm-project#17134) Signed-off-by: Rui Qiao <[email protected]> * Better error message for missing mistral params.json (vllm-project#17132) Signed-off-by: mgoin <[email protected]> * Use custom address for listening socket (vllm-project#15988) Signed-off-by: Jens Glaser <[email protected]> * [FEAT] [ROCm]: AITER Fused MOE V1 Support (vllm-project#16752) Signed-off-by: vllmellm <[email protected]> Co-authored-by: tjtanaa <[email protected]> * [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 (vllm-project#16864) Signed-off-by: Lucas Wilkinson <[email protected]> * fix float16 support for kimi-vl (vllm-project#17156) Co-authored-by: zhouzaida <[email protected]> * [Doc] V1 : Update LoRA status (vllm-project#17133) Signed-off-by: varun sundar rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> * [Docs] Fix True->true in supported_models.md (vllm-project#17141) * Move missed `SchedulerConfig` args into scheduler config group in `EngineArgs` (vllm-project#17131) Signed-off-by: Harry Mellor <[email protected]> * [Misc] Clean up redundant code in uniproc_executor.py (vllm-project#16762) Signed-off-by: Lifu Huang <[email protected]> * [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton (vllm-project#15099) Signed-off-by: Mengqing Cao <[email protected]> * [Misc] Benchmark Serving Script Support Appending Results (vllm-project#17028) Signed-off-by: Lucas Wilkinson <[email protected]> * [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance (vllm-project#16457) Signed-off-by: cynthieye <[email protected]> Co-authored-by: MagnetoWang <[email protected]> * [Bugfix] remove fallback in guided_json (int range, patterns) (vllm-project#16725) Signed-off-by: csy1204 <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> * [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization (vllm-project#15734) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> * [Doc] Add headings to improve gptqmodel.md (vllm-project#17164) Signed-off-by: windsonsea <[email protected]> * Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (vllm-project#17158) * [Doc] Add two links to disagg_prefill.md (vllm-project#17168) Signed-off-by: windsonsea <[email protected]> * [Doc] Move todo out of beam search docstring (vllm-project#17183) Signed-off-by: Alex-Brooks <[email protected]> * [Bugfix] Fix mistral model tests (vllm-project#17181) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix Mistral ChatCompletionRequest Body Exception (vllm-project#16769) Signed-off-by: Jasmond Loh <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Bump Transformers to 4.51.3 (vllm-project#17116) Signed-off-by: Harry Mellor <[email protected]> * Use Transformers helper `get_text_config()` instead of checking for `text_config` (vllm-project#17105) Signed-off-by: Harry Mellor <[email protected]> * [doc] update wrong hf model links (vllm-project#17184) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Inline Molmo requirements (vllm-project#17190) Signed-off-by: DarkLight1337 <[email protected]> * [Security] Use safe serialization and fix zmq setup for mooncake pipe (vllm-project#17192) Signed-off-by: Shangming Cai <[email protected]> Co-authored-by: Shangming Cai <[email protected]> * [V1] Move usage stats to worker and start logging TPU hardware (vllm-project#16211) * [Bugfix] Fix hybrid model tests (vllm-project#17182) Signed-off-by: DarkLight1337 <[email protected]> * Fix Python packaging edge cases (vllm-project#17159) Signed-off-by: Christian Heimes <[email protected]> * [BugFix][Frontend] Fix `LLM.chat()` tokenization (vllm-project#16081) Signed-off-by: Nick Hill <[email protected]> * [V1][Spec Decode] EAGLE-3 Support (vllm-project#16937) Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> * [Misc] Refine ray_serve_deepseek example (vllm-project#17204) Signed-off-by: Rui Qiao <[email protected]> * [Bugfix] gemma[2,3] interleaved attention when sliding window is disabled (vllm-project#17180) Signed-off-by: Chen Zhang <[email protected]> * [AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary (vllm-project#17215) Signed-off-by: Randall Smith <[email protected]> * [v1] [P/D] Adding LMCache KV connector for v1 (vllm-project#16625) * [Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env (vllm-project#17142) Signed-off-by: James Wu <[email protected]> * [MISC][AMD] Add unused annotation to rocm kernel file (vllm-project#17097) Signed-off-by: Lu Fang <[email protected]> * [doc] add Anything LLM integration (vllm-project#17216) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Minor][Spec Decode] Add use_eagle to SpeculativeConfig (vllm-project#17213) Signed-off-by: Woosuk Kwon <[email protected]> * [Doc] Minor fix for the vLLM TPU setup page (vllm-project#17206) Signed-off-by: Yarong Mu <[email protected]> * [Minor][Models] Fix Return Types of Llama & Eagle (vllm-project#17220) Signed-off-by: Woosuk Kwon <[email protected]> * Allocate kv_cache with stride order (vllm-project#16605) Signed-off-by: shuw <[email protected]> * [ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. (vllm-project#17011) Signed-off-by: charlifu <[email protected]> * [V1][Metrics] Allow V1 AsyncLLM to use custom logger (vllm-project#14661) Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [BugFix] Avoid race conditions in zero-copy tensor transmission (vllm-project#17203) Signed-off-by: Nick Hill <[email protected]> * [CI/test] Fix Eagle Correctness Test (vllm-project#17209) Signed-off-by: Woosuk Kwon <[email protected]> * [Core] Remove prompt string from engine core data structures (vllm-project#17214) Signed-off-by: Nick Hill <[email protected]> * [Bugfix] Fix missing int type for `-n` in multi-image example (vllm-project#17223) * [Bugfix] Fix standard models tests (vllm-project#17217) Signed-off-by: DarkLight1337 <[email protected]> * [Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device (vllm-project#17186) Signed-off-by: Agata Dobrzyniewicz <[email protected]> * [V1] Add `structural_tag` support using xgrammar (vllm-project#17085) * [BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set (vllm-project#17088) Signed-off-by: Andy Xie <[email protected]> * [Chore] added stubs for `vllm_flash_attn` during development mode (vllm-project#17228) Signed-off-by: Aaron Pham <[email protected]> * [Docs] Update structured output doc for V1 (vllm-project#17135) Signed-off-by: Russell Bryant <[email protected]> * [Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps` (vllm-project#9276) Signed-off-by: changjun.lee <[email protected]> * Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 (vllm-project#16573) Signed-off-by: Lu Fang <[email protected]> * [MISC] rename interval to max_recent_requests (vllm-project#14285) * [Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation (vllm-project#16878) Signed-off-by: imkero <[email protected]> * [Minor] Fix lint error in main branch (vllm-project#17233) Signed-off-by: Woosuk Kwon <[email protected]> * [CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh (vllm-project#16271) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * Update test_flash_attn.py (vllm-project#17102) Signed-off-by: ShuaibinLi <[email protected]> * [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel (vllm-project#12591) Signed-off-by: Randall Smith <[email protected]> * [Misc] Make cached tokenizer pickle-compatible (vllm-project#17048) Signed-off-by: DarkLight1337 <[email protected]> * [Bugfix] Fix QWen2 VL multimodal mapping (vllm-project#17240) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Get a specific type of layer from forward context (vllm-project#17222) Signed-off-by: Chen Zhang <[email protected]> * [MISC] Use string annotation types for class definitions (vllm-project#17244) Signed-off-by: Jade Zheng <[email protected]> * [Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens (vllm-project#17033) Signed-off-by: sfc-gh-zhwang <[email protected]> * [Bugfix] Fix Lora Name Parsing (vllm-project#17196) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> * [NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032) Signed-off-by: kaixih <[email protected]> * [Feature] support sequence parallelism using compilation pass (vllm-project#16155) Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> * [doc] Add feature status legend (vllm-project#17257) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Metrics] Fix minor inconsistencies in bucket progression (vllm-project#17262) Signed-off-by: DarkLight1337 <[email protected]> * [V1][Spec Decode] Make eagle compatible with prefix caching. (vllm-project#17137) Signed-off-by: LiuXiaoxuanPKU <[email protected]> * [BugFix] Fix vllm_flash_attn install issues (vllm-project#17267) Signed-off-by: Lucas Wilkinson <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Aaron Pham <[email protected]> * [Bugfix] Fix missing ARG in Dockerfile for arm64 platforms (vllm-project#17261) Signed-off-by: lkm-schulz <[email protected]> * [Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… (vllm-project#16751) Signed-off-by: Ther-LF <[email protected]> * [Bugfix] Fix Mistral3 spatial merge error (vllm-project#17270) Signed-off-by: mgoin <[email protected]> * [Doc] Fix wrong github link in LMCache examples (vllm-project#17274) Signed-off-by: KuntaiDu <[email protected]> * [Doc] small fix (vllm-project#17277) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Validate `stop_token_ids` contents (vllm-project#17268) Signed-off-by: Nick Hill <[email protected]> * [Minor][Models] Pass partial_rotary_factor parameter to rope (vllm-project#17266) Signed-off-by: evian <[email protected]> Co-authored-by: evian <[email protected]> * [Core] Remove legacy input mapper/processor from V0 (vllm-project#15686) Signed-off-by: DarkLight1337 <[email protected]> * [Model] Add Granite Speech Support (vllm-project#16246) Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> * Update tpu_worker.py 's typo (vllm-project#17288) * Add missing class docstring for `PromptAdapterConfig` (vllm-project#17302) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix] Add missing `get_language_model` to new MLLMs (vllm-project#17300) Signed-off-by: DarkLight1337 <[email protected]> * [doc] update wrong model id (vllm-project#17287) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Misc] Minor typo/grammar in `platforms/interface.py` (vllm-project#17307) Signed-off-by: NickLucche <[email protected]> * [Misc] Clean up Qwen2.5-Omni code (vllm-project#17301) Signed-off-by: DarkLight1337 <[email protected]> * [Docs] Add a security guide (vllm-project#17230) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * Improve conversion from dataclass configs to argparse arguments (vllm-project#17303) Signed-off-by: Harry Mellor <[email protected]> * Make name of `compressed-tensors` quant method consistent across vLLM (vllm-project#17255) Signed-off-by: Harry Mellor <[email protected]> * Explicitly explain quant method override ordering and ensure all overrides are ordered (vllm-project#17256) Signed-off-by: Harry Mellor <[email protected]> * [Security] Don't bind tcp zmq socket to all interfaces (vllm-project#17197) Signed-off-by: Russell Bryant <[email protected]> * [Chore] cleanup license indicators in light of SPDX (vllm-project#17259) Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]> * [BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) (vllm-project#17283) Signed-off-by: Lucas Wilkinson <[email protected]> * [Bugfix] Fix moe weight losing all extra attrs after `process_weights_after_loading`. (vllm-project#16854) Signed-off-by: charlifu <[email protected]> * [Model] Qwen3 Dense FP8 Compat Fixes (vllm-project#17318) Signed-off-by: simon-mo <[email protected]> * Support loading transformers models with named parameters (vllm-project#16868) Signed-off-by: Alex <[email protected]> * [Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project#17328) Signed-off-by: mgoin <[email protected]> * [Benchmark] Add single turn MTBench to Serving Bench (vllm-project#17202) * [Optim] Compute multimodal hash only once per item (vllm-project#17314) Signed-off-by: DarkLight1337 <[email protected]> * implement Structural Tag with Guidance backend (vllm-project#17333) Signed-off-by: Michal Moskal <[email protected]> * [V1][Spec Decode] Make Eagle model arch config driven (vllm-project#17323) * [model] make llama4 compatible with pure dense layers (vllm-project#17315) Signed-off-by: Lucia Fang <[email protected]> * [Bugfix] Fix `numel()` downcast in fused_layernorm_dynamic_per_token_quant.cu (vllm-project#17316) * Ignore `'<string>'` filepath (vllm-project#17330) Signed-off-by: rzou <[email protected]> * [Bugfix] Add contiguous call inside rope kernel wrapper (vllm-project#17091) Signed-off-by: 苏政渊 <[email protected]> Co-authored-by: 苏政渊 <[email protected]> * [Misc] Add a Jinja template to support Mistral3 function calling (vllm-project#17195) Signed-off-by: chaunceyjiang <[email protected]> * [Model] support MiniMax-VL-01 model (vllm-project#16328) Signed-off-by: qingjun <[email protected]> * [Misc] Move config fields to MultiModalConfig (vllm-project#17343) Signed-off-by: DarkLight1337 <[email protected]> * [Misc]Use a platform independent interface to obtain the device attributes (vllm-project#17100) * [Fix] Documentation spacing in compilation config help text (vllm-project#17342) Signed-off-by: Zerohertz <[email protected]> * [Build][Bugfix] Restrict setuptools version to <80 (vllm-project#17320) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Model] Ignore rotary embed load for Cohere model (vllm-project#17319) * Update docs requirements (vllm-project#17379) Signed-off-by: Harry Mellor <[email protected]> * [Doc] Fix QWen3MOE info (vllm-project#17381) Signed-off-by: Jee Jee Li <[email protected]> * [Bugfix] Clean up MiniMax-VL and fix processing (vllm-project#17354) Signed-off-by: DarkLight1337 <[email protected]> * `pre-commit autoupdate` (vllm-project#17380) Signed-off-by: Harry Mellor <[email protected]> * [Frontend] Support `chat_template_kwargs` in `LLM.chat` (vllm-project#17356) Signed-off-by: DarkLight1337 <[email protected]> * Transformers backend tweaks (vllm-project#17365) Signed-off-by: Harry Mellor <[email protected]> * Fix: Spelling of inference (vllm-project#17387) * Improve literal dataclass field conversion to argparse argument (vllm-project#17391) Signed-off-by: Harry Mellor <[email protected]> * [V1] Remove num_input_tokens from attn_metadata (vllm-project#17193) Signed-off-by: Chen Zhang <[email protected]> * [Bugfix] add qwen3 reasoning-parser fix content is None when disable … (vllm-project#17369) Signed-off-by: mofanke <[email protected]> * fix gemma3 results all zero (vllm-project#17364) Signed-off-by: mayuyuace <[email protected]> * [Misc][ROCm] Exclude `cutlass_mla_decode` for ROCm build (vllm-project#17289) Signed-off-by: Tianyuan Wu <[email protected]> * Enabling multi-group kernel tests. (vllm-project#17115) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Docs] Propose a deprecation policy for the project (vllm-project#17063) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Doc][Typo] Fixing label in new model requests link in overview.md (vllm-project#17400) * [TPU][V1][CI] Replace `python3 setup.py develop` with standard `pip install --e` on TPU (vllm-project#17374) Signed-off-by: NickLucche <[email protected]> * [CI] Uses Python 3.11 for TPU (vllm-project#17359) Signed-off-by: Aaron Pham <[email protected]> * [CI/Build] Add retry mechanism for add-apt-repository (vllm-project#17107) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [Bugfix] Fix Minicpm-O-int4 GPTQ model inference (vllm-project#17397) Signed-off-by: Isotr0py <[email protected]> * Simplify (and fix) passing of guided decoding backend options (vllm-project#17008) Signed-off-by: Harry Mellor <[email protected]> * Remove Falcon3 2x7B from CI (vllm-project#17404) Signed-off-by: Harry Mellor <[email protected]> * Fix: Python package installation for opentelmetry (vllm-project#17049) Signed-off-by: Dilip Gowda Bhagavan <[email protected]> * [V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE (vllm-project#17211) Signed-off-by: Bryan Lu <[email protected]> * Remove Bamba 9B from CI (vllm-project#17407) Signed-off-by: Harry Mellor <[email protected]> * [V1][Feature] Enable Speculative Decoding with Structured Outputs (vllm-project#14702) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> * [release] Always git fetch all to get latest tag on TPU release (vllm-project#17322) * Truncation control for embedding models (vllm-project#14776) Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Co-authored-by: Max de Bayser <[email protected]> * Update PyTorch to 2.7.0 (vllm-project#16859) * Improve configs - `ModelConfig` (vllm-project#17130) Signed-off-by: Harry Mellor <[email protected]> * Fix call to `logger.info_once` (vllm-project#17416) Signed-off-by: Harry Mellor <[email protected]> * Fix some speculative decode tests with tl.dot (vllm-project#17371) Signed-off-by: Huy Do <[email protected]> * Support LoRA for Mistral3 (vllm-project#17428) Signed-off-by: mgoin <[email protected]> * [Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue (vllm-project#17298) Signed-off-by: Kunshang Ji <[email protected]> * [Hardware][Intel GPU] Upgrade to torch 2.7 (vllm-project#17444) Signed-off-by: Kunshang Ji <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> * [Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' (vllm-project#17434) Signed-off-by: chaunceyjiang <[email protected]> * [MODEL ADDITION] Ovis2 Model Addition (vllm-project#15826) Signed-off-by: Marco <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> * Make the _apply_rotary_emb compatible with dynamo (vllm-project#17435) * [Misc] Remove deprecated files (vllm-project#17447) Signed-off-by: chaunceyjiang <[email protected]> * [V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None (vllm-project#15755) Signed-off-by: rongfu.leng <[email protected]> * [TPU][V1][CI] Update regression test baseline for v6 CI (vllm-project#17064) Signed-off-by: NickLucche <[email protected]> * [Core] Prevent side-channel attacks via cache salting (vllm-project#17045) Signed-off-by: Marko Rosenmueller <[email protected]> * [V1][Metrics] add support for kv event publishing (vllm-project#16750) Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> * [Feature] The Qwen3 reasoning parser supports guided decoding (vllm-project#17466) Signed-off-by: chaunceyjiang <[email protected]> * [Docs] Add command for running mypy tests from CI (vllm-project#17475) Signed-off-by: Russell Bryant <[email protected]> * [Fix] Support passing args to logger (vllm-project#17425) Signed-off-by: Aaron Pham <[email protected]> * [Bugfix] Fixed mistral tokenizer path when pointing to file (vllm-project#17457) Signed-off-by: Pete Savage <[email protected]> * [V1] Allow turning off pickle fallback in vllm.v1.serial_utils (vllm-project#17427) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> * [Docs] Update optimization.md doc (vllm-project#17482) Signed-off-by: mgoin <[email protected]> * [BugFix] Fix authorization of openai_transcription_client.py (vllm-project#17321) Signed-off-by: zh Wang <[email protected]> * [Bugfix][ROCm] Restrict ray version due to a breaking release (vllm-project#17480) Signed-off-by: Gregory Shtrasberg <[email protected]> * [doc] add install tips (vllm-project#17373) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * doc: fix bug report Github template formatting (vllm-project#17486) Signed-off-by: David Xia <[email protected]> * [v1][Spec Decode] Make sliding window compatible with eagle prefix caching (vllm-project#17398) Signed-off-by: Chen Zhang <[email protected]> * Bump Compressed Tensors version to 0.9.4 (vllm-project#17478) Signed-off-by: Rahul Tuli <[email protected]> Co-authored-by: mgoin <[email protected]> * [Misc] Rename Audios -> Audio in Qwen2audio Processing (vllm-project#17507) Signed-off-by: Alex-Brooks <[email protected]> * [CI][TPU] Skip Multimodal test (vllm-project#17488) Signed-off-by: Siyuan Liu <[email protected]> * [Bugfix][ROCm] Fix import error on ROCm (vllm-project#17495) Signed-off-by: Gregory Shtrasberg <[email protected]> * [Bugfix] Temporarily disable gptq_bitblas on ROCm (vllm-project#17411) Signed-off-by: Yan Cangang <[email protected]> * [CI][TPU] Skip structured outputs+spec decode tests on TPU (vllm-project#17510) Signed-off-by: mgoin <[email protected]> * [CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg (vllm-project#17500) Signed-off-by: mgoin <[email protected]> * [CI/Build] Reorganize models tests (vllm-project#17459) Signed-off-by: DarkLight1337 <[email protected]> * FIxing the AMD test failures caused by PR#16457 (vllm-project#17511) Signed-off-by: Alexei V. Ivanov <[email protected]> * [Build] Require setuptools >= 77.0.3 for PEP 639 (vllm-project#17389) Signed-off-by: Russell Bryant <[email protected]> * [ROCm] Effort to reduce the number of environment variables in command line (vllm-project#17229) Signed-off-by: Hongxia Yang <[email protected]> * [BugFix] fix speculative decoding memory leak when speculation is disabled (vllm-project#15506) Signed-off-by: Noah Yoshida <[email protected]> * [BugFix] Fix mla cpu - missing 3 required positional arguments (vllm-project#17494) Signed-off-by: Lucas Wilkinson <[email protected]> * Avoid overwriting vllm_compile_cache.py (vllm-project#17418) Signed-off-by: Keyun Tong <[email protected]> * [Core] Enable IPv6 with vllm.utils.make_zmq_socket() (vllm-project#16506) Signed-off-by: Russell Bryant <[email protected]> * [Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content (vllm-project#17515) Signed-off-by: chaunceyjiang <[email protected]> * Improve configs - `ObservabilityConfig` (vllm-project#17453) Signed-off-by: Harry Mellor <[email protected]> * [Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model (vllm-project#17285) Signed-off-by: Teruaki Ishizaki <[email protected]> * [Frontend] Show progress bar for adding requests (vllm-project#17525) Signed-off-by: DarkLight1337 <[email protected]> * [Misc] Clean up test docstrings and names (vllm-project#17521) Signed-off-by: DarkLight1337 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X (vllm-project#17530) Signed-off-by: tjtanaa <[email protected]> * Fix more broken speculative decode tests (vllm-project#17450) Signed-off-by: Huy Do <[email protected]> * [doc] add streamlit integration (vllm-project#17522) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> * [FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config (vllm-project#17535) Signed-off-by: tjtanaa <[email protected]> * [Feature][Frontend]: Deprecate --enable-reasoning (vllm-project#17452) Signed-off-by: chaunceyjiang <[email protected]> * [ROCm] remove unsupported archs from rocm triton flash-attention supported list (vllm-project#17536) Signed-off-by: Hongxia Yang <[email protected]> * [torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations (vllm-project#10867) Signed-off-by: Sage Moore <[email protected]> * [Misc] refactor example - cpu_offload_lmcache (vllm-project#17460) Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]> --------- Signed-off-by: youkaichao <[email protected]> Signed-off-by: Michal Adamczyk <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: reidliu41 <[email protected]> Signed-off-by: vllmellm <[email protected]> Signed-off-by: Nick Hill <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: gitover22 <[email protected]> Signed-off-by: chaunceyjiang <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: mgoin <[email protected]> Signed-off-by: windsonsea <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: csy1204 <[email protected]> Signed-off-by: sydarb <[email protected]> Signed-off-by: 开哲 <[email protected]> Signed-off-by: Omer Dayan (SW-GPU) <[email protected]> Signed-off-by: Rui Qiao <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Mark McLoughlin <[email protected]> Signed-off-by: Aaruni Aggarwal <[email protected]> Signed-off-by: Eyshika Agarwal <[email protected]> Signed-off-by: eyshika <[email protected]> Signed-off-by: Yinghai Lu <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Jens Glaser <[email protected]> Signed-off-by: varun sundar rabindranath <[email protected]> Signed-off-by: Lifu Huang <[email protected]> Signed-off-by: Mengqing Cao <[email protected]> Signed-off-by: cynthieye <[email protected]> Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Jasmond Loh <[email protected]> Signed-off-by: Shangming Cai <[email protected]> Signed-off-by: Christian Heimes <[email protected]> Signed-off-by: Bryan Lu <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Chen Zhang <[email protected]> Signed-off-by: James Wu <[email protected]> Signed-off-by: Lu Fang <[email protected]> Signed-off-by: Yarong Mu <[email protected]> Signed-off-by: shuw <[email protected]> Signed-off-by: charlifu <[email protected]> Signed-off-by: Zijing Liu <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Andy Xie <[email protected]> Signed-off-by: Aaron Pham <[email protected]> Signed-off-by: changjun.lee <[email protected]> Signed-off-by: imkero <[email protected]> Signed-off-by: ShuaibinLi <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Signed-off-by: Jade Zheng <[email protected]> Signed-off-by: sfc-gh-zhwang <[email protected]> Signed-off-by: kaixih <[email protected]> Signed-off-by: cascade812 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: LiuXiaoxuanPKU <[email protected]> Signed-off-by: lkm-schulz <[email protected]> Signed-off-by: Ther-LF <[email protected]> Signed-off-by: KuntaiDu <[email protected]> Signed-off-by: evian <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: simon-mo <[email protected]> Signed-off-by: Alex <[email protected]> Signed-off-by: Michal Moskal <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: rzou <[email protected]> Signed-off-by: 苏政渊 <[email protected]> Signed-off-by: qingjun <[email protected]> Signed-off-by: Zerohertz <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]> Signed-off-by: mofanke <[email protected]> Signed-off-by: mayuyuace <[email protected]> Signed-off-by: Tianyuan Wu <[email protected]> Signed-off-by: Alexei V. Ivanov <[email protected]> Signed-off-by: Dilip Gowda Bhagavan <[email protected]> Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Gabriel Marinho <[email protected]> Signed-off-by: Huy Do <[email protected]> Signed-off-by: Kunshang Ji <[email protected]> Signed-off-by: Marco <[email protected]> Signed-off-by: isotr0py <[email protected]> Signed-off-by: rongfu.leng <[email protected]> Signed-off-by: Marko Rosenmueller <[email protected]> Signed-off-by: alec-flowers <[email protected]> Signed-off-by: Pete Savage <[email protected]> Signed-off-by: zh Wang <[email protected]> Signed-off-by: David Xia <[email protected]> Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: Siyuan Liu <[email protected]> Signed-off-by: Yan Cangang <[email protected]> Signed-off-by: Hongxia Yang <[email protected]> Signed-off-by: Noah Yoshida <[email protected]> Signed-off-by: Keyun Tong <[email protected]> Signed-off-by: Teruaki Ishizaki <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: Sage Moore <[email protected]> Co-authored-by: Chauncey <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Michal Adamczyk <[email protected]> Co-authored-by: Reid <[email protected]> Co-authored-by: reidliu41 <[email protected]> Co-authored-by: vllmellm <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: huafeng <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Michael Yao <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Yong Hoon Shin <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Sangyeon Cho <[email protected]> Co-authored-by: Chen Xia <[email protected]> Co-authored-by: Areeb Syed <[email protected]> Co-authored-by: 张宇 <[email protected]> Co-authored-by: 开哲 <[email protected]> Co-authored-by: omer-dayan <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rui Qiao <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Shanshan Shen <[email protected]> Co-authored-by: wang.yuqi <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Co-authored-by: Aaruni Aggarwal <[email protected]> Co-authored-by: Atilla <[email protected]> Co-authored-by: Eyshika Agarwal <[email protected]> Co-authored-by: Yinghai Lu <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: jglaser <[email protected]> Co-authored-by: tjtanaa <[email protected]> Co-authored-by: Zaida Zhou <[email protected]> Co-authored-by: zhouzaida <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: varun sundar rabindranath <[email protected]> Co-authored-by: Lifu Huang <[email protected]> Co-authored-by: Mengqing Cao <[email protected]> Co-authored-by: yexin(叶鑫) <[email protected]> Co-authored-by: MagnetoWang <[email protected]> Co-authored-by: 조상연[플레이스 AI] <[email protected]> Co-authored-by: rasmith <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lu Fang <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Jasmond L <[email protected]> Co-authored-by: Shangming Cai <[email protected]> Co-authored-by: Daniel Li <[email protected]> Co-authored-by: Christian Heimes <[email protected]> Co-authored-by: Benjamin Chislett <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Yihua Cheng <[email protected]> Co-authored-by: James Wu <[email protected]> Co-authored-by: yarongmu-google <[email protected]> Co-authored-by: Shu Wang <[email protected]> Co-authored-by: Charlie Fu <[email protected]> Co-authored-by: Zijing Liu <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Co-authored-by: Ning Xie <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: changjun.lee <[email protected]> Co-authored-by: Kero Liang <[email protected]> Co-authored-by: Happy <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: Jade Zheng <[email protected]> Co-authored-by: Flex Wang <[email protected]> Co-authored-by: Kaixi Hou <[email protected]> Co-authored-by: cascade <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Lennart K. M. Schulz <[email protected]> Co-authored-by: TherLF <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Wanrui Dai <[email protected]> Co-authored-by: evian <[email protected]> Co-authored-by: idouba <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Aaron Pham <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Michał Moskal <[email protected]> Co-authored-by: Lucia Fang <[email protected]> Co-authored-by: Richard Barnes <[email protected]> Co-authored-by: Richard Zou <[email protected]> Co-authored-by: Zhengyuan Su (苏政渊) <[email protected]> Co-authored-by: 苏政渊 <[email protected]> Co-authored-by: qscqesze <[email protected]> Co-authored-by: ponix-j <[email protected]> Co-authored-by: Hyogeun Oh (오효근) <[email protected]> Co-authored-by: Gregory Shtrasberg <[email protected]> Co-authored-by: a2q1p <[email protected]> Co-authored-by: mofanke <[email protected]> Co-authored-by: Qiming Zhang <[email protected]> Co-authored-by: TY-AMD <[email protected]> Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]> Co-authored-by: casinca <[email protected]> Co-authored-by: Dilip Gowda Bhagavan <[email protected]> Co-authored-by: Bryan Lu <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Gabriel Marinho <[email protected]> Co-authored-by: Huy Do <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Marco <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: rongfu.leng <[email protected]> Co-authored-by: Marko Rosenmueller <[email protected]> Co-authored-by: Alec <[email protected]> Co-authored-by: Pete Savage <[email protected]> Co-authored-by: zh Wang <[email protected]> Co-authored-by: David Xia <[email protected]> Co-authored-by: Rahul Tuli <[email protected]> Co-authored-by: Siyuan Liu <[email protected]> Co-authored-by: NaLan ZeYu <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Co-authored-by: Noah Yoshida <[email protected]> Co-authored-by: Keyun Tong <[email protected]> Co-authored-by: Teruaki Ishizaki <[email protected]> Co-authored-by: Sage Moore <[email protected]>
1 parent a45a694 commit e088f0f

File tree

720 files changed

+29626
-11500
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

720 files changed

+29626
-11500
lines changed

.buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
23
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For hf script, without -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
23
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For hf script, without -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
23
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-FP8.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
23
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
1+
# For hf script, without -t option (tensor parallel size).
2+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5
23
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
34
tasks:
45
- name: "gsm8k"

.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
23
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
34
tasks:

.buildkite/lm-eval-harness/configs/Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
23
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
34
tasks:

.buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
23
model_name: "mgoin/Minitron-4B-Base-FP8"
34
tasks:

.buildkite/lm-eval-harness/configs/Mixtral-8x22B-Instruct-v0.1-FP8-Dynamic.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
23
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
34
tasks:

.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1-FP8.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
23
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
34
tasks:

.buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
1+
# For hf script, without -t option (tensor parallel size).
2+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5
23
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
34
tasks:
45
- name: "gsm8k"
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 -b auto -l 1319 -f 5 -t 1
23
model_name: "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16"
34
tasks:
45
- name: "gsm8k"
56
metrics:
67
- name: "exact_match,strict-match"
7-
value: 0.31
8+
value: 0.30
89
- name: "exact_match,flexible-extract"
9-
value: 0.47
10+
value: 0.465
1011
limit: 1319
1112
num_fewshot: 5

.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
34
tasks:

.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
23
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
34
tasks:

.buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-W8A16-compressed-tensors.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
23
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
34
tasks:

.buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
23
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
34
tasks:

.buildkite/lm-eval-harness/configs/SparseLlama3.1_2of4_fp8_compressed.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# For vllm script, with -t option (tensor parallel size).
12
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
23
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
34
tasks:

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
import pytest
1717
import yaml
1818

19-
RTOL = 0.05
19+
RTOL = 0.08
2020
TEST_DATA_FILE = os.environ.get(
2121
"LM_EVAL_TEST_DATA_FILE",
2222
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")

.buildkite/release-pipeline.yaml

+6-5
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
steps:
2-
- label: "Build wheel - CUDA 12.4"
2+
- label: "Build wheel - CUDA 12.8"
33
agents:
44
queue: cpu_queue_postmerge
55
commands:
6-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
6+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
77
- "mkdir artifacts"
88
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
99
- "bash .buildkite/scripts/upload-wheels.sh"
1010
env:
1111
DOCKER_BUILDKIT: "1"
1212

13-
- label: "Build wheel - CUDA 12.1"
13+
- label: "Build wheel - CUDA 12.6"
1414
agents:
1515
queue: cpu_queue_postmerge
1616
commands:
17-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
17+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
1818
- "mkdir artifacts"
1919
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
2020
- "bash .buildkite/scripts/upload-wheels.sh"
@@ -48,7 +48,7 @@ steps:
4848
queue: cpu_queue_postmerge
4949
commands:
5050
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
51-
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
51+
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
5252
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
5353

5454
- label: "Build and publish TPU release image"
@@ -57,6 +57,7 @@ steps:
5757
agents:
5858
queue: tpu_queue_postmerge
5959
commands:
60+
- "git fetch --all"
6061
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
6162
- "docker push vllm/vllm-tpu:nightly"
6263
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"

.buildkite/scripts/hardware_ci/run-amd-test.sh

+44-23
Original file line numberDiff line numberDiff line change
@@ -75,30 +75,51 @@ HF_MOUNT="/root/.cache/huggingface"
7575
commands=$@
7676
echo "Commands:$commands"
7777
#ignore certain kernels tests
78-
if [[ $commands == *" kernels "* ]]; then
78+
if [[ $commands == *" kernels/core"* ]]; then
7979
commands="${commands} \
80-
--ignore=kernels/test_attention_selector.py \
81-
--ignore=kernels/test_blocksparse_attention.py \
82-
--ignore=kernels/test_causal_conv1d.py \
83-
--ignore=kernels/test_cutlass.py \
84-
--ignore=kernels/test_encoder_decoder_attn.py \
85-
--ignore=kernels/test_flash_attn.py \
86-
--ignore=kernels/test_flashinfer.py \
87-
--ignore=kernels/test_int8_quant.py \
88-
--ignore=kernels/test_machete_gemm.py \
89-
--ignore=kernels/test_mamba_ssm.py \
90-
--ignore=kernels/test_marlin_gemm.py \
91-
--ignore=kernels/test_moe.py \
92-
--ignore=kernels/test_prefix_prefill.py \
93-
--ignore=kernels/test_rand.py \
94-
--ignore=kernels/test_sampler.py \
95-
--ignore=kernels/test_cascade_flash_attn.py \
96-
--ignore=kernels/test_mamba_mixer2.py \
97-
--ignore=kernels/test_aqlm.py \
98-
--ignore=kernels/test_machete_mm.py \
99-
--ignore=kernels/test_mha_attn.py \
100-
--ignore=kernels/test_block_fp8.py \
101-
--ignore=kernels/test_permute_cols.py"
80+
--ignore=kernels/core/test_fused_quant_layernorm.py \
81+
--ignore=kernels/core/test_permute_cols.py"
82+
fi
83+
84+
if [[ $commands == *" kernels/attention"* ]]; then
85+
commands="${commands} \
86+
--ignore=kernels/attention/stest_attention_selector.py \
87+
--ignore=kernels/attention/test_blocksparse_attention.py \
88+
--ignore=kernels/attention/test_encoder_decoder_attn.py \
89+
--ignore=kernels/attention/test_attention_selector.py \
90+
--ignore=kernels/attention/test_flash_attn.py \
91+
--ignore=kernels/attention/test_flashinfer.py \
92+
--ignore=kernels/attention/test_prefix_prefill.py \
93+
--ignore=kernels/attention/test_cascade_flash_attn.py \
94+
--ignore=kernels/attention/test_mha_attn.py \
95+
--ignore=kernels/attention/test_lightning_attn.py \
96+
--ignore=kernels/attention/test_attention.py"
97+
fi
98+
99+
if [[ $commands == *" kernels/quantization"* ]]; then
100+
commands="${commands} \
101+
--ignore=kernels/quantization/test_int8_quant.py \
102+
--ignore=kernels/quantization/test_aqlm.py \
103+
--ignore=kernels/quantization/test_machete_mm.py \
104+
--ignore=kernels/quantization/test_block_fp8.py \
105+
--ignore=kernels/quantization/test_block_int8.py \
106+
--ignore=kernels/quantization/test_marlin_gemm.py \
107+
--ignore=kernels/quantization/test_cutlass_scaled_mm.py \
108+
--ignore=kernels/quantization/test_int8_kernel.py"
109+
fi
110+
111+
if [[ $commands == *" kernels/mamba"* ]]; then
112+
commands="${commands} \
113+
--ignore=kernels/mamba/test_mamba_mixer2.py \
114+
--ignore=kernels/mamba/test_causal_conv1d.py \
115+
--ignore=kernels/mamba/test_mamba_ssm_ssd.py"
116+
fi
117+
118+
if [[ $commands == *" kernels/moe"* ]]; then
119+
commands="${commands} \
120+
--ignore=kernels/moe/test_moe.py \
121+
--ignore=kernels/moe/test_cutlass_moe.py \
122+
--ignore=kernels/moe/test_triton_moe_ptpc_fp8.py"
102123
fi
103124

104125
#ignore certain Entrypoints/openai tests

.buildkite/scripts/hardware_ci/run-cpu-test-ppc64le.sh

+11-4
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,30 @@
55
set -ex
66

77
# Setup cleanup
8-
remove_docker_container() { podman rm -f cpu-test-ubi9-ppc || true; podman system prune -f; }
8+
remove_docker_container() {
9+
if [[ -n "$container_id" ]]; then
10+
podman rm -f "$container_id" || true
11+
fi
12+
podman system prune -f
13+
}
914
trap remove_docker_container EXIT
1015
remove_docker_container
1116

1217
# Try building the docker image
1318
podman build -t cpu-test-ubi9-ppc -f docker/Dockerfile.ppc64le .
1419

1520
# Run the image
16-
podman run -itd --entrypoint /bin/bash -v /tmp/:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --name cpu-test-ubi9-ppc cpu-test-ubi9-ppc
21+
container_id=$(podman run -itd --entrypoint /bin/bash -v /tmp/:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN cpu-test-ubi9-ppc)
1722

1823
function cpu_tests() {
1924

2025
# offline inference
21-
podman exec cpu-test-ubi9-ppc bash -c "
26+
podman exec -it "$container_id" bash -c "
2227
set -e
2328
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"
2429

2530
# Run basic model test
26-
podman exec cpu-test-ubi9-ppc bash -c "
31+
podman exec -it "$container_id" bash -c "
2732
set -e
2833
pip install pytest pytest-asyncio einops peft Pillow soundfile transformers_stream_generator matplotlib
2934
pip install sentence-transformers datamodel_code_generator
@@ -33,6 +38,8 @@ function cpu_tests() {
3338
}
3439

3540
# All of CPU tests are expected to be finished less than 40 mins.
41+
42+
export container_id
3643
export -f cpu_tests
3744
timeout 40m bash -c cpu_tests
3845

.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh

+7-2
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,9 @@ source /etc/environment
1717
docker run --privileged --net host --shm-size=16G -it \
1818
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
1919
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
20-
&& python3 -m pip install pytest tpu-info \
20+
&& python3 -m pip install pytest pytest-asyncio tpu-info \
2121
&& python3 -m pip install lm_eval[api]==0.4.4 \
22+
&& export VLLM_XLA_CACHE_PATH= \
2223
&& export VLLM_USE_V1=1 \
2324
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
2425
&& echo HARDWARE \
@@ -42,7 +43,11 @@ docker run --privileged --net host --shm-size=16G -it \
4243
&& echo TEST_8 \
4344
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_topk_topp_sampler.py \
4445
&& echo TEST_9 \
45-
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" \
46+
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_multimodal.py \
47+
&& echo TEST_10 \
48+
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py \
49+
&& echo TEST_11 \
50+
&& pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py" \
4651

4752

4853
# TODO: This test fails because it uses RANDOM_SEED sampling

.buildkite/scripts/upload-wheels.sh

+9-9
Original file line numberDiff line numberDiff line change
@@ -50,11 +50,11 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
5050
if [[ $normal_wheel == *"cu118"* ]]; then
5151
# if $normal_wheel matches cu118, do not upload the index.html
5252
echo "Skipping index files for cu118 wheels"
53-
elif [[ $normal_wheel == *"cu121"* ]]; then
54-
# if $normal_wheel matches cu121, do not upload the index.html
55-
echo "Skipping index files for cu121 wheels"
53+
elif [[ $normal_wheel == *"cu126"* ]]; then
54+
# if $normal_wheel matches cu126, do not upload the index.html
55+
echo "Skipping index files for cu126 wheels"
5656
else
57-
# only upload index.html for cu124 wheels (default wheels)
57+
# only upload index.html for cu128 wheels (default wheels)
5858
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
5959
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
6060
fi
@@ -66,12 +66,12 @@ aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
6666
if [[ $normal_wheel == *"cu118"* ]]; then
6767
# if $normal_wheel matches cu118, do not upload the index.html
6868
echo "Skipping index files for cu118 wheels"
69-
elif [[ $normal_wheel == *"cu121"* ]]; then
70-
# if $normal_wheel matches cu121, do not upload the index.html
71-
echo "Skipping index files for cu121 wheels"
69+
elif [[ $normal_wheel == *"cu126"* ]]; then
70+
# if $normal_wheel matches cu126, do not upload the index.html
71+
echo "Skipping index files for cu126 wheels"
7272
else
73-
# only upload index.html for cu124 wheels (default wheels)
73+
# only upload index.html for cu128 wheels (default wheels)
7474
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
7575
fi
7676

77-
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
77+
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

0 commit comments

Comments
 (0)