Skip to content

Enable multi-image support benchmarking for serving #21145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

leopck
Copy link

@leopck leopck commented Jul 17, 2025

Purpose

This PR updates the online benchmark script (benchmarks/benchmark_serving.py) to support multi-image prompts. This capability is necessary for accurately benchmarking multimodal models using dataset that support more than one image e.g. MUIRBENCH.

Test Plan

To validate the changes, run the online benchmark script pointing to a multi-image dataset. The following command was used for testing with the MUIRBENCH dataset:

python3 benchmarks/benchmark_serving.py \
    --backend openai-chat \
    --endpoint /chat/completions \
    --base-url http://127.0.0.1:8688/v1 \
    --model gemma/gemma-3-27b-it \
    --request-rate "inf" \
    --percentile-metrics ttft,tpot,itl,e2el \
    --ignore-eos \
    --num-prompt 128 \
    --port 8688 \
    --dataset-path MUIRBENCH/MUIRBENCH \
    --dataset-name hf \
    --max-concurrency 128 \
    --save-result

Test Result

The results from the test command are provided below:

============ Serving Benchmark Result ============
Successful requests:                     128
Benchmark duration (s):                  116.25
Total input tokens:                      2808
Total generated tokens:                  16384
Request throughput (req/s):              1.10
Output token throughput (tok/s):         140.93
Total Token throughput (tok/s):          165.09
---------------Time to First Token----------------
Mean TTFT (ms):                          73745.15
Median TTFT (ms):                        71047.68
P99 TTFT (ms):                           111490.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          194.42
Median TPOT (ms):                        162.71
P99 TPOT (ms):                           425.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           192.97
Median ITL (ms):                         40.98
P99 ITL (ms):                            1227.04
----------------End-to-end Latency----------------
Mean E2EL (ms):                          98436.22
Median E2EL (ms):                        88044.50
P99 E2EL (ms):                           116240.25
==================================================

Allows the online benchmark script to correctly process and send
requests containing multiple images per prompt, which is required
for datasets like MUIRBENCH.

Example on how to run this:

```sh
python3 benchmarks/benchmark_serving.py \
	--backend openai-chat \
	--endpoint /chat/completions \
	--base-url http://127.0.0.1:8688/v1 \
	--model gemma/gemma-3-27b-it \
	--request-rate "inf" \
	--percentile-metrics ttft,tpot,itl,e2el \
	--ignore-eos \
	--num-prompt 128 \
	--port 8688 \
	--dataset-path MUIRBENCH/MUIRBENCH \
	--dataset-name hf \
	--max-concurrency 128 \
	--save-result
```

And here is the results:

```sh
============ Serving Benchmark Result ============
Successful requests:                     128
Benchmark duration (s):                  116.25
Total input tokens:                      2808
Total generated tokens:                  16384
Request throughput (req/s):              1.10
Output token throughput (tok/s):         140.93
Total Token throughput (tok/s):          165.09
---------------Time to First Token----------------
Mean TTFT (ms):                          73745.15
Median TTFT (ms):                        71047.68
P99 TTFT (ms):                           111490.03
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          194.42
Median TPOT (ms):                        162.71
P99 TPOT (ms):                           425.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           192.97
Median ITL (ms):                         40.98
P99 ITL (ms):                            1227.04
----------------End-to-end Latency----------------
Mean E2EL (ms):                          98436.22
Median E2EL (ms):                        88044.50
P99 E2EL (ms):                           116240.25
==================================================
```
Signed-off-by: Stanley Phoong <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-image support for serving benchmarks. It adds the MuirBenchDataset and updates the request handling logic for the OpenAI Chat backend. There are two potential issues that should be addressed to ensure the code functions correctly.

@xuechendi
Copy link
Contributor

@DarkLight1337 , may you help to review this PR?

@DarkLight1337 DarkLight1337 requested review from mgoin and ywang96 July 18, 2025 03:10
@leopck
Copy link
Author

leopck commented Jul 18, 2025

@mgoin and @ywang96 may you help review and approve this PR? appreciate your review on this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants