This repository provides an automated workflow for benchmarking LLM inference using the llm-d
stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d
.
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh
Out of the box: standup
a llm-d
stack (default method is llm-d-modelservice
, serving meta-llama/Llama-3.2-1B-Instruct
model), run
a harness (default inference-perf
) with a load profile (default sanity_random
) and then teardown
the deployed stack.
./e2e.sh
Tip
The penultimate line on the output, starting with "ℹ️ The current work dir is" will indicate the current path for the generated standup files and collected performance data.
The same above example could be explicitly split in three separate parts.
./setup/standup.sh
./run.sh
./setup/teardown.sh
A user can elect to standup
an llm-d
stack once, and then run
the inference-perf
harness with a different load profile (i.e., chatbot_synthetic
)
./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`
Tip
./run.sh
can be used to run a particular workload against a pre-deployed stack (llm-d
or otherwise)
llm-d-benchmark
stands up a stack (currently, both llm-d
and "standalone" are supported) with a specific set of Standup Parameters, and the run a specific harness with a specific set of Run Parameters
Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort.
Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion.
Well defined set of Metrics
Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.
Relevant collection of Workloads
Define a mix of workloads that express real-world use cases, allowing for llm-d
performance characterization, evaluation, stress investigation.
llm-d-benchmark
follows the practice of its parent project (llm-d
) by having also it is own Northstar design (a work in progress)
Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, large language model, and llm-d
parameters (an environment file, and optionally a values.yaml
file for modelservice helm charts).
A "harness" is a load generator (Python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm, the benchmarks found on the benchmarks
folder on vllm, and "no op" (internally designed "nop") for users interested in benchmarking mostly model load times. There are ongoing efforts to consolidate and provide an easier way to support different load generators.
(Workload) Profiles
A (workload) profile is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution, and dataset. Supported workload profiles can be found under workload/profiles
.
Important
The triplet <scenario>
,<harness>
,<(workload) profile>
, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow a single experiment to be reproduced.
A file describing a series of parameters - both standup
and run
- to be executed automatically. This file follows the "Design of Experiments" (DOE) approach, where each parameter (factor
) is listed alongside with the target values (levels
) resulting into a list of combinations (treatments
).
- Instructions on how to contribute including details on our development process and governance.
- We use Slack to discuss development across organizations. Please join: Slack. There is a
sig-benchmarking
channel there. - We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.
This project is licensed under Apache License 2.0. See the LICENSE file for details.