GitHub - llm-d/llm-d-benchmark: llm-d benchmark scripts and tooling

`llm-d`-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Main Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh

Quickstart

Out of the box: standup a llm-d stack (default method is llm-d-modelservice, serving meta-llama/Llama-3.2-1B-Instruct model), run a harness (default inference-perf) with a load profile (default sanity_random) and then teardown the deployed stack.

./e2e.sh

Tip

The penultimate line on the output, starting with "ℹ️ The current work dir is" will indicate the current path for the generated standup files and collected performance data.

The same above example could be explicitly split in three separate parts.

./setup/standup.sh
./run.sh
./setup/teardown.sh

A user can elect to standup an llm-d stack once, and then run the inference-perf harness with a different load profile (i.e., chatbot_synthetic)

./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`

Tip

./run.sh can be used to run a particular workload against a pre-deployed stack (llm-d or otherwise)

Architecture

llm-d-benchmark stands up a stack (currently, both llm-d and "standalone" are supported) with a specific set of Standup Parameters, and the run a specific harness with a specific set of Run Parameters

llm-d Logo

Goals

Reproducibility

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort.

Flexibility

Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion.

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Scenarios

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, large language model, and llm-d parameters (an environment file, and optionally a values.yaml file for modelservice helm charts).

Harnesses

A "harness" is a load generator (Python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm, the benchmarks found on the benchmarks folder on vllm, and "no op" (internally designed "nop") for users interested in benchmarking mostly model load times. There are ongoing efforts to consolidate and provide an easier way to support different load generators.

(Workload) Profiles

A (workload) profile is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution, and dataset. Supported workload profiles can be found under workload/profiles.

Important

The triplet <scenario>,<harness>,<(workload) profile>, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow a single experiment to be reproduced.

Experiments

A file describing a series of parameters - both standup and run - to be executed automatically. This file follows the "Design of Experiments" (DOE) approach, where each parameter (factor) is listed alongside with the target values (levels) resulting into a list of combinations (treatments).

Dependencies

Contribute

Instructions on how to contribute including details on our development process and governance.
We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.
We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github		.github
analysis		analysis
build		build
deploy		deploy
docs		docs
experiments		experiments
scenarios		scenarios
setup		setup
util		util
workload		workload
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit_requirements.txt		.pre-commit_requirements.txt
.secrets.baseline		.secrets.baseline
.version.json		.version.json
BASH_TO_PYTHON_CONVERSION.md		BASH_TO_PYTHON_CONVERSION.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
e2e.sh		e2e.sh
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`llm-d`-benchmark

Main Goal

📦 Repository Setup

Quickstart

Architecture

Goals

Reproducibility

Flexibility

Well defined set of Metrics

Relevant collection of Workloads

Design and Roadmap

Main concepts (identified by specific directories)

Scenarios

Harnesses

(Workload) Profiles

Experiments

Dependencies

Topics

Lifecycle

Reproducibility

Observability

Quickstart

FAQ

Contribute

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 22

Languages

License

llm-d/llm-d-benchmark

Folders and files

Latest commit

History

Repository files navigation

llm-d-benchmark

Main Goal

📦 Repository Setup

Quickstart

Architecture

Goals

Well defined set of Metrics

Relevant collection of Workloads

Design and Roadmap

Main concepts (identified by specific directories)

(Workload) Profiles

Dependencies

Topics

Contribute

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages

`llm-d`-benchmark