Skip to content

llm-d/llm-d-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-d-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Main Goal

Provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
./setup/install_deps.sh

Quickstart

Out of the box: standup a llm-d stack (default method is llm-d-modelservice, serving meta-llama/Llama-3.2-1B-Instruct model), run a harness (default inference-perf) with a load profile (default sanity_random) and then teardown the deployed stack.

./e2e.sh

Tip

The penultimate line on the output, starting with "ℹ️ The current work dir is" will indicate the current path for the generated standup files and collected performance data.

The same above example could be explicitly split in three separate parts.

./setup/standup.sh
./run.sh
./setup/teardown.sh

A user can elect to standup an llm-d stack once, and then run the inference-perf harness with a different load profile (i.e., chatbot_synthetic)

./run.sh --harness inference-perf --workload chatbot_synthetic --methods <a string that matches a inference service or pod>`

Tip

./run.sh can be used to run a particular workload against a pre-deployed stack (llm-d or otherwise)

Architecture

llm-d-benchmark stands up a stack (currently, both llm-d and "standalone" are supported) with a specific set of Standup Parameters, and the run a specific harness with a specific set of Run Parameters

llm-d Logo

Goals

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort.

Multiple load generators and multiple load profiles available, in a plugable architecture that allows expansion.

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, large language model, and llm-d parameters (an environment file, and optionally a values.yaml file for modelservice helm charts).

A "harness" is a load generator (Python code) which drives the benchmark load. Today, llm-d-benchmark supports fmperf, inference-perf, guidellm, the benchmarks found on the benchmarks folder on vllm, and "no op" (internally designed "nop") for users interested in benchmarking mostly model load times. There are ongoing efforts to consolidate and provide an easier way to support different load generators.

(Workload) Profiles

A (workload) profile is the actual benchmark load specification which includes the LLM use case to benchmark, traffic pattern, input / output distribution, and dataset. Supported workload profiles can be found under workload/profiles.

Important

The triplet <scenario>,<harness>,<(workload) profile>, combined with the standup/teardown capabilities provided by llm-d-infra and llm-d-modelservice should provide enough information to allow a single experiment to be reproduced.

A file describing a series of parameters - both standup and run - to be executed automatically. This file follows the "Design of Experiments" (DOE) approach, where each parameter (factor) is listed alongside with the target values (levels) resulting into a list of combinations (treatments).

Dependencies

Topics

Contribute

  • Instructions on how to contribute including details on our development process and governance.
  • We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.
  • We host a weekly standup for contributors on Thursdays at 13:30 ET. Please join: Meeting Details. The meeting notes can be found here. Joining the llm-d google groups will grant you access.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.