Skip to content

Commit 5a80f16

Browse files
committed
llama4 post
Signed-off-by: simon-mo <[email protected]>
1 parent 79dc796 commit 5a80f16

File tree

2 files changed

+107
-0
lines changed

2 files changed

+107
-0
lines changed

_posts/2025-04-05-llama4.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
layout: post
3+
title: "Llama 4 in vLLM"
4+
author: "The vLLM Team"
5+
image: /assets/figures/llama4/perf.png
6+
thumbnail-img: /assets/figures/llama4/perf.png
7+
share-img: /assets/figures/llama4/perf.png
8+
---
9+
10+
We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 10 images!), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later:
11+
12+
```
13+
pip install -U vllm
14+
```
15+
16+
with the following sample commands, alternatively you can replace the CLI command with docker run with instruction [here](https://docs.vllm.ai/en/latest/deployment/docker.html) or use our Pythonic interface the [`LLM` class](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference) for local batch inference. We also recommend checking out [a demo](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb) from the Meta team showcasing the 1M long context capability with vLLM.
17+
18+
Below, you'll find sample commands to get started. Alternatively, you can replace the CLI command with docker run ([instructions here](https://docs.vllm.ai/en/latest/deployment/docker.html)) or use [our Pythonic interface](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference), the `LLM` class, for local batch inference. We also recommend checking out the [demo from the Meta team](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb) showcasing the 1M long context capability with vLLM.
19+
20+
## Usage Guide
21+
22+
Here's how you can serve the Llama 4 models using different hardware configurations.
23+
24+
Using 8xH100, vLLM can serve Scout with 1M context and Maverick with about 430K. See more tips below for performance enhancement and leveraging long context.
25+
26+
On 8x H100 GPUs:
27+
28+
* Scout (up to 1M context):
29+
30+
```
31+
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
32+
--tensor-parallel-size 8 \
33+
--max-model-len 1000000
34+
```
35+
36+
* Maverick (up to \~430K context):
37+
38+
```
39+
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
40+
--tensor-parallel-size 8 \
41+
--max-model-len 430000
42+
```
43+
44+
On 8x H200 GPUs:
45+
46+
* Scout (up to 3.6M context):
47+
48+
```
49+
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
50+
--tensor-parallel-size 8 \
51+
--max-model-len 3600000
52+
```
53+
54+
* Maverick (up to 1M context):
55+
56+
```
57+
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
58+
--tensor-parallel-size 8
59+
```
60+
61+
**Performance:**
62+
63+
With the configurations above, we observe the following output tokens/s:
64+
65+
![](/assets/figures/llama4/perf.png)
66+
While more performance enhancements are on the way, we believe the Llama 4 models' efficient architecture and relatively small size make them practical for scaled usage today.
67+
68+
**Tips for Performance and Long Context:**
69+
70+
* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting.
71+
* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html).
72+
* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens.
73+
74+
**Other Hardware Support & Quantizations:**
75+
76+
* A100: We have verified that the bf16 versions of the models work well on A100 GPUs.
77+
* INT4: An INT4-quantized version of the model checkpoint is currently a work in progress. Stay tuned for updates.
78+
* AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above.
79+
80+
**Inference Accuracy Validation:**
81+
We validated inference accuracy against the official Meta report using lm-eval-harness. Here are the results for [meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8):
82+
83+
| | MMLU Pro | ChartQA |
84+
|----------|---------|---------|
85+
| Reported | 80.5 | 90 |
86+
| H100 FP8 | 80.4 | 89.4 |
87+
| AMD BF16 | 80.4 | 89.4 |
88+
| H200 BF16 | 80.2 | 89.3 |
89+
90+
## Efficient Architecture and Cluster Scale Serving
91+
92+
Llama 4’s model architecture is particularly well-suited for efficient long-context inference, thanks to features like:
93+
94+
* **Mixture of Experts (MoE):** Scout uses 16 experts (17B activated parameters), and Maverick uses 128 experts (17B activated parameters). Only one expert is activated per token, maintaining efficiency.
95+
* **Interleaved RoPE (iRoPE):** Llama 4 interleaves global attention (without RoPE) with chunked local attention (with RoPE) in a 1:3 ratio. The local attention layer attends to tokens in non-overlapping chunks, significantly reducing the quadratic complexity of attention as context length scales.
96+
97+
98+
vLLM recently launched the [V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), delivering major performance speedups on single nodes, along with native torch.compile support. Our [Q2 roadmap](https://github.com/vllm-project/vllm/issues/15735) focuses on enhancing vLLM’s multi-node scaling capabilities, aiming for disaggregated, cluster-scale serving. We are actively adding support for efficient expert parallelism, multi-node data parallelism, and cluster-wide prefill disaggregation.
99+
100+
## Acknowledgement
101+
102+
We extend our sincere thanks to the Meta team for their implementation of the model architecture, extensive accuracy evaluation, and performance benchmarking: [Lucia (Lu) Fang](https://github.com/luccafong), [Ye (Charlotte) Qi](https://github.com/yeqcharlotte), [Lu Fang](https://github.com/houseroad), [Yang Chen](https://github.com/chenyang78), [Zijing Liu](https://github.com/liuzijing2014), [Yong Hoon Shin](https://github.com/sarckk), [Zhewen Li](https://github.com/zhewenl), [Jon Swenson](https://github.com/jmswen), [Kai Wu](https://github.com/wukaixingxp), [Xiaodong Wang](https://github.com/xw285cornell), [Shiyan Deng](https://github.com/842974287), [Wenchen Wang](https://github.com/wangwenchen0407), [Lai Wei](https://github.com/roywei), [Matthias Reso](https://github.com/mreso), [Chris Thi](https://github.com/cthi), [Keyun Tong](https://github.com/youngkent), [Jinho Hwang](https://github.com/jinhohwang-meta), [Driss Guessous](https://github.com/drisspg), [Aston Zhang](https://github.com/astonzhang).
103+
104+
We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang.
105+
106+
The vLLM team’s performance benchmarks were run on hardware generously provided by Nebius and NVIDIA.
107+

assets/figures/llama4/perf.png

476 KB
Loading

0 commit comments

Comments
 (0)