Skip to content

Commit e1d7d06

Browse files
committed
support vllm cpu in backendruntime
Signed-off-by: googs1025 <[email protected]>
1 parent cf3283f commit e1d7d06

File tree

4 files changed

+104
-0
lines changed

4 files changed

+104
-0
lines changed

chart/values.global.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ backendRuntime:
2626
image:
2727
repository: vllm/vllm-openai
2828
tag: v0.7.3
29+
vllmcpu:
30+
image:
31+
# more image detail: https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
32+
repository: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
33+
tag: v0.8.5
2934

3035
leaderWorkerSet:
3136
enabled: true

docs/examples/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ We provide a set of examples to help you serve large language models, by default
1212
- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
1313
- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
1414
- [Deploy models via ollama](#deploy-models-via-ollama)
15+
- [Deploy models via vLLM CPU](#deploy-models-via-vllm-cpu)
1516
- [Speculative Decoding with llama.cpp](#speculative-decoding-with-llamacpp)
1617
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
1718
- [Multi-Host Inference](#multi-host-inference)
@@ -64,6 +65,11 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
6465

6566
llama.cpp supports speculative decoding to significantly improve inference performance, see [example](./speculative-decoding/llamacpp/) here.
6667

68+
### Deploy models via vLLM CPU
69+
70+
[vLLM](https://github.com/vllm-project/vllm) is an efficient and high-throughput LLM inference engine. It also provides a **CPU version** for environments without GPU support. see [example](./vllm-cpu/) here.
71+
72+
6773
### Speculative Decoding with vLLM
6874

6975
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: BackendRuntime
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: backendruntime
6+
app.kubernetes.io/part-of: llmaz
7+
app.kubernetes.io/created-by: llmaz
8+
name: vllmcpu
9+
spec:
10+
image: {{ .Values.backendRuntime.vllmcpu.image.repository }}
11+
version: {{ .Values.backendRuntime.vllmcpu.image.tag }}
12+
envs:
13+
- name: VLLM_CPU_KVCACHE_SPACE
14+
value: "8"
15+
lifecycle:
16+
preStop:
17+
exec:
18+
command:
19+
- /bin/sh
20+
- -c
21+
- |
22+
while true; do
23+
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
24+
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
25+
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
26+
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
27+
exit 0
28+
else
29+
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
30+
sleep 5
31+
fi
32+
done
33+
# Do not edit the preset argument name unless you know what you're doing.
34+
# Free to add more arguments with your requirements.
35+
recommendedConfigs:
36+
- name: default
37+
args:
38+
- --model
39+
- "{{`{{ .ModelPath }}`}}"
40+
- --served-model-name
41+
- "{{`{{ .ModelName }}`}}"
42+
- --host
43+
- "0.0.0.0"
44+
- --port
45+
- "8080"
46+
sharedMemorySize: 2Gi
47+
resources:
48+
requests:
49+
cpu: 10
50+
memory: 32Gi
51+
limits:
52+
cpu: 10
53+
memory: 32Gi
54+
startupProbe:
55+
periodSeconds: 10
56+
failureThreshold: 30
57+
httpGet:
58+
path: /health
59+
port: 8080
60+
livenessProbe:
61+
initialDelaySeconds: 15
62+
periodSeconds: 10
63+
failureThreshold: 3
64+
httpGet:
65+
path: /health
66+
port: 8080
67+
readinessProbe:
68+
initialDelaySeconds: 5
69+
periodSeconds: 5
70+
failureThreshold: 3
71+
httpGet:
72+
path: /health
73+
port: 8080
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen3-0--6b
5+
spec:
6+
familyName: qwen3
7+
source:
8+
modelHub:
9+
modelID: Qwen/Qwen3-0.6B
10+
---
11+
apiVersion: inference.llmaz.io/v1alpha1
12+
kind: Playground
13+
metadata:
14+
name: qwen3-0--6b
15+
spec:
16+
replicas: 1
17+
modelClaim:
18+
modelName: qwen3-0--6b
19+
backendRuntimeConfig:
20+
backendName: vllmcpu

0 commit comments

Comments
 (0)