Skip to content

Commit a482905

Browse files
committed
support vllm cpu in backendruntime
Signed-off-by: googs1025 <[email protected]>
1 parent 2bdc376 commit a482905

File tree

3 files changed

+100
-0
lines changed

3 files changed

+100
-0
lines changed

docs/examples/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ We provide a set of examples to help you serve large language models, by default
1212
- [Deploy models via TensorRT-LLM](#deploy-models-via-tensorrt-llm)
1313
- [Deploy models via text-generation-inference](#deploy-models-via-text-generation-inference)
1414
- [Deploy models via ollama](#deploy-models-via-ollama)
15+
- [Deploy models via vLLM CPU](#deploy-models-via-vllm-cpu)
1516
- [Speculative Decoding with llama.cpp](#speculative-decoding-with-llamacpp)
1617
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
1718
- [Multi-Host Inference](#multi-host-inference)
@@ -64,6 +65,11 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
6465

6566
llama.cpp supports speculative decoding to significantly improve inference performance, see [example](./speculative-decoding/llamacpp/) here.
6667

68+
### Deploy models via vLLM CPU
69+
70+
[vLLM](https://github.com/vllm-project/vllm) is an efficient and high-throughput LLM inference engine. It also provides a **CPU version** for environments without GPU support. see [example](./vllm-cpu/) here.
71+
72+
6773
### Speculative Decoding with vLLM
6874

6975
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: BackendRuntime
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: backendruntime
6+
app.kubernetes.io/part-of: llmaz
7+
app.kubernetes.io/created-by: llmaz
8+
name: vllmcpu
9+
spec:
10+
# more image detail: https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
11+
image: public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
12+
version: v0.8.5
13+
envs:
14+
- name: VLLM_CPU_KVCACHE_SPACE
15+
value: "8"
16+
lifecycle:
17+
preStop:
18+
exec:
19+
command:
20+
- /bin/sh
21+
- -c
22+
- |
23+
while true; do
24+
RUNNING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_running' | grep -v '#' | awk '{print $2}')
25+
WAITING=$(curl -s http://localhost:8000/metrics | grep 'vllm:num_requests_waiting' | grep -v '#' | awk '{print $2}')
26+
if [ "$RUNNING" = "0.0" ] && [ "$WAITING" = "0.0" ]; then
27+
echo "Terminating: No active or waiting requests, safe to terminate" >> /proc/1/fd/1
28+
exit 0
29+
else
30+
echo "Terminating: Running: $RUNNING, Waiting: $WAITING" >> /proc/1/fd/1
31+
sleep 5
32+
fi
33+
done
34+
# Do not edit the preset argument name unless you know what you're doing.
35+
# Free to add more arguments with your requirements.
36+
recommendedConfigs:
37+
- name: default
38+
args:
39+
- --model
40+
- "{{`{{ .ModelPath }}`}}"
41+
- --served-model-name
42+
- "{{`{{ .ModelName }}`}}"
43+
- --host
44+
- "0.0.0.0"
45+
- --port
46+
- "8080"
47+
sharedMemorySize: 2Gi
48+
resources:
49+
requests:
50+
cpu: 10
51+
memory: 32Gi
52+
limits:
53+
cpu: 10
54+
memory: 32Gi
55+
startupProbe:
56+
periodSeconds: 10
57+
failureThreshold: 30
58+
httpGet:
59+
path: /health
60+
port: 8080
61+
livenessProbe:
62+
initialDelaySeconds: 15
63+
periodSeconds: 10
64+
failureThreshold: 3
65+
httpGet:
66+
path: /health
67+
port: 8080
68+
readinessProbe:
69+
initialDelaySeconds: 5
70+
periodSeconds: 5
71+
failureThreshold: 3
72+
httpGet:
73+
path: /health
74+
port: 8080
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen3-0--6b
5+
spec:
6+
familyName: qwen3
7+
source:
8+
modelHub:
9+
modelID: Qwen/Qwen3-0.6B
10+
---
11+
apiVersion: inference.llmaz.io/v1alpha1
12+
kind: Playground
13+
metadata:
14+
name: qwen3-0--6b
15+
spec:
16+
replicas: 1
17+
modelClaim:
18+
modelName: qwen3-0--6b
19+
backendRuntimeConfig:
20+
backendName: vllmcpu

0 commit comments

Comments
 (0)