Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit b12a11b

Browse files
authored
improve SQ/WOQ examples (#1662)
Signed-off-by: changwangss <[email protected]>
1 parent 79277b4 commit b12a11b

File tree

13 files changed

+544
-305
lines changed

13 files changed

+544
-305
lines changed

examples/huggingface/pytorch/code-generation/quantization/README.md

Lines changed: 54 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -19,63 +19,46 @@ pip install -r requirements.txt
1919

2020
# Run
2121
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
22-
>**Note**:
23-
> Model type "llama" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3.
2422

25-
## 1. Performance
23+
## MixedPrecison and SmoothQuant
24+
25+
### 1. Performance
2626
```bash
2727
export KMP_BLOCKTIME=1
2828
export KMP_SETTINGS=1
2929
export KMP_AFFINITY=granularity=fine,compact,1,0
3030
export LD_PRELOAD=${CONDA_PREFIX}/lib/libiomp5.so
3131
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
3232
# fp32
33-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
33+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation_sq.py \
3434
--model bigcode/starcoder \
3535
--benchmark \
36-
--batch_size 1
36+
--benchmark_batch_size 1
37+
3738
# mixedprecision
38-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
39+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation_sq.py \
3940
--model bigcode/starcoder \
4041
--mixed_precision \
4142
--benchmark \
4243
--batch_size 1
44+
4345
# smoothquant
4446
# [alternative] --int8 is used for int8 only, --int8_bf16_mixed is used for int8 mixed bfloat16 precision.
45-
python run_generation.py \
47+
python run_generation_sq.py \
4648
--model bigcode/starcoder \
4749
--output_dir "./saved_results" \
4850
--sq \
4951
--alpha 0.7 \
50-
--calib_iters 500 \
52+
--calib_n_samples 500 \
5153
--dataset "mbpp"
52-
--int8 \
53-
--benchmark \
54-
--batch_size 1
55-
# weightonlyquant
56-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
57-
--model bigcode/starcoder \
58-
--woq \
59-
--benchmark \
60-
--batch_size 1
61-
# load_in_4bit
62-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
63-
--model bigcode/starcoder \
64-
--load_in_4bit \
65-
--benchmark \
66-
--batch_size 1
67-
# load_in_8bit
68-
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
69-
--model bigcode/starcoder \
70-
--load_in_8bit \
7154
--benchmark \
7255
--batch_size 1
7356
```
74-
## 2. Accuracy
57+
### 2. Accuracy
7558

7659
```bash
7760
# fp32
78-
python run_generation.py \
61+
python run_generation_sq.py \
7962
--model bigcode/starcoder \
8063
--accuracy \
8164
--batch_size 20 \
@@ -85,7 +68,7 @@ python run_generation.py \
8568
--do_sample \
8669
--tasks "humaneval"
8770
# mixedprecision
88-
python run_generation.py \
71+
python run_generation_sq.py \
8972
--model bigcode/starcoder \
9073
--mixed_precision \
9174
--accuracy \
@@ -97,23 +80,53 @@ python run_generation.py \
9780
--tasks "humaneval"
9881
# smoothquant
9982
# [alternative] --int8 is used for int8 only, --int8_bf16_mixed is used for int8 mixed bfloat16 precision.
100-
python run_generation.py \
83+
python run_generation_sq.py \
10184
--model bigcode/starcoder \
10285
--sq \
10386
--alpha 1.0 \
104-
--int8 \
10587
--accuracy \
10688
--batch_size 20 \
10789
--n_samples 20 \
10890
--allow_code_execution \
10991
--temperature 0.2 \
11092
--do_sample \
11193
--tasks "humaneval"
94+
```
95+
96+
## WeightOnlyQuant
97+
98+
1. ### Performance
99+
100+
```bash
101+
# weightonlyquant
102+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation_cpu_woq.py \
103+
--model bigcode/starcoder \
104+
--woq \
105+
--benchmark \
106+
--benchmark_batch_size 1
107+
# load_in_4bit
108+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation_cpu_woq.py \
109+
--model bigcode/starcoder \
110+
--load_in_4bit \
111+
--benchmark \
112+
--benchmark_batch_size 1
113+
# load_in_8bit
114+
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation_cpu_woq.py \
115+
--model bigcode/starcoder \
116+
--load_in_8bit \
117+
--benchmark \
118+
--benchmark_batch_size 1
119+
```
120+
121+
2. ### Accuracy
122+
123+
```bash
124+
112125
# weightonlyquant
113-
python run_generation.py \
126+
python run_generation_cpu_woq.py \
114127
--model bigcode/starcoder \
115128
--woq \
116-
--woq_weight_dtype "nf4" \
129+
--weight_dtype "nf4" \
117130
--accuracy \
118131
--batch_size 20 \
119132
--n_samples 20 \
@@ -122,7 +135,7 @@ python run_generation.py \
122135
--do_sample \
123136
--tasks "humaneval"
124137
# load_in_4bit
125-
python run_generation.py \
138+
python run_generation_cpu_woq.py \
126139
--model bigcode/starcoder \
127140
--load_in_4bit \
128141
--accuracy \
@@ -133,7 +146,7 @@ python run_generation.py \
133146
--do_sample \
134147
--tasks "humaneval"
135148
# load_in_8bit
136-
python run_generation.py \
149+
python run_generation_cpu_woq.py \
137150
--model bigcode/starcoder \
138151
--load_in_8bit \
139152
--accuracy \
@@ -166,17 +179,14 @@ This creates an image called `evaluation-harness-multiple`, and runs a test on i
166179
Suppose the fp32 model is `starcoder-3b`, saved quantized model in `saved_results` and do evaluation on `multiple-lua` tasks with:
167180
```
168181
docker run -v $(CURDIR):$(CURDIR) -it /bin/bash
169-
python3 run_generation.py \
182+
python3 run_generation_sq.py \
170183
--model $(CURDIR)/starcoder-3b \
171-
--quantize \
172184
--sq \
173185
--alpha 0.7 \
174-
--ipex \
175-
--calib_iters 500 \
186+
--calib_n_samples 500 \
176187
--calib_batch_size 1 \
177188
--dataset "mbpp" \
178189
--output_dir "$(CURDIR)/saved_results" \
179-
--int8 \
180190
--accuracy \
181191
--tasks multiple-py \
182192
--batch_size 20 \
@@ -191,9 +201,9 @@ python3 run_generation.py \
191201
To run the container (here from image `evaluation-harness-multiple`) to quantize and evaluate on `CURDIR`, or another file mount it with -v, specify n_samples and allow code execution with --allow_code_execution (and add the number of problems --limit if it was used during generation):
192202
```bash
193203
docker run -v $(CURDIR):$(CURDIR) \
194-
-it $(IMAGE_NAME) python3 run_generation.py --model $(CURDIR)/starcoder-3b --quantize --sq --alpha 0.7 --ipex \
195-
--calib_iters 5 --calib_batch_size 1 --dataset "mbpp" --calib_split "test" --output_dir "$(CURDIR)/saved_results" \
196-
--int8 --accuracy --tasks multiple-py --batch_size 20 --n_samples 20 --allow_code_execution \
204+
-it $(IMAGE_NAME) python3 run_generation_sq.py --model $(CURDIR)/starcoder-3b --sq --alpha 0.7
205+
--calib_n_samples 5 --calib_batch_size 1 --dataset "mbpp" --output_dir "$(CURDIR)/saved_results" \
206+
--accuracy --tasks multiple-py --batch_size 20 --n_samples 20 --allow_code_execution \
197207
--do_sample --temperature 0.2 --limit 2
198208

199209
```

examples/huggingface/pytorch/code-generation/quantization/run_benchmark.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ function init_params {
1414
batch_size=1
1515
tuned_checkpoint=saved_results
1616
lm_eval_tasks="humaneval"
17-
script="run_generation.py"
17+
script="run_generation_sq.py"
1818
for var in "$@"
1919
do
2020
case $var in
@@ -85,7 +85,7 @@ function run_benchmark {
8585

8686

8787
if [[ ${int8} == "true" ]]; then
88-
extra_cmd=$extra_cmd" --int8"
88+
model_name_or_path=$tuned_checkpoint
8989
fi
9090

9191
echo $extra_cmd

0 commit comments

Comments
 (0)