Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit fcc2671

Browse files
committed
update doc
Signed-off-by: zhenwei-intel <[email protected]>
1 parent 5612db3 commit fcc2671

File tree

1 file changed

+4
-91
lines changed

1 file changed

+4
-91
lines changed

docs/weightonlyquant.md

Lines changed: 4 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -147,104 +147,17 @@ loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir)
147147
> Note: For LLM runtime model loading usage, please refer to [neural_speed readme](https://github.com/intel/neural-speed/blob/main/README.md#quick-start-transformer-like-usage)
148148
149149
## Examples For Intel GPU
150-
Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC and ARC) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
150+
Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
151151

152152
Now 4-bit/8-bit inference with `RtnConfig`, `AwqConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device.
153153

154-
We support experimental woq inference on intel GPU(PVC and ARC) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B.
155-
Here are the example codes.
156-
157-
#### Prepare Dependency Packages
158-
1. Install Oneapi Package
159-
Weight-only quantization ops only exist in "dev/QLLM" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
160-
161-
2. Build and Install PyTorch and Intel-extension-for-pytorch
162-
```python
163-
python -m pip install torch==2.1.0a0 -f https://developer.intel.com/ipex-whl-stable-xpu
164-
165-
source /opt/intel/oneapi/setvars.sh
166-
167-
# Build IPEX from Source Code
168-
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
169-
cd ipex-gpu
170-
git checkout -b dev/QLLM origin/dev/QLLM
171-
git submodule update --init --recursive
172-
export USE_AOT_DEVLIST='pvc,ats-m150'
173-
export BUILD_WITH_CPU=OFF
174-
175-
pip install -r requirements.txt
176-
177-
python setup.py install
178-
```
179-
180-
3. Install Intel-extension-for-transformers and Neural-compressor
181-
```pythpon
182-
pip install neural-compressor
183-
pip install intel-extension-for-transformers
184-
```
185-
186-
4. Quantization Model and Inference
187-
```python
188-
import intel_extension_for_pytorch as ipex
189-
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
190-
from transformers import AutoTokenizer
191-
192-
device = "xpu"
193-
model_name = "Qwen/Qwen-7B"
194-
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
195-
prompt = "Once upon a time, there existed a little girl,"
196-
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
197-
198-
qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True)
199-
200-
# optimize the model with ipex, it will improve performance.
201-
qmodel = ipex.optimize_transformers(qmodel, inplace=True, dtype=torch.float16, woq=True, device="xpu")
202-
203-
output = user_model.generate(inputs)
204-
```
205-
206-
> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
207-
208-
5. Saving and Loading quantized model
209-
* First step: Quantize and save model
210-
```python
211-
212-
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
213-
214-
qmodel = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", load_in_4bit=True, device_map="xpu", trust_remote_code=True)
215-
216-
# Please note, saving model should be executed before ipex.optimize_transformers function is called.
217-
model.save_pretrained("saved_dir")
218-
```
219-
* Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.)
220-
```python
221-
# Load model
222-
loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
223-
224-
# Before executed the loaded model, you can call ipex.optimize_transformers function.
225-
loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config={}, device="xpu")
226-
227-
output = loaded_model.generate(inputs)
228-
229-
```
230-
231-
6. You can directly use [example script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py)
232-
```python
233-
python run_generation_gpu_woq.py --woq --benchmark
234-
```
235-
236-
>Note:
237-
> * Saving quantized model should be executed before the optimize_transformers function is called.
238-
> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
154+
We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B (only for PVC/ARC), Llama-7B.
239155

240-
## Examples For Intel GPU(MTL)
241-
Intel-extension-for-transformers implement weight-only quantization for intel GPU(MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/MTL".
242-
We support experimental woq inference on intel GPU(MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B.
243156
Here are the example codes.
244157

245158
#### Prepare Dependency Packages
246159
1. Install Oneapi Package
247-
Weight-only quantization ops only exist in "dev/MTL" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
160+
The Oneapi DPCPP compiler is needed to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
248161

249162
2. Build and Install PyTorch and Intel-extension-for-pytorch
250163
```python
@@ -255,8 +168,8 @@ source /opt/intel/oneapi/setvars.sh
255168
# Build IPEX from Source Code
256169
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
257170
cd ipex-gpu
258-
git checkout -b dev/MTL origin/dev/MTL
259171
git submodule update --init --recursive
172+
export USE_AOT_DEVLIST='pvc,ats-m150,7d55'
260173
export BUILD_WITH_CPU=OFF
261174

262175
pip install -r requirements.txt

0 commit comments

Comments
 (0)