You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Oct 25, 2024. It is now read-only.
> Note: For LLM runtime model loading usage, please refer to [neural_speed readme](https://github.com/intel/neural-speed/blob/main/README.md#quick-start-transformer-like-usage)
148
148
149
149
## Examples For Intel GPU
150
-
Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC and ARC) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
150
+
Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
151
151
152
152
Now 4-bit/8-bit inference with `RtnConfig`, `AwqConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device.
153
153
154
-
We support experimental woq inference on intel GPU(PVC and ARC) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B.
155
-
Here are the example codes.
156
-
157
-
#### Prepare Dependency Packages
158
-
1. Install Oneapi Package
159
-
Weight-only quantization ops only exist in "dev/QLLM" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
160
-
161
-
2. Build and Install PyTorch and Intel-extension-for-pytorch
> Note: If your device memory isnot enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
207
-
208
-
5. Saving and Loading quantized model
209
-
* First step: Quantize and save model
210
-
```python
211
-
212
-
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
# Please note, saving model should be executed before ipex.optimize_transformers function is called.
217
-
model.save_pretrained("saved_dir")
218
-
```
219
-
* Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.)
6. You can directly use [example script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py)
>* Saving quantized model should be executed before the optimize_transformers function is called.
238
-
>* The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
154
+
We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B (only for PVC/ARC), Llama-7B.
239
155
240
-
## Examples For Intel GPU(MTL)
241
-
Intel-extension-for-transformers implement weight-only quantization for intel GPU(MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/MTL".
242
-
We support experimental woq inference on intel GPU(MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B.
243
156
Here are the example codes.
244
157
245
158
#### Prepare Dependency Packages
246
159
1. Install Oneapi Package
247
-
Weight-only quantization ops only exist in"dev/MTL" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
160
+
The Oneapi DPCPP compiler is needed to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
248
161
249
162
2. Build and Install PyTorch and Intel-extension-for-pytorch
0 commit comments