update doc

zhenwei-intel · zhenwei-intel · commit fcc267198a81 · 2024-04-03T15:29:29.000+08:00
Signed-off-by: zhenwei-intel &lt;zhenwei.liu@intel.com&gt;
diff --git a/docs/weightonlyquant.md b/docs/weightonlyquant.md
@@ -147,104 +147,17 @@ loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir)
 > Note: For LLM runtime model loading usage, please refer to [neural_speed readme](https://github.com/intel/neural-speed/blob/main/README.md#quick-start-transformer-like-usage)
 
 ## Examples For Intel GPU
-Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC and ARC) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
+Intel-extension-for-transformers implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/QLLM".
 
 Now 4-bit/8-bit inference with `RtnConfig`, `AwqConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device.
 
-We support experimental woq inference on intel GPU(PVC and ARC) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B.  
-Here are the example codes.
-
-#### Prepare Dependency Packages
-1. Install Oneapi Package  
-Weight-only quantization ops only exist in "dev/QLLM" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
-
-2. Build and Install PyTorch and Intel-extension-for-pytorch
-```python
-python -m pip install torch==2.1.0a0  -f https://developer.intel.com/ipex-whl-stable-xpu
-
-source /opt/intel/oneapi/setvars.sh
-
-# Build IPEX from Source Code
-git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
-cd ipex-gpu
-git checkout -b dev/QLLM origin/dev/QLLM
-git submodule update --init --recursive
-export USE_AOT_DEVLIST='pvc,ats-m150'
-export BUILD_WITH_CPU=OFF
-
-pip install -r requirements.txt
-
-python setup.py install
-```
-
-3. Install Intel-extension-for-transformers and Neural-compressor
-```pythpon
-pip install neural-compressor
-pip install intel-extension-for-transformers
-```
-
-4. Quantization Model and Inference
-```python
-import intel_extension_for_pytorch as ipex
-from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
-from transformers import AutoTokenizer
-
-device = "xpu"
-model_name = "Qwen/Qwen-7B"
-tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-prompt = "Once upon a time, there existed a little girl,"
-inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
-
-qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True)
-
-# optimize the model with ipex, it will improve performance.
-qmodel = ipex.optimize_transformers(qmodel, inplace=True, dtype=torch.float16, woq=True, device="xpu")
-
-output = user_model.generate(inputs)
-```
-
-> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
-
-5. Saving and Loading quantized model
- * First step: Quantize and save model
-```python
-
-from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
-
-qmodel = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", load_in_4bit=True, device_map="xpu", trust_remote_code=True)
-
-# Please note, saving model should be executed before ipex.optimize_transformers function is called. 
-model.save_pretrained("saved_dir")
-```
- * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.)
-```python
-# Load model
-loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
-
-# Before executed the loaded model, you can call ipex.optimize_transformers function.
-loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config={}, device="xpu")
-
-output = loaded_model.generate(inputs)
-
-```
-
-6. You can directly use [example script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py)
-```python
-python run_generation_gpu_woq.py --woq --benchmark 
-```
-
->Note:
-> * Saving quantized model should be executed before the optimize_transformers function is called.
-> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
+We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, GPT-J-6B (only for PVC/ARC), Llama-7B.  
 
-## Examples For Intel GPU(MTL)
-Intel-extension-for-transformers implement weight-only quantization for intel GPU(MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). Currently, the Linear op kernel of Weight-only quantization is implemented in the Intel-extension-for-pytorch branch: "dev/MTL".  
-We support experimental woq inference on intel GPU(MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B.  
 Here are the example codes.
 
 #### Prepare Dependency Packages
 1. Install Oneapi Package  
-Weight-only quantization ops only exist in "dev/MTL" branch on the intel-extension-for-pytorch. It needs to be compiled with the Oneapi DPCPP compiler. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
+The Oneapi DPCPP compiler is needed to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
 
 2. Build and Install PyTorch and Intel-extension-for-pytorch
 ```python
@@ -255,8 +168,8 @@ source /opt/intel/oneapi/setvars.sh
 # Build IPEX from Source Code
 git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
 cd ipex-gpu
-git checkout -b dev/MTL origin/dev/MTL
 git submodule update --init --recursive
+export USE_AOT_DEVLIST='pvc,ats-m150,7d55'
 export BUILD_WITH_CPU=OFF
 
 pip install -r requirements.txt