From e9583de6f2bf08f7523ceecc138e4d894bb7dc5f Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 11:26:07 +0800
Subject: [PATCH 1/8] add transformers-like api doc

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 226 ++++++++++++++++++++++++
 1 file changed, 226 insertions(+)
 create mode 100644 docs/source/3x/transformers_like_api.md

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
new file mode 100644
index 00000000000..fd45401be93
--- /dev/null
+++ b/docs/source/3x/transformers_like_api.md
@@ -0,0 +1,226 @@
+Weight Only Quantization (WOQ)
+=====
+
+1. [Introduction](#introduction)
+
+2. [Supported Algorithms](#supported-algorithms)
+
+3. [Usage For Intel CPU](#Usage-for-cpu-and-cuda)
+
+4. [Usage For Intel GPU](#Usage-for-intel-gpu)
+
+## Introduction
+
+Transformers-like API provides seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor).
+## Supported Algorithms
+
+| Support Device |  Rtn  |  Awq  |  Teq |  GPTQ  | AutoRound |
+|:--------------:|:----------:|:----------:|:----------:|:----:|:----:|
+|     Intel CPU        |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |
+|     Intel GPU        |  &#10004;  |  stay tuned  |  stay tuned  |  &#10004;  |  &#10004;  |
+
+> Please refer to [weight-only quant document](./PT_WeightOnlyQuant.md) for more details.
+
+
+## Usage For CPU 
+
+Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implement conversion on the CPU.
+
+### Usage examples for CPU device
+quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConfig`, `AutoRoundConfig` on CPU device.
+```python
+# RTN
+from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+model_name_or_path = "MODEL_NAME_OR_PATH"
+woq_config = RtnConfig(bits=4)
+q_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,                                        quantization_config=woq_config,
+    )
+
+# AWQ
+from neural_compressor.transformers import AutoModelForCausalLM, AwqConfig
+model_name_or_path = "MODEL_NAME_OR_PATH"
+woq_config = AwqConfig(bits=4)
+q_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,
+    quantization_config=woq_config,
+    )
+
+# TEQ
+from transformers import AutoTokenizer
+from neural_compressor.transformers import AutoModelForCausalLM, TeqConfig
+
+model_name_or_path = "MODEL_NAME_OR_PATH"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+woq_config = TeqConfig(bits=4, tokenizer=tokenizer)
+q_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,
+    quantization_config=woq_config
+    )
+
+# GPTQ
+from transformers import AutoTokenizer
+from neural_compressor.transformers import AutoModelForCausalLM, GPTQConfig
+
+model_name_or_path = "MODEL_NAME_OR_PATH"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+woq_config = GPTQConfig(bits=4, tokenizer=tokenizer)
+woq_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,
+    quantization_config=woq_config
+    )
+
+# AutoRound
+from transformers import AutoTokenizer
+from neural_compressor.transformers import AutoModelForCausalLM, AutoRoundConfig
+
+model_name_or_path = "MODEL_NAME_OR_PATH"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer)
+woq_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,
+    quantization_config=woq_config
+    )
+
+# inference
+from transformers import AutoTokenizer
+prompt = "Once upon a time, a little girl"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+
+input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4)
+gen_ids = q_model.generate(input_ids, **generate_kwargs)
+gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+print(gen_text)
+```
+
+You can also save and load your quantized low bit model by the below code.
+
+```python
+# quant
+from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+model_name_or_path = "MODEL_NAME_OR_PATH"
+woq_config = RtnConfig(bits=4)
+q_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,                                        quantization_config=woq_config,
+    )
+
+# save quant model
+saved_dir = "SAVE_DIR"
+q_model.save_pretrained(saved_dir)
+
+# load quant model
+loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir)
+```
+
+## Usage For Intel GPU
+Intel® Neural Compressor implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch).
+
+Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device.
+
+We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B, Phi-3.  
+
+Here are the example codes.
+
+#### Prepare Dependency Packages
+1. Install Oneapi Package  
+The Oneapi DPCPP compiler is required to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
+
+2. Build and Install PyTorch and Intel-extension-for-pytorch
+```python
+python -m pip install torch==2.3.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+
+# Build IPEX from Source Code
+git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
+cd ipex-gpu
+git submodule update --init --recursive
+export USE_AOT_DEVLIST='pvc,ats-m150'  # Comment this line if you are compiling for MTL
+export BUILD_WITH_CPU=OFF
+export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/:$LD_LIBRARY_PATH
+export OCL_ICD_VENDORS=/etc/OpenCL/vendors
+export CCL_ROOT=${CONDA_PREFIX}
+source /opt/intel/oneapi/setvars.sh --force
+export LLM_ACC_TEST=1
+pip install -r requirements.txt
+
+python setup.py install
+```
+
+3. Install Neural-compressor
+```pythpon
+pip install neural-compressor
+```
+
+4. Quantization Model and Inference
+```python
+import intel_extension_for_pytorch as ipex
+from neural_compressor.transformers import AutoModelForCausalLM
+from transformers import AutoTokenizer
+import torch
+
+model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH
+prompt = "Once upon a time, a little girl"
+input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
+
+q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True)
+
+# optimize the model with ipex, it will improve performance.
+quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None
+q_model = ipex.optimize_transformers(q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu")
+
+output = q_model.generate(input_ids, max_new_tokens=100, do_sample=True)
+print(tokenizer.batch_decode(output, skip_special_tokens=True))
+```
+
+> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
+
+5. Saving and Loading quantized model
+ * First step: Quantize and save model
+```python
+
+from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+model_name_or_path = "MODEL_NAME_OR_PATH"
+woq_config = RtnConfig(bits=4)
+q_model = AutoModelForCausalLM.from_pretrained(
+    model_name_or_path,                                        quantization_config=woq_config,
+    device_map="xpu",
+    trust_remote_code=True，
+    )
+
+# Please note, saving model should be executed before ipex.optimize_transformers function is called. 
+q_model.save_pretrained("saved_dir")
+```
+ * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.)
+```python
+# Load model
+loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
+
+# Before executed the loaded model, you can call ipex.optimize_transformers function.
+quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None
+loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu")
+
+# inference
+from transformers import AutoTokenizer
+prompt = "Once upon a time, a little girl"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
+generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4)
+gen_ids = q_model.generate(input_ids, **generate_kwargs)
+gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
+print(gen_text)
+
+```
+
+6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py)
+```python
+python run_generation_gpu_woq.py --woq --benchmark --model save_dir
+```
+
+>Note:
+> * Saving quantized model should be executed before the optimize_transformers function is called.
+> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
+
+## Examples
+
+Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api.
\ No newline at end of file

From 34bab8523a370ff3c6fbed56c9a0be8b79a8aab5 Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 11:29:14 +0800
Subject: [PATCH 2/8] update toc

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index fd45401be93..6340743345a 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -5,9 +5,11 @@ Weight Only Quantization (WOQ)
 
 2. [Supported Algorithms](#supported-algorithms)
 
-3. [Usage For Intel CPU](#Usage-for-cpu-and-cuda)
+3. [Usage For Intel CPU](#usage-for-cpu)
 
-4. [Usage For Intel GPU](#Usage-for-intel-gpu)
+4. [Usage For Intel GPU](#usage-for-intel-gpu)
+
+5. [Examples](#examples)
 
 ## Introduction
 

From 6304d17bf948f18e23ab8a4aad601bccf39f809f Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 12:26:28 +0800
Subject: [PATCH 3/8] update title

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index 6340743345a..291ec763170 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -1,4 +1,4 @@
-Weight Only Quantization (WOQ)
+Transformers-like API
 =====
 
 1. [Introduction](#introduction)
@@ -13,7 +13,7 @@ Weight Only Quantization (WOQ)
 
 ## Introduction
 
-Transformers-like API provides seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor).
+Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs, leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor), and replacing Linear operator with [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
 ## Supported Algorithms
 
 | Support Device |  Rtn  |  Awq  |  Teq |  GPTQ  | AutoRound |

From 51407fd9f813102309b8266d8dbeb3ee1eeae596 Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 12:34:02 +0800
Subject: [PATCH 4/8] minor fix

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index 291ec763170..e0623d444d9 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -21,7 +21,7 @@ Transformers-like API provides a seamless user experience of model compressions
 |     Intel CPU        |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |
 |     Intel GPU        |  &#10004;  |  stay tuned  |  stay tuned  |  &#10004;  |  &#10004;  |
 
-> Please refer to [weight-only quant document](./PT_WeightOnlyQuant.md) for more details.
+> Please refer to [weight-only quantization document](./PT_WeightOnlyQuant.md) for more details.
 
 
 ## Usage For CPU 
@@ -36,7 +36,8 @@ from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = RtnConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
-    model_name_or_path,                                        quantization_config=woq_config,
+    model_name_or_path,
+    quantization_config=woq_config,
     )
 
 # AWQ
@@ -57,7 +58,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 woq_config = TeqConfig(bits=4, tokenizer=tokenizer)
 q_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
-    quantization_config=woq_config
+    quantization_config=woq_config,
     )
 
 # GPTQ
@@ -69,7 +70,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 woq_config = GPTQConfig(bits=4, tokenizer=tokenizer)
 woq_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
-    quantization_config=woq_config
+    quantization_config=woq_config,
     )
 
 # AutoRound
@@ -81,7 +82,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer)
 woq_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
-    quantization_config=woq_config
+    quantization_config=woq_config,
     )
 
 # inference
@@ -104,7 +105,8 @@ from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = RtnConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
-    model_name_or_path,                                        quantization_config=woq_config,
+    model_name_or_path,
+    quantization_config=woq_config,
     )
 
 # save quant model

From c2fb59c5efc03b88ff6f6b950e75250b007471b3 Mon Sep 17 00:00:00 2001
From: "pre-commit-ci[bot]"
 <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date: Mon, 30 Sep 2024 05:23:12 +0000
Subject: [PATCH 5/8] [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
---
 docs/source/3x/transformers_like_api.md | 44 ++++++++++++++-----------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index e0623d444d9..e1ba8c8b290 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -33,21 +33,23 @@ quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConf
 ```python
 # RTN
 from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = RtnConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # AWQ
 from neural_compressor.transformers import AutoModelForCausalLM, AwqConfig
+
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = AwqConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # TEQ
 from transformers import AutoTokenizer
@@ -59,7 +61,7 @@ woq_config = TeqConfig(bits=4, tokenizer=tokenizer)
 q_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # GPTQ
 from transformers import AutoTokenizer
@@ -71,7 +73,7 @@ woq_config = GPTQConfig(bits=4, tokenizer=tokenizer)
 woq_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # AutoRound
 from transformers import AutoTokenizer
@@ -83,10 +85,11 @@ woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer)
 woq_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # inference
 from transformers import AutoTokenizer
+
 prompt = "Once upon a time, a little girl"
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 
@@ -102,12 +105,13 @@ You can also save and load your quantized low bit model by the below code.
 ```python
 # quant
 from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = RtnConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
     model_name_or_path,
     quantization_config=woq_config,
-    )
+)
 
 # save quant model
 saved_dir = "SAVE_DIR"
@@ -162,7 +166,7 @@ from neural_compressor.transformers import AutoModelForCausalLM
 from transformers import AutoTokenizer
 import torch
 
-model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH
+model_name_or_path = "Qwen/Qwen-7B-Chat"  # MODEL_NAME_OR_PATH
 prompt = "Once upon a time, a little girl"
 input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
@@ -170,8 +174,10 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=
 q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True)
 
 # optimize the model with ipex, it will improve performance.
-quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None
-q_model = ipex.optimize_transformers(q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu")
+quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None
+q_model = ipex.optimize_transformers(
+    q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu"
+)
 
 output = q_model.generate(input_ids, max_new_tokens=100, do_sample=True)
 print(tokenizer.batch_decode(output, skip_special_tokens=True))
@@ -182,17 +188,15 @@ print(tokenizer.batch_decode(output, skip_special_tokens=True))
 5. Saving and Loading quantized model
  * First step: Quantize and save model
 ```python
-
 from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
+
 model_name_or_path = "MODEL_NAME_OR_PATH"
 woq_config = RtnConfig(bits=4)
 q_model = AutoModelForCausalLM.from_pretrained(
-    model_name_or_path,                                        quantization_config=woq_config,
-    device_map="xpu",
-    trust_remote_code=True，
-    )
+    model_name_or_path, quantization_config=woq_config, device_map="xpu", trust_remote_code=True，
+)
 
-# Please note, saving model should be executed before ipex.optimize_transformers function is called. 
+# Please note, saving model should be executed before ipex.optimize_transformers function is called.
 q_model.save_pretrained("saved_dir")
 ```
  * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.)
@@ -201,11 +205,14 @@ q_model.save_pretrained("saved_dir")
 loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
 
 # Before executed the loaded model, you can call ipex.optimize_transformers function.
-quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None
-loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu")
+quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None
+loaded_model = ipex.optimize_transformers(
+    loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu"
+)
 
 # inference
 from transformers import AutoTokenizer
+
 prompt = "Once upon a time, a little girl"
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
@@ -213,7 +220,6 @@ generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4)
 gen_ids = q_model.generate(input_ids, **generate_kwargs)
 gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
 print(gen_text)
-
 ```
 
 6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py)
@@ -227,4 +233,4 @@ python run_generation_gpu_woq.py --woq --benchmark --model save_dir
 
 ## Examples
 
-Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api.
\ No newline at end of file
+Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api.

From 8f63ecca8ca2e6c13eeb12cdd95e27caa2e34c75 Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 12:38:18 +0800
Subject: [PATCH 6/8] update code

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index e1ba8c8b290..88424000dc6 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -174,7 +174,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=
 q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True)
 
 # optimize the model with ipex, it will improve performance.
-quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None
+quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None
 q_model = ipex.optimize_transformers(
     q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu"
 )
@@ -205,7 +205,7 @@ q_model.save_pretrained("saved_dir")
 loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True)
 
 # Before executed the loaded model, you can call ipex.optimize_transformers function.
-quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None
+quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None
 loaded_model = ipex.optimize_transformers(
     loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu"
 )
@@ -217,7 +217,7 @@ prompt = "Once upon a time, a little girl"
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
 input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
 generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4)
-gen_ids = q_model.generate(input_ids, **generate_kwargs)
+gen_ids = loaded_model.generate(input_ids, **generate_kwargs)
 gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
 print(gen_text)
 ```

From ec9143f89cba7845a7ca98bab09a1f4c447a54c2 Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 14:41:47 +0800
Subject: [PATCH 7/8] update docs for comments

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 44 +++++++------------------
 1 file changed, 11 insertions(+), 33 deletions(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index 88424000dc6..ec6670e94c8 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -13,10 +13,11 @@ Transformers-like API
 
 ## Introduction
 
-Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs, leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor), and replacing Linear operator with [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
+Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs, leveraging neural compressor existing weight-only quantization capability and replacing Linear operator with Intel® Extension for PyTorch.
+
 ## Supported Algorithms
 
-| Support Device |  Rtn  |  Awq  |  Teq |  GPTQ  | AutoRound |
+| Support Device |  RTN  |  AWQ  |  TEQ |  GPTQ  | AutoRound |
 |:--------------:|:----------:|:----------:|:----------:|:----:|:----:|
 |     Intel CPU        |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |  &#10004;  |
 |     Intel GPU        |  &#10004;  |  stay tuned  |  stay tuned  |  &#10004;  |  &#10004;  |
@@ -26,7 +27,7 @@ Transformers-like API provides a seamless user experience of model compressions
 
 ## Usage For CPU 
 
-Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implement conversion on the CPU.
+Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implements conversion on the CPU.
 
 ### Usage examples for CPU device
 quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConfig`, `AutoRoundConfig` on CPU device.
@@ -122,11 +123,11 @@ loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir)
 ```
 
 ## Usage For Intel GPU
-Intel® Neural Compressor implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch).
+Intel® Neural Compressor implement weight-only quantization for Intel GPU,(PVC/ARC/MTL/LNL) with [intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch).
 
-Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device.
+Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on Intel GPU device.
 
-We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B, Phi-3.  
+We support experimental woq inference on Intel GPU,(PVC/ARC/MTL/LNL) with replacing Linear op in PyTorch. Validated models: meta-llama/Meta-Llama-3-8B, meta/llama-Llama-2-7b-hf, Qwen/Qwen-7B-Chat, microsoft/Phi-3-mini-4k-instruct.
 
 Here are the example codes.
 
@@ -134,32 +135,9 @@ Here are the example codes.
 1. Install Oneapi Package  
 The Oneapi DPCPP compiler is required to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder".
 
-2. Build and Install PyTorch and Intel-extension-for-pytorch
-```python
-python -m pip install torch==2.3.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-# Build IPEX from Source Code
-git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
-cd ipex-gpu
-git submodule update --init --recursive
-export USE_AOT_DEVLIST='pvc,ats-m150'  # Comment this line if you are compiling for MTL
-export BUILD_WITH_CPU=OFF
-export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/:$LD_LIBRARY_PATH
-export OCL_ICD_VENDORS=/etc/OpenCL/vendors
-export CCL_ROOT=${CONDA_PREFIX}
-source /opt/intel/oneapi/setvars.sh --force
-export LLM_ACC_TEST=1
-pip install -r requirements.txt
-
-python setup.py install
-```
-
-3. Install Neural-compressor
-```pythpon
-pip install neural-compressor
-```
+2. Build and Install PyTorch and intel-extension-for-pytorch. Please follow [the link](https://intel.github.io/intel-extension-for-pytorch/index.html#installation).
 
-4. Quantization Model and Inference
+3. Quantization Model and Inference
 ```python
 import intel_extension_for_pytorch as ipex
 from neural_compressor.transformers import AutoModelForCausalLM
@@ -185,7 +163,7 @@ print(tokenizer.batch_decode(output, skip_special_tokens=True))
 
 > Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
 
-5. Saving and Loading quantized model
+4. Saving and Loading quantized model
  * First step: Quantize and save model
 ```python
 from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig
@@ -222,7 +200,7 @@ gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
 print(gen_text)
 ```
 
-6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py)
+5. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py)
 ```python
 python run_generation_gpu_woq.py --woq --benchmark --model save_dir
 ```

From 5ba9a029006e4bbd50c331f44fac65f843ede8e2 Mon Sep 17 00:00:00 2001
From: Kaihui-intel <kaihui.tang@intel.com>
Date: Mon, 30 Sep 2024 14:43:11 +0800
Subject: [PATCH 8/8] update code type

Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
---
 docs/source/3x/transformers_like_api.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md
index ec6670e94c8..9aafeed5278 100644
--- a/docs/source/3x/transformers_like_api.md
+++ b/docs/source/3x/transformers_like_api.md
@@ -201,7 +201,7 @@ print(gen_text)
 ```
 
 5. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py)
-```python
+```bash
 python run_generation_gpu_woq.py --woq --benchmark --model save_dir
 ```