From e9583de6f2bf08f7523ceecc138e4d894bb7dc5f Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 11:26:07 +0800 Subject: [PATCH 1/8] add transformers-like api doc Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 226 ++++++++++++++++++++++++ 1 file changed, 226 insertions(+) create mode 100644 docs/source/3x/transformers_like_api.md diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md new file mode 100644 index 00000000000..fd45401be93 --- /dev/null +++ b/docs/source/3x/transformers_like_api.md @@ -0,0 +1,226 @@ +Weight Only Quantization (WOQ) +===== + +1. [Introduction](#introduction) + +2. [Supported Algorithms](#supported-algorithms) + +3. [Usage For Intel CPU](#Usage-for-cpu-and-cuda) + +4. [Usage For Intel GPU](#Usage-for-intel-gpu) + +## Introduction + +Transformers-like API provides seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor). +## Supported Algorithms + +| Support Device | Rtn | Awq | Teq | GPTQ | AutoRound | +|:--------------:|:----------:|:----------:|:----------:|:----:|:----:| +| Intel CPU | ✔ | ✔ | ✔ | ✔ | ✔ | +| Intel GPU | ✔ | stay tuned | stay tuned | ✔ | ✔ | + +> Please refer to [weight-only quant document](./PT_WeightOnlyQuant.md) for more details. + + +## Usage For CPU + +Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implement conversion on the CPU. + +### Usage examples for CPU device +quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConfig`, `AutoRoundConfig` on CPU device. +```python +# RTN +from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig +model_name_or_path = "MODEL_NAME_OR_PATH" +woq_config = RtnConfig(bits=4) +q_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, quantization_config=woq_config, + ) + +# AWQ +from neural_compressor.transformers import AutoModelForCausalLM, AwqConfig +model_name_or_path = "MODEL_NAME_OR_PATH" +woq_config = AwqConfig(bits=4) +q_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, + quantization_config=woq_config, + ) + +# TEQ +from transformers import AutoTokenizer +from neural_compressor.transformers import AutoModelForCausalLM, TeqConfig + +model_name_or_path = "MODEL_NAME_OR_PATH" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) +woq_config = TeqConfig(bits=4, tokenizer=tokenizer) +q_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, + quantization_config=woq_config + ) + +# GPTQ +from transformers import AutoTokenizer +from neural_compressor.transformers import AutoModelForCausalLM, GPTQConfig + +model_name_or_path = "MODEL_NAME_OR_PATH" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) +woq_config = GPTQConfig(bits=4, tokenizer=tokenizer) +woq_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, + quantization_config=woq_config + ) + +# AutoRound +from transformers import AutoTokenizer +from neural_compressor.transformers import AutoModelForCausalLM, AutoRoundConfig + +model_name_or_path = "MODEL_NAME_OR_PATH" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) +woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer) +woq_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, + quantization_config=woq_config + ) + +# inference +from transformers import AutoTokenizer +prompt = "Once upon a time, a little girl" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) + +input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) +gen_ids = q_model.generate(input_ids, **generate_kwargs) +gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) +print(gen_text) +``` + +You can also save and load your quantized low bit model by the below code. + +```python +# quant +from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig +model_name_or_path = "MODEL_NAME_OR_PATH" +woq_config = RtnConfig(bits=4) +q_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, quantization_config=woq_config, + ) + +# save quant model +saved_dir = "SAVE_DIR" +q_model.save_pretrained(saved_dir) + +# load quant model +loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir) +``` + +## Usage For Intel GPU +Intel® Neural Compressor implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). + +Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device. + +We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B, Phi-3. + +Here are the example codes. + +#### Prepare Dependency Packages +1. Install Oneapi Package +The Oneapi DPCPP compiler is required to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder". + +2. Build and Install PyTorch and Intel-extension-for-pytorch +```python +python -m pip install torch==2.3.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +# Build IPEX from Source Code +git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu +cd ipex-gpu +git submodule update --init --recursive +export USE_AOT_DEVLIST='pvc,ats-m150' # Comment this line if you are compiling for MTL +export BUILD_WITH_CPU=OFF +export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/:$LD_LIBRARY_PATH +export OCL_ICD_VENDORS=/etc/OpenCL/vendors +export CCL_ROOT=${CONDA_PREFIX} +source /opt/intel/oneapi/setvars.sh --force +export LLM_ACC_TEST=1 +pip install -r requirements.txt + +python setup.py install +``` + +3. Install Neural-compressor +```pythpon +pip install neural-compressor +``` + +4. Quantization Model and Inference +```python +import intel_extension_for_pytorch as ipex +from neural_compressor.transformers import AutoModelForCausalLM +from transformers import AutoTokenizer +import torch + +model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH +prompt = "Once upon a time, a little girl" +input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True) + +q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True) + +# optimize the model with ipex, it will improve performance. +quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None +q_model = ipex.optimize_transformers(q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu") + +output = q_model.generate(input_ids, max_new_tokens=100, do_sample=True) +print(tokenizer.batch_decode(output, skip_special_tokens=True)) +``` + +> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference. + +5. Saving and Loading quantized model + * First step: Quantize and save model +```python + +from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig +model_name_or_path = "MODEL_NAME_OR_PATH" +woq_config = RtnConfig(bits=4) +q_model = AutoModelForCausalLM.from_pretrained( + model_name_or_path, quantization_config=woq_config, + device_map="xpu", + trust_remote_code=True, + ) + +# Please note, saving model should be executed before ipex.optimize_transformers function is called. +q_model.save_pretrained("saved_dir") +``` + * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.) +```python +# Load model +loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True) + +# Before executed the loaded model, you can call ipex.optimize_transformers function. +quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None +loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu") + +# inference +from transformers import AutoTokenizer +prompt = "Once upon a time, a little girl" +tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) +input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] +generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) +gen_ids = q_model.generate(input_ids, **generate_kwargs) +gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) +print(gen_text) + +``` + +6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) +```python +python run_generation_gpu_woq.py --woq --benchmark --model save_dir +``` + +>Note: +> * Saving quantized model should be executed before the optimize_transformers function is called. +> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md). + +## Examples + +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api. \ No newline at end of file From 34bab8523a370ff3c6fbed56c9a0be8b79a8aab5 Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 11:29:14 +0800 Subject: [PATCH 2/8] update toc Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index fd45401be93..6340743345a 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -5,9 +5,11 @@ Weight Only Quantization (WOQ) 2. [Supported Algorithms](#supported-algorithms) -3. [Usage For Intel CPU](#Usage-for-cpu-and-cuda) +3. [Usage For Intel CPU](#usage-for-cpu) -4. [Usage For Intel GPU](#Usage-for-intel-gpu) +4. [Usage For Intel GPU](#usage-for-intel-gpu) + +5. [Examples](#examples) ## Introduction From 6304d17bf948f18e23ab8a4aad601bccf39f809f Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 12:26:28 +0800 Subject: [PATCH 3/8] update title Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index 6340743345a..291ec763170 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -1,4 +1,4 @@ -Weight Only Quantization (WOQ) +Transformers-like API ===== 1. [Introduction](#introduction) @@ -13,7 +13,7 @@ Weight Only Quantization (WOQ) ## Introduction -Transformers-like API provides seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor). +Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs, leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor), and replacing Linear operator with [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch). ## Supported Algorithms | Support Device | Rtn | Awq | Teq | GPTQ | AutoRound | From 51407fd9f813102309b8266d8dbeb3ee1eeae596 Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 12:34:02 +0800 Subject: [PATCH 4/8] minor fix Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index 291ec763170..e0623d444d9 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -21,7 +21,7 @@ Transformers-like API provides a seamless user experience of model compressions | Intel CPU | ✔ | ✔ | ✔ | ✔ | ✔ | | Intel GPU | ✔ | stay tuned | stay tuned | ✔ | ✔ | -> Please refer to [weight-only quant document](./PT_WeightOnlyQuant.md) for more details. +> Please refer to [weight-only quantization document](./PT_WeightOnlyQuant.md) for more details. ## Usage For CPU @@ -36,7 +36,8 @@ from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = RtnConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( - model_name_or_path, quantization_config=woq_config, + model_name_or_path, + quantization_config=woq_config, ) # AWQ @@ -57,7 +58,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) woq_config = TeqConfig(bits=4, tokenizer=tokenizer) q_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, - quantization_config=woq_config + quantization_config=woq_config, ) # GPTQ @@ -69,7 +70,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) woq_config = GPTQConfig(bits=4, tokenizer=tokenizer) woq_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, - quantization_config=woq_config + quantization_config=woq_config, ) # AutoRound @@ -81,7 +82,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer) woq_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, - quantization_config=woq_config + quantization_config=woq_config, ) # inference @@ -104,7 +105,8 @@ from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = RtnConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( - model_name_or_path, quantization_config=woq_config, + model_name_or_path, + quantization_config=woq_config, ) # save quant model From c2fb59c5efc03b88ff6f6b950e75250b007471b3 Mon Sep 17 00:00:00 2001 From: "pre-commit-ci[bot]" <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon, 30 Sep 2024 05:23:12 +0000 Subject: [PATCH 5/8] [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --- docs/source/3x/transformers_like_api.md | 44 ++++++++++++++----------- 1 file changed, 25 insertions(+), 19 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index e0623d444d9..e1ba8c8b290 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -33,21 +33,23 @@ quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConf ```python # RTN from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig + model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = RtnConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # AWQ from neural_compressor.transformers import AutoModelForCausalLM, AwqConfig + model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = AwqConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # TEQ from transformers import AutoTokenizer @@ -59,7 +61,7 @@ woq_config = TeqConfig(bits=4, tokenizer=tokenizer) q_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # GPTQ from transformers import AutoTokenizer @@ -71,7 +73,7 @@ woq_config = GPTQConfig(bits=4, tokenizer=tokenizer) woq_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # AutoRound from transformers import AutoTokenizer @@ -83,10 +85,11 @@ woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer) woq_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # inference from transformers import AutoTokenizer + prompt = "Once upon a time, a little girl" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) @@ -102,12 +105,13 @@ You can also save and load your quantized low bit model by the below code. ```python # quant from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig + model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = RtnConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( model_name_or_path, quantization_config=woq_config, - ) +) # save quant model saved_dir = "SAVE_DIR" @@ -162,7 +166,7 @@ from neural_compressor.transformers import AutoModelForCausalLM from transformers import AutoTokenizer import torch -model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH +model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH prompt = "Once upon a time, a little girl" input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True) @@ -170,8 +174,10 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code= q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True) # optimize the model with ipex, it will improve performance. -quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None -q_model = ipex.optimize_transformers(q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu") +quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None +q_model = ipex.optimize_transformers( + q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu" +) output = q_model.generate(input_ids, max_new_tokens=100, do_sample=True) print(tokenizer.batch_decode(output, skip_special_tokens=True)) @@ -182,17 +188,15 @@ print(tokenizer.batch_decode(output, skip_special_tokens=True)) 5. Saving and Loading quantized model * First step: Quantize and save model ```python - from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig + model_name_or_path = "MODEL_NAME_OR_PATH" woq_config = RtnConfig(bits=4) q_model = AutoModelForCausalLM.from_pretrained( - model_name_or_path, quantization_config=woq_config, - device_map="xpu", - trust_remote_code=True, - ) + model_name_or_path, quantization_config=woq_config, device_map="xpu", trust_remote_code=True, +) -# Please note, saving model should be executed before ipex.optimize_transformers function is called. +# Please note, saving model should be executed before ipex.optimize_transformers function is called. q_model.save_pretrained("saved_dir") ``` * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.) @@ -201,11 +205,14 @@ q_model.save_pretrained("saved_dir") loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True) # Before executed the loaded model, you can call ipex.optimize_transformers function. -quantization_config = q_model.quantization_config if hasattr (user_model, "quantization_config") else None -loaded_model = ipex.optimize_transformers(loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu") +quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None +loaded_model = ipex.optimize_transformers( + loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu" +) # inference from transformers import AutoTokenizer + prompt = "Once upon a time, a little girl" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] @@ -213,7 +220,6 @@ generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) gen_ids = q_model.generate(input_ids, **generate_kwargs) gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) print(gen_text) - ``` 6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) @@ -227,4 +233,4 @@ python run_generation_gpu_woq.py --woq --benchmark --model save_dir ## Examples -Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api. \ No newline at end of file +Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api. From 8f63ecca8ca2e6c13eeb12cdd95e27caa2e34c75 Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 12:38:18 +0800 Subject: [PATCH 6/8] update code Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index e1ba8c8b290..88424000dc6 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -174,7 +174,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code= q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True) # optimize the model with ipex, it will improve performance. -quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None +quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None q_model = ipex.optimize_transformers( q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu" ) @@ -205,7 +205,7 @@ q_model.save_pretrained("saved_dir") loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True) # Before executed the loaded model, you can call ipex.optimize_transformers function. -quantization_config = q_model.quantization_config if hasattr(user_model, "quantization_config") else None +quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None loaded_model = ipex.optimize_transformers( loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu" ) @@ -217,7 +217,7 @@ prompt = "Once upon a time, a little girl" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) -gen_ids = q_model.generate(input_ids, **generate_kwargs) +gen_ids = loaded_model.generate(input_ids, **generate_kwargs) gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) print(gen_text) ``` From ec9143f89cba7845a7ca98bab09a1f4c447a54c2 Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 14:41:47 +0800 Subject: [PATCH 7/8] update docs for comments Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 44 +++++++------------------ 1 file changed, 11 insertions(+), 33 deletions(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index 88424000dc6..ec6670e94c8 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -13,10 +13,11 @@ Transformers-like API ## Introduction -Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs, leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor), and replacing Linear operator with [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch). +Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs, leveraging neural compressor existing weight-only quantization capability and replacing Linear operator with Intel® Extension for PyTorch. + ## Supported Algorithms -| Support Device | Rtn | Awq | Teq | GPTQ | AutoRound | +| Support Device | RTN | AWQ | TEQ | GPTQ | AutoRound | |:--------------:|:----------:|:----------:|:----------:|:----:|:----:| | Intel CPU | ✔ | ✔ | ✔ | ✔ | ✔ | | Intel GPU | ✔ | stay tuned | stay tuned | ✔ | ✔ | @@ -26,7 +27,7 @@ Transformers-like API provides a seamless user experience of model compressions ## Usage For CPU -Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implement conversion on the CPU. +Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implements conversion on the CPU. ### Usage examples for CPU device quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConfig`, `AutoRoundConfig` on CPU device. @@ -122,11 +123,11 @@ loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir) ``` ## Usage For Intel GPU -Intel® Neural Compressor implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). +Intel® Neural Compressor implement weight-only quantization for Intel GPU,(PVC/ARC/MTL/LNL) with [intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). -Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device. +Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on Intel GPU device. -We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B, Phi-3. +We support experimental woq inference on Intel GPU,(PVC/ARC/MTL/LNL) with replacing Linear op in PyTorch. Validated models: meta-llama/Meta-Llama-3-8B, meta/llama-Llama-2-7b-hf, Qwen/Qwen-7B-Chat, microsoft/Phi-3-mini-4k-instruct. Here are the example codes. @@ -134,32 +135,9 @@ Here are the example codes. 1. Install Oneapi Package The Oneapi DPCPP compiler is required to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder". -2. Build and Install PyTorch and Intel-extension-for-pytorch -```python -python -m pip install torch==2.3.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -# Build IPEX from Source Code -git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu -cd ipex-gpu -git submodule update --init --recursive -export USE_AOT_DEVLIST='pvc,ats-m150' # Comment this line if you are compiling for MTL -export BUILD_WITH_CPU=OFF -export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/:$LD_LIBRARY_PATH -export OCL_ICD_VENDORS=/etc/OpenCL/vendors -export CCL_ROOT=${CONDA_PREFIX} -source /opt/intel/oneapi/setvars.sh --force -export LLM_ACC_TEST=1 -pip install -r requirements.txt - -python setup.py install -``` - -3. Install Neural-compressor -```pythpon -pip install neural-compressor -``` +2. Build and Install PyTorch and intel-extension-for-pytorch. Please follow [the link](https://intel.github.io/intel-extension-for-pytorch/index.html#installation). -4. Quantization Model and Inference +3. Quantization Model and Inference ```python import intel_extension_for_pytorch as ipex from neural_compressor.transformers import AutoModelForCausalLM @@ -185,7 +163,7 @@ print(tokenizer.batch_decode(output, skip_special_tokens=True)) > Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference. -5. Saving and Loading quantized model +4. Saving and Loading quantized model * First step: Quantize and save model ```python from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig @@ -222,7 +200,7 @@ gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) print(gen_text) ``` -6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) +5. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) ```python python run_generation_gpu_woq.py --woq --benchmark --model save_dir ``` From 5ba9a029006e4bbd50c331f44fac65f843ede8e2 Mon Sep 17 00:00:00 2001 From: Kaihui-intel Date: Mon, 30 Sep 2024 14:43:11 +0800 Subject: [PATCH 8/8] update code type Signed-off-by: Kaihui-intel --- docs/source/3x/transformers_like_api.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/3x/transformers_like_api.md b/docs/source/3x/transformers_like_api.md index ec6670e94c8..9aafeed5278 100644 --- a/docs/source/3x/transformers_like_api.md +++ b/docs/source/3x/transformers_like_api.md @@ -201,7 +201,7 @@ print(gen_text) ``` 5. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) -```python +```bash python run_generation_gpu_woq.py --woq --benchmark --model save_dir ```