-
Notifications
You must be signed in to change notification settings - Fork 282
Add transformers-like api doc #2018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+214
−0
Merged
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e9583de
add transformers-like api doc
Kaihui-intel 34bab85
update toc
Kaihui-intel 6304d17
update title
Kaihui-intel 51407fd
minor fix
Kaihui-intel c2fb59c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 8f63ecc
update code
Kaihui-intel ec9143f
update docs for comments
Kaihui-intel 5ba9a02
update code type
Kaihui-intel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,236 @@ | ||
| Transformers-like API | ||
| ===== | ||
|
|
||
| 1. [Introduction](#introduction) | ||
|
|
||
| 2. [Supported Algorithms](#supported-algorithms) | ||
|
|
||
| 3. [Usage For Intel CPU](#usage-for-cpu) | ||
|
|
||
| 4. [Usage For Intel GPU](#usage-for-intel-gpu) | ||
|
|
||
| 5. [Examples](#examples) | ||
|
|
||
| ## Introduction | ||
|
|
||
| Transformers-like API provides a seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs, leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor), and replacing Linear operator with [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch). | ||
| ## Supported Algorithms | ||
|
|
||
| | Support Device | Rtn | Awq | Teq | GPTQ | AutoRound | | ||
Kaihui-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| |:--------------:|:----------:|:----------:|:----------:|:----:|:----:| | ||
| | Intel CPU | ✔ | ✔ | ✔ | ✔ | ✔ | | ||
| | Intel GPU | ✔ | stay tuned | stay tuned | ✔ | ✔ | | ||
|
|
||
| > Please refer to [weight-only quantization document](./PT_WeightOnlyQuant.md) for more details. | ||
|
|
||
|
|
||
| ## Usage For CPU | ||
|
|
||
| Our motivation is to improve CPU support for weight only quantization. We have extended the `from_pretrained` function so that `quantization_config` can accept [`RtnConfig`](https://github.com/intel/neural-compressor/blob/master/neural_compressor/transformers/utils/quantization_config.py#L243), [`AwqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L394), [`TeqConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L464), [`GPTQConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L298), [`AutoroundConfig`](https://github.com/intel/neural-compressor/blob/72398b69334d90cdd7664ac12a025cd36695b55c/neural_compressor/transformers/utils/quantization_config.py#L527) to implement conversion on the CPU. | ||
|
|
||
| ### Usage examples for CPU device | ||
| quantization and inference with `RtnConfig`, `AwqConfig`, `TeqConfig`, `GPTQConfig`, `AutoRoundConfig` on CPU device. | ||
| ```python | ||
| # RTN | ||
| from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| woq_config = RtnConfig(bits=4) | ||
| q_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # AWQ | ||
| from neural_compressor.transformers import AutoModelForCausalLM, AwqConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| woq_config = AwqConfig(bits=4) | ||
| q_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # TEQ | ||
| from transformers import AutoTokenizer | ||
| from neural_compressor.transformers import AutoModelForCausalLM, TeqConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) | ||
| woq_config = TeqConfig(bits=4, tokenizer=tokenizer) | ||
| q_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # GPTQ | ||
| from transformers import AutoTokenizer | ||
| from neural_compressor.transformers import AutoModelForCausalLM, GPTQConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) | ||
| woq_config = GPTQConfig(bits=4, tokenizer=tokenizer) | ||
| woq_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # AutoRound | ||
| from transformers import AutoTokenizer | ||
| from neural_compressor.transformers import AutoModelForCausalLM, AutoRoundConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) | ||
| woq_config = AutoRoundConfig(bits=4, tokenizer=tokenizer) | ||
| woq_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # inference | ||
| from transformers import AutoTokenizer | ||
|
|
||
| prompt = "Once upon a time, a little girl" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) | ||
|
|
||
| input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] | ||
| generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) | ||
| gen_ids = q_model.generate(input_ids, **generate_kwargs) | ||
| gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) | ||
| print(gen_text) | ||
| ``` | ||
|
|
||
| You can also save and load your quantized low bit model by the below code. | ||
|
|
||
| ```python | ||
| # quant | ||
| from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| woq_config = RtnConfig(bits=4) | ||
| q_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, | ||
| quantization_config=woq_config, | ||
| ) | ||
|
|
||
| # save quant model | ||
| saved_dir = "SAVE_DIR" | ||
| q_model.save_pretrained(saved_dir) | ||
|
|
||
| # load quant model | ||
| loaded_model = AutoModelForCausalLM.from_pretrained(saved_dir) | ||
| ``` | ||
|
|
||
| ## Usage For Intel GPU | ||
| Intel® Neural Compressor implement weight-only quantization for intel GPU(PVC/ARC/MTL) with [Intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch). | ||
Kaihui-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Now 4-bit/8-bit inference with `RtnConfig`, `GPTQConfig`, `AutoRoundConfig` are support on intel GPU device. | ||
|
|
||
| We support experimental woq inference on intel GPU(PVC/ARC/MTL) with replacing Linear op in PyTorch. Validated models: Qwen-7B, Llama-7B, Phi-3. | ||
Kaihui-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Here are the example codes. | ||
|
|
||
| #### Prepare Dependency Packages | ||
| 1. Install Oneapi Package | ||
| The Oneapi DPCPP compiler is required to compile intel-extension-for-pytorch. Please follow [the link](https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html) to install the OneAPI to "/opt/intel folder". | ||
|
|
||
| 2. Build and Install PyTorch and Intel-extension-for-pytorch | ||
Kaihui-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```python | ||
| python -m pip install torch==2.3.1+cxx11.abi --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ | ||
|
|
||
| # Build IPEX from Source Code | ||
| git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu | ||
| cd ipex-gpu | ||
| git submodule update --init --recursive | ||
| export USE_AOT_DEVLIST='pvc,ats-m150' # Comment this line if you are compiling for MTL | ||
| export BUILD_WITH_CPU=OFF | ||
| export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib/:$LD_LIBRARY_PATH | ||
| export OCL_ICD_VENDORS=/etc/OpenCL/vendors | ||
| export CCL_ROOT=${CONDA_PREFIX} | ||
| source /opt/intel/oneapi/setvars.sh --force | ||
| export LLM_ACC_TEST=1 | ||
| pip install -r requirements.txt | ||
|
|
||
| python setup.py install | ||
| ``` | ||
|
|
||
| 3. Install Neural-compressor | ||
Kaihui-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ```pythpon | ||
| pip install neural-compressor | ||
| ``` | ||
|
|
||
| 4. Quantization Model and Inference | ||
| ```python | ||
| import intel_extension_for_pytorch as ipex | ||
| from neural_compressor.transformers import AutoModelForCausalLM | ||
| from transformers import AutoTokenizer | ||
| import torch | ||
|
|
||
| model_name_or_path = "Qwen/Qwen-7B-Chat" # MODEL_NAME_OR_PATH | ||
| prompt = "Once upon a time, a little girl" | ||
| input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True) | ||
|
|
||
| q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="xpu", trust_remote_code=True) | ||
|
|
||
| # optimize the model with ipex, it will improve performance. | ||
| quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None | ||
| q_model = ipex.optimize_transformers( | ||
| q_model, inplace=True, dtype=torch.float16, quantization_config=quantizaiton_config, device="xpu" | ||
| ) | ||
|
|
||
| output = q_model.generate(input_ids, max_new_tokens=100, do_sample=True) | ||
| print(tokenizer.batch_decode(output, skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
| > Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference. | ||
|
|
||
| 5. Saving and Loading quantized model | ||
| * First step: Quantize and save model | ||
| ```python | ||
| from neural_compressor.transformers import AutoModelForCausalLM, RtnConfig | ||
|
|
||
| model_name_or_path = "MODEL_NAME_OR_PATH" | ||
| woq_config = RtnConfig(bits=4) | ||
| q_model = AutoModelForCausalLM.from_pretrained( | ||
| model_name_or_path, quantization_config=woq_config, device_map="xpu", trust_remote_code=True, | ||
| ) | ||
|
|
||
| # Please note, saving model should be executed before ipex.optimize_transformers function is called. | ||
| q_model.save_pretrained("saved_dir") | ||
| ``` | ||
| * Second step: Load model and inference(In order to reduce memory usage, you may need to end the quantize process and rerun the script to load the model.) | ||
| ```python | ||
| # Load model | ||
| loaded_model = AutoModelForCausalLM.from_pretrained("saved_dir", trust_remote_code=True) | ||
|
|
||
| # Before executed the loaded model, you can call ipex.optimize_transformers function. | ||
| quantization_config = q_model.quantization_config if hasattr(q_model, "quantization_config") else None | ||
| loaded_model = ipex.optimize_transformers( | ||
| loaded_model, inplace=True, dtype=torch.float16, quantization_config=quantization_config, device="xpu" | ||
| ) | ||
|
|
||
| # inference | ||
| from transformers import AutoTokenizer | ||
|
|
||
| prompt = "Once upon a time, a little girl" | ||
| tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) | ||
| input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"] | ||
| generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=4) | ||
| gen_ids = loaded_model.generate(input_ids, **generate_kwargs) | ||
| gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True) | ||
| print(gen_text) | ||
| ``` | ||
|
|
||
| 6. You can directly use [example script](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation/run_generation_gpu_woq.py) | ||
| ```python | ||
| python run_generation_gpu_woq.py --woq --benchmark --model save_dir | ||
| ``` | ||
|
|
||
| >Note: | ||
| > * Saving quantized model should be executed before the optimize_transformers function is called. | ||
| > * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md). | ||
|
|
||
| ## Examples | ||
|
|
||
| Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/transformers/weight_only/text-generation) on how to quantize a model with transformers-like api. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.