[pipeline] A simple fix for half-precision & 8bit models #21479
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Thanks for the well thought out issue and proposed fix. I don't particularly like the fix because it depends on some weird internal and still forces users to use Couldn't we just use Imo using |
|
Thanks for the feedback @Narsil !
Makes sense ! |
I would highly advise against it too. There's a limit to magic. Doing half precision on cpu should crash in a lot of places. We shouldn't upcast on behalf of a user that explicitely asked for half precision imo. That's breaking user intent. Does |
|
I see, makes sense!
I am not sure here, maybe @sgugger & @muellerzr knows better |
if not the pipeline could have the simplest heuristic |
|
What do you think: diff --git a/src/transformers/pipelines/base.py b/src/transformers/pipelines/base.py
index 30402b36e..e698f1aa3 100644
--- a/src/transformers/pipelines/base.py
+++ b/src/transformers/pipelines/base.py
@@ -749,7 +749,7 @@ class Pipeline(_ScikitCompat):
framework: Optional[str] = None,
task: str = "",
args_parser: ArgumentHandler = None,
- device: Union[int, str, "torch.device"] = -1,
+ device: Union[int, str, "torch.device"] = None,
torch_dtype: Optional[Union[str, "torch.dtype"]] = None,
binary_output: bool = False,
**kwargs,
@@ -764,6 +764,20 @@ class Pipeline(_ScikitCompat):
self.image_processor = image_processor
self.modelcard = modelcard
self.framework = framework
+
+ # Special handling
+ if self.framework == "pt" and device is not None:
+ self.model = self.model.to(device=device)
+
+ if device is None:
+ # `accelerate` device map
+ hf_device_map = getattr(self.model, "hf_device_map", None)
+ if hf_device_map is not None:
+ # Take the first device used by `accelerate`.
+ device = next(iter(hf_device_map.values()))
+ else:
+ device = -1
+
if is_torch_available() and self.framework == "pt":
if isinstance(device, torch.device):
self.device = device
@@ -775,13 +789,10 @@ class Pipeline(_ScikitCompat):
self.device = torch.device(f"cuda:{device}")
else:
self.device = device
+
self.torch_dtype = torch_dtype
self.binary_output = binary_output
- # Special handling
- if self.framework == "pt" and self.device.type != "cpu":
- self.model = self.model.to(self.device)
-
# Update config with task specific parameters
task_specific_params = self.model.config.task_specific_params
if task_specific_params is not None and task in task_specific_params:Here we just modify the default device when the model uses |
|
I think that would totally work @Narsil ! Happy to change the PR with your proposed changes, let me know! |
|
Sure. Let's update the doc too. |
|
Side notes:
if set(device_map.values()) == {"cpu"} or set(device_map.values()) == {"cpu", "disk"}:
main_device = "cpu"
else:
main_device = [d for d in device_map.values() if d not in ["cpu", "disk"]][0] |
Narsil
left a comment
There was a problem hiding this comment.
Thanks for the update.
I propose to simplify the examples a lot.
Then, regarding the actual code, I think your proposed change is entangling and confusing device/self.device more than it should. At least more than my original modification.
We could go even further in "cleanliness" and handle all that logic within pipeline function and not in Pipeline.__init__ since at that point the model should be a black box (that includes moving the model on device).
Modifying the model on device things will break calls like TranslationPipeline(model=MyModel(), device=0). I don't think we can realistically break that, but we can probably figure out a non breaking change that does handle device_map better.
| if self.framework == "pt" and device is not None: | ||
| self.model = self.model.to(device=self.device) |
There was a problem hiding this comment.
Don't mix self.device and device . This is super error prone.
The proposed change I made was at least explicit about it's default value.
I really think this needs to be changed. Too many opportunities to introduce bugs later on.
- Set the default value (if no value provided)
- Handle
deviceto createself.device. - Use
self.deviceeverywhere.
There was a problem hiding this comment.
Yeah I somehow didn't fully considered your proposition in #21479 (comment) - I think it's wiser to revert my changes with yours!
| hf_device_map = getattr(self.model, "hf_device_map", None) | ||
| if hf_device_map is not None: |
There was a problem hiding this comment.
There's probably a way to structure code where this is written only once.
I think directly in pipeline function we could warn when both device and device_map are set.
Prevents having to guess here. If you're splitting model loading and pipeline loading, then you should be aware of what you do, but we shouldn't actively depend on internals to seek what's going on.
Essentially, when users use pipeline(model=MyModel()) the model is a black box to us, we shouldn't look at it. We're looking at it in my proposed change only when there's no device being sent.
And to be even purer, we could modify the pipeline itself, to check hf_device_map only when we do from_pretrained. That seems even cleaner since we know that this internal map could exist here (where we can't here if user passes in a real object).
| ```py | ||
| # pip install accelerate | ||
| import torch | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline | ||
|
|
||
| model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", torch_dtype=torch.bfloat16, device_map="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom") |
There was a problem hiding this comment.
Maybe use a smaller example and say in a note the user can replace it by BLOOM?
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
the failing test is indenpendent to our PR! Merging! |
What does this PR do?
Currently on the
mainbranch oftransformersif a user wants to run apipelineusing large models (thus, ideally loaded withdevice_map=...) and in half precision (or in int8), they may encounter some issues when callingpipelinewithtop_p&top_ksampling:Snippet to reproduce & explanations:
This is because the
input_idsare automatically set oncpusince the argumentdeviceis not passed when initializing thepipeline. A model that is loaded withdevice_map=...(i.e. withaccelerate) always sets the output tensor of the model to thedeviceof the input tensor thanks to the forward hooks. Therefore when calling the top_k method, the output tensor is in fp16 (because the model has been loaded in fp16) & oncpuhence the torch error above.Currently a hack to fix this is to add
device=0when initializing thepipelinebut this leads to inconsistent and undesirable behaviours for some cases, for example when loading large models in several GPUs, since the callmodel.to(self.device)will break some internals (the hooks will be still there but the weights will be set on the wrong devices). A snippet to reproduce below:adding this hack also breaks the usage of
pipelinewith int8 models, since thetomethod is blocked for these models:Thus, I propose to fix this by simply checking whether a model has been loaded with
accelerateby looking at the attributehf_device_map, and set the model on the correct device only if it has not been loaded with accelerate as backend. This fixes 3 bugs: usingpipelinewith a fp16 model that has been loaded withacceleratewithout having any error in case of multi-gpu usage, usingpipelinewith a fp16 model waccelerate& sampling strategies, and usingpipelinewith int8 models & sampling strategies.cc @sgugger @Narsil