Skip to content
37 changes: 36 additions & 1 deletion docs/source/en/pipeline_tutorial.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ If the model is too large for a single GPU, you can set `device_map="auto"` to a
generator(model="openai/whisper-large", device_map="auto")
```

Note that if `device_map="auto"` is passed, there is no need to add the argument `device=device` when instantiating your `pipeline` as you may encounter some unexpected behavior!

### Batch size

By default, pipelines will not batch inference for reasons explained in detail [here](https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching). The reason is that batching is not necessarily faster, and can actually be quite slower in some cases.
Expand Down Expand Up @@ -257,4 +259,37 @@ sudo apt install -y tesseract-ocr
pip install pytesseract
```

</Tip>
</Tip>

## Using `pipeline` on large models with 🤗 `accelerate`:

You can easily run `pipeline` on large models using 🤗 `accelerate`! First make sure you have installed `accelerate` with `pip install accelerate`.

Let's assume you fullfill the hardware requirements to run a large model such as `bloom` (which has 176B parameters, so ~350GB in `bfloat16`). First load your model
using `device_map="auto"`

```py
# pip install accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a smaller example and say in a note the user can replace it by BLOOM?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!


pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
Comment thread
younesbelkada marked this conversation as resolved.
Outdated
output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
```

You can also pass 8-bit loaded models if you install `bitsandbytes` and add the argument `load_in_8bit=True`

```py
# pip install accelerate bitsandbytes
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
Comment thread
younesbelkada marked this conversation as resolved.
Outdated
output = pipe("This is a cool example!", do_sample=True, top_p=0.95)
```
37 changes: 30 additions & 7 deletions src/transformers/pipelines/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -749,7 +749,7 @@ def __init__(
framework: Optional[str] = None,
task: str = "",
args_parser: ArgumentHandler = None,
device: Union[int, str, "torch.device"] = -1,
device: Union[int, str, "torch.device"] = None,
torch_dtype: Optional[Union[str, "torch.dtype"]] = None,
binary_output: bool = False,
**kwargs,
Expand All @@ -769,18 +769,41 @@ def __init__(
self.device = device
elif isinstance(device, str):
self.device = torch.device(device)
elif device < 0:
elif device is None or device < 0:
self.device = torch.device("cpu")
else:
self.device = torch.device(f"cuda:{device}")
else:
self.device = device
self.device = device if device is not None else -1
self.torch_dtype = torch_dtype
self.binary_output = binary_output

# Special handling
if self.framework == "pt" and self.device.type != "cpu":
self.model = self.model.to(self.device)
if self.framework == "pt" and device is not None:
self.model = self.model.to(device=self.device)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mix self.device and device . This is super error prone.

The proposed change I made was at least explicit about it's default value.
I really think this needs to be changed. Too many opportunities to introduce bugs later on.

  • Set the default value (if no value provided)
  • Handle device to create self.device.
  • Use self.device everywhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I somehow didn't fully considered your proposition in #21479 (comment) - I think it's wiser to revert my changes with yours!


hf_device_map = getattr(self.model, "hf_device_map", None)
if hf_device_map is not None:

@Narsil Narsil Feb 9, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a way to structure code where this is written only once.

I think directly in pipeline function we could warn when both device and device_map are set.
Prevents having to guess here. If you're splitting model loading and pipeline loading, then you should be aware of what you do, but we shouldn't actively depend on internals to seek what's going on.

Essentially, when users use pipeline(model=MyModel()) the model is a black box to us, we shouldn't look at it. We're looking at it in my proposed change only when there's no device being sent.

And to be even purer, we could modify the pipeline itself, to check hf_device_map only when we do from_pretrained. That seems even cleaner since we know that this internal map could exist here (where we can't here if user passes in a real object).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 on this @Narsil

logger.warning(
"The model has been loaded with `accelerate` using `device_map=xxx` in `from_pretrained`"
" method, you should not pass a device when initializing your pipeline."
)

if device is None and self.framework == "pt":
# `accelerate` device map
hf_device_map = getattr(self.model, "hf_device_map", None)
if hf_device_map is not None:
# Take the main device used by `accelerate`.
# adapted from: https://github.com/huggingface/transformers/pull/21479#issuecomment-1420833512
if set(hf_device_map.values()) == {"cpu"} or set(hf_device_map.values()) == {"cpu", "disk"}:
accelerate_device = torch.device("cpu")
else:
main_device = [d for d in hf_device_map.values() if d not in ["cpu", "disk"]][0]
accelerate_device = torch.device(f"cuda:{main_device}")

self.device = accelerate_device
else:
self.device = torch.device("cpu")

# Update config with task specific parameters
task_specific_params = self.model.config.task_specific_params
Expand Down Expand Up @@ -1048,8 +1071,8 @@ def __call__(self, inputs, *args, num_workers=None, batch_size=None, **kwargs):
self.call_count += 1
if self.call_count > 10 and self.framework == "pt" and self.device.type == "cuda":
warnings.warn(
"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a"
" dataset",
"You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please"
" use a dataset",
UserWarning,
)

Expand Down
23 changes: 22 additions & 1 deletion tests/pipelines/test_pipelines_text_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,14 @@

import unittest

from transformers import MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_CAUSAL_LM_MAPPING, TextGenerationPipeline, pipeline
from transformers import (
MODEL_FOR_CAUSAL_LM_MAPPING,
TF_MODEL_FOR_CAUSAL_LM_MAPPING,
AutoModelForCausalLM,
AutoTokenizer,
TextGenerationPipeline,
pipeline,
)
from transformers.testing_utils import (
require_accelerate,
require_tf,
Expand Down Expand Up @@ -312,3 +319,17 @@ def test_small_model_fp16(self):

pipe = pipeline(model="hf-internal-testing/tiny-random-bloom", device=0, torch_dtype=torch.float16)
pipe("This is a test")

@require_torch
@require_accelerate
@require_torch_gpu
def test_pipeline_accelerate_top_p(self):
import torch

model_id = "hf-internal-testing/tiny-random-bloom"

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
Comment thread
younesbelkada marked this conversation as resolved.
Outdated
pipe("This is a test", do_sample=True, top_p=0.5)