huggingface · leloykun · Oct 27, 2024 · Dec 2, 2024 · Dec 2, 2024 · Dec 2, 2024
diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md
@@ -50,13 +50,18 @@ The original code can be found [here](https://github.com/facebookresearch/chamel
 
 - We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to set `processor.tokenizer.padding_side = "left"` before generating.
 
+- When generating images, we advice users to load the model in `bfloat16` for better results. Simply make sure to set `torch_dtype=torch.bfloat16` when loading the model.
+
 - Note that Chameleon was tuned for safety alignment. If the model is refusing to answer, consider asking a more concrete question, instead of an open question.
 
 - Chameleon generates in chat format which means that the generated text will always be the "assistant's turn". You can enable a text completion generation by passing `return_for_text_completion=True` when calling the processor.
 
 > [!NOTE]
 > Chameleon implementation in Transformers uses a special image token to indicate where to merge image embeddings. For special image token we didn't add a new one but used one of the reserved tokens: `<reserved08707>`. You have to add `<image>` to your prompt in the place where the image should be embedded for correct generation.
 
+> [!NOTE]
+> The official model checkpoint currently only supports text generation. To generate images and interleaved text-image responses, you can use finetuned versions such as [Anole](https://arxiv.org/abs/2407.06135). Note however that Anole has a bias for "empty" or background patches, so it is recommended to use sampling when generating images (i.e. setting `do_sample=True` during generation) to reduce the likelihood of generating a blank image.
+
 ## Usage example
 
 ### Single image inference
@@ -124,6 +129,142 @@ generate_ids = model.generate(**inputs, max_new_tokens=50)
 processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
 ```
 
+### Text to image generation
+
+Chameleon can also generate images. However, the official model checkpoint currently only supports text generation. We need to use finetuned versions such as [Anole](https://arxiv.org/abs/2407.06135) to do image generation. Here is how you can do it:
+
+```python
+import torch
+from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+
+processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf")
+model = ChameleonForConditionalGeneration.from_pretrained(
+    "leloy/Anole-7b-v0.1-hf",
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+
+# Prepare a prompt
+prompt = "Generate an image of a snowman."
+
+# Preprocess the prompt
+inputs = processor(prompt, padding=True, return_tensors="pt").to(model.device, dtype=model.dtype)
+
+# Generate discrete image tokens
+generate_ids = model.generate(
+    **inputs,
+    multimodal_generation_mode="image-only",
+    # Note: We need to set `max_new_tokens` to 1026 since the model generates the `image_start_token` marker token first, then 1024 image tokens, and finally the `image_end_token` marker token.
+    max_new_tokens=1026,
+    # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
+    do_sample=True,
+)
+
+# Only keep the tokens from the response
+response_ids = generate_ids[:, inputs["input_ids"].shape[-1]:]
+
+# Decode the generated image tokens
+pixel_values = model.decode_image_tokens(response_ids[:, 1:-1])
+images = processor.postprocess_pixel_values(pixel_values)
+
+# Save the image
+images[0].save("snowman.png")
+```
+
+### Text-image to image generation
+
+We can also interleave text and images in the prompt to generate images. Here is how you can do it:
+
+```python
+import requests
+
+import torch
+from PIL import Image
+from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+from transformers.image_transforms import to_pil_image
+
+processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf")
+model = ChameleonForConditionalGeneration.from_pretrained(
+    "leloy/Anole-7b-v0.1-hf",
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+
+# Get image of a snowman
+url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
+image_snowman = Image.open(requests.get(url, stream=True).raw)
+
+# Prepare a prompt
+prompt = "Generate a variation of this image.<image>"
+
+# Preprocess the prompt
+inputs = processor(
+    images=[image_snowman],
+    text=prompt,
+    padding=True,
+    return_tensors="pt",
+).to(model.device, dtype=model.dtype)
+
+# Generate discrete image tokens
+generate_ids = model.generate(
+    **inputs,
+    multimodal_generation_mode="image-only",
+    # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
+    do_sample=True,
+)
+
+# Only keep the tokens from the response
+response_ids = generate_ids[:, inputs["input_ids"].shape[-1]:]
+
+# The generated image tokens are wrapped by the `image_start_token` and `image_end_token` tokens. We need to remove them before decoding the image tokens.
+image_token_ids = response_ids[:, 1:-1]
+
+# Decode the generated image tokens
+pixel_values = model.decode_image_tokens(image_token_ids)
+pixel_values = processor.postprocess_pixel_values(pixel_values)
+
+# Save the image
+image = to_pil_image(pixel_values[0].detach().cpu())
+image.save("snowman.png")
+```
+
+### Interleaved text-image generation
+
+We can also generate interleaved text and images in the output. Here is how you can do it:
+
+```python
+import torch
+from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
+
+processor = ChameleonProcessor.from_pretrained("leloy/Anole-7b-v0.1-hf")
+model = ChameleonForConditionalGeneration.from_pretrained(
+    "leloy/Anole-7b-v0.1-hf",
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+
+# Prepare a prompt
+prompt = "Can you draw a snowman and explain how to build one?"
+
+# Preprocess the prompt
+inputs = processor(prompt, padding=True, return_tensors="pt").to(model.device, dtype=model.dtype)
+
+# Generate interleaved text and discrete image tokens
+generate_ids = model.generate(
+    **inputs,
+    multimodal_generation_mode="interleaved-text-image",
+    # Note: We will need a larger `max_new_tokens` value since we are generating both text and image tokens.
+    max_new_tokens=4096,
+    # This is important because most of the image tokens during training were for "empty" patches, so greedy decoding of image tokens will likely result in a blank image.
+    do_sample=True,
+)
+
+# Only keep the tokens from the response
+response_ids = generate_ids[:, inputs["input_ids"].shape[-1]:]
+```
+
+From here, you can split the response tokens into text and image token segments, decode them separately as shown in the previous examples, and finally render the resulting text and images together. You can also use [MMSG](https://github.com/leloykun/mmsg) to do this more easily.
+
 ## Model optimization
 
 ### Quantization using Bitsandbytes

diff --git a/src/transformers/generation/logits_process.py b/src/transformers/generation/logits_process.py
@@ -1778,6 +1778,61 @@ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> to
         return scores_processed
 
 
+class SuppressTokensInIndexRangeLogitsProcessor(LogitsProcessor):
+    r"""
+    [`SuppressTokensInIndexRangeLogitsProcessor`] supresses a list of tokens from `start_index` to `end_index` (exclusive)
+
+    Args:
+        suppress_tokens (`List[int]`):
+            List of token ids to suppress during generation.
+        start_index (`int`):
+            The index at which to start suppressing tokens.
+        end_index (`int`, *optional*):
+            The index at which to end suppressing tokens. If `None`, it will suppress tokens indefinitely.
+        device (`str`, *optional*, defaults to `"cpu"`):
+            The device to allocate the tensors.
+
+    Examples:
+
+    ```python
+    >>> from transformers import AutoProcessir, ChameleonForConditionalGenerartion, LogitsProcessorList
+    >>> from transformers.generation.logits_process import SuppressTokensInIndexRangeLogitsProcessor
+    >>> import torch
+
+    >>> model = ChameleonForConditionalGenerartion.from_pretrained("leloy/Anole-7b-v0.1-hf")
+    >>> processor = AutoProcessir.from_pretrained("leloy/Anole-7b-v0.1-hf")
+
+    >>> inputs = processor("Can you draw a snowman?", return_tensors="pt")
+    >>> max_length = 1200
+    >>> # Don't start generating an image if there aren't enough space for the rest of the image tokens.
+    >>> logits_processor = SuppressTokensInIndexRangeLogitsProcessor(
+    ...     suppress_tokens=[model.vocabulary_mapping.boi_token_id],
+    ...     start_index=max_length - model.model.image_seq_length - 1,
+    ...     device=model.device,
+    ... )
+
+    >>> outputs = model.generate(**inputs, max_length=max_length, logits_processors=LogitsProcessorList([logits_processor]))
+    >>> print(torch.isin(outputs[input.input_ids.shape[1] + 1 : ], model.vocabulary_mapping.image_token_ids).all())
+    True
+    ```
+    """
+
+    def __init__(
+        self, suppress_tokens: List[int], start_index: int, end_index: Optional[int] = None, device: str = "cpu"
+    ):
+        self.suppress_tokens = torch.tensor(suppress_tokens, device=device)
+        self.start_index = start_index
+        self.end_index = end_index if end_index is not None else math.inf
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        current_index = input_ids.shape[1]
+        if self.start_index > current_index or current_index > self.end_index:
+            return scores
+        suppress_tokens_mask = torch.zeros_like(scores, dtype=torch.bool)
+        suppress_tokens_mask[:, self.suppress_tokens] = True
+        return scores.masked_fill(suppress_tokens_mask, -float("inf"))
+
+
 class SuppressTokensAtBeginLogitsProcessor(LogitsProcessor):
     r"""
     [`SuppressTokensAtBeginLogitsProcessor`] supresses a list of tokens as soon as the `generate` function starts
@@ -2953,3 +3008,83 @@ def expected_mean_g_value(self, vocab_size: int, coinflip_prob: float = 0.5) ->
             The expected mean g-value for watermarked text.
         """
         return coinflip_prob + coinflip_prob * (1 - coinflip_prob) * (1 - (1 / vocab_size))
+
+
+class AllowOnlyTokensInRelativeWindowLogitsProcessor(LogitsProcessor):
+    r"""
+    [`AllowOnlyTokensInRelativeWindowLogitsProcessor`] suppresses the logits of tokens aside from a specific set of tokens
+    that can be generated at a relative window from a trigger token (e.g. begin image token). If `exclusive` is set to
+    `True`, the set of tokens allowed at this window will not be allowed anywhere else. This is useful for enforcing
+    multimodal generation constraints.
+
+    Originally created for [Chameleon](https://huggingface.co/docs/transformers/model_doc/chameleon).
+
+    Args:
+        trigger_token_id (`int`):
+            The token id that triggers the window check.
+        allowed_token_ids (`List[int]`):
+            The list of token ids that are allowed at the specified relative window.
+        window_width (`int`):
+            The window_width of the window from the trigger token.
+        exclusive (`bool`, *optional*, defaults to `False`):
+            If `True`, the set of tokens allowed at this window will not be allowed anywhere else.
+        device (`str`, *optional*, defaults to `cpu`):
+            The device to allocate the util tensor on.
+
+    Examples:
+
+    ```python
+    >>> from transformers import AutoProcessir, ChameleonForConditionalGenerartion, LogitsProcessorList
-    >>> from transformers import AutoProcessir, ChameleonForConditionalGenerartion, LogitsProcessorList
+    >>> from transformers import AutoProcessor, ChameleonForConditionalGeneration, LogitsProcessorList
-    >>> from transformers import AutoProcessir, ChameleonForConditionalGenerartion, LogitsProcessorList
+    >>> from transformers import AutoProcessor, ChameleonForConditionalGeneration, LogitsProcessorList
+    >>> from transformers.generation.logits_process import AllowOnlyTokensInRelativeWindowLogitsProcessor
+    >>> import torch
+
+    >>> model = ChameleonForConditionalGenerartion.from_pretrained("leloy/Anole-7b-v0.1-hf")
+    >>> processor = AutoProcessir.from_pretrained("leloy/Anole-7b-v0.1-hf")
+
+    >>> inputs = processor("Can you draw a snowman?", return_tensors="pt")
+    >>> max_length = 1200
+    >>> # Generate only image token ids for `image_seq_length` steps when the boi-token is already generated
+    >>> logits_processor = AllowOnlyTokensInRelativeWindowLogitsProcessor(
+    ...     trigger_token_id=model.vocabulary_mapping.boi_token_id,
+    ...     allowed_token_ids=model.vocabulary_mapping.image_token_ids,
+    ...     window_width=model.model.image_seq_length,
+    ...     exclusive=True,
+    ...     device=model.device,
+    ... )
+
+    >>> outputs = model.generate(**inputs, max_length=max_length, logits_processors=LogitsProcessorList([logits_processor]))
+    ```
+    """
+
+    def __init__(
+        self,
+        trigger_token_id: int,
+        allowed_token_ids: List[int],
+        window_width: int,
+        exclusive: bool = False,
+        device: str = "cpu",
+    ):
+        self.trigger_token_id = trigger_token_id
+        self.allowed_token_ids = torch.tensor(allowed_token_ids, device=device).unsqueeze(0)
+        self.window_width = window_width
+        self.exclusive = exclusive
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+        if input_ids.shape[1] < self.window_width and not self.exclusive:
+            return scores
+
+        window_width = min(self.window_width, input_ids.shape[1])
+        trigger_positions = (input_ids[:, -window_width:] == self.trigger_token_id).any(dim=1).unsqueeze(-1)
+
+        disallowed_tokens_mask = torch.ones_like(scores, dtype=torch.bool)
+        disallowed_tokens_mask[:, self.allowed_token_ids] = False
+
+        if self.exclusive:
+            return scores.masked_fill(
+                ~(disallowed_tokens_mask ^ trigger_positions),
+                -float("inf"),
+            )
+        return scores.masked_fill(
+            disallowed_tokens_mask & trigger_positions,
+            -float("inf"),
+        )
diff --git a/src/transformers/image_processing_utils.py b/src/transformers/image_processing_utils.py
@@ -18,7 +18,7 @@
 import numpy as np
 
 from .image_processing_base import BatchFeature, ImageProcessingMixin
-from .image_transforms import center_crop, normalize, rescale
+from .image_transforms import center_crop, normalize, rescale, unnormalize
 from .image_utils import ChannelDimension
 from .utils import logging
 
@@ -112,6 +112,43 @@ def normalize(
             image, mean=mean, std=std, data_format=data_format, input_data_format=input_data_format, **kwargs
         )
 
+    def unnormalize(
+        self,
+        image: np.ndarray,
+        mean: Union[float, Iterable[float]],
+        std: Union[float, Iterable[float]],
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        input_data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Normalize an image. image = (image - image_mean) / image_std.
+
+        Args:
+            image (`np.ndarray`):
+                Image to unnormalize.
+            mean (`float` or `Iterable[float]`):
+                Image mean to use for unnormalization.
+            std (`float` or `Iterable[float]`):
+                Image standard deviation to use for unnormalization.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format for the output image. If unset, the channel dimension format of the input
+                image is used. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+            input_data_format (`ChannelDimension` or `str`, *optional*):
+                The channel dimension format for the input image. If unset, the channel dimension format is inferred
+                from the input image. Can be one of:
+                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
+                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
+
+        Returns:
+            `np.ndarray`: The normalized image.
+        """
+        return unnormalize(
+            image, mean=mean, std=std, data_format=data_format, input_data_format=input_data_format, **kwargs
+        )
+
     def center_crop(
         self,
         image: np.ndarray,