diff --git a/docs/source/en/tasks/semantic_segmentation.mdx b/docs/source/en/tasks/semantic_segmentation.mdx index f1ab7ee0ea68..074f9e2137c5 100644 --- a/docs/source/en/tasks/semantic_segmentation.mdx +++ b/docs/source/en/tasks/semantic_segmentation.mdx @@ -35,7 +35,7 @@ Before you begin, make sure you have all the necessary libraries installed: pip install -q datasets transformers evaluate ``` -We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login: +We encourage you to log in to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to log in: ```py >>> from huggingface_hub import notebook_login @@ -95,9 +95,13 @@ The next step is to load a SegFormer image processor to prepare the images and a ```py >>> from transformers import AutoImageProcessor ->>> feature_extractor = AutoImageProcessor.from_pretrained("nvidia/mit-b0", reduce_labels=True) +>>> checkpoint = "nvidia/mit-b0" +>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint, reduce_labels=True) ``` + + + It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image, but you can also use any image library you like. ```py @@ -112,14 +116,14 @@ Now create two preprocessing functions to prepare the images and annotations for >>> def train_transforms(example_batch): ... images = [jitter(x) for x in example_batch["image"]] ... labels = [x for x in example_batch["annotation"]] -... inputs = feature_extractor(images, labels) +... inputs = image_processor(images, labels) ... return inputs >>> def val_transforms(example_batch): ... images = [x for x in example_batch["image"]] ... labels = [x for x in example_batch["annotation"]] -... inputs = feature_extractor(images, labels) +... inputs = image_processor(images, labels) ... return inputs ``` @@ -130,6 +134,67 @@ To apply the `jitter` over the entire dataset, use the 🤗 Datasets [`~datasets >>> test_ds.set_transform(val_transforms) ``` + + + + + +It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. +In this guide, you'll use [`tf.image`](https://www.tensorflow.org/api_docs/python/tf/image) to randomly change the color properties of an image, but you can also use any image +library you like. +Define two separate transformation functions: +- training data transformations that include image augmentation +- validation data transformations that only transpose the images, since computer vision models in 🤗 Transformers expect channels-first layout + +```py +>>> import tensorflow as tf + + +>>> def aug_transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.image.random_brightness(image, 0.25) +... image = tf.image.random_contrast(image, 0.5, 2.0) +... image = tf.image.random_saturation(image, 0.75, 1.25) +... image = tf.image.random_hue(image, 0.1) +... image = tf.transpose(image, (2, 0, 1)) +... return image + + +>>> def transforms(image): +... image = tf.keras.utils.img_to_array(image) +... image = tf.transpose(image, (2, 0, 1)) +... return image +``` + +Next, create two preprocessing functions to prepare batches of images and annotations for the model. These functions apply +the image transformations and use the earlier loaded `image_processor` to convert the images into `pixel_values` and +annotations to `labels`. `ImageProcessor` also takes care of resizing and normalizing the images. + +```py +>>> def train_transforms(example_batch): +... images = [aug_transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs + + +>>> def val_transforms(example_batch): +... images = [transforms(x.convert("RGB")) for x in example_batch["image"]] +... labels = [x for x in example_batch["annotation"]] +... inputs = image_processor(images, labels) +... return inputs +``` + +To apply the preprocessing transformations over the entire dataset, use the 🤗 Datasets [`~datasets.Dataset.set_transform`] function. +The transform is applied on the fly which is faster and consumes less disk space: + +```py +>>> train_ds.set_transform(train_transforms) +>>> test_ds.set_transform(val_transforms) +``` + + + ## Evaluate Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): @@ -140,7 +205,11 @@ Including a metric during training is often helpful for evaluating your model's >>> metric = evaluate.load("mean_iou") ``` -Then create a function to [`~evaluate.EvaluationModule.compute`] the metrics. Your predictions need to be converted to logits first, and then reshaped to match the size of the labels before you can call [`~evaluate.EvaluationModule.compute`]: +Then create a function to [`~evaluate.EvaluationModule.compute`] the metrics. Your predictions need to be converted to +logits first, and then reshaped to match the size of the labels before you can call [`~evaluate.EvaluationModule.compute`]: + + + ```py >>> def compute_metrics(eval_pred): @@ -168,10 +237,48 @@ Then create a function to [`~evaluate.EvaluationModule.compute`] the metrics. Yo ... return metrics ``` + + + + + + + +```py +>>> def compute_metrics(eval_pred): +... logits, labels = eval_pred +... logits = tf.transpose(logits, perm=[0, 2, 3, 1]) +... logits_resized = tf.image.resize( +... logits, +... size=tf.shape(labels)[1:], +... method="bilinear", +... ) + +... pred_labels = tf.argmax(logits_resized, axis=-1) +... metrics = metric.compute( +... predictions=pred_labels, +... references=labels, +... num_labels=num_labels, +... ignore_index=-1, +... reduce_labels=image_processor.do_reduce_labels, +... ) + +... per_category_accuracy = metrics.pop("per_category_accuracy").tolist() +... per_category_iou = metrics.pop("per_category_iou").tolist() + +... metrics.update({f"accuracy_{id2label[i]}": v for i, v in enumerate(per_category_accuracy)}) +... metrics.update({f"iou_{id2label[i]}": v for i, v in enumerate(per_category_iou)}) +... return {"val_" + k: v for k, v in metrics.items()} +``` + + + + Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training. ## Train - + + If you aren't familiar with finetuning a model with the [`Trainer`], take a look at the basic tutorial [here](../training#finetune-with-trainer)! @@ -183,10 +290,7 @@ You're ready to start training your model now! Load SegFormer with [`AutoModelFo ```py >>> from transformers import AutoModelForSemanticSegmentation, TrainingArguments, Trainer ->>> pretrained_model_name = "nvidia/mit-b0" ->>> model = AutoModelForSemanticSegmentation.from_pretrained( -... pretrained_model_name, id2label=id2label, label2id=label2id -... ) +>>> model = AutoModelForSemanticSegmentation.from_pretrained(checkpoint, id2label=id2label, label2id=label2id) ``` At this point, only three steps remain: @@ -229,6 +333,112 @@ Once training is completed, share your model to the Hub with the [`~transformers ```py >>> trainer.push_to_hub() ``` + + + + + + + +If you are unfamiliar with fine-tuning a model with Keras, check out the [basic tutorial](./training#train-a-tensorflow-model-with-keras) first! + + + +To fine-tune a model in TensorFlow, follow these steps: +1. Define the training hyperparameters, and set up an optimizer and a learning rate schedule. +2. Instantiate a pretrained model. +3. Convert a 🤗 Dataset to a `tf.data.Dataset`. +4. Compile your model. +5. Add callbacks to calculate metrics and upload your model to 🤗 Hub +6. Use the `fit()` method to run the training. + +Start by defining the hyperparameters, optimizer and learning rate schedule: + +```py +>>> from transformers import create_optimizer + +>>> batch_size = 2 +>>> num_epochs = 50 +>>> num_train_steps = len(train_ds) * num_epochs +>>> learning_rate = 6e-5 +>>> weight_decay_rate = 0.01 + +>>> optimizer, lr_schedule = create_optimizer( +... init_lr=learning_rate, +... num_train_steps=num_train_steps, +... weight_decay_rate=weight_decay_rate, +... num_warmup_steps=0, +... ) +``` + +Then, load SegFormer with [`TFAutoModelForSemanticSegmentation`] along with the label mappings, and compile it with the +optimizer: + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained( +... checkpoint, +... id2label=id2label, +... label2id=label2id, +... ) +>>> model.compile(optimizer=optimizer) +``` + +Convert your datasets to the `tf.data.Dataset` format using the [`~datasets.Dataset.to_tf_dataset`] and the [`DefaultDataCollator`]: + +```py +>>> from transformers import DefaultDataCollator + +>>> data_collator = DefaultDataCollator(return_tensors="tf") + +>>> tf_train_dataset = train_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) + +>>> tf_eval_dataset = test_ds.to_tf_dataset( +... columns=["pixel_values", "label"], +... shuffle=True, +... batch_size=batch_size, +... collate_fn=data_collator, +... ) +``` + +To compute the accuracy from the predictions and push your model to the 🤗 Hub, use [Keras callbacks](./main_classes/keras_callbacks). +Pass your `compute_metrics` function to [`KerasMetricCallback`], +and use the [`PushToHubCallback`] to upload the model: + +```py +>>> from transformers.keras_callbacks import KerasMetricCallback, PushToHubCallback + +>>> metric_callback = KerasMetricCallback( +... metric_fn=compute_metrics, eval_dataset=tf_eval_dataset, batch_size=batch_size, label_cols=["labels"] +... ) + +>>> push_to_hub_callback = PushToHubCallback(output_dir="scene_segmentation", tokenizer=image_processor) + +>>> callbacks = [metric_callback, push_to_hub_callback] +``` + +Finally, you are ready to train your model! Call `fit()` with your training and validation datasets, the number of epochs, +and your callbacks to fine-tune the model: + +```py +>>> model.fit( +... tf_train_dataset, +... validation_data=tf_eval_dataset, +... callbacks=callbacks, +... epochs=num_epochs, +... ) +``` + +Congratulations! You have fine-tuned your model and shared it on the 🤗 Hub. You can now use it for inference! + + + ## Inference @@ -245,6 +455,8 @@ Load an image for inference: Image of bedroom + + The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image segmentation with your model, and pass your image to it: ```py @@ -285,7 +497,7 @@ You can also manually replicate the results of the `pipeline` if you'd like. Pro ```py >>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU ->>> encoding = feature_extractor(image, return_tensors="pt") +>>> encoding = image_processor(image, return_tensors="pt") >>> pixel_values = encoding.pixel_values.to(device) ``` @@ -309,10 +521,50 @@ Next, rescale the logits to the original image size: >>> pred_seg = upsampled_logits.argmax(dim=1)[0] ``` -To visualize the results, load the [dataset color palette](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) that maps each class to their RGB values. Then you can combine and plot your image and the predicted segmentation map: + + + + + +Load an image processor to preprocess the image and return the input as TensorFlow tensors: + +```py +>>> from transformers import AutoImageProcessor + +>>> image_processor = AutoImageProcessor.from_pretrained("MariaK/scene_segmentation") +>>> inputs = image_processor(image, return_tensors="tf") +``` + +Pass your input to the model and return the `logits`: + +```py +>>> from transformers import TFAutoModelForSemanticSegmentation + +>>> model = TFAutoModelForSemanticSegmentation.from_pretrained("MariaK/scene_segmentation") +>>> logits = model(**inputs).logits +``` + +Next, rescale the logits to the original image size and apply argmax on the class dimension: +```py +>>> logits = tf.transpose(logits, [0, 2, 3, 1]) + +>>> upsampled_logits = tf.image.resize( +... logits, +... # We reverse the shape of `image` because `image.size` returns width and height. +... image.size[::-1], +... ) + +>>> pred_seg = tf.math.argmax(upsampled_logits, axis=-1)[0] +``` + + + + +To visualize the results, load the [dataset color palette](https://github.com/tensorflow/models/blob/3f1ca33afe3c1631b733ea7e40c294273b9e406d/research/deeplab/utils/get_dataset_colormap.py#L51) as `ade_palette()` that maps each class to their RGB values. Then you can combine and plot your image and the predicted segmentation map: ```py >>> import matplotlib.pyplot as plt +>>> import numpy as np >>> color_seg = np.zeros((pred_seg.shape[0], pred_seg.shape[1], 3), dtype=np.uint8) >>> palette = np.array(ade_palette())