Remove `task` arg in `load_dataset` in image-classification example by regisss · Pull Request #28408 · huggingface/transformers

regisss · 2024-01-09T10:01:51Z

What does this PR do?

The task argument is now deprecated in the datasets.load_dataset method. This PR removes it and adds the renaming logic needed to deal with datasets like Cifar10 (the task attribute of datasets used to help with that).
Internal discussion here: https://huggingface.slack.com/archives/C034N0A7H09/p1704447848692889

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

albertvillanova

Thanks for the fix. Indeed, the tasks were deprecated long ago and we should definitely stop promoting its use.

Just some comment below about a possible edge case.

albertvillanova · 2024-01-09T10:39:19Z

        )

+    # Rename image and label columns if needed (e.g. Cifar10)
+    if "img" in dataset["train"].features:


Note that if data_args.dataset_name is None and data_args.train_dir is None, then the dataset dict will not have a "train" key, and the line above will raise a KeyError.

This could be avoided by replacing the line above with:

Suggested change

if "img" in dataset["train"].features:

if "img" in list(dataset.column_names.values())[0]:

Or:

Suggested change

if "img" in dataset["train"].features:

if "img" in next(iter(dataset.column_names.values())):

maybe more readable ?

Suggested change

if "img" in dataset["train"].features:

if "img" in (dataset["train"].features if "train" in dataset else dataset["validation"].features):

and also compatible with your suggestion at huggingface/datasets#6571

actually it seems this script always do training no ? in this case you can assume "train" is always present

Indeed it's probably mostly used for training, but I'm going to add the suggestion anyway in case.

albertvillanova · 2024-01-09T10:40:19Z

+    # Rename image and label columns if needed (e.g. Cifar10)
+    if "img" in dataset["train"].features:
+        dataset = dataset.rename_column("img", "image")
+    if "label" in dataset["train"].features:


The same as above.

albertvillanova · 2024-01-09T10:40:30Z

        # https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder.

+    # Rename image and label columns if needed (e.g. Cifar10)
+    if "img" in dataset["train"].features:


The same as above.

albertvillanova · 2024-01-09T10:40:42Z

+    # Rename image and label columns if needed (e.g. Cifar10)
+    if "img" in dataset["train"].features:
+        dataset = dataset.rename_column("img", "image")
+    if "label" in dataset["train"].features:


The same as above.

albertvillanova · 2024-01-09T10:46:50Z

Related to this PR, I opened an issue in datasets to improve the user experience when using DatasetDict.column_names:

Make DatasetDict.column_names return a list instead of dict datasets#6571

HuggingFaceDocBuilderDev · 2024-01-09T14:23:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

mariosasko · 2024-01-09T14:29:29Z

I think we should also add image_column_name (default image) and label_column_name (default label) to the example's DataTrainingArguments to be consistent with audio-classification as renaming img to image is not general enough.

mariosasko · 2024-01-10T15:47:38Z

+    if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
+        dataset = dataset.rename_column(data_args.image_column_name, "image")
+    if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
+        dataset = dataset.rename_column(data_args.label_column_name, "labels")


To be consistent with the audio-classification task, I think we should do the following:

Suggested change

if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):

dataset = dataset.rename_column(data_args.image_column_name, "image")

if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):

dataset = dataset.rename_column(data_args.label_column_name, "labels")

dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names

if data_args.image_column_name not in dataset_column_names:

if "img" in dataset_column_names:

logger.info(f"Renaming column img to {data_args.image_column_name}")

dataset = dataset.rename_column("img", data_args.image_column_name)

else:

raise ValueError(

f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. "

"Make sure to set `--image_column_name` to the correct image column - one of "

f"{', '.join(dataset_column_names)}."

)

if data_args.label_column_name not in dataset["train"].column_names:

raise ValueError(

f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. "

"Make sure to set `--label_column_name` to the correct text column - one of "

f"{', '.join(dataset_column_names)}."

)

(Then, we later need to rename label to labels either via rename_column or inside the train_transforms/val_transforms so that we don't break the script)

If the goal is to have something generic, we should precisely avoid using "img" in the example.
data_args.image_column_name should not be the name of the new column, we already know it is "image" like in the task template. It should be the name of the column to rename so that users can easily run the script if the image column name is not "img" or "image".

ArthurZucker

LGTM, let's make sure you rebase and the CIs are green

ArthurZucker · 2024-01-15T15:49:58Z


 ```bash
-accelerate launch run_image_classification_trainer.py
+accelerate launch run_image_classification_no_trainer.py --image_column_name img


the default should not need this no?

I set the default to image because that's the name of the image column for most image datasets. But Cifar10, which is used in this example, has it named img for some reason.

got it thanks

ArthurZucker · 2024-01-16T07:04:04Z

Thanks @regisss

Remove task arg in load_dataset in image-classification example

9dbf3ff

regisss requested a review from albertvillanova January 9, 2024 10:02

albertvillanova reviewed Jan 9, 2024

View reviewed changes

Manage case where "train" is not in dataset

9e6157c

regisss requested review from albertvillanova and lhoestq January 9, 2024 14:05

regisss mentioned this pull request Jan 9, 2024

Fix error in run_image_classification.py huggingface/optimum-habana#631

Merged

3 tasks

Add new args to manage image and label column names

53f84f2

regisss requested a review from mariosasko January 10, 2024 09:32

mariosasko reviewed Jan 10, 2024

View reviewed changes

regisss added 2 commits January 11, 2024 00:18

Similar to audio-classification example

6916845

Fix README

a7f1d8e

regisss requested review from ArthurZucker and mariosasko January 10, 2024 23:35

Merge branch 'main' into remove_task_image_classification_example

b79d5f0

ArthurZucker approved these changes Jan 15, 2024

View reviewed changes

regisss added 2 commits January 15, 2024 17:00

Merge branch 'main' into remove_task_image_classification_example

db2b580

Update tests

445f30a

ArthurZucker merged commit 0cdcd7a into huggingface:main Jan 16, 2024

regisss deleted the remove_task_image_classification_example branch January 16, 2024 08:57

	if "img" in dataset["train"].features:
	if "img" in list(dataset.column_names.values())[0]:

	if "img" in dataset["train"].features:
	if "img" in next(iter(dataset.column_names.values())):

	if "img" in dataset["train"].features:
	if "img" in (dataset["train"].features if "train" in dataset else dataset["validation"].features):

-    if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
-        dataset = dataset.rename_column(data_args.image_column_name, "image")
-    if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features):
-        dataset = dataset.rename_column(data_args.label_column_name, "labels")
+    dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names
+    if data_args.image_column_name not in dataset_column_names:
+        if "img" in dataset_column_names:
+            logger.info(f"Renaming column img to {data_args.image_column_name}")
+            dataset = dataset.rename_column("img", data_args.image_column_name)
+        else:
+            raise ValueError(
+                f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. "
+                "Make sure to set `--image_column_name` to the correct image column - one of "
+                f"{', '.join(dataset_column_names)}."
+        )
+    if data_args.label_column_name not in dataset["train"].column_names:
+        raise ValueError(
+            f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. "
+            "Make sure to set `--label_column_name` to the correct text column - one of "
+            f"{', '.join(dataset_column_names)}."
+        )

Conversation

regisss commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 9, 2024

Uh oh!

mariosasko commented Jan 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

regisss commented Jan 9, 2024 •

edited

Loading

albertvillanova commented Jan 9, 2024 •

edited

Loading