Remove task arg in load_dataset in image-classification example#28408
Conversation
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks for the fix. Indeed, the tasks were deprecated long ago and we should definitely stop promoting its use.
Just some comment below about a possible edge case.
| ) | ||
|
|
||
| # Rename image and label columns if needed (e.g. Cifar10) | ||
| if "img" in dataset["train"].features: |
There was a problem hiding this comment.
Note that if data_args.dataset_name is None and data_args.train_dir is None, then the dataset dict will not have a "train" key, and the line above will raise a KeyError.
This could be avoided by replacing the line above with:
| if "img" in dataset["train"].features: | |
| if "img" in list(dataset.column_names.values())[0]: |
There was a problem hiding this comment.
Or:
| if "img" in dataset["train"].features: | |
| if "img" in next(iter(dataset.column_names.values())): |
There was a problem hiding this comment.
maybe more readable ?
| if "img" in dataset["train"].features: | |
| if "img" in (dataset["train"].features if "train" in dataset else dataset["validation"].features): |
and also compatible with your suggestion at huggingface/datasets#6571
There was a problem hiding this comment.
actually it seems this script always do training no ? in this case you can assume "train" is always present
There was a problem hiding this comment.
Indeed it's probably mostly used for training, but I'm going to add the suggestion anyway in case.
| # Rename image and label columns if needed (e.g. Cifar10) | ||
| if "img" in dataset["train"].features: | ||
| dataset = dataset.rename_column("img", "image") | ||
| if "label" in dataset["train"].features: |
| # https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder. | ||
|
|
||
| # Rename image and label columns if needed (e.g. Cifar10) | ||
| if "img" in dataset["train"].features: |
| # Rename image and label columns if needed (e.g. Cifar10) | ||
| if "img" in dataset["train"].features: | ||
| dataset = dataset.rename_column("img", "image") | ||
| if "label" in dataset["train"].features: |
|
Related to this PR, I opened an issue in |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
I think we should also add |
| if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | ||
| dataset = dataset.rename_column(data_args.image_column_name, "image") | ||
| if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | ||
| dataset = dataset.rename_column(data_args.label_column_name, "labels") |
There was a problem hiding this comment.
To be consistent with the audio-classification task, I think we should do the following:
| if data_args.image_column_name != "image" and data_args.image_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | |
| dataset = dataset.rename_column(data_args.image_column_name, "image") | |
| if data_args.label_column_name != "labels" and data_args.label_column_name in (dataset["train"].features if "train" in dataset else dataset["validation"].features): | |
| dataset = dataset.rename_column(data_args.label_column_name, "labels") | |
| dataset_column_names = dataset["train"].column_names if "train" in dataset else dataset["validation"].column_names | |
| if data_args.image_column_name not in dataset_column_names: | |
| if "img" in dataset_column_names: | |
| logger.info(f"Renaming column img to {data_args.image_column_name}") | |
| dataset = dataset.rename_column("img", data_args.image_column_name) | |
| else: | |
| raise ValueError( | |
| f"--image_column_name {data_args.image_column_name} not found in dataset '{data_args.dataset_name}'. " | |
| "Make sure to set `--image_column_name` to the correct image column - one of " | |
| f"{', '.join(dataset_column_names)}." | |
| ) | |
| if data_args.label_column_name not in dataset["train"].column_names: | |
| raise ValueError( | |
| f"--label_column_name {data_args.label_column_name} not found in dataset '{data_args.dataset_name}'. " | |
| "Make sure to set `--label_column_name` to the correct text column - one of " | |
| f"{', '.join(dataset_column_names)}." | |
| ) |
(Then, we later need to rename label to labels either via rename_column or inside the train_transforms/val_transforms so that we don't break the script)
There was a problem hiding this comment.
If the goal is to have something generic, we should precisely avoid using "img" in the example.
data_args.image_column_name should not be the name of the new column, we already know it is "image" like in the task template. It should be the name of the column to rename so that users can easily run the script if the image column name is not "img" or "image".
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM, let's make sure you rebase and the CIs are green
|
|
||
| ```bash | ||
| accelerate launch run_image_classification_trainer.py | ||
| accelerate launch run_image_classification_no_trainer.py --image_column_name img |
There was a problem hiding this comment.
the default should not need this no?
There was a problem hiding this comment.
I set the default to image because that's the name of the image column for most image datasets. But Cifar10, which is used in this example, has it named img for some reason.
|
Thanks @regisss |
What does this PR do?
The
taskargument is now deprecated in thedatasets.load_datasetmethod. This PR removes it and adds the renaming logic needed to deal with datasets like Cifar10 (thetaskattribute of datasets used to help with that).Internal discussion here: https://huggingface.slack.com/archives/C034N0A7H09/p1704447848692889
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.