Skip to content

Conversation

@sgugger
Copy link
Collaborator

@sgugger sgugger commented Jan 20, 2023

What does this PR do?

This PR cleans up all docstrings following up from #20757 and #21199. It removes the need for the processor_class in TensorFlow and Flax generic examples by setting in the examples like #20757 did for PyTorch then makes a full pass across all models to clean up the docstrings (removing the processor_classin theadd_code_sample` decorator, remove random outputs, use the auto classes for preprocessing).

Note that in some cases we can't use the auto-classes for preprocessing: when linking to the __call__ method of a processor or image processor, we need the actual class (cc @amyeroberts I changed a couple of things you did here).

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 20, 2023

The documentation is not available anymore as the PR was closed or merged.

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 23, 2023

Thank you @sgugger for cleaning this up. With all ~250 files, I will trust you instead of look lines by lines, except one question below.

I would definitely prefer to run a doctest first offline before merging this PR - for which I can launch on my side. From previous PRs, it has shown there are always some surprise. I will launch doctest CI when all reviewers give their approval.

So here my question

Note that in some cases we can't use the auto-classes for preprocessing: when linking to the call method of a processor or image processor, we need the actual class (cc @amyeroberts I changed a couple of things you did here).

I see even in such places, we still have

Pixel values can be obtained using [`AutoImageProcessor`].  See [`ConvNextImageProcessor.__call__`] for details.

I don't have much context and prior knowledge, but is it true we want to use AutoImageProcessor but ConvNextImageProcessor.__call__ in such cases?

@amyeroberts amyeroberts mentioned this pull request Jan 23, 2023
5 tasks
@sgugger
Copy link
Collaborator Author

sgugger commented Jan 23, 2023

With all ~250 files, I will trust you instead of look lines by lines.

A review would still be much appreciated, as it could catch accidental typos.

I would definitely prefer to run a doctest first offline before merging this PR - for which I can launch on my side. From previous PRs, it has shown there are always some surprise. I will launch doctest CI when all reviewers give their approval.

Sure, we can wait for that as long as the results are available before the release branch is cut.

I don't have much context and prior knowledge, but is it true we want to use AutoImageProcessor but ConvNextImageProcessor.call in such cases?

Yes.

@ydshieh
Copy link
Collaborator

ydshieh commented Jan 23, 2023

I triggered the doctest CI against the (last) commit (so far) in this PR. Will take a look on the PR changes too :-)

run page

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive PR! Only found a couple of typos.

Comment on lines 540 to 543
tokenizer = AutoTokenizer.from_pretrained(mname) >>> UTTERANCE = "My friends are cool but they eat too many
carbs." >>> print("Human: ", UTTERANCE) >>> inputs = tokenizer([UTTERANCE], return_tensors='tf') >>> reply_ids
= model.generate(**inputs) >>> print("Bot: ", tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an issue in this docstring's format

Comment on lines 1497 to 1499
>>> from transformers import AutoTokenizer, FlaxBlenderbotSmallForConditionalGeneration >>> tokenizer =
AutoTokenizer.from_pretrained('facebook/blenderbot_small-90M') >>> TXT = "My friends are <mask> but they eat
too many carbs."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Comment on lines 539 to 540
>>> from transformers import BlenderbotSmallTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname =
>>> from transformers import AutoTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname =
'facebook/blenderbot_small-90M' >>> model = BlenderbotSmallForConditionalGeneration.from_pretrained(mname) >>>
tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname)
tokenizer = AutoTokenizer.from_pretrained(mname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some minor comments. (one is mentioned by Lysandre already I believe).
Thank you again!

_CHECKPOINT_FOR_QA = "ydshieh/bert-base-cased-squad2"
_QA_EXPECTED_OUTPUT = "'a nice puppet'"
_QA_EXPECTED_LOSS = 7.41
_QA_TARGET_START_INDEX = 14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently these are not used, so I understand the reason you remove them.

But we should instead keep here and use them in add_code_sample_docstrings for BertForQuestionAnswering.

(without using it, it works currently due to the default values 14 and 15 specified in def add_code_sample_docstrings - this is not the best approach though)


_CONFIG_FOR_DOC = "BlenderbotConfig"
_TOKENIZER_FOR_DOC = "BlenderbotTokenizer"
_CHECKPOINT_FOR_DOC = "facebook/blenderbot-400M-distill"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna pointing out this attribute in pipeline testing (to generate tests)

    if hasattr(module, "_CHECKPOINT_FOR_DOC"):
        return module._CHECKPOINT_FOR_DOC
    else:
        logger.warning(f"Can't retrieve checkpoint from {architecture.__name__}")

I am ok with this change though, as we plan to use new tiny models for pipeline tests (which doesn't need this attribute anymore)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the _CHECKPOINT_FOR_DOC here is my mistake.

Summarization example:
>>> from transformers import BlenderbotSmallTokenizer, FlaxBlenderbotSmallForConditionalGeneration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docstring in this block seems broken style, but this is not introduced in this PR. Good to fix it here if you find time.

BLENDERBOT_SMALL_GENERATION_EXAMPLE = r"""
Conversation example::
>>> from transformers import BlenderbotSmallTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc example broken (not from this PR though)

```python
>>> from PIL import Image
>>> import requests
>>> from transformers import CLIPProcessor, CLIPVisionModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just for the record) For CLIP-like models, we sometimes use the Processor to process text components or vision models, but other times use tokenizer and image processor to do this.


_CHECKPOINT_FOR_DOC = "distilbert-base-uncased"
_CONFIG_FOR_DOC = "DistilBertConfig"
_TOKENIZER_FOR_DOC = "DistilBertTokenizer"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't comment on the exact line, but this file has

Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] 

output_type=SequenceClassifierOutputWithPast,
config_class=_CONFIG_FOR_DOC,
expected_output="'LABEL_0'",
expected_loss=5.28,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we remove expected ouptuts/loss here? Just because we don't want to see LABEL_0?

The checkpoint is real one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LABEL_0 is not informative at all and the loss seems very high, I don't think this is a model trained for sequence classification.


_CONFIG_FOR_DOC = "TapasConfig"
_TOKENIZER_FOR_DOC = "TapasTokenizer"
_TOKENIZER_FOR_DOC = "google/tapas-base"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TOKENIZER_FOR_DOC = "google/tapas-base" one line below should be deleted too.

@sgugger sgugger merged commit fd5cdae into main Jan 23, 2023
@sgugger sgugger deleted the models_docstring branch January 23, 2023 19:33
@sgugger sgugger mentioned this pull request Jan 24, 2023
5 tasks
pziecina-nv added a commit to triton-inference-server/pytriton that referenced this pull request Feb 1, 2023
Remove obsolete tokenizer doc from append_call_sample_docstring
to match API changes in huggingface/transformers#21225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants