-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Models docstring #21225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Models docstring #21225
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
Thank you @sgugger for cleaning this up. With all ~250 files, I will trust you instead of look lines by lines, except one question below. I would definitely prefer to run a doctest first offline before merging this PR - for which I can launch on my side. From previous PRs, it has shown there are always some surprise. I will launch doctest CI when all reviewers give their approval. So here my question
I see even in such places, we still have Pixel values can be obtained using [`AutoImageProcessor`]. See [`ConvNextImageProcessor.__call__`] for details.I don't have much context and prior knowledge, but is it true we want to use |
A review would still be much appreciated, as it could catch accidental typos.
Sure, we can wait for that as long as the results are available before the release branch is cut.
Yes. |
|
I triggered the doctest CI against the (last) commit (so far) in this PR. Will take a look on the PR changes too :-) |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive PR! Only found a couple of typos.
| tokenizer = AutoTokenizer.from_pretrained(mname) >>> UTTERANCE = "My friends are cool but they eat too many | ||
| carbs." >>> print("Human: ", UTTERANCE) >>> inputs = tokenizer([UTTERANCE], return_tensors='tf') >>> reply_ids | ||
| = model.generate(**inputs) >>> print("Bot: ", tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an issue in this docstring's format
| >>> from transformers import AutoTokenizer, FlaxBlenderbotSmallForConditionalGeneration >>> tokenizer = | ||
| AutoTokenizer.from_pretrained('facebook/blenderbot_small-90M') >>> TXT = "My friends are <mask> but they eat | ||
| too many carbs." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
| >>> from transformers import BlenderbotSmallTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname = | ||
| >>> from transformers import AutoTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname = | ||
| 'facebook/blenderbot_small-90M' >>> model = BlenderbotSmallForConditionalGeneration.from_pretrained(mname) >>> | ||
| tokenizer = BlenderbotSmallTokenizer.from_pretrained(mname) | ||
| tokenizer = AutoTokenizer.from_pretrained(mname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
ydshieh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave some minor comments. (one is mentioned by Lysandre already I believe).
Thank you again!
| _CHECKPOINT_FOR_QA = "ydshieh/bert-base-cased-squad2" | ||
| _QA_EXPECTED_OUTPUT = "'a nice puppet'" | ||
| _QA_EXPECTED_LOSS = 7.41 | ||
| _QA_TARGET_START_INDEX = 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently these are not used, so I understand the reason you remove them.
But we should instead keep here and use them in add_code_sample_docstrings for BertForQuestionAnswering.
(without using it, it works currently due to the default values 14 and 15 specified in def add_code_sample_docstrings - this is not the best approach though)
|
|
||
| _CONFIG_FOR_DOC = "BlenderbotConfig" | ||
| _TOKENIZER_FOR_DOC = "BlenderbotTokenizer" | ||
| _CHECKPOINT_FOR_DOC = "facebook/blenderbot-400M-distill" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanna pointing out this attribute in pipeline testing (to generate tests)
if hasattr(module, "_CHECKPOINT_FOR_DOC"):
return module._CHECKPOINT_FOR_DOC
else:
logger.warning(f"Can't retrieve checkpoint from {architecture.__name__}")I am ok with this change though, as we plan to use new tiny models for pipeline tests (which doesn't need this attribute anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the _CHECKPOINT_FOR_DOC here is my mistake.
| Summarization example: | ||
| >>> from transformers import BlenderbotSmallTokenizer, FlaxBlenderbotSmallForConditionalGeneration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring in this block seems broken style, but this is not introduced in this PR. Good to fix it here if you find time.
| BLENDERBOT_SMALL_GENERATION_EXAMPLE = r""" | ||
| Conversation example:: | ||
| >>> from transformers import BlenderbotSmallTokenizer, TFBlenderbotSmallForConditionalGeneration >>> mname = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc example broken (not from this PR though)
| ```python | ||
| >>> from PIL import Image | ||
| >>> import requests | ||
| >>> from transformers import CLIPProcessor, CLIPVisionModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just for the record) For CLIP-like models, we sometimes use the Processor to process text components or vision models, but other times use tokenizer and image processor to do this.
|
|
||
| _CHECKPOINT_FOR_DOC = "distilbert-base-uncased" | ||
| _CONFIG_FOR_DOC = "DistilBertConfig" | ||
| _TOKENIZER_FOR_DOC = "DistilBertTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't comment on the exact line, but this file has
Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] | output_type=SequenceClassifierOutputWithPast, | ||
| config_class=_CONFIG_FOR_DOC, | ||
| expected_output="'LABEL_0'", | ||
| expected_loss=5.28, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we remove expected ouptuts/loss here? Just because we don't want to see LABEL_0?
The checkpoint is real one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LABEL_0 is not informative at all and the loss seems very high, I don't think this is a model trained for sequence classification.
|
|
||
| _CONFIG_FOR_DOC = "TapasConfig" | ||
| _TOKENIZER_FOR_DOC = "TapasTokenizer" | ||
| _TOKENIZER_FOR_DOC = "google/tapas-base" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_TOKENIZER_FOR_DOC = "google/tapas-base" one line below should be deleted too.
Remove obsolete tokenizer doc from append_call_sample_docstring to match API changes in huggingface/transformers#21225
What does this PR do?
This PR cleans up all docstrings following up from #20757 and #21199. It removes the need for the
processor_classin TensorFlow and Flax generic examples by setting in the examples like #20757 did for PyTorch then makes a full pass across all models to clean up the docstrings (removing the processor_classin theadd_code_sample` decorator, remove random outputs, use the auto classes for preprocessing).Note that in some cases we can't use the auto-classes for preprocessing: when linking to the
__call__method of a processor or image processor, we need the actual class (cc @amyeroberts I changed a couple of things you did here).