Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,22 @@ orbs:
gcp-gke: circleci/[email protected]
go: circleci/[email protected]

commands:
skip-job-on-doc-only-changes:
description: "Do not continue this job and exit with success for PRs with only doc changes"
steps:

- run:
name: docs-only changes skip check
command: |
if git diff --name-only << pipeline.git.base_revision >>...<< pipeline.git.revision >> | egrep -qv '\.(md|rst)$'
then
echo "Non-docs were modified in this PR, proceeding normally"
else
echo "Only docs were modified in this PR, quitting this job"
circleci step halt
fi

# TPU REFERENCES
references:
checkout_ml_testing: &checkout_ml_testing
Expand Down Expand Up @@ -72,6 +88,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-torch_and_tf-{{ checksum "setup.py" }}
Expand All @@ -98,6 +115,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-torch-{{ checksum "setup.py" }}
Expand All @@ -124,6 +142,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-tf-{{ checksum "setup.py" }}
Expand All @@ -150,6 +169,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-flax-{{ checksum "setup.py" }}
Expand All @@ -176,6 +196,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-torch-{{ checksum "setup.py" }}
Expand All @@ -202,6 +223,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-tf-{{ checksum "setup.py" }}
Expand All @@ -226,6 +248,7 @@ jobs:
RUN_CUSTOM_TOKENIZERS: yes
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-custom_tokenizers-{{ checksum "setup.py" }}
Expand Down Expand Up @@ -253,6 +276,7 @@ jobs:
parallelism: 1
steps:
- checkout
- skip-job-on-doc-only-changes
- restore_cache:
keys:
- v0.4-torch_examples-{{ checksum "setup.py" }}
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Follow these steps to start contributing:
$ git checkout -b a-descriptive-name-for-my-changes
```

**do not** work on the `master` branch.
**Do not** work on the `master` branch.

4. Set up a development environment by running the following command in a virtual environment:

Expand Down
3 changes: 1 addition & 2 deletions docs/source/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ Preprocessing data
=======================================================================================================================

In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we

call a :doc:`tokenizer <main_classes/tokenizer>`. You can build one using the tokenizer class associated to the model
you would like to use, or directly with the :class:`~transformers.AutoTokenizer` class.

Expand Down Expand Up @@ -52,7 +51,7 @@ The tokenizer can decode a list of token ids in a proper sentence:
"[CLS] Hello, I'm a single sentence! [SEP]"

As you can see, the tokenizer automatically added some special tokens that the model expects. Not all models need
special tokens; for instance, if we had used` gtp2-medium` instead of `bert-base-cased` to create our tokenizer, we
special tokens; for instance, if we had used `gpt2-medium` instead of `bert-base-cased` to create our tokenizer, we
would have seen the same sentence as the original one here. You can disable this behavior (which is only advised if you
have added those special tokens yourself) by passing ``add_special_tokens=False``.

Expand Down
4 changes: 3 additions & 1 deletion docs/source/quicktour.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,9 @@ activations of the model.
[ 0.08181786, -0.04179301]], dtype=float32)>,)

The model can return more than just the final activations, which is why the output is a tuple. Here we only asked for
the final activations, so we get a tuple with one element. .. note::
the final activations, so we get a tuple with one element.

.. note::

All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model *before* the final activation
function (like SoftMax) since this final activation function is often fused with the loss.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/serialization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,8 @@ inference.
optimizations afterwards.

.. note::
For more information about the optimizations enabled by ONNXRuntime, please have a look at the (`ONNXRuntime Github
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_)
For more information about the optimizations enabled by ONNXRuntime, please have a look at the `ONNXRuntime Github
<https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers>`_.

Quantization
-----------------------------------------------------------------------------------------------------------------------
Expand Down
19 changes: 9 additions & 10 deletions src/transformers/convert_slow_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -547,10 +547,12 @@ class BertGenerationConverter(SpmConverter):
class PegasusConverter(SpmConverter):
def vocab(self, proto):
vocab = [
(self.original_tokenizer.pad_token, 0),
(self.original_tokenizer.eos_token, 0),
(self.original_tokenizer.pad_token, 0.0),
(self.original_tokenizer.eos_token, 0.0),
(self.original_tokenizer.mask_token_sent, 0.0),
(self.original_tokenizer.mask_token, 0.0),
]
vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]
vocab += [(f"<unk_{i}>", -100.0) for i in range(2, self.original_tokenizer.offset)]
vocab += [(piece.piece, piece.score) for piece in proto.pieces[2:]]
return vocab

Expand All @@ -559,13 +561,10 @@ def unk_id(self, proto):

def post_processor(self):
eos = self.original_tokenizer.eos_token
return processors.TemplateProcessing(
single=["$A", eos],
pair=["$A", "$B", eos],
special_tokens=[
(eos, self.original_tokenizer.eos_token_id),
],
)
special_tokens = [
(eos, self.original_tokenizer.eos_token_id),
]
return processors.TemplateProcessing(single=["$A", eos], pair=["$A", "$B", eos], special_tokens=special_tokens)


class T5Converter(SpmConverter):
Expand Down
6 changes: 3 additions & 3 deletions src/transformers/data/data_collator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@

def default_data_collator(features: List[InputDataClass]) -> Dict[str, torch.Tensor]:
"""
Very simple data collator that simply collates batches of dict-like objects and erforms special handling for
Very simple data collator that simply collates batches of dict-like objects and performs special handling for
potential keys named:

- ``label``: handles a single value (int or float) per object
- ``label_ids``: handles a list of values per object

Des not do any additional preprocessing: property names of the input object will be used as corresponding inputs to
the model. See glue and ner for example of how it's useful.
Does not do any additional preprocessing: property names of the input object will be used as corresponding inputs
to the model. See glue and ner for example of how it's useful.
"""

# In this function we'll make the assumption that all `features` in the batch
Expand Down
Loading