Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions examples/image_classification-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1201,7 +1201,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We need to convert our datasets to a format Keras understands. The easiest way to do this is with the `to_tf_dataset()` method. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get TensorFlow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!"
"We need to convert our datasets to a format Keras understands. The easiest way to do this is with the `to_tf_dataset()` method. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our `to_tf_dataset` pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"
]
},
{
Expand All @@ -1219,7 +1219,7 @@
"metadata": {},
"outputs": [],
"source": [
"data_collator = DefaultDataCollator(return_tensors=\"tf\")\n",
"data_collator = DefaultDataCollator(return_tensors=\"np\")\n",
"\n",
"train_set = train_ds.to_tf_dataset(\n",
" columns=[\"pixel_values\", \"label\"],\n",
Expand Down Expand Up @@ -3127,7 +3127,7 @@
"hash": "668fb96a716f4e6c0ace6609c578b3593a4af0cc3bba8b3739e6b5cb74dc056a"
},
"kernelspec": {
"display_name": "Python 3.8.12 ('tenv': venv)",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -3141,7 +3141,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.10.8"
},
"vscode": {
"interpreter": {
Expand Down
6 changes: 3 additions & 3 deletions examples/language_modeling-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2143,7 +2143,7 @@
"source": [
"Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible for taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to randomly mask tokens. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.\n",
"\n",
"To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get Tensorflow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!"
"To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"
]
},
{
Expand All @@ -2157,7 +2157,7 @@
"from transformers import DataCollatorForLanguageModeling\n",
"\n",
"data_collator = DataCollatorForLanguageModeling(\n",
" tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"tf\"\n",
" tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"np\"\n",
")"
]
},
Expand Down Expand Up @@ -2447,7 +2447,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions examples/language_modeling_from_scratch-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1112,7 +1112,7 @@
"source": [
"Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible for taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to randomly mask tokens. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.\n",
"\n",
"To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Make sure to set `return_tensors=\"tf\"` too - the `DataCollator` objects all support multiple frameworks, and we don't want to accidentally get a bunch of `torch.Tensor` objects floating around in our TensorFlow code!"
"To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"
]
},
{
Expand All @@ -1126,7 +1126,7 @@
"from transformers import DataCollatorForLanguageModeling\n",
"\n",
"data_collator = DataCollatorForLanguageModeling(\n",
" tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"tf\"\n",
" tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"np\"\n",
")"
]
},
Expand Down Expand Up @@ -1316,7 +1316,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
4 changes: 2 additions & 2 deletions examples/multiple_choice-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1038,7 +1038,7 @@
" padding=self.padding,\n",
" max_length=self.max_length,\n",
" pad_to_multiple_of=self.pad_to_multiple_of,\n",
" return_tensors=\"tf\",\n",
" return_tensors=\"np\",\n",
" )\n",
"\n",
" # Un-flatten\n",
Expand Down Expand Up @@ -1570,7 +1570,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion examples/protein_language_modeling-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2102,7 +2102,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion examples/question_answering-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2411,7 +2411,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
8 changes: 4 additions & 4 deletions examples/summarization-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -862,7 +862,7 @@
"id": "km3pGVdTIrJc"
},
"source": [
"Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!\n",
"Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!\n",
"\n",
"We also want to compute `ROUGE` metrics, which will require us to generate text from our model. To speed things up, we can compile our generation loop with XLA. This results in a *huge* speedup - up to 100X! The downside of XLA generation, though, is that it doesn't like variable input shapes, because it needs to run a new compilation for each new input shape! To compensate for that, let's use `pad_to_multiple_of` for the dataset we use for text generation. This will reduce the number of unique input shapes a lot, meaning we can get the benefits of XLA generation with only a few compilations."
]
Expand All @@ -873,9 +873,9 @@
"metadata": {},
"outputs": [],
"source": [
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\")\n",
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\")\n",
"\n",
"generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\", pad_to_multiple_of=128)"
"generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\", pad_to_multiple_of=128)"
]
},
{
Expand Down Expand Up @@ -1479,7 +1479,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion examples/text_classification-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1471,7 +1471,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions examples/token_classification-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1154,7 +1154,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels. Note that our data collators support multiple frameworks, so ensure you set `return_tensors='tf'` to get `tf.Tensor` outputs - you don't want to forget it and end up with a pile of `torch.Tensor` messing up your Tensorflow code!"
"Now we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"
]
},
{
Expand All @@ -1165,7 +1165,7 @@
"source": [
"from transformers import DataCollatorForTokenClassification\n",
"\n",
"data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors=\"tf\")"
"data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors=\"np\")"
]
},
{
Expand Down Expand Up @@ -1668,7 +1668,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down
24 changes: 18 additions & 6 deletions examples/translation-tf.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -880,18 +880,30 @@
"id": "km3pGVdTIrJc"
},
"source": [
"Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!"
"Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"
]
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 1,
"metadata": {},
"outputs": [],
"outputs": [
{
"ename": "NameError",
"evalue": "name 'DataCollatorForSeq2Seq' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m data_collator \u001b[38;5;241m=\u001b[39m \u001b[43mDataCollatorForSeq2Seq\u001b[49m(tokenizer, model\u001b[38;5;241m=\u001b[39mmodel, return_tensors\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnp\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 3\u001b[0m generation_data_collator \u001b[38;5;241m=\u001b[39m DataCollatorForSeq2Seq(tokenizer, model\u001b[38;5;241m=\u001b[39mmodel, return_tensors\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnp\u001b[39m\u001b[38;5;124m\"\u001b[39m, pad_to_multiple_of\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m128\u001b[39m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'DataCollatorForSeq2Seq' is not defined"
]
}
],
"source": [
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\")\n",
"data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\")\n",
"\n",
"generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\", pad_to_multiple_of=128)"
"generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\", pad_to_multiple_of=128)"
]
},
{
Expand Down Expand Up @@ -1466,7 +1478,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
"version": "3.10.8"
}
},
"nbformat": 4,
Expand Down