Use return_tensors="np" instead of "tf" #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Rocketknight1 merged 1 commit into main from return_np_in_tf_examples

Jan 24, 2023

examples/image_classification-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1201,7 +1201,7 @@
  
       "cell_type": "markdown",

       "metadata": {},

       "source": [

        "We need to convert our datasets to a format Keras understands. The easiest way to do this is with the `to_tf_dataset()` method. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get TensorFlow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!"

        "We need to convert our datasets to a format Keras understands. The easiest way to do this is with the `to_tf_dataset()` method. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our `to_tf_dataset` pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"

       ]

      },

      {

    @@ -1219,7 +1219,7 @@
  
       "metadata": {},

       "outputs": [],

       "source": [

        "data_collator = DefaultDataCollator(return_tensors=\"tf\")\n",

        "data_collator = DefaultDataCollator(return_tensors=\"np\")\n",

        "\n",

        "train_set = train_ds.to_tf_dataset(\n",

        "    columns=[\"pixel_values\", \"label\"],\n",

    @@ -3127,7 +3127,7 @@
  
       "hash": "668fb96a716f4e6c0ace6609c578b3593a4af0cc3bba8b3739e6b5cb74dc056a"

      },

      "kernelspec": {

       "display_name": "Python 3.8.12 ('tenv': venv)",

       "display_name": "Python 3 (ipykernel)",

       "language": "python",

       "name": "python3"

      },

    @@ -3141,7 +3141,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.8.12"

       "version": "3.10.8"

      },

      "vscode": {

       "interpreter": {

examples/language_modeling-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2143,7 +2143,7 @@
  
       "source": [

        "Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible for taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to randomly mask tokens. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.\n",

        "\n",

        "To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get Tensorflow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!"

        "To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"

       ]

      },

      {

    @@ -2157,7 +2157,7 @@
  
        "from transformers import DataCollatorForLanguageModeling\n",

        "\n",

        "data_collator = DataCollatorForLanguageModeling(\n",

        "    tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"tf\"\n",

        "    tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"np\"\n",

        ")"

       ]

      },

    @@ -2447,7 +2447,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

examples/language_modeling_from_scratch-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1112,7 +1112,7 @@
  
       "source": [

        "Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible for taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to randomly mask tokens. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.\n",

        "\n",

        "To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Make sure to set `return_tensors=\"tf\"` too - the `DataCollator` objects all support multiple frameworks, and we don't want to accidentally get a bunch of `torch.Tensor` objects floating around in our TensorFlow code!"

        "To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"

       ]

      },

      {

    @@ -1126,7 +1126,7 @@
  
        "from transformers import DataCollatorForLanguageModeling\n",

        "\n",

        "data_collator = DataCollatorForLanguageModeling(\n",

        "    tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"tf\"\n",

        "    tokenizer=tokenizer, mlm_probability=0.15, return_tensors=\"np\"\n",

        ")"

       ]

      },

    @@ -1316,7 +1316,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

examples/multiple_choice-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1038,7 +1038,7 @@
  
        "            padding=self.padding,\n",

        "            max_length=self.max_length,\n",

        "            pad_to_multiple_of=self.pad_to_multiple_of,\n",

        "            return_tensors=\"tf\",\n",

        "            return_tensors=\"np\",\n",

        "        )\n",

        "\n",

        "        # Un-flatten\n",

    @@ -1570,7 +1570,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

examples/protein_language_modeling-tf.ipynb

-Original file line number
+Diff line change
@@ Expand Up / @@ -2102,7 +2102,7 @@ @@
        "name": "python",
        "nbconvert_exporter": "python",
        "pygments_lexer": "ipython3",
-       "version": "3.10.0"
+       "version": "3.10.8"
       }
      },
      "nbformat": 4,
@@ Expand Down @@

examples/question_answering-tf.ipynb

-Original file line number
+Diff line change
@@ Expand Up / @@ -2411,7 +2411,7 @@ @@
        "name": "python",
        "nbconvert_exporter": "python",
        "pygments_lexer": "ipython3",
-       "version": "3.10.4"
+       "version": "3.10.8"
       }
      },
      "nbformat": 4,
@@ Expand Down @@

examples/summarization-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -862,7 +862,7 @@
  
        "id": "km3pGVdTIrJc"

       },

       "source": [

        "Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!\n",

        "Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!\n",

        "\n",

        "We also want to compute `ROUGE` metrics, which will require us to generate text from our model. To speed things up, we can compile our generation loop with XLA. This results in a *huge* speedup - up to 100X! The downside of XLA generation, though, is that it doesn't like variable input shapes, because it needs to run a new compilation for each new input shape! To compensate for that, let's use `pad_to_multiple_of` for the dataset we use for text generation. This will reduce the number of unique input shapes a lot, meaning we can get the benefits of XLA generation with only a few compilations."

       ]

    @@ -873,9 +873,9 @@
  
       "metadata": {},

       "outputs": [],

       "source": [

        "data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\")\n",

        "data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\")\n",

        "\n",

        "generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\", pad_to_multiple_of=128)"

        "generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\", pad_to_multiple_of=128)"

       ]

      },

      {

    @@ -1479,7 +1479,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

examples/text_classification-tf.ipynb

-Original file line number
+Diff line change
@@ Expand Up / @@ -1471,7 +1471,7 @@ @@
        "name": "python",
        "nbconvert_exporter": "python",
        "pygments_lexer": "ipython3",
-       "version": "3.10.4"
+       "version": "3.10.8"
       }
      },
      "nbformat": 4,
@@ Expand Down @@

examples/token_classification-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1154,7 +1154,7 @@
  
       "cell_type": "markdown",

       "metadata": {},

       "source": [

        "Now we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels. Note that our data collators support multiple frameworks, so ensure you set `return_tensors='tf'` to get `tf.Tensor` outputs - you don't want to forget it and end up with a pile of `torch.Tensor` messing up your Tensorflow code!"

        "Now we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"

       ]

      },

      {

    @@ -1165,7 +1165,7 @@
  
       "source": [

        "from transformers import DataCollatorForTokenClassification\n",

        "\n",

        "data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors=\"tf\")"

        "data_collator = DataCollatorForTokenClassification(tokenizer, return_tensors=\"np\")"

       ]

      },

      {

    @@ -1668,7 +1668,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

examples/translation-tf.ipynb

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -880,18 +880,30 @@
  
        "id": "km3pGVdTIrJc"

       },

       "source": [

        "Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are multi-framework, so make sure you set `return_tensors='tf'` so you get `tf.Tensor` objects back and not something else!"

        "Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='np'` argument to get NumPy arrays out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code! You could also use `return_tensors='tf'` to get TensorFlow tensors, but our TF dataset pipeline actually uses a NumPy loader internally, which is wrapped at the end with a `tf.data.Dataset`. As a result, `np` is usually more reliable and performant when you're using it!"

       ]

      },

      {

       "cell_type": "code",

       "execution_count": 21,

       "execution_count": 1,

       "metadata": {},

       "outputs": [],

       "outputs": [

        {

         "ename": "NameError",

         "evalue": "name 'DataCollatorForSeq2Seq' is not defined",

         "output_type": "error",

         "traceback": [

          "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",

          "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",

          "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m data_collator \u001b[38;5;241m=\u001b[39m \u001b[43mDataCollatorForSeq2Seq\u001b[49m(tokenizer, model\u001b[38;5;241m=\u001b[39mmodel, return_tensors\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnp\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m      3\u001b[0m generation_data_collator \u001b[38;5;241m=\u001b[39m DataCollatorForSeq2Seq(tokenizer, model\u001b[38;5;241m=\u001b[39mmodel, return_tensors\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mnp\u001b[39m\u001b[38;5;124m\"\u001b[39m, pad_to_multiple_of\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m128\u001b[39m)\n",

          "\u001b[0;31mNameError\u001b[0m: name 'DataCollatorForSeq2Seq' is not defined"

         ]

        }

       ],

       "source": [

        "data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\")\n",

        "data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\")\n",

        "\n",

        "generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"tf\", pad_to_multiple_of=128)"

        "generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors=\"np\", pad_to_multiple_of=128)"

       ]

      },

      {

    @@ -1466,7 +1478,7 @@
  
       "name": "python",

       "nbconvert_exporter": "python",

       "pygments_lexer": "ipython3",

       "version": "3.10.4"

       "version": "3.10.8"

      }

     },

     "nbformat": 4,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use return_tensors="np" instead of "tf" #308

Uh oh!

Diff view

Diff view

There are no files selected for viewing