TAPAS tokenizer & tokenizer tests by LysandreJik · Pull Request #8482 · huggingface/transformers

LysandreJik · 2020-11-12T04:10:49Z

This PR aims to implement the tokenizer API for the TAPAS model, as well as the tests. It is based on tapas-style which contains all the changes done by black & isort on top of the nielsrogge/tapas_v3 branch in #8113.

The API is akin to our other tokenizers': it is based on the __call__ method which dispatches to encode_plus or batch_encode_plus according to the inputs.

These two methods then dispatch to _encode_plus and _batch_encode_plus, which themselves dispatch to prepare_for_model and _batch_prepare_for_model.

Here are the remaining tasks for the tokenizers, from what I could observe:

Two tokenizer tests are failing. This is only due to the fact that there is no checkpoint currently available.
The truncation is not the same as it was before these changes. Before these changes, if a row of the dataframe was to be truncated, the whole row was removed. Right now only the overflowing tokens will be removed. This is probably an important change that will need to be reverted (implemented in the new API).
The tokenizer is based on pd.DataFrames. It should be very simple to switch from these to datasets.Dataset, which serve the same purpose.

Once this PR is merged, I'll open a PR from tapas-style to nielsrogge/tapas_v3 as explained in #8113 (comment)

NielsRogge · 2020-11-12T09:52:29Z

src/transformers/tokenization_tapas.py

+            PreTokenizedInput,
+            EncodedInput,
+        ],
+        answer_coordinate: Optional[List[Tuple]] = None,


Maybe it's more logical to call it answer_coordinates (plural) rather than answer_coordinate, because a table-question pair typically has an answer spanning multiple coordinates. Here's a screenshot of the SQA TSV format (column is called "answer_coordinates"):

This will also require a change in the encode_plus and _encode_plus methods.

NielsRogge · 2020-11-12T09:55:01Z

src/transformers/tokenization_tapas.py


-    def _add_numeric_column_ranks(self, column_ids, row_ids, table, features):
+    def _get_numeric_column_ranks(self, column_ids, row_ids, table):
        """Adds column ranks for all numeric columns."""


To be consistent with _get_numeric_relations and the other methods below, the docstring can be changed to "Returns numeric column rank embeddings for all numeric columns".

NielsRogge · 2020-11-12T10:05:06Z

src/transformers/tokenization_tapas.py

+            cell_trim_length=cell_trim_length,
+            max_column_id=max_column_id,
+            max_row_id=max_row_id,
+            strip_column_names=strip_column_names,
+            update_answer_coordinates=update_answer_coordinates,
+            drop_rows_to_fit=drop_rows_to_fit,


Is this necessary? Didn't know that 😄

This is necessary because all those arguments passed to the super class will then be saved in self.init_kwargs, which will be used when saving the tokenizer. It is important to have those here, so that when saving/reloading the tokenizer, the exact same config is used!

NielsRogge · 2020-11-12T10:17:10Z

src/transformers/tokenization_tapas.py

+    def _get_numeric_relations(self, question, column_ids, row_ids, table, columns_to_numeric_values):
        """
-        Adds numeric relation embeddings to 'features'
+        Return numeric relations embeddings


(nit) ReturnS

NielsRogge · 2020-11-12T12:08:28Z

src/transformers/tokenization_tapas.py

+        segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data)
+        column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data)
+        row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data)
+        prev_label_ids = [0] * len(row_ids)


I see that you set the previous label ids (i.e. which tokens of the table were an answer to the previous question) here. This is OK for a single example, but in case of a batch of examples, then the prev_label_ids must be set to the label_ids of the previous table-question pair in the batch (which are calculated based on the get_answer_ids function). This can be seen here in the original implementation. They use the index of the table-question pair in the batch to determine if it's the first, second, ... question of the batch. However, I'm not entirely sure about whether this is working well, I've submitted an issue to get this resolved.

I see you removed the position_to_label_ids dictionary that implemented this. It's a dictionary that maps a position to the label ids.

You're correct, this is my bad. I will fix this.

Ok, maybe it's better to wait with this until the issue mentioned above is resolved. It seems to me that this can be implemented in a better way than in the original implementation.

Update: after receiving a response from one of the authors (see issue above), I now understand how these are created in the original implementation (I actually implemented it in the wrong way with that dictionary). The correct way to do this (in case of a batch), is to set the prev_label_ids equal to get_answer_ids(queries[index - 1]), with index indicating whether the question is the first, second,... in a batch (note that a batch should always contain questions that refer to the same table). This function in turn calls _get_answer_ids(column_ids, row_ids, question), with the column_ids and row_ids of the current table-question pair. So it's important that before calling get_answer_ids, the column_ids and row_ids are set to those of the current table-question pair.

This will require some changes, also to the signature of several methods (such as get_answer_ids, which is currently accepting a lot more parameters than just question), to reflect the original implementation.

NielsRogge · 2020-11-12T14:18:14Z

src/transformers/tokenization_tapas.py

+        raw_queries: Union[
+            TextInput,
+            PreTokenizedInput,
+            EncodedInput,


I assume raw_queries must be a List

NielsRogge · 2020-11-12T14:27:04Z

src/transformers/tokenization_tapas.py

+            table=table,
+            query=query,


Note that I'm using answer_coordinateS here, with an s (see comment with screenshot)

Suggested change

table=table,

query=query,

table=table,

query=query,

answer_coordinates=answer_coordinates,

answer_text=answer_text

NielsRogge · 2020-11-12T14:30:29Z

src/transformers/tokenization_tapas.py

+        _, _, num_tokens = self._get_table_boundaries(table_tokens)
+
+        table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))


Similar to a comment above, I suggest the following:

Suggested change

_, _, num_tokens = self._get_table_boundaries(table_tokens)

table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))

_, _, max_num_tokens = self._get_table_boundaries(table_tokens)

if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:

max_num_tokens = self.cell_trim_length

table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))

NielsRogge · 2020-11-12T14:34:54Z

src/transformers/tokenization_tapas.py

+            _, _, num_tokens = self._get_table_boundaries(table_tokens)
+            table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))


Again, suggestion:

Suggested change

_, _, num_tokens = self._get_table_boundaries(table_tokens)

table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))

_, _, max_num_tokens = self._get_table_boundaries(table_tokens)

if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:

max_num_tokens = self.cell_trim_length

table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, max_num_tokens))

NielsRogge · 2020-11-12T14:41:19Z

src/transformers/tokenization_tapas.py

        return self.model_max_length - self._question_encoding_cost(question_tokens)

-    def _get_table_values(self, table, num_columns, num_rows, num_tokens):
+    def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:


This makes it more clear.

Suggested change

def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:

def _get_table_values(self, table, num_columns, num_rows, max_num_tokens) -> Generator[TableValue, None, None]:

NielsRogge · 2020-11-12T15:19:10Z

Thank you!

❗ This is a preliminary review, I'm not finished with it. 2 important things for now:

I am also testing the Colab demo's with this branch. Currently I'm getting an error when providing answer_coordinates and answer_texts to the tokenizer:

SQA: https://colab.research.google.com/drive/1BNxrKkrwpWuE2TthZL5qQlERtcK4ZbIt?usp=sharing
WTQ: https://colab.research.google.com/drive/1K8ZeNQyBqo-A03D8RL8_j34n-Ubggb9U?usp=sharing

Normally, the label_ids, numeric_values and numeric_values_scale should also be padded when I set padding='max_length'.

I've got an updated version of the creation of the numeric values (they are currently not performed correctly) in a branch named tapas_v3_up_to_date_with_master. Either you could incorporate these changes in your branch before making a PR, or I make them after the PR is merged (what you like best - the latter is probably easier).

NielsRogge · 2020-11-12T16:47:33Z

src/transformers/tokenization_tapas.py

+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.
+
+        This implementation does not add special tokens and this method should be overridden in a subclass.


This line can be removed, since it's overridden here.

LysandreJik · 2020-11-12T22:08:55Z

Great, thanks for your great preliminary review. I've fixed a few of the issues, just pushed a commit. There's a few things you mention that definitely need a deeper look. I can do so in the coming days, but I'll let you finish your review first so that I may batch everything. Thank you!

NielsRogge · 2020-11-13T09:12:33Z

src/transformers/tokenization_tapas.py

+                List[EncodedInput],
+            ]
+        ] = None,
+        answer_coordinates: Optional[List[Tuple]] = None,


Actually, the answer_coordinates are a List of Lists of Tuples (in case of a batch). Because each table-question pair has a list of tuples as answer coordinates.

In case of a single example, then it's indeed a list of tuples.

I also wonder whether we should include PreTokenizedInput, since we have a NotImplementedError further down stating that "Currently TapasTokenizer only supports questions as strings."

Done (however, what about PreTokenizedInput?)

NielsRogge · 2020-11-13T09:18:59Z

src/transformers/tokenization_tapas.py

+            ]
+        ] = None,
+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_texts: Optional[List[TextInput]] = None,


Also, the answer_texts is a List of List of TextInputs in case of a batch. A single example can have multiple answer texts, corresponding to the different cells, here's an example row from the SQA dev set:

list two pylons that are at most, 80 m in height. table_csv/203-375.csv ['(-1, -1)', '(-1, -1)'] ['Mittersill goods aerial tramway', 'Singapore cable car'] NONE

Maybe it would also be better to rename this to answer_text (singular), to have the same column name as the SQA format. This will also make sure we have the same argument names in both the batch and non-batch methods. This will of course require some changes to the other encoding methods.

NielsRogge · 2020-11-13T09:35:06Z

src/transformers/tokenization_tapas.py

+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_texts: Optional[List[TextInput]] = None,


Suggested change

answer_coordinates: Optional[List[Tuple]] = None,

answer_texts: Optional[List[TextInput]] = None,

answer_coordinates: Optional[List[List[Tuple]]] = None,

answer_texts: Optional[List[List[[TextInput]]] = None,

NielsRogge · 2020-11-13T09:35:46Z

src/transformers/tokenization_tapas.py

+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_texts: Optional[List[TextInput]] = None,


Suggested change

answer_coordinates: Optional[List[Tuple]] = None,

answer_texts: Optional[List[TextInput]] = None,

answer_coordinates: Optional[List[List[[Tuple]]] = None,

answer_texts: Optional[List[List[TextInput]]] = None,

NielsRogge · 2020-11-13T09:41:04Z

src/transformers/tokenization_tapas.py

+        for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip(
+            queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts


If we change the answer_texts parameter in all encoding methods to answer_text, than we have to update this code (and a lot of other places):

Suggested change

for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip(

queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts

for query_ids, raw_query, query_tokens, answer_coords, answer_txt in zip(

queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_text

NielsRogge · 2020-11-13T09:41:23Z

src/transformers/tokenization_tapas.py

+                table_data=table_data,
+                query_tokens=query_tokens,
+                answer_coordinates=answer_coords,
+                answer_text=answer_text,


If we change the answer_texts parameter to answer_text, then this becomes:

Suggested change

answer_text=answer_text,

answer_text=answer_txt,

NielsRogge · 2020-11-13T09:58:00Z

src/transformers/tokenization_tapas.py

+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_texts: Optional[List[TextInput]] = None,


Suggested change

answer_coordinates: Optional[List[Tuple]] = None,

answer_texts: Optional[List[TextInput]] = None,

answer_coordinates: Optional[List[List[[Tuple]] = None,

answer_texts: Optional[List[List[TextInput]] = None,

NielsRogge · 2020-11-13T10:11:03Z

src/transformers/tokenization_tapas.py

+            label_ids = self.get_answer_ids(
+                column_ids, row_ids, table_data, query_tokens, answer_text, answer_coordinates
+            )
+            numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids, columns_to_numeric_values)
+            numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids)
+
+            encoded_inputs["label_ids"] = label_ids
+            encoded_inputs["numeric_values"] = numeric_values
+            encoded_inputs["numeric_values_scale"] = numeric_values_scale


These are created, but it seems that the padding/truncation is not working when self.pad() is applied on them. This currently results in an error (when I run a Colab notebook from this branch):

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis) 607 --> 608 tensor = as_tensor(value) 609 ValueError: expected sequence of length 41 at dim 1 (got 45) During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) 6 frames /usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis) 623 ) 624 raise ValueError( --> 625 "Unable to create tensor, you should probably activate truncation and/or padding " 626 "with 'padding=True' 'truncation=True' to have batched tensors with the same length." 627 ) ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Padding has been fixed

NielsRogge · 2020-11-13T10:19:59Z

src/transformers/tokenization_tapas.py

+            token_ids_0 (:obj:`List[int]`): The first tokenized sequence.
+            token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence.


Suggested change

token_ids_0 (:obj:`List[int]`): The first tokenized sequence.

token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence.

token_ids_0 (:obj:`List[int]`): The ids of the question.

token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table.

NielsRogge · 2020-11-13T10:22:17Z

src/transformers/tokenization_tapas.py

+        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
+        adding special tokens.


Suggested change

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and

adding special tokens.

Build model inputs from a question and flattened table for sequence classification or question answering tasks by concatenating and

adding special tokens.

NielsRogge · 2020-11-13T10:25:17Z

src/transformers/tokenization_tapas.py

+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                Optional second list of IDs for sequence pairs.


Suggested change

Args:

token_ids_0 (:obj:`List[int]`):

List of IDs.

token_ids_1 (:obj:`List[int]`, `optional`):

Optional second list of IDs for sequence pairs.

Args:

token_ids_0 (:obj:`List[int]`):

List of question IDs.

token_ids_1 (:obj:`List[int]`, `optional`):

List of flattened table IDs.

NielsRogge · 2020-11-13T11:05:31Z

tests/test_tokenization_tapas.py

+            ],
+        }
+
+        new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0], padding="max_length")


This test is to prepare inputs for TAPAS for inference. We also need to include tests to prepare inputs for TAPAS for training (fine-tuning). In that case, also answer_coordinates and answer_text need to be provided to the tokenizer, and it should create label_ids, numeric_values and numeric_values_scale.

NielsRogge · 2020-11-13T11:24:57Z

@LysandreJik I have finished reviewing, I've added more (mostly documentation-related) comments.

The most important thing is that when label_ids, answer_coordinates and answer_text are provided to the tokenizer, an error is currently thrown due to the fact that padding is not working.

Besides this, the other important things are:

a correct implementation of the creation of the prev_label_ids when a batch of table-question pairs is provided
a correct implementation of drop_rows_to_fit and cell_trim_length

TAPAS

e00e47f

LysandreJik mentioned this pull request Nov 12, 2020

[WIP] Add Tapas model #8113

Closed

NielsRogge reviewed Nov 12, 2020

View reviewed changes

Fix some of NielsRogge's comments

eb0c17a

NielsRogge reviewed Nov 13, 2020

View reviewed changes

LysandreJik merged commit 81d56b2 into tapas-style Nov 16, 2020

LysandreJik deleted the tapas-final branch November 16, 2020 16:30

LysandreJik restored the tapas-final branch November 16, 2020 16:30

LysandreJik deleted the tapas-final branch November 16, 2020 16:30

LysandreJik restored the tapas-final branch November 16, 2020 16:30

LysandreJik deleted the tapas-final branch April 27, 2022 15:42

		_, _, num_tokens = self._get_table_boundaries(table_tokens)

		table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))

-            _, _, num_tokens = self._get_table_boundaries(table_tokens)
-            table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens))
+            _, _, max_num_tokens = self._get_table_boundaries(table_tokens)
+            if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:
+            max_num_tokens = self.cell_trim_length
+            table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, max_num_tokens))

	def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:
	def _get_table_values(self, table, num_columns, num_rows, max_num_tokens) -> Generator[TableValue, None, None]:

		answer_coordinates: Optional[List[Tuple]] = None,
		answer_texts: Optional[List[TextInput]] = None,

		for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip(
		queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts

		token_ids_0 (:obj:`List[int]`): The first tokenized sequence.
		token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence.

		Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
		adding special tokens.

Conversation

LysandreJik commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NielsRogge Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Nov 12, 2020

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NielsRogge Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

LysandreJik commented Nov 12, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge commented Nov 12, 2020 •

edited

Loading

NielsRogge Nov 12, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading

NielsRogge Nov 13, 2020 •

edited

Loading