TAPAS tokenizer & tokenizer tests#8482
Conversation
| PreTokenizedInput, | ||
| EncodedInput, | ||
| ], | ||
| answer_coordinate: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Maybe it's more logical to call it answer_coordinates (plural) rather than answer_coordinate, because a table-question pair typically has an answer spanning multiple coordinates. Here's a screenshot of the SQA TSV format (column is called "answer_coordinates"):
This will also require a change in the encode_plus and _encode_plus methods.
|
|
||
| def _add_numeric_column_ranks(self, column_ids, row_ids, table, features): | ||
| def _get_numeric_column_ranks(self, column_ids, row_ids, table): | ||
| """Adds column ranks for all numeric columns.""" |
There was a problem hiding this comment.
To be consistent with _get_numeric_relations and the other methods below, the docstring can be changed to "Returns numeric column rank embeddings for all numeric columns".
| cell_trim_length=cell_trim_length, | ||
| max_column_id=max_column_id, | ||
| max_row_id=max_row_id, | ||
| strip_column_names=strip_column_names, | ||
| update_answer_coordinates=update_answer_coordinates, | ||
| drop_rows_to_fit=drop_rows_to_fit, |
There was a problem hiding this comment.
Is this necessary? Didn't know that 😄
There was a problem hiding this comment.
This is necessary because all those arguments passed to the super class will then be saved in self.init_kwargs, which will be used when saving the tokenizer. It is important to have those here, so that when saving/reloading the tokenizer, the exact same config is used!
| def _get_numeric_relations(self, question, column_ids, row_ids, table, columns_to_numeric_values): | ||
| """ | ||
| Adds numeric relation embeddings to 'features' | ||
| Return numeric relations embeddings |
| segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data) | ||
| column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data) | ||
| row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data) | ||
| prev_label_ids = [0] * len(row_ids) |
There was a problem hiding this comment.
I see that you set the previous label ids (i.e. which tokens of the table were an answer to the previous question) here. This is OK for a single example, but in case of a batch of examples, then the prev_label_ids must be set to the label_ids of the previous table-question pair in the batch (which are calculated based on the get_answer_ids function). This can be seen here in the original implementation. They use the index of the table-question pair in the batch to determine if it's the first, second, ... question of the batch. However, I'm not entirely sure about whether this is working well, I've submitted an issue to get this resolved.
I see you removed the position_to_label_ids dictionary that implemented this. It's a dictionary that maps a position to the label ids.
There was a problem hiding this comment.
You're correct, this is my bad. I will fix this.
There was a problem hiding this comment.
Ok, maybe it's better to wait with this until the issue mentioned above is resolved. It seems to me that this can be implemented in a better way than in the original implementation.
There was a problem hiding this comment.
Update: after receiving a response from one of the authors (see issue above), I now understand how these are created in the original implementation (I actually implemented it in the wrong way with that dictionary). The correct way to do this (in case of a batch), is to set the prev_label_ids equal to get_answer_ids(queries[index - 1]), with index indicating whether the question is the first, second,... in a batch (note that a batch should always contain questions that refer to the same table). This function in turn calls _get_answer_ids(column_ids, row_ids, question), with the column_ids and row_ids of the current table-question pair. So it's important that before calling get_answer_ids, the column_ids and row_ids are set to those of the current table-question pair.
This will require some changes, also to the signature of several methods (such as get_answer_ids, which is currently accepting a lot more parameters than just question), to reflect the original implementation.
| raw_queries: Union[ | ||
| TextInput, | ||
| PreTokenizedInput, | ||
| EncodedInput, |
There was a problem hiding this comment.
I assume raw_queries must be a List
| table=table, | ||
| query=query, |
There was a problem hiding this comment.
Note that I'm using answer_coordinateS here, with an s (see comment with screenshot)
| table=table, | |
| query=query, | |
| table=table, | |
| query=query, | |
| answer_coordinates=answer_coordinates, | |
| answer_text=answer_text |
| _, _, num_tokens = self._get_table_boundaries(table_tokens) | ||
|
|
||
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
There was a problem hiding this comment.
Similar to a comment above, I suggest the following:
| _, _, num_tokens = self._get_table_boundaries(table_tokens) | |
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) | |
| _, _, max_num_tokens = self._get_table_boundaries(table_tokens) | |
| if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length: | |
| max_num_tokens = self.cell_trim_length | |
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
| _, _, num_tokens = self._get_table_boundaries(table_tokens) | ||
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) |
There was a problem hiding this comment.
Again, suggestion:
| _, _, num_tokens = self._get_table_boundaries(table_tokens) | |
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) | |
| _, _, max_num_tokens = self._get_table_boundaries(table_tokens) | |
| if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length: | |
| max_num_tokens = self.cell_trim_length | |
| table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, max_num_tokens)) |
| return self.model_max_length - self._question_encoding_cost(question_tokens) | ||
|
|
||
| def _get_table_values(self, table, num_columns, num_rows, num_tokens): | ||
| def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: |
There was a problem hiding this comment.
This makes it more clear.
| def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: | |
| def _get_table_values(self, table, num_columns, num_rows, max_num_tokens) -> Generator[TableValue, None, None]: |
|
Thank you! ❗ This is a preliminary review, I'm not finished with it. 2 important things for now:
SQA: https://colab.research.google.com/drive/1BNxrKkrwpWuE2TthZL5qQlERtcK4ZbIt?usp=sharing Normally, the
|
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | ||
| adding special tokens. | ||
|
|
||
| This implementation does not add special tokens and this method should be overridden in a subclass. |
There was a problem hiding this comment.
This line can be removed, since it's overridden here.
|
Great, thanks for your great preliminary review. I've fixed a few of the issues, just pushed a commit. There's a few things you mention that definitely need a deeper look. I can do so in the coming days, but I'll let you finish your review first so that I may batch everything. Thank you! |
| List[EncodedInput], | ||
| ] | ||
| ] = None, | ||
| answer_coordinates: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Actually, the answer_coordinates are a List of Lists of Tuples (in case of a batch). Because each table-question pair has a list of tuples as answer coordinates.
In case of a single example, then it's indeed a list of tuples.
I also wonder whether we should include PreTokenizedInput, since we have a NotImplementedError further down stating that "Currently TapasTokenizer only supports questions as strings."
There was a problem hiding this comment.
Done (however, what about PreTokenizedInput?)
| ] | ||
| ] = None, | ||
| answer_coordinates: Optional[List[Tuple]] = None, | ||
| answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
Also, the answer_texts is a List of List of TextInputs in case of a batch. A single example can have multiple answer texts, corresponding to the different cells, here's an example row from the SQA dev set:
list two pylons that are at most, 80 m in height. table_csv/203-375.csv ['(-1, -1)', '(-1, -1)'] ['Mittersill goods aerial tramway', 'Singapore cable car'] NONE
Maybe it would also be better to rename this to answer_text (singular), to have the same column name as the SQA format. This will also make sure we have the same argument names in both the batch and non-batch methods. This will of course require some changes to the other encoding methods.
| answer_coordinates: Optional[List[Tuple]] = None, | ||
| answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
| answer_coordinates: Optional[List[Tuple]] = None, | |
| answer_texts: Optional[List[TextInput]] = None, | |
| answer_coordinates: Optional[List[List[Tuple]]] = None, | |
| answer_texts: Optional[List[List[[TextInput]]] = None, |
| answer_coordinates: Optional[List[Tuple]] = None, | ||
| answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
| answer_coordinates: Optional[List[Tuple]] = None, | |
| answer_texts: Optional[List[TextInput]] = None, | |
| answer_coordinates: Optional[List[List[[Tuple]]] = None, | |
| answer_texts: Optional[List[List[TextInput]]] = None, |
| for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip( | ||
| queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts |
There was a problem hiding this comment.
If we change the answer_texts parameter in all encoding methods to answer_text, than we have to update this code (and a lot of other places):
| for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip( | |
| queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts | |
| for query_ids, raw_query, query_tokens, answer_coords, answer_txt in zip( | |
| queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_text |
| table_data=table_data, | ||
| query_tokens=query_tokens, | ||
| answer_coordinates=answer_coords, | ||
| answer_text=answer_text, |
There was a problem hiding this comment.
If we change the answer_texts parameter to answer_text, then this becomes:
| answer_text=answer_text, | |
| answer_text=answer_txt, |
| answer_coordinates: Optional[List[Tuple]] = None, | ||
| answer_texts: Optional[List[TextInput]] = None, |
There was a problem hiding this comment.
| answer_coordinates: Optional[List[Tuple]] = None, | |
| answer_texts: Optional[List[TextInput]] = None, | |
| answer_coordinates: Optional[List[List[[Tuple]] = None, | |
| answer_texts: Optional[List[List[TextInput]] = None, |
| label_ids = self.get_answer_ids( | ||
| column_ids, row_ids, table_data, query_tokens, answer_text, answer_coordinates | ||
| ) | ||
| numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids, columns_to_numeric_values) | ||
| numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids) | ||
|
|
||
| encoded_inputs["label_ids"] = label_ids | ||
| encoded_inputs["numeric_values"] = numeric_values | ||
| encoded_inputs["numeric_values_scale"] = numeric_values_scale |
There was a problem hiding this comment.
These are created, but it seems that the padding/truncation is not working when self.pad() is applied on them. This currently results in an error (when I run a Colab notebook from this branch):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
607
--> 608 tensor = as_tensor(value)
609
ValueError: expected sequence of length 41 at dim 1 (got 45)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
623 )
624 raise ValueError(
--> 625 "Unable to create tensor, you should probably activate truncation and/or padding "
626 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
627 )
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
| token_ids_0 (:obj:`List[int]`): The first tokenized sequence. | ||
| token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence. |
There was a problem hiding this comment.
| token_ids_0 (:obj:`List[int]`): The first tokenized sequence. | |
| token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence. | |
| token_ids_0 (:obj:`List[int]`): The ids of the question. | |
| token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table. |
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | ||
| adding special tokens. |
There was a problem hiding this comment.
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |
| adding special tokens. | |
| Build model inputs from a question and flattened table for sequence classification or question answering tasks by concatenating and | |
| adding special tokens. |
| Args: | ||
| token_ids_0 (:obj:`List[int]`): | ||
| List of IDs. | ||
| token_ids_1 (:obj:`List[int]`, `optional`): | ||
| Optional second list of IDs for sequence pairs. |
There was a problem hiding this comment.
| Args: | |
| token_ids_0 (:obj:`List[int]`): | |
| List of IDs. | |
| token_ids_1 (:obj:`List[int]`, `optional`): | |
| Optional second list of IDs for sequence pairs. | |
| Args: | |
| token_ids_0 (:obj:`List[int]`): | |
| List of question IDs. | |
| token_ids_1 (:obj:`List[int]`, `optional`): | |
| List of flattened table IDs. |
| ], | ||
| } | ||
|
|
||
| new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0], padding="max_length") |
There was a problem hiding this comment.
This test is to prepare inputs for TAPAS for inference. We also need to include tests to prepare inputs for TAPAS for training (fine-tuning). In that case, also answer_coordinates and answer_text need to be provided to the tokenizer, and it should create label_ids, numeric_values and numeric_values_scale.
|
@LysandreJik I have finished reviewing, I've added more (mostly documentation-related) comments. The most important thing is that when Besides this, the other important things are:
|
This PR aims to implement the tokenizer API for the TAPAS model, as well as the tests. It is based on
tapas-stylewhich contains all the changes done by black & isort on top of thenielsrogge/tapas_v3branch in #8113.The API is akin to our other tokenizers': it is based on the
__call__method which dispatches toencode_plusorbatch_encode_plusaccording to the inputs.These two methods then dispatch to
_encode_plusand_batch_encode_plus, which themselves dispatch toprepare_for_modeland_batch_prepare_for_model.Here are the remaining tasks for the tokenizers, from what I could observe:
pd.DataFrames. It should be very simple to switch from these todatasets.Dataset, which serve the same purpose.Once this PR is merged, I'll open a PR from
tapas-styletonielsrogge/tapas_v3as explained in #8113 (comment)