diff --git a/docs/source/index.rst b/docs/source/index.rst index e448bbfe26e4..7b68b3ce91bc 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -145,8 +145,8 @@ conversion utilities for the following models: 27. :doc:`T5 ` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer `__ by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. -28. :doc:`TAPAS ` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via - Pre-training `__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, +28. :doc:`TAPAS ` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via + Pre-training `__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. 29. :doc:`Transformer-XL ` (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `__ by Zihang Dai*, diff --git a/docs/source/model_doc/tapas.rst b/docs/source/model_doc/tapas.rst index c506ce1d722e..e070e78bac9e 100644 --- a/docs/source/model_doc/tapas.rst +++ b/docs/source/model_doc/tapas.rst @@ -5,85 +5,84 @@ Overview ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training -`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and -Julian Martin Eisenschlos. -It's a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. Compared to -BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular structure. TAPAS is pre-trained -on the masked language modeling (MLM) objective on a large dataset comprising millions of tables from English Wikipedia and -corresponding texts. For question answering, TAPAS has 2 heads on top: a cell selection head and an aggregation head, for -(optionally) performing aggregations (such as counting or summing) among selected cells. TAPAS has been fine-tuned on several -datasets: SQA (Sequential Question Answering by Microsoft), WTQ (Wiki Table Questions by Stanford University) and WikiSQL -(by Salesforce). It achieves state-of-the-art on both SQA and WTQ, while having comparable performance to SOTA on WikiSQL, -with a much simpler architecture. +`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and +Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for answering questions +about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular +structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising millions +of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads on top: a cell +selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among +selected cells. TAPAS has been fine-tuned on several datasets: SQA (Sequential Question Answering by Microsoft), WTQ +(Wiki Table Questions by Stanford University) and WikiSQL (by Salesforce). It achieves state-of-the-art on both SQA and +WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture. The abstract from the paper is the following: -*Answering natural language questions over tables is usually seen as a semantic parsing task. -To alleviate the collection cost of full logical forms, one popular approach focuses on weak -supervision consisting of denotations instead of logical forms. However, training semantic parsers -from weak supervision poses difficulties, and in addition, the generated logical forms are only used -as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an -approach to question answering over tables without generating logical forms. TAPAS trains from weak -supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding -aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, -initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, -and is trained end-to-end. We experiment with three different semantic parsing datasets, and find -that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on -SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but -with a simpler model architecture. We additionally find that transfer learning, which is trivial -in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.* - -In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset of millions -of automatically created training examples which are learned in an intermediate step prior to fine-tuning. The authors of TAPAS -call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, and then on another dataset). -They found that intermediate pre-training further improves performance on SQA, achieving a new state-of-the-art as well as -state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for table entailment (a binary classification task). -For more details, see their new paper: `Understanding tables with intermediate pre-training `__ -by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller. +*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the +collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations +instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, +the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we +present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak +supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation +operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective +joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with +three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by +improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL +and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our +setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.* + +In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset +of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. +The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, +and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving +a new state-of-the-art as well as state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for +table entailment (a binary classification task). For more details, see their new paper: `Understanding tables with +intermediate pre-training `__ by Julian Martin Eisenschlos, Syrine Krichene and +Thomas Müller. The original code can be found `here `__. Tips: -- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). According to - the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out - of embeddings. - If you don't want this, you can set the `reset_position_index_per_cell` parameter of :class:`~transformers.TapasConfig` to False. -- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a conversational set-up. This - means that you can ask follow-up questions such as "what is his age?" related to the previous question. Note that the forward pass of - TAPAS is a bit different in case of a conversational set-up: in that case, you have to feed every training example one by one to the - model, such that the `prev_label_ids` token type ids can be overwritten by the predicted `label_ids` of the model to the previous - question. -- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. - It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for - text generation. Models trained with a causal language modeling (CLM) objective are better in that regard. +- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell + of the table). According to the authors, this usually results in a slightly better performance, and allows you to + encode longer sequences without running out of embeddings. If you don't want this, you can set the + `reset_position_index_per_cell` parameter of :class:`~transformers.TapasConfig` to False. +- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a + conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the + previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that + case, you have to feed every training example one by one to the model, such that the `prev_label_ids` token type ids + can be overwritten by the predicted `label_ids` of the model to the previous question. +- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore + efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained + with a causal language modeling (CLM) objective are better in that regard. Usage ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If you just want to perform inference (i.e. making predictions) in a non-conversational setup, you can do the following: +If you just want to perform inference (i.e. making predictions) in a non-conversational setup, you can do the +following: .. code-block:: >>> from transformers import TapasTokenizer, TapasForQuestionAnswering >>> import pandas as pd - + >>> model_name = 'tapas-base-finetuned-wtq' >>> model = TapasForQuestionAnswering.from_pretrained(model_name) >>> tokenizer = TapasTokenizer.from_pretrained(model_name) - + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]} >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"] >>> table = pd.Dataframe(data) >>> inputs = tokenizer(table, queries, return_tensors='pt') >>> logits, logits_agg = model(**inputs) >>> answer_coordinates_batch, aggregation_predictions = tokenizer.convert_logits_to_predictions(inputs, logits, logits_agg) - + >>> # let's print out the results: >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"} >>> aggregation_predictions_string = [id2aggregation[x] for x in aggregation_predictions] - + >>> answers = [] >>> for coordinates in answer_coordinates_batch: ... if len(coordinates) == 1: @@ -95,7 +94,7 @@ If you just want to perform inference (i.e. making predictions) in a non-convers ... for coordinate in coordinates: ... cell_values.append(df.iat[coordinate]) ... answers.append(", ".join(cell_values)) - + >>> display(df) >>> print("") >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string): diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index 93d6cb92a00b..2122d7579c86 100755 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -562,10 +562,10 @@ ) from .modeling_tapas import ( TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST, - TapasModel, TapasForMaskedLM, TapasForQuestionAnswering, TapasForSequenceClassification, + TapasModel, load_tf_weights_in_tapas, ) from .modeling_transfo_xl import ( diff --git a/src/transformers/configuration_tapas.py b/src/transformers/configuration_tapas.py index 1f6393e41b0a..6b9eb4462f67 100644 --- a/src/transformers/configuration_tapas.py +++ b/src/transformers/configuration_tapas.py @@ -17,20 +17,20 @@ from .configuration_utils import PretrainedConfig + TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP = {"tapas-base": "", "tapas-large": ""} # to be added # to be added class TapasConfig(PretrainedConfig): r""" - This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. - It is used to instantiate a TAPAS model according to the specified arguments, defining the model - architecture. Instantiating a configuration with the defaults will yield a similar configuration - to that of the TAPAS `tapas-base-finetuned-sqa` architecture. Configuration objects - inherit from :class:`~transformers.PreTrainedConfig` and can be used to control the model outputs. - Read the documentation from :class:`~transformers.PretrainedConfig` for more information. + This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. It is used to + instantiate a TAPAS model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the TAPAS `tapas-base-finetuned-sqa` + architecture. Configuration objects inherit from :class:`~transformers.PreTrainedConfig` and can be used to control + the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. - Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original implementation. - Original implementation available at https://github.com/google-research/tapas/tree/master. + Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original + implementation. Original implementation available at https://github.com/google-research/tapas/tree/master. Args: vocab_size (:obj:`int`, `optional`, defaults to 30522): @@ -87,9 +87,9 @@ class TapasConfig(PretrainedConfig): average_approximation_function: (:obj:`string`, `optional`, defaults to :obj:`"ratio"`): Method to calculate expected average of cells in the relaxed case. cell_selection_preference: (:obj:`float`, `optional`, defaults to None): - Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for aggregation (WTQ, WikiSQL). - If the total mass of the aggregation probabilities (excluding the "NONE" operator) is higher than this hyperparameter, - then aggregation is predicted for an example. + Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for + aggregation (WTQ, WikiSQL). If the total mass of the aggregation probabilities (excluding the "NONE" + operator) is higher than this hyperparameter, then aggregation is predicted for an example. answer_loss_cutoff: (:obj:`float`, `optional`, defaults to None): Ignore examples with answer loss larger than cutoff. max_num_rows: (:obj:`int`, `optional`, defaults to 64): @@ -109,7 +109,7 @@ class TapasConfig(PretrainedConfig): disable_per_token_loss: (:obj:`bool`, `optional`, defaults to :obj:`False`): Disable any (strong or weak) supervision on cells. span_prediction: (:obj:`string`, `optional`, defaults to :obj:`"none"`): - Span selection mode to use. Currently only "none" is supported. + Span selection mode to use. Currently only "none" is supported. Example:: diff --git a/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py index 53fade3d012d..3023e9aaf6f2 100644 --- a/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py +++ b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py @@ -19,7 +19,12 @@ import torch -from transformers import TapasConfig, TapasForQuestionAnswering, TapasForSequenceClassification, load_tf_weights_in_tapas +from transformers import ( + TapasConfig, + TapasForQuestionAnswering, + TapasForSequenceClassification, + load_tf_weights_in_tapas, +) from transformers.utils import logging @@ -41,14 +46,14 @@ def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, tapas_config_file, pyto # select_one_column = True, # allow_empty_column_selection = False, # temperature = 0.0352513) - + # SQA config config = TapasConfig() - + print("Building PyTorch model from configuration: {}".format(str(config))) # model = TapasForMaskedLM(config) model = TapasForQuestionAnswering(config) - #model = TapasForSequenceClassification(config) + # model = TapasForSequenceClassification(config) # Load weights from tf checkpoint load_tf_weights_in_tapas(model, config, tf_checkpoint_path) diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py index c21faacbe981..20834e4550a3 100644 --- a/src/transformers/file_utils.py +++ b/src/transformers/file_utils.py @@ -191,10 +191,10 @@ except ImportError: _tokenizers_available = False - - -try: - import torch_scatter + + +try: + import torch_scatter # Check we're not importing a "torch_scatter" directory somewhere _scatter_available = hasattr(torch_scatter, "__version__") and hasattr(torch_scatter, "scatter") diff --git a/src/transformers/modeling_tapas.py b/src/transformers/modeling_tapas.py index e6c8f1a94596..d3b795ded588 100644 --- a/src/transformers/modeling_tapas.py +++ b/src/transformers/modeling_tapas.py @@ -15,11 +15,11 @@ """PyTorch TAPAS model. """ +import enum import math import os from dataclasses import dataclass from typing import Optional, Tuple -import enum import torch import torch.nn as nn @@ -27,19 +27,15 @@ from .activations import ACT2FN from .configuration_tapas import TapasConfig -from .file_utils import (ModelOutput, - add_start_docstrings, - add_start_docstrings_to_model_forward, - replace_return_docstrings, - is_scatter_available, - requires_scatter, -) -from .modeling_outputs import ( - BaseModelOutput, - BaseModelOutputWithPooling, - MaskedLMOutput, - SequenceClassifierOutput, +from .file_utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + is_scatter_available, + replace_return_docstrings, + requires_scatter, ) +from .modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, MaskedLMOutput, SequenceClassifierOutput from .modeling_utils import ( PreTrainedModel, apply_chunking_to_forward, @@ -48,6 +44,7 @@ ) from .utils import logging + # soft dependency if is_scatter_available(): from torch_scatter import scatter @@ -75,20 +72,20 @@ class TableQuestionAnsweringOutput(ModelOutput): Args: loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label_ids` (and possibly :obj:`answer`, :obj:`aggregation_labels`, :obj:`numeric_values` and :obj:`numeric_values_scale` are provided)): - Total loss as the sum of the hierarchical cell selection log-likelihood loss and (optionally) the semi-supervised regression loss and (optionally) supervised loss for aggregations. + Total loss as the sum of the hierarchical cell selection log-likelihood loss and (optionally) the + semi-supervised regression loss and (optionally) supervised loss for aggregations. logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): Prediction scores of the cell selection head, for every token. logits_aggregation (:obj:`torch.FloatTensor`, `optional`, of shape :obj:`(batch_size, num_aggregation_labels)`): Prediction scores of the aggregation head, for every aggregation operator. hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) - of shape :obj:`(batch_size, sequence_length, hidden_size)`. - Hidden-states of the model at the output of each layer plus the initial embedding outputs. + of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of + each layer plus the initial embedding outputs. attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): - Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape - :obj:`(batch_size, num_heads, sequence_length, sequence_length)`. - Attentions weights after the attention softmax, used to compute the weighted average in the self-attention - heads. + Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, + sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the + weighted average in the self-attention heads. """ loss: Optional[torch.FloatTensor] = None @@ -99,7 +96,9 @@ class TableQuestionAnsweringOutput(ModelOutput): def load_tf_weights_in_tapas(model, config, tf_checkpoint_path): - """Load tf checkpoints in a PyTorch model. This is an adaptation from load_tf_weights_in_bert + """ + Load tf checkpoints in a PyTorch model. This is an adaptation from load_tf_weights_in_bert + - add cell selection and aggregation heads - take into account additional token type embedding layers """ @@ -144,19 +143,19 @@ def load_tf_weights_in_tapas(model, config, tf_checkpoint_path): ): logger.info("Skipping {}".format("/".join(name))) continue - # in case the model is TapasForSequenceClassification, we skip output_bias and output_weights + # in case the model is TapasForSequenceClassification, we skip output_bias and output_weights # since these are not used for classification if isinstance(model, TapasForSequenceClassification): if any( - n - in [ - "output_bias", - "output_weights", - ] - for n in name + n + in [ + "output_bias", + "output_weights", + ] + for n in name ): - logger.info("Skipping {}".format("/".join(name))) - continue + logger.info("Skipping {}".format("/".join(name))) + continue # if first scope name starts with "bert", change it to "tapas" if name[0] == "bert": name[0] = "tapas" @@ -233,8 +232,8 @@ def load_tf_weights_in_tapas(model, config, tf_checkpoint_path): class TapasEmbeddings(nn.Module): """ - Construct the embeddings from word, position and token_type embeddings. - Same as BertEmbeddings but with a number of additional token type embeddings to encode tabular structure. + Construct the embeddings from word, position and token_type embeddings. Same as BertEmbeddings but with a number of + additional token type embeddings to encode tabular structure. """ def __init__(self, config): @@ -632,8 +631,9 @@ def forward(self, hidden_states): class TapasPreTrainedModel(PreTrainedModel): - """An abstract class to handle weights initialization and - a simple interface for downloading and loading pretrained models. + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. """ config_class = TapasConfig @@ -658,42 +658,40 @@ def _init_weights(self, module): methods the library implements for all its models (such as downloading or saving, resizing the input embeddings, pruning heads etc.) - This model is also a PyTorch `torch.nn.Module `__ subclass. - Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general - usage and behavior. + This model is also a PyTorch `torch.nn.Module `__ + subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to + general usage and behavior. Parameters: config (:class:`~transformers.TapasConfig`): Model configuration class with all the parameters of the model. - Initializing with a config file does not load the weights associated with the model, only the configuration. - Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model weights. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model + weights. """ TAPAS_INPUTS_DOCSTRING = r""" Args: input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`): - Indices of input sequence tokens in the vocabulary. - Indices can be obtained using :class:`~transformers.TapasTokenizer`. - See :meth:`transformers.PreTrainedTokenizer.encode` and - :meth:`transformers.PreTrainedTokenizer.__call__` for details. - `What are input IDs? <../glossary.html#input-ids>`__ + Indices of input sequence tokens in the vocabulary. Indices can be obtained using + :class:`~transformers.TapasTokenizer`. See :meth:`transformers.PreTrainedTokenizer.encode` and + :meth:`transformers.PreTrainedTokenizer.__call__` for details. `What are input IDs? + <../glossary.html#input-ids>`__ attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): - Mask to avoid performing attention on padding token indices. - Mask values selected in ``[0, 1]``: - - 1 for tokens that are **not masked**, - - 0 for tokens that are **masked**. - `What are attention masks? <../glossary.html#attention-mask>`__ + Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: - 1 for + tokens that are **not masked**, - 0 for tokens that are **masked**. `What are attention masks? + <../glossary.html#attention-mask>`__ token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0}, 7)`, `optional`): - Token indices that encode tabular structure. Indices can be obtained using :class:`~transformers.TapasTokenizer`. See this class for more info. - `What are token type IDs? <../glossary.html#token-type-ids>`_ + Token indices that encode tabular structure. Indices can be obtained using + :class:`~transformers.TapasTokenizer`. See this class for more info. `What are token type IDs? + <../glossary.html#token-type-ids>`_ position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`): - Indices of positions of each input sequence tokens in the position embeddings. If ``reset_position_index_per_cell`` of :class:`~transformers.TapasConfig` is set to ``True``, relative position embeddings will be used. - Selected in the range ``[0, config.max_position_embeddings - 1]``. - `What are position IDs? <../glossary.html#position-ids>`_ + Indices of positions of each input sequence tokens in the position embeddings. If + ``reset_position_index_per_cell`` of :class:`~transformers.TapasConfig` is set to ``True``, relative + position embeddings will be used. Selected in the range ``[0, config.max_position_embeddings - 1]``. `What + are position IDs? <../glossary.html#position-ids>`_ head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): - Mask to nullify selected heads of the self-attention modules. - Mask values selected in ``[0, 1]``: - - 1 indicates the head is **not masked**, - - 0 indicates the head is **masked**. + Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``: - 1 + indicates the head is **not masked**, - 0 indicates the head is **masked**. inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`): Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert :obj:`input_ids` indices into associated @@ -715,19 +713,18 @@ def _init_weights(self, module): ) class TapasModel(TapasPreTrainedModel): """ - This class is a small change compared to :class:`~transformers.BertModel`, taking into account the additional token type ids. - - The model can behave as an encoder (with only self-attention) as well - as a decoder, in which case a layer of cross-attention is added between - the self-attention layers, following the architecture described in `Attention is all you need - `__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, - Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. - - To behave as an decoder the model needs to be initialized with the - :obj:`is_decoder` argument of the configuration set to :obj:`True`. - To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder` - argument and :obj:`add_cross_attention` set to :obj:`True`; an - :obj:`encoder_hidden_states` is then expected as an input to the forward pass. + This class is a small change compared to :class:`~transformers.BertModel`, taking into account the additional token + type ids. + + The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of + cross-attention is added between the self-attention layers, following the architecture described in `Attention is + all you need `__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, + Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. + + To behave as an decoder the model needs to be initialized with the :obj:`is_decoder` argument of the configuration + set to :obj:`True`. To be used in a Seq2Seq model, the model needs to initialized with both :obj:`is_decoder` + argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an + input to the forward pass. """ config_class = TapasConfig @@ -751,9 +748,9 @@ def set_input_embeddings(self, value): self.embeddings.word_embeddings = value def _prune_heads(self, heads_to_prune): - """Prunes heads of the model. - heads_to_prune: dict of {layer_num: list of heads to prune in this layer} - See base class PreTrainedModel + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel """ for layer, heads in heads_to_prune.items(): self.encoder.layer[layer].attention.prune_heads(heads) @@ -903,10 +900,9 @@ def forward( ): r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): - Labels for computing the masked language modeling loss. - Indices should be in ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) - Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens with labels - in ``[0, ..., config.vocab_size]`` + Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., + config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored + (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` Returns: @@ -923,7 +919,7 @@ def forward( >>> inputs = tokenizer(table, "How many [MASK] has George [MASK] played in?", return_tensors="pt") >>> labels = tokenizer(table, "How many movies has George Clooney played in?", return_tensors="pt")["input_ids"] - + >>> outputs = model(**inputs, labels=labels) >>> last_hidden_states = outputs.last_hidden_state """ @@ -989,8 +985,11 @@ def forward(self, features, **kwargs): @add_start_docstrings( - """Tapas Model with a cell selection head and optionally aggregation head on top for question-answering - tasks on tables (linear layers on top of the hidden-states output to compute `logits` and optionally `logits_aggregation`), e.g. for SQA, WTQ or WikiSQL tasks. """, + """ + Tapas Model with a cell selection head and optionally aggregation head on top for question-answering tasks on + tables (linear layers on top of the hidden-states output to compute `logits` and optionally `logits_aggregation`), + e.g. for SQA, WTQ or WikiSQL tasks. + """, TAPAS_START_DOCSTRING, ) class TapasForQuestionAnswering(TapasPreTrainedModel): @@ -1049,27 +1048,29 @@ def forward( ): r""" table_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): - Mask for the table. Indicates which tokens belong to the table (1). Question tokens, table headers and padding are 0. + Mask for the table. Indicates which tokens belong to the table (1). Question tokens, table headers and + padding are 0. label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): - Labels per token for computing the hierarchical cell selection loss. This encodes the positions of the answer appearing in the table. Can be obtained using :class:`~transformers.TapasTokenizer`. - - 1 for tokens that are **part of the answer**, - - 0 for tokens that are **not part of the answer**. + Labels per token for computing the hierarchical cell selection loss. This encodes the positions of the + answer appearing in the table. Can be obtained using :class:`~transformers.TapasTokenizer`. - 1 for tokens + that are **part of the answer**, - 0 for tokens that are **not part of the answer**. aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`, `optional`): - Aggregation function index for every example in the batch for computing the aggregation loss. - Indices should be in :obj:`[0, ..., config.num_aggregation_labels - 1]`. - Only required in case of strong supervision for aggregation (WikiSQL-SUPERVISED). + Aggregation function index for every example in the batch for computing the aggregation loss. Indices + should be in :obj:`[0, ..., config.num_aggregation_labels - 1]`. Only required in case of strong + supervision for aggregation (WikiSQL-SUPERVISED). answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`, `optional`): - Answer for every example in the batch. NaN if there is no scalar answer. - Only required in case of weak supervision (WTQ, WikiSQL) to calculate the aggregate mask and regression loss. + Answer for every example in the batch. NaN if there is no scalar answer. Only required in case of weak + supervision (WTQ, WikiSQL) to calculate the aggregate mask and regression loss. numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): - Numeric values of every token, NaN for tokens which are not numeric values. Can be obtained using :class:`~transformers.TapasTokenizer`. - Only required in case of weak supervision for aggregation (WTQ, WikiSQL) to calculate the regression loss. + Numeric values of every token, NaN for tokens which are not numeric values. Can be obtained using + :class:`~transformers.TapasTokenizer`. Only required in case of weak supervision for aggregation (WTQ, + WikiSQL) to calculate the regression loss. numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): - Scale of the numeric values of every token. Can be obtained using :class:`~transformers.TapasTokenizer`. - Only required in case of weak supervision for aggregation (WTQ, WikiSQL) to calculate the regression loss. + Scale of the numeric values of every token. Can be obtained using :class:`~transformers.TapasTokenizer`. + Only required in case of weak supervision for aggregation (WTQ, WikiSQL) to calculate the regression loss. Returns: - + Examples:: >>> from transformers import TapasTokenizer, TapasForQuestionAnswering @@ -1160,9 +1161,7 @@ def forward( cell_mask, _ = reduce_mean(input_mask_float, cell_index) # Compute logits per token. These are used to select individual cells. - logits = compute_token_logits( - sequence_output, self.config.temperature, self.output_weights, self.output_bias - ) + logits = compute_token_logits(sequence_output, self.config.temperature, self.output_weights, self.output_bias) # Compute logits per column. These are used to select a column. column_logits = None @@ -1208,7 +1207,7 @@ def forward( pooled_output, self.config.cell_selection_preference, label_ids, - self.aggregation_classifier + self.aggregation_classifier, ) else: raise ValueError("You have to specify answers in order to calculate the aggregate mask") @@ -1262,13 +1261,15 @@ def forward( logits_aggregation, aggregate_mask, aggregation_labels, self.config ) else: - raise ValueError("You have to specify aggregation labels in order to calculate the aggregation loss") + raise ValueError( + "You have to specify aggregation labels in order to calculate the aggregation loss" + ) else: # Set aggregation labels to zeros aggregation_labels = torch.zeros(label_ids.shape[0], dtype=torch.long, device=label_ids.device) per_example_additional_loss = _calculate_aggregation_loss( - logits_aggregation, aggregate_mask, aggregation_labels, self.config - ) + logits_aggregation, aggregate_mask, aggregation_labels, self.config + ) if self.config.use_answer_as_supervision: if numeric_values is not None and numeric_values_scale is not None: @@ -1288,7 +1289,9 @@ def forward( # Zero loss for examples with answer_loss > cutoff. per_example_additional_loss *= large_answer_loss_mask else: - raise ValueError("You have to specify numeric values and numeric values scale in order to calculate the regression loss") + raise ValueError( + "You have to specify numeric values and numeric values scale in order to calculate the regression loss" + ) total_loss += torch.mean(per_example_additional_loss) @@ -1310,9 +1313,12 @@ def forward( attentions=outputs.attentions, ) + @add_start_docstrings( - """Tapas Model with a sequence classification head on top (a linear layer on top of - the pooled output), e.g. for TabFact (Chen et al., 2020). """, + """ + Tapas Model with a sequence classification head on top (a linear layer on top of the pooled output), e.g. for + TabFact (Chen et al., 2020). + """, TAPAS_START_DOCSTRING, ) class TapasForSequenceClassification(TapasPreTrainedModel): @@ -1343,14 +1349,13 @@ def forward( ): r""" labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): - Labels for computing the sequence classification/regression loss. - Indices should be in :obj:`[0, ..., config.num_labels - 1]`. - If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), - If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). - Note: this is called "classification_class_index" in the original implementation. + Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ..., + config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), + If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). Note: this is called + "classification_class_index" in the original implementation. Returns: - + Examples:: >>> from transformers import TapasTokenizer, TapasForSequenceClassification @@ -1366,13 +1371,13 @@ def forward( >>> inputs = tokenizer(table, queries, return_tensors="pt") >>> labels = torch.tensor([1, 0]) # 1 means entailed, 0 means refuted - + >>> outputs = model(**inputs, labels=labels) >>> loss = outputs.loss >>> logits = outputs.logits """ return_dict = return_dict if return_dict is not None else self.config.use_return_dict - + outputs = self.tapas( input_ids, attention_mask=attention_mask, @@ -1411,8 +1416,10 @@ def forward( attentions=outputs.attentions, ) + """ TAPAS utilities.""" + class AverageApproximationFunction(str, enum.Enum): RATIO = "ratio" FIRST_ORDER = "first_order" @@ -1426,17 +1433,19 @@ class IndexMap(object): """Index grouping entries within a tensor.""" def __init__(self, indices, num_segments, batch_dims=0): - """Creates an index. + """ + Creates an index + Args: indices (:obj:`torch.LongTensor`, same shape as `values`): Tensor containing the indices. num_segments (:obj:`torch.LongTensor`): - Scalar tensor, the number of segments. All elements in a batched segmented tensor - must have the same number of segments (although many segments can be empty). + Scalar tensor, the number of segments. All elements in a batched segmented tensor must have the same + number of segments (although many segments can be empty). batch_dims (:obj:`int`, `optional`, defaults to 0): - The number of batch dimensions. The first `batch_dims` dimensions of a SegmentedTensor - are treated as batch dimensions. Segments in different batch elements are always distinct - even if they have the same index. + The number of batch dimensions. The first `batch_dims` dimensions of a SegmentedTensor are treated as + batch dimensions. Segments in different batch elements are always distinct even if they have the same + index. """ self.indices = torch.as_tensor(indices) self.num_segments = torch.as_tensor(num_segments, device=indices.device) @@ -1450,14 +1459,13 @@ class ProductIndexMap(IndexMap): """The product of two indices.""" def __init__(self, outer_index, inner_index): - """Combines indices i and j into pairs (i, j). - The result is an index where each segment (i, j) is the intersection of - segments i and j. For example if the inputs represent table cells indexed by - respectively rows and columns the output will be a table indexed by - (row, column) pairs, i.e. by cell. - The implementation combines indices {0, .., n - 1} and {0, .., m - 1} into - {0, .., nm - 1}. The output has `num_segments` equal to - `outer_index.num_segments` * `inner_index.num_segments`. + """ + Combines indices i and j into pairs (i, j). The result is an index where each segment (i, j) is the + intersection of segments i and j. For example if the inputs represent table cells indexed by respectively rows + and columns the output will be a table indexed by (row, column) pairs, i.e. by cell. The implementation + combines indices {0, .., n - 1} and {0, .., m - 1} into {0, .., nm - 1}. The output has `num_segments` equal to + `outer_index.num_segments` * `inner_index.num_segments` + Args: outer_index (:obj:`IndexMap`): IndexMap. @@ -1496,17 +1504,18 @@ def project_inner(self, index): def gather(values, index, name="segmented_gather"): - """Gathers from `values` using the index map. - For each element in the domain of the index map this operation looks up a - value for that index in `values`. Two elements from the same segment always - get assigned the same value. + """ + Gathers from `values` using the index map. For each element in the domain of the index map this operation looks up + a value for that index in `values`. Two elements from the same segment always get assigned the same value + Args: values (:obj:`torch.Tensor` of shape (B1, ..., Bn, num_segments, V1, ...)): Tensor with segment values. index (:obj:`IndexMap` of shape (B1, ..., Bn, I1, ..., Ik)): IndexMap. name (:obj:`str`, `optional`, defaults to 'segmented_gather'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: :obj:`tuple(torch.Tensor)`: Tensor of shape (B1, ..., Bn, I1, ..., Ik, V1, ...) with the gathered values. """ @@ -1528,16 +1537,18 @@ def gather(values, index, name="segmented_gather"): def flatten(index, name="segmented_flatten"): - """Flattens a batched index map (which is typically of shape batch_size, seq_length) to a 1d index map. - This operation relabels the segments to keep batch elements distinct. The k-th - batch element will have indices shifted by `num_segments` * (k - 1). The - result is a tensor with `num_segments` multiplied by the number of elements - in the batch. + """ + Flattens a batched index map (which is typically of shape batch_size, seq_length) to a 1d index map. This operation + relabels the segments to keep batch elements distinct. The k-th batch element will have indices shifted by + `num_segments` * (k - 1). The result is a tensor with `num_segments` multiplied by the number of elements in the + batch + Args: index (:obj:`IndexMap`): IndexMap to flatten. name (:obj:`str`, `optional`, defaults to 'segmented_flatten'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: (:obj:`IndexMap`): The flattened IndexMap. """ @@ -1555,14 +1566,17 @@ def flatten(index, name="segmented_flatten"): def range_index_map(batch_shape, num_segments, name="range_index_map"): - """Constructs an index map equal to range(num_segments). + """ + Constructs an index map equal to range(num_segments) + Args: batch_shape (:obj:`torch.Size`): Batch shape num_segments (:obj:`int`): Number of segments name (:obj:`str`, `optional`, defaults to 'range_index_map'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments). """ @@ -1593,7 +1607,9 @@ def range_index_map(batch_shape, num_segments, name="range_index_map"): def _segment_reduce(values, index, segment_reduce_fn, name): - """Applies a segment reduction segment-wise. + """ + Applies a segment reduction segment-wise + Args: values (:obj:`torch.Tensor`): Tensor with segment values. @@ -1602,7 +1618,8 @@ def _segment_reduce(values, index, segment_reduce_fn, name): segment_reduce_fn (:obj:`str`): Name for the reduce operation. One of "sum", "mean", "max" or "min". name (:obj:`str`): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments). """ @@ -1641,99 +1658,99 @@ def _segment_reduce(values, index, segment_reduce_fn, name): def reduce_sum(values, index, name="segmented_reduce_sum"): - """Sums a tensor over its segments. - Outputs 0 for empty segments. - This operations computes the sum over segments, with support for: + """ + Sums a tensor over its segments. Outputs 0 for empty segments. This operations computes the sum over segments, with + support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in - a batch can have different indices. - - Vectorization using the last dimension [V1, V2, ...]. If they are present - the output will be a sum of vectors rather than scalars. - Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + a batch can have different indices. - Vectorization using the last dimension [V1, V2, ...]. If they are present + the output will be a sum of vectors rather than scalars. Only the middle dimensions [I1, ..., Ik] are reduced + by the operation + Args: values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): Tensor containing the values of which the sum must be taken segment-wise. index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): Index defining the segments. name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: - output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): - Tensor containing the output values. - output_index (:obj:`IndexMap`): - IndexMap with shape [B1, B2, ..., Bn, num_segments]. . + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. . """ return _segment_reduce(values, index, "sum", name) def reduce_mean(values, index, name="segmented_reduce_mean"): - """Averages a tensor over its segments. - Outputs 0 for empty segments. - This operations computes the mean over segments, with support for: + """ + Averages a tensor over its segments. Outputs 0 for empty segments. This operations computes the mean over segments, + with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in - a batch can have different indices. - - Vectorization using the last dimension [V1, V2, ...]. If they are present - the output will be a mean of vectors rather than scalars. - Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + a batch can have different indices. - Vectorization using the last dimension [V1, V2, ...]. If they are present + the output will be a mean of vectors rather than scalars. Only the middle dimensions [I1, ..., Ik] are reduced + by the operation + Args: values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): Tensor containing the values of which the mean must be taken segment-wise. index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): Index defining the segments. name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: - output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): - Tensor containing the output values. - output_index (:obj:`IndexMap`): - IndexMap with shape [B1, B2, ..., Bn, num_segments]. + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. """ return _segment_reduce(values, index, "mean", name) def reduce_max(values, index, name="segmented_reduce_max"): - """Computes the maximum over segments. - This operations computes the maximum over segments, with support for: + """ + Computes the maximum over segments. This operations computes the maximum over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in - a batch can have different indices. - - Vectorization using the last dimension [V1, V2, ...]. If they are present - the output will be an element-wise maximum of vectors rather than scalars. - Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + a batch can have different indices. - Vectorization using the last dimension [V1, V2, ...]. If they are present + the output will be an element-wise maximum of vectors rather than scalars. Only the middle dimensions [I1, ..., + Ik] are reduced by the operation + Args: values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): Tensor containing the values of which the max must be taken segment-wise. index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): Index defining the segments. name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: - output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): - Tensor containing the output values. - output_index (:obj:`IndexMap`): - IndexMap with shape [B1, B2, ..., Bn, num_segments]. + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. """ return _segment_reduce(values, index, "max", name) def reduce_min(values, index, name="segmented_reduce_min"): - """Computes the minimum over segments. - This operations computes the maximum over segments, with support for: + """ + Computes the minimum over segments. This operations computes the maximum over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in - a batch can have different indices. - - Vectorization using the last dimension [V1, V2, ...]. If they are present - the output will be an element-wise maximum of vectors rather than scalars. - Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + a batch can have different indices. - Vectorization using the last dimension [V1, V2, ...]. If they are present + the output will be an element-wise maximum of vectors rather than scalars. Only the middle dimensions [I1, ..., + Ik] are reduced by the operation + Args: values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): Tensor containing the values of which the min must be taken segment-wise. index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): Index defining the segments. name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): - Name for the operation. Currently not used. + Name for the operation. Currently not used + Returns: - output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): - Tensor containing the output values. - output_index (:obj:`IndexMap`): - IndexMap with shape [B1, B2, ..., Bn, num_segments]. + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. """ return _segment_reduce(values, index, "min", name) @@ -1744,7 +1761,8 @@ def reduce_min(values, index, name="segmented_reduce_min"): def compute_column_logits( sequence_output, column_output_weights, column_output_bias, cell_index, cell_mask, allow_empty_column_selection ): - """Computes the column logits. + """ + Computes the column logits. Args: sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): @@ -1758,10 +1776,11 @@ def compute_column_logits( cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`): Mask for cells that exist in the table (i.e. that are not padding). allow_empty_column_selection (:obj:`bool`): - Whether to allow not to select any column. + Whether to allow not to select any column + Returns: - column_logits (:obj:`torch.FloatTensor`of shape :obj:`(batch_size, max_num_cols)`): - Tensor containing the column logits for every example in the batch. + column_logits (:obj:`torch.FloatTensor`of shape :obj:`(batch_size, max_num_cols)`): Tensor containing the + column logits for every example in the batch. """ # First, compute the token logits (batch_size, seq_len) - without temperature @@ -1792,10 +1811,10 @@ def compute_column_logits( def _single_column_cell_selection_loss(token_logits, column_logits, label_ids, cell_index, col_index, cell_mask): - """Computes the loss for cell selection constrained to a single column. - The loss is a hierarchical log-likelihood. The model first predicts a column - and then selects cells within that column (conditioned on the column). Cells - outside the selected column are never selected. + """ + Computes the loss for cell selection constrained to a single column. The loss is a hierarchical log-likelihood. The + model first predicts a column and then selects cells within that column (conditioned on the column). Cells outside + the selected column are never selected. Args: token_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): @@ -1812,11 +1831,10 @@ def _single_column_cell_selection_loss(token_logits, column_logits, label_ids, c Mask for cells that exist in the table (i.e. that are not padding). Returns: - selection_loss_per_example (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - Loss for each example. - logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): - New logits which are only allowed to select cells in a single column. Logits outside of the most likely - column according to `column_logits` will be set to a very low value (such that the probabilities are 0). + selection_loss_per_example (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Loss for each example. + logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): New logits which are only + allowed to select cells in a single column. Logits outside of the most likely column according to + `column_logits` will be set to a very low value (such that the probabilities are 0). """ ## Part 1: column loss @@ -1903,7 +1921,9 @@ def _single_column_cell_selection_loss(token_logits, column_logits, label_ids, c def compute_token_logits(sequence_output, temperature, output_weights, output_bias): - """Computes logits per token. + """ + Computes logits per token + Args: sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model. @@ -1912,10 +1932,10 @@ def compute_token_logits(sequence_output, temperature, output_weights, output_bi output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size,)`): Weights of the linear layer for cell selection. output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`): - Bias of the linear layer for cell selection. + Bias of the linear layer for cell selection + Returns: - logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): - Logits per token. + logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): Logits per token. """ logits = (torch.einsum("bsj,j->bs", sequence_output, output_weights) + output_bias) / temperature @@ -1923,17 +1943,16 @@ def compute_token_logits(sequence_output, temperature, output_weights, output_bi def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, label_ids, aggregation_classifier): - """Finds examples where the model should select cells with no aggregation. - - Returns a mask that determines for which examples should the model select - answers directly from the table, without any aggregation function. If the - answer is a piece of text the case is unambiguous as aggregation functions - only apply to numbers. If the answer is a number but does not appear in the - table then we must use some aggregation case. The ambiguous case is when the - answer is a number that also appears in the table. In this case we use the - aggregation function probabilities predicted by the model to decide whether - to select or aggregate. The threshold for this is a hyperparameter - `cell_selection_preference` + """ + Finds examples where the model should select cells with no aggregation. + + Returns a mask that determines for which examples should the model select answers directly from the table, without + any aggregation function. If the answer is a piece of text the case is unambiguous as aggregation functions only + apply to numbers. If the answer is a number but does not appear in the table then we must use some aggregation + case. The ambiguous case is when the answer is a number that also appears in the table. In this case we use the + aggregation function probabilities predicted by the model to decide whether to select or aggregate. The threshold + for this is a hyperparameter `cell_selection_preference + Args: answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): Answer for every example in the batch. Nan if there is no scalar answer. @@ -1942,12 +1961,11 @@ def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, cell_selection_preference (:obj:`float`): Preference for cell selection in ambiguous cases. label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): - Labels per token. - aggregation_classifier (:obj:`torch.nn.Linear`): - Aggregation head. + Labels per token. aggregation_classifier (:obj:`torch.nn.Linear`): Aggregation head + Returns: - aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - A mask set to 1 for examples that should use aggregation functions. + aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A mask set to 1 for examples that + should use aggregation functions. """ # torch.FloatTensor(batch_size,) aggregate_mask_init = torch.logical_not(torch.isnan(answer)).type(torch.FloatTensor).to(answer.device) @@ -1976,13 +1994,13 @@ def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, def _calculate_aggregation_loss_known(logits_aggregation, aggregate_mask, aggregation_function_id, config): - """Calculates aggregation loss when its type is known during training. + """ + Calculates aggregation loss when its type is known during training. + + In the weakly supervised setting, the only known information is that for cell selection examples, "no aggregation" + should be predicted. For other examples (those that require aggregation), no loss is accumulated. In the setting + where aggregation type is always known, standard cross entropy loss is accumulated for all examples - In the weakly supervised setting, the only known information is that for - cell selection examples, "no aggregation" should be predicted. For other - examples (those that require aggregation), no loss is accumulated. - In the setting where aggregation type is always known, standard cross entropy - loss is accumulated for all examples. Args: logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): Logits per aggregation operation. @@ -1991,10 +2009,11 @@ def _calculate_aggregation_loss_known(logits_aggregation, aggregate_mask, aggreg aggregation_function_id (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`): Aggregation function id for every example in the batch. config (:class:`~transformers.TapasConfig`): - Model configuration class with all the parameters of the model. + Model configuration class with all the parameters of the model + Returns: - aggregation_loss_known (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - Aggregation loss (when its type is known during training) per example. + aggregation_loss_known (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (when its + type is known during training) per example. """ if config.use_answer_as_supervision: # Prepare "no aggregation" targets for cell selection examples. @@ -2019,15 +2038,18 @@ def _calculate_aggregation_loss_known(logits_aggregation, aggregate_mask, aggreg def _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask): - """Calculates aggregation loss in the case of answer supervision. + """ + Calculates aggregation loss in the case of answer supervision + Args: logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): Logits per aggregation operation. aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): - A mask set to 1 for examples that should use aggregation functions. + A mask set to 1 for examples that should use aggregation functions + Returns: - aggregation_loss_unknown (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - Aggregation loss (in case of answer supervision) per example. + aggregation_loss_unknown (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (in case of + answer supervision) per example. """ dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation) @@ -2041,7 +2063,9 @@ def _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask): def _calculate_aggregation_loss(logits_aggregation, aggregate_mask, aggregation_function_id, config): - """Calculates the aggregation loss per example. + """ + Calculates the aggregation loss per example + Args: logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): Logits per aggregation operation. @@ -2050,10 +2074,10 @@ def _calculate_aggregation_loss(logits_aggregation, aggregate_mask, aggregation_ aggregation_function_id (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`): Aggregation function id for every example in the batch. config (:class:`~transformers.TapasConfig`): - Model configuration class with all the parameters of the model. + Model configuration class with all the parameters of the model + Returns: - aggregation_loss (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - Aggregation loss per example. + aggregation_loss (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss per example. """ per_example_aggregation_loss = _calculate_aggregation_loss_known( logits_aggregation, aggregate_mask, aggregation_function_id, config @@ -2068,7 +2092,9 @@ def _calculate_aggregation_loss(logits_aggregation, aggregate_mask, aggregation_ def _calculate_expected_result( dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config ): - """Calculate the expected result given cell and aggregation probabilities. + """ + Calculate the expected result given cell and aggregation probabilities + Args: dist_per_cell (:obj:`torch.distributions.Bernoulli`): Cell selection distribution for each cell. @@ -2081,10 +2107,10 @@ def _calculate_expected_result( logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): Logits per aggregation operation. config (:class:`~transformers.TapasConfig`): - Model configuration class with all the parameters of the model. + Model configuration class with all the parameters of the model + Returns: - expected_result (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - The expected result per example. + expected_result (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): The expected result per example. """ if config.use_gumbel_for_cells: gumbel_dist = torch.distributions.RelaxedBernoulli( @@ -2163,7 +2189,9 @@ def _calculate_regression_loss( logits_aggregation, config, ): - """Calculates the regression loss per example. + """ + Calculates the regression loss per example + Args: answer (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`): Answer for every example in the batch. Nan if there is no scalar answer. @@ -2180,12 +2208,12 @@ def _calculate_regression_loss( logits_aggregation (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): Logits per aggregation operation. config (:class:`~transformers.TapasConfig`): - Model configuration class with all the parameters of the model. + Model configuration class with all the parameters of the model + Returns: - per_example_answer_loss_scaled (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - Scales answer loss for each example in the batch. - large_answer_loss_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): - A mask which is 1 for examples for which their answer loss is larger than the answer_loss_cutoff. + per_example_answer_loss_scaled (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Scales answer loss for + each example in the batch. large_answer_loss_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A + mask which is 1 for examples for which their answer loss is larger than the answer_loss_cutoff. """ # [batch_size] expected_result = _calculate_expected_result( @@ -2219,4 +2247,4 @@ def _calculate_regression_loss( ) per_example_answer_loss_scaled = config.answer_loss_importance * (per_example_answer_loss * aggregate_mask) - return per_example_answer_loss_scaled, large_answer_loss_mask \ No newline at end of file + return per_example_answer_loss_scaled, large_answer_loss_mask diff --git a/src/transformers/tokenization_tapas.py b/src/transformers/tokenization_tapas.py index cd28a505e5d5..1a55980e71e3 100644 --- a/src/transformers/tokenization_tapas.py +++ b/src/transformers/tokenization_tapas.py @@ -15,42 +15,35 @@ """ Tokenization class for TAPAS model.""" - +import ast import collections -import os -import unicodedata -import math import datetime import enum import itertools +import math +import os import re -import ast +import unicodedata +import warnings from dataclasses import dataclass -from typing import Callable, Dict, List, Optional, Text, Tuple, Union +from typing import Callable, Dict, Generator, List, Optional, Text, Tuple, Union +import pandas as pd import torch from .tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace from .tokenization_utils_base import ( - ENCODE_KWARGS_DOCSTRING, - ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING, - INIT_TOKENIZER_DOCSTRING, - AddedToken, BatchEncoding, EncodedInput, - EncodedInputPair, PaddingStrategy, PreTokenizedInput, - PreTokenizedInputPair, - PreTrainedTokenizerBase, TensorType, TextInput, - TextInputPair, TruncationStrategy, ) - from .utils import logging + logger = logging.get_logger(__name__) @@ -72,6 +65,9 @@ } +TableValue = collections.namedtuple("TokenValue", ["token", "column_id", "row_id"]) + + @dataclass(frozen=True) class TokenCoordinates: column_index: int @@ -119,25 +115,38 @@ def whitespace_tokenize(text): class TapasTokenizer(PreTrainedTokenizer): r""" - Construct a TAPAS tokenizer. Based on WordPiece. Flattens a table and one or more related sentences to be used by - TAPAS models. + Construct a TAPAS tokenizer. Based on WordPiece. Flattens a table and one or more related sentences to be used by + TAPAS models. This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods. - Users should refer to this superclass for more information regarding those methods. :class:`~transformers.TapasTokenizer` - creates several token type ids to encode tabular structure. To be more precise, it adds 7 token type ids, in the following - order: "segment_ids", "column_ids", "row_ids", "prev_label_ids", "column_ranks", "inv_column_ranks" and "numeric_relations": - - - segment_ids: indicate whether a token belongs to the question (0) or the table (1). 0 for special tokens and padding. - - column_ids: indicate to which column of the table a token belongs (starting from 1). Is 0 for all question tokens, special tokens and padding. - - row_ids: indicate to which row of the table a token belongs (starting from 1). Is 0 for all question tokens, special tokens and padding. Tokens of column headers are also 0. - - prev_label_ids: indicate whether a token was (part of) an answer to the previous question (1) or not (0). Useful in a conversational setup (such as SQA). - - column_ranks: indicate the rank of a table token relative to a column, if applicable. For example, if you have a column "number of movies" with values 87, - 53 and 69, then the column ranks of these tokens are 3, 1 and 2 respectively. 0 for all question tokens, special tokens and padding. - - inv_column_ranks: indicate the inverse rank of a table token relative to a column, if applicable. For example, if you have a column "number of movies" with values 87, - 53 and 69, then the inverse column ranks of these tokens are 1, 3 and 2 respectively. 0 for all question tokens, special tokens and padding. - - numeric_relations: indicate numeric relations between the question and the tokens of the table. 0 for all question tokens, special tokens and padding. - - :class:`~transformers.TapasTokenizer` runs end-to-end tokenization on a table and associated sentences: punctuation splitting and wordpiece. + Users should refer to this superclass for more information regarding those methods. + :class:`~transformers.TapasTokenizer` creates several token type ids to encode tabular structure. To be more + precise, it adds 7 token type ids, in the following order: "segment_ids", "column_ids", "row_ids", + "prev_label_ids", "column_ranks", "inv_column_ranks" and "numeric_relations": + + - segment_ids: indicate whether a token belongs to the question (0) or the table (1). 0 for special tokens and + padding. + - column_ids: indicate to which column of the table a token belongs (starting from 1). Is 0 for all question + tokens, special tokens and padding. + - row_ids: indicate to which row of the table a token belongs (starting from 1). Is 0 for all question tokens, + special tokens and padding. Tokens of column headers are also 0. + - prev_label_ids: indicate whether a token was (part of) an answer to the previous question (1) or not (0). Useful + in a conversational setup (such as SQA). + - column_ranks: indicate the rank of a table token relative to a column, if applicable. For example, if you have a + column "number of movies" with values 87, + + 53 and 69, then the column ranks of these tokens are 3, 1 and 2 respectively. 0 for all question tokens, special + tokens and padding. + - inv_column_ranks: indicate the inverse rank of a table token relative to a column, if applicable. For example, if + you have a column "number of movies" with values 87, + + 53 and 69, then the inverse column ranks of these tokens are 1, 3 and 2 respectively. 0 for all question tokens, + special tokens and padding. + - numeric_relations: indicate numeric relations between the question and the tokens of the table. 0 for all + question tokens, special tokens and padding. + + :class:`~transformers.TapasTokenizer` runs end-to-end tokenization on a table and associated sentences: punctuation + splitting and wordpiece. Args: vocab_file (:obj:`str`): @@ -165,14 +174,14 @@ class TapasTokenizer(PreTrainedTokenizer): The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): - Whether or not to tokenize Chinese characters. - This should likely be deactivated for Japanese (see this `issue - `__). + Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this + `issue `__). strip_accents: (:obj:`bool`, `optional`): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for :obj:`lowercase` (as in the original BERT). cell_trim_length (:obj:`int`, `optional`, defaults to -1): - If > 0: Trim cells so that the length is <= this value. Also disables further cell trimming, should thus be used with 'drop_rows_to_fit' below. + If > 0: Trim cells so that the length is <= this value. Also disables further cell trimming, should thus be + used with 'drop_rows_to_fit' below. max_column_id (:obj:`int`, `optional`, defaults to None): Max column id to extract. max_row_id (:obj:`int`, `optional`, defaults to None): @@ -223,6 +232,12 @@ def __init__( mask_token=mask_token, tokenize_chinese_chars=tokenize_chinese_chars, strip_accents=strip_accents, + cell_trim_length=cell_trim_length, + max_column_id=max_column_id, + max_row_id=max_row_id, + strip_column_names=strip_column_names, + update_answer_coordinates=update_answer_coordinates, + drop_rows_to_fit=drop_rows_to_fit, **kwargs, ) @@ -290,82 +305,6 @@ def convert_tokens_to_string(self, tokens): return out_string # the code below was also copied from tokenization_bert.py, but should be updated for TAPAS - - # def build_inputs_with_special_tokens( - # self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None - # ) -> List[int]: - # """ - # Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and - # adding special tokens. A BERT sequence has the following format: - # - single sequence: ``[CLS] X [SEP]`` - # - pair of sequences: ``[CLS] A [SEP] B [SEP]`` - # Args: - # token_ids_0 (:obj:`List[int]`): - # List of IDs to which the special tokens will be added. - # token_ids_1 (:obj:`List[int]`, `optional`): - # Optional second list of IDs for sequence pairs. - # Returns: - # :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. - # """ - # if token_ids_1 is None: - # return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] - # cls = [self.cls_token_id] - # sep = [self.sep_token_id] - # return cls + token_ids_0 + sep + token_ids_1 + sep - - # def get_special_tokens_mask( - # self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False - # ) -> List[int]: - # """ - # Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding - # special tokens using the tokenizer ``prepare_for_model`` method. - # Args: - # token_ids_0 (:obj:`List[int]`): - # List of IDs. - # token_ids_1 (:obj:`List[int]`, `optional`): - # Optional second list of IDs for sequence pairs. - # already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): - # Whether or not the token list is already formatted with special tokens for the model. - # Returns: - # :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. - # """ - - # if already_has_special_tokens: - # if token_ids_1 is not None: - # raise ValueError( - # "You should not supply a second sequence if the provided sequence of " - # "ids is already formatted with special tokens for the model." - # ) - # return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0)) - - # if token_ids_1 is not None: - # return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1] - # return [1] + ([0] * len(token_ids_0)) + [1] - - # def create_token_type_ids_from_sequences( - # self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None - # ) -> List[int]: - # """ - # Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence - # pair mask has the following format: - # :: - # 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 - # | first sequence | second sequence | - # If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). - # Args: - # token_ids_0 (:obj:`List[int]`): - # List of IDs. - # token_ids_1 (:obj:`List[int]`, `optional`): - # Optional second list of IDs for sequence pairs. - # Returns: - # :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given - # sequence(s). - # """ - # sep = [self.sep_token_id] - # cls = [self.cls_token_id] - # if token_ids_1 is None: - # return len(cls + token_ids_0 + sep) * [0] - # return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 + sep) * [1] def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: index = 0 @@ -386,17 +325,834 @@ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = writer.write(token + "\n") index += 1 return (vocab_file,) - + + def create_attention_mask_from_sequences(self, query_ids: List[int], table_values: List[TableValue]) -> List[int]: + return [1] * (1 + len(query_ids) + 1 + len(table_values)) + + def create_segment_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + table_ids = list(zip(*table_values))[0] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + [1] * len(table_ids) + + def create_column_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + table_column_ids = list(zip(*table_values))[1] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + list(table_column_ids) + + def create_row_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + table_row_ids = list(zip(*table_values))[2] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + list(table_row_ids) + + def build_inputs_with_special_tokens( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and + adding special tokens. + + Args: + token_ids_0 (:obj:`List[int]`): The first tokenized sequence. + token_ids_1 (:obj:`List[int]`, `optional`): The second tokenized sequence. + + Returns: + :obj:`List[int]`: The model input with special tokens. + """ + if token_ids_1 is None: + raise ValueError("With TAPAS, you must provide both question IDs and table IDs.") + + return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1 + + def get_special_tokens_mask( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False + ) -> List[int]: + """ + Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding + special tokens using the tokenizer ``prepare_for_model`` method. + + Args: + token_ids_0 (:obj:`List[int]`): + List of IDs. + token_ids_1 (:obj:`List[int]`, `optional`): + Optional second list of IDs for sequence pairs. + already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether or not the token list is already formatted with special tokens for the model. + + Returns: + :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. + """ + + if already_has_special_tokens: + if token_ids_1 is not None: + raise ValueError( + "You should not supply a second sequence if the provided sequence of " + "ids is already formatted with special tokens for the model." + ) + return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0)) + + if token_ids_1 is not None: + return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + return [1] + ([0] * len(token_ids_0)) + [1] + + def __call__( + self, + table: pd.DataFrame, + queries: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ] + ] = None, + answer_coordinates: Optional[List[Tuple]] = None, + answer_texts: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + """ + Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of + sequences. + + Args: + text (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]`): + The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings + (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set + :obj:`is_split_into_words=True` (to lift the ambiguity with a batch of sequences). + text_pair (:obj:`str`, :obj:`List[str]`, :obj:`List[List[str]]`): + The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings + (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set + :obj:`is_split_into_words=True` (to lift the ambiguity with a batch of sequences). + """ + assert isinstance(table, pd.DataFrame), "Table must be of type pd.DataFrame" + + # Input type checking for clearer error + assert ( + queries is None + or isinstance(queries, str) + or ( + isinstance(queries, (list, tuple)) + and ( + len(queries) == 0 + or ( + isinstance(queries[0], str) + or ( + isinstance(queries[0], (list, tuple)) + and (len(queries[0]) == 0 or isinstance(queries[0][0], str)) + ) + ) + ) + ) + ), ( + "queries input must of type `str` (single example), `List[str]` (batch or single pretokenized example) " + "or `List[List[str]]` (batch of pretokenized examples)." + ) + + is_batched = isinstance(queries, (list, tuple)) + + if is_batched: + return self.batch_encode_plus( + table=table, + queries=queries, + answer_coordinates=answer_coordinates, + answer_texts=answer_texts, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + stride=stride, + is_split_into_words=is_split_into_words, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + else: + return self.encode_plus( + table=table, + query=queries, + answer_coordinates=answer_coordinates, + answer_text=answer_texts, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + stride=stride, + is_split_into_words=is_split_into_words, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + def batch_encode_plus( + self, + table: pd.DataFrame, + queries: Optional[ + Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ] + ] = None, + answer_coordinates: Optional[List[Tuple]] = None, + answer_texts: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + + padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + verbose=verbose, + **kwargs, + ) + + if return_token_type_ids is not None and not add_special_tokens: + raise ValueError( + "Asking to return token_type_ids while setting add_special_tokens to False " + "results in an undefined behavior. Please set add_special_tokens to True or " + "set return_token_type_ids to None." + ) + + if (answer_coordinates and not answer_texts) or (not answer_coordinates and answer_texts): + raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided") + elif answer_coordinates is None and answer_texts is None: + answer_coordinates = answer_texts = [None] * len(queries) + + if "is_split_into_words" in kwargs: + raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.") + + if return_offsets_mapping: + raise NotImplementedError( + "return_offset_mapping is not available when using Python tokenizers." + "To use this feature, change your tokenizer to one deriving from " + "transformers.PreTrainedTokenizerFast." + ) + + if "return_lengths" in kwargs: + if verbose: + warnings.warn( + "The PreTrainedTokenizerBase.prepare_for_model `return_lengths` parameter is deprecated. " + "Please use `return_length` instead.", + FutureWarning, + ) + return_length = kwargs["return_lengths"] + + return self._batch_encode_plus( + table=table, + queries=queries, + answer_coordinates=answer_coordinates, + answer_texts=answer_texts, + add_special_tokens=add_special_tokens, + padding_strategy=padding_strategy, + truncation_strategy=truncation_strategy, + max_length=max_length, + stride=stride, + is_split_into_words=is_split_into_words, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + def _batch_encode_plus( + self, + table, + queries: Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ], + answer_coordinates: Optional[List[Tuple]] = None, + answer_texts: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, + truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + + table_tokens = self._tokenize_table(table) + + queries_tokens = [] + queries_ids = [] + for query in queries: + query_tokens = self.tokenize(query) + queries_tokens.append(query_tokens) + queries_ids.append(self.convert_tokens_to_ids(query_tokens)) + + num_rows = self._get_num_rows(table, self.drop_rows_to_fit) + num_columns = self._get_num_columns(table) + + _, _, num_tokens = self._get_table_boundaries(table_tokens) + + table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) + + table_ids = list(zip(*table_data))[0] if len(table_data) > 0 else list(zip(*table_data)) + table_ids = self.convert_tokens_to_ids(list(table_ids)) + + batch_outputs = self._batch_prepare_for_model( + table_ids, + queries_ids, + table, + queries, + table_data=table_data, + queries_tokens=queries_tokens, + answer_coordinates=answer_coordinates, + answer_texts=answer_texts, + add_special_tokens=add_special_tokens, + padding=padding_strategy.value, + truncation=truncation_strategy.value, + max_length=max_length, + stride=stride, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + prepend_batch_axis=True, + return_attention_mask=return_attention_mask, + return_token_type_ids=return_token_type_ids, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + verbose=verbose, + ) + + return BatchEncoding(batch_outputs) + + def _batch_prepare_for_model( + self, + table_ids: List[int], + queries_ids: List[List[int]], + raw_table: pd.DataFrame, + raw_queries: Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ], + answer_coordinates: Optional[List[Tuple]] = None, + answer_texts: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + prepend_batch_axis: bool = False, + **kwargs + ) -> BatchEncoding: + """ + Prepares a sequence of strings (queries) related to a table so that it can be used by the model. It creates + input ids, adds special tokens, truncates the table if overflowing (if the drop_rows_to_fit parameter is set to + True) while taking into account the special tokens and manages a moving window (with user defined stride) for + overflowing tokens + + This function is based on prepare_for_model (but in Tapas, training examples depend on each other, so we + defined it at a batch level) + + Args: + table: Pandas dataframe + queries: List of Strings, containing questions related to the table + """ + batch_outputs = {} + + if "table_data" in kwargs and "queries_tokens" in kwargs: + table_data = kwargs["table_data"] + queries_tokens = kwargs["queries_tokens"] + else: + table_data = None + queries_tokens = [None] * len(queries_ids) + + for query_ids, raw_query, query_tokens, answer_coords, answer_text in zip( + queries_ids, raw_queries, queries_tokens, answer_coordinates, answer_texts + ): + outputs = self.prepare_for_model( + table_ids, + query_ids, + raw_table, + raw_query, + table_data=table_data, + query_tokens=query_tokens, + answer_coordinates=answer_coords, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterward + truncation=truncation, + max_length=max_length, + stride=stride, + pad_to_multiple_of=None, # we pad in batch afterward + return_attention_mask=False, # we pad in batch afterward + return_token_type_ids=return_token_type_ids, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + return_tensors=None, # We convert the whole batch to tensors at the end + prepend_batch_axis=False, + verbose=verbose, + ) + + for key, value in outputs.items(): + if key not in batch_outputs: + batch_outputs[key] = [] + batch_outputs[key].append(value) + + batch_outputs = self.pad( + batch_outputs, + padding=padding, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors) + + return batch_outputs + + def encode( + self, + table: pd.DataFrame, + query: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ] + ] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs + ) -> List[int]: + encoded_inputs = self.encode_plus( + table, + query=query, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + stride=stride, + return_tensors=return_tensors, + **kwargs, + ) + + return encoded_inputs["input_ids"] + + def encode_plus( + self, + table: pd.DataFrame, + query: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ] + ] = None, + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + verbose=verbose, + **kwargs, + ) + + if return_token_type_ids is not None and not add_special_tokens: + raise ValueError( + "Asking to return token_type_ids while setting add_special_tokens to False " + "results in an undefined behavior. Please set add_special_tokens to True or " + "set return_token_type_ids to None." + ) + + if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text): + raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided") + + if "is_split_into_words" in kwargs: + raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.") + + if return_offsets_mapping: + raise NotImplementedError( + "return_offset_mapping is not available when using Python tokenizers." + "To use this feature, change your tokenizer to one deriving from " + "transformers.PreTrainedTokenizerFast." + ) + + if "return_lengths" in kwargs: + if verbose: + warnings.warn( + "The PreTrainedTokenizerBase.prepare_for_model `return_lengths` parameter is deprecated. " + "Please use `return_length` instead.", + FutureWarning, + ) + return_length = kwargs["return_lengths"] + + return self._encode_plus( + table=table, + query=query, + add_special_tokens=add_special_tokens, + padding_strategy=padding_strategy, + truncation_strategy=truncation_strategy, + max_length=max_length, + stride=stride, + is_split_into_words=is_split_into_words, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + def _encode_plus( + self, + table: pd.DataFrame, + query: Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ], + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, + truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ): + if query is None: + query = "" + logger.warning( + "TAPAS is a question answering model but you have not passed a query. Please be aware that the " + "model will probably not behave correctly." + ) + + table_tokens = self._tokenize_table(table) + query_tokens = self.tokenize(query) + + num_rows = self._get_num_rows(table, self.drop_rows_to_fit) + num_columns = self._get_num_columns(table) + + _, _, num_tokens = self._get_table_boundaries(table_tokens) + + table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) + + query_ids = self.convert_tokens_to_ids(query_tokens) + table_ids = list(zip(*table_data))[0] if len(table_data) > 0 else list(zip(*table_data)) + table_ids = self.convert_tokens_to_ids(list(table_ids)) + + return self.prepare_for_model( + table_ids, + query_ids, + table, + query, + table_data=table_data, + query_tokens=query_tokens, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + padding=padding_strategy.value, + truncation=truncation_strategy.value, + max_length=max_length, + stride=stride, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + prepend_batch_axis=True, + return_attention_mask=return_attention_mask, + return_token_type_ids=return_token_type_ids, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + verbose=verbose, + ) + + def prepare_for_model( + self, + table_ids: List[int], + query_ids: List[int], + raw_table: pd.DataFrame, + raw_query: Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ], + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TruncationStrategy] = False, + max_length: Optional[int] = None, + stride: int = 0, + is_split_into_words: bool = False, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + prepend_batch_axis: bool = False, + **kwargs + ) -> BatchEncoding: + + # Backward compatibility for 'truncation_strategy', 'pad_to_max_length' + padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + verbose=verbose, + **kwargs, + ) + + encoded_inputs = {} + + # This can be retrieved from the encoding step, which prevents recomputing. + # We still need to handle recomputing as `prepare_for_model` should be callable on raw IDs/table/query as well. + if ( + "table_data" not in kwargs + or "query_tokens" not in kwargs + or ( + ("table_data" in kwargs and kwargs["table_data"] is None) + and ("query_tokens" in kwargs and kwargs["query_tokens"] is None) + ) + ): + table_tokens = self._tokenize_table(raw_table) + num_rows = self._get_num_rows(raw_table, self.drop_rows_to_fit) + num_columns = self._get_num_columns(raw_table) + _, _, num_tokens = self._get_table_boundaries(table_tokens) + table_data = list(self._get_table_values(table_tokens, num_columns, num_rows, num_tokens)) + query_tokens = self.tokenize(raw_query) + else: + table_data = kwargs["table_data"] + query_tokens = kwargs["query_tokens"] + + total_len = ( + len(query_ids) + len(table_ids) + (self.num_special_tokens_to_add(pair=True) if add_special_tokens else 0) + ) + + overflowing_tokens = [] + if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length: + query_ids, table_ids, overflowing_tokens = self.truncate_sequences( + query_ids, + pair_ids=table_ids, + num_tokens_to_remove=total_len - max_length, + truncation_strategy=truncation_strategy, + stride=stride, + ) + + if return_overflowing_tokens: + encoded_inputs["overflowing_tokens"] = overflowing_tokens + encoded_inputs["num_truncated_tokens"] = total_len - max_length + + if add_special_tokens: + input_ids = self.build_inputs_with_special_tokens(query_ids, table_ids) + else: + input_ids = query_ids + table_ids + + encoded_inputs["input_ids"] = input_ids + + segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data) + column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data) + row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data) + prev_label_ids = [0] * len(row_ids) + + column_ranks, inv_column_ranks, columns_to_numeric_values = self._get_numeric_column_ranks( + column_ids, row_ids, raw_table + ) + numeric_relations = self._get_numeric_relations( + raw_query, column_ids, row_ids, raw_table, columns_to_numeric_values + ) + + # Load from model defaults + if return_token_type_ids is None: + return_token_type_ids = "token_type_ids" in self.model_input_names + if return_attention_mask is None: + return_attention_mask = "attention_mask" in self.model_input_names + + if return_attention_mask: + attention_mask = self.create_attention_mask_from_sequences(query_ids, table_data) + encoded_inputs["attention_mask"] = attention_mask + + if answer_coordinates is not None and answer_text is not None: + label_ids = self.get_answer_ids( + column_ids, row_ids, table_data, query_tokens, answer_text, answer_coordinates + ) + numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids, columns_to_numeric_values) + numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids) + + encoded_inputs["label_ids"] = label_ids + encoded_inputs["numeric_values"] = numeric_values + encoded_inputs["numeric_values_scale"] = numeric_values_scale + + if return_token_type_ids: + token_type_ids = [ + segment_ids, + column_ids, + row_ids, + prev_label_ids, + column_ranks, + inv_column_ranks, + numeric_relations, + ] + + token_type_ids = [list(ids) for ids in list(zip(*token_type_ids))] + encoded_inputs["token_type_ids"] = token_type_ids + + if return_special_tokens_mask: + if add_special_tokens: + encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(query_ids, table_ids) + else: + encoded_inputs["special_tokens_mask"] = [0] * len(input_ids) + + # Check lengths + if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose: + if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False): + logger.warning( + "Token indices sequence length is longer than the specified maximum sequence length " + "for this model ({} > {}). Running this sequence through the model will result in " + "indexing errors".format(len(encoded_inputs["input_ids"]), self.model_max_length) + ) + self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True + + # Padding + if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask: + encoded_inputs = self.pad( + encoded_inputs, + max_length=max_length, + padding=padding_strategy.value, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + if return_length: + encoded_inputs["length"] = len(encoded_inputs["input_ids"]) + + batch_outputs = BatchEncoding( + encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis + ) + + return batch_outputs + def _tokenize_table( self, table=None, ): - """Tokenizes column headers and cell texts of a table. + """ + Tokenizes column headers and cell texts of a table. Args: table (:obj:`pd.Dataframe`): - Table. - Returns: :obj:`TokenizedTable`: TokenizedTable object. + Table. Returns: :obj:`TokenizedTable`: TokenizedTable object. """ tokenized_rows = [] tokenized_row = [] @@ -437,17 +1193,18 @@ def _question_encoding_cost(self, question_tokens): return len(question_tokens) + 2 def _get_token_budget(self, question_tokens): - """Computes the number of tokens left for the table after tokenizing a question, - taking into account the max sequence length of the model. + """ + Computes the number of tokens left for the table after tokenizing a question, taking into account the max + sequence length of the model. Args: - question_tokens (:obj:`List[String]`): - List of question tokens. - Returns: :obj:`int`: the number of tokens left for the table, given the model max length. + question_tokens (:obj:`List[String]`): + List of question tokens. Returns: :obj:`int`: the number of tokens left for the table, given the model + max length. """ return self.model_max_length - self._question_encoding_cost(question_tokens) - - def _get_table_values(self, table, num_columns, num_rows, num_tokens): + + def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: """Iterates over partial table and returns token, column and row indexes.""" for tc in table.selected_tokens: # First row is header row. @@ -464,7 +1221,7 @@ def _get_table_values(self, table, num_columns, num_rows, num_tokens): word_begin_index -= 1 if word_begin_index >= num_tokens: continue - yield token, tc.column_index + 1, tc.row_index + yield TableValue(token, tc.column_index + 1, tc.row_index) def _get_table_boundaries(self, table): """Return maximal number of rows, columns and tokens.""" @@ -574,11 +1331,11 @@ def _serialize( ) def _get_column_values(self, table_numeric_values): - """This is an adaptation from _get_column_values in tf_example_utils.py of the original implementation. - Given table_numeric_values, a dictionary that maps row indices of a certain column - of a Pandas dataframe to either an empty list (no numeric value) or a list containing - a NumericValue object, it returns the same dictionary, but only for the row indices that - have a corresponding NumericValue object. + """ + This is an adaptation from _get_column_values in tf_example_utils.py of the original implementation. Given + table_numeric_values, a dictionary that maps row indices of a certain column of a Pandas dataframe to either an + empty list (no numeric value) or a list containing a NumericValue object, it returns the same dictionary, but + only for the row indices that have a corresponding NumericValue object. """ table_numeric_values_without_empty_lists = {} for row_index, value in table_numeric_values.items(): @@ -591,8 +1348,8 @@ def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id): if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id: yield index - def _add_numeric_column_ranks(self, column_ids, row_ids, table, features): - """Adds column ranks for all numeric columns.""" + def _get_numeric_column_ranks(self, column_ids, row_ids, table): + """Returns column ranks for all numeric columns.""" ranks = [0] * len(column_ids) inv_ranks = [0] * len(column_ids) @@ -628,18 +1385,17 @@ def _add_numeric_column_ranks(self, column_ids, row_ids, table, features): ranks[index] = rank + 1 inv_ranks[index] = len(unique_values) - rank - features["column_ranks"] = ranks - features["inv_column_ranks"] = inv_ranks - - return features, columns_to_numeric_values + return ranks, inv_ranks, columns_to_numeric_values def _get_numeric_sort_key_fn(self, table_numeric_values, value): - """Returns the sort key function for comparing value to table values. - The function returned will be a suitable input for the key param of the - sort(). See number_annotation_utils._get_numeric_sort_key_fn for details. + """ + Returns the sort key function for comparing value to table values. The function returned will be a suitable + input for the key param of the sort(). See number_annotation_utils._get_numeric_sort_key_fn for details + Args: table_numeric_values: Numeric values of a column - value: Numeric value in the question. + value: Numeric value in the question + Returns: A function key function to compare column and question values. """ @@ -652,14 +1408,15 @@ def _get_numeric_sort_key_fn(self, table_numeric_values, value): except ValueError: return None - def _add_numeric_relations(self, question, column_ids, row_ids, table, features, columns_to_numeric_values): - """Adds numeric relation embeddings to 'features'. + def _get_numeric_relations(self, question, column_ids, row_ids, table, columns_to_numeric_values): + """ + Returns numeric relations embeddings + Args: question: The question, numeric values are used. column_ids: Maps word piece position to column id. row_ids: Maps word piece position to row id. table: The table containing the numeric cell values. - features: Output. columns_to_numeric_values: Dictionary that maps column indices to numeric values. """ @@ -692,12 +1449,10 @@ def _add_numeric_relations(self, question, column_ids, row_ids, table, features, for cell_token_index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index): numeric_relations[cell_token_index] = relation_set_index - features["numeric_relations"] = numeric_relations - - return features + return numeric_relations - def _add_numeric_values(self, table, token_ids_dict, features, columns_to_numeric_values): - """Adds numeric values for computation of answer loss.""" + def _get_numeric_values(self, table, column_ids, row_ids, columns_to_numeric_values): + """Returns numeric values for computation of answer loss.""" numeric_values = [float("nan")] * self.model_max_length @@ -718,17 +1473,13 @@ def _add_numeric_values(self, table, token_ids_dict, features, columns_to_numeri if float_value == float("inf"): continue - for index in self._get_cell_token_indexes( - token_ids_dict["column_ids"], token_ids_dict["row_ids"], col_index, row_index - ): + for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index): numeric_values[index] = float_value - features["numeric_values"] = numeric_values - - return features + return numeric_values - def _add_numeric_values_scale(self, table, token_ids_dict, features): - """Adds a scale to each token to down weigh the value of long words.""" + def _get_numeric_values_scale(self, table, column_ids, row_ids): + """Returns a scale to each token to down weigh the value of long words.""" numeric_values_scale = [1.0] * self.model_max_length @@ -740,20 +1491,13 @@ def _add_numeric_values_scale(self, table, token_ids_dict, features): for col_index in range(num_columns): for row_index in range(num_rows): - indices = [ - index - for index in self._get_cell_token_indexes( - token_ids_dict["column_ids"], token_ids_dict["row_ids"], col_index, row_index - ) - ] + indices = [index for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index)] num_indices = len(indices) if num_indices > 1: for index in indices: numeric_values_scale[index] = float(num_indices) - features["numeric_values_scale"] = numeric_values_scale - - return features + return numeric_values_scale def _pad_to_seq_length(self, inputs): while len(inputs) > self.model_max_length: @@ -761,109 +1505,6 @@ def _pad_to_seq_length(self, inputs): while len(inputs) < self.model_max_length: inputs.append(0) - def _to_features(self, tokens, token_ids_dict, table, question): - """Produces a dict of features. This function creates input ids, attention mask, token type ids - (except the prev label ids), as well as numeric value and numeric value scale. - """ - tokens = list(tokens) - token_ids_dict = {key: list(values) for key, values in token_ids_dict.items()} - - length = len(tokens) - for values in token_ids_dict.values(): - if len(values) != length: - raise ValueError("Inconsistent length") - - # currently the input ids, mask and token type ids are created here - # also, padding and truncation up to max length is done here (see function _pad_to_seq_length) - input_ids = self.convert_tokens_to_ids(tokens) - attention_mask = [1] * len(input_ids) - - self._pad_to_seq_length(input_ids) - self._pad_to_seq_length(attention_mask) - for values in token_ids_dict.values(): - self._pad_to_seq_length(values) - - assert len(input_ids) == self.model_max_length - assert len(attention_mask) == self.model_max_length - for values in token_ids_dict.values(): - assert len(values) == self.model_max_length - - features = {} - features["input_ids"] = input_ids - features["attention_mask"] = attention_mask - for key, values in sorted(token_ids_dict.items()): - features[key] = values - - features, columns_to_numeric_values = self._add_numeric_column_ranks( - token_ids_dict["column_ids"], token_ids_dict["row_ids"], table, features - ) - - features = self._add_numeric_relations( - question, - token_ids_dict["column_ids"], - token_ids_dict["row_ids"], - table, - features, - columns_to_numeric_values, - ) - - # finally, add numeric values and numeric values scale (only needed in case of regression loss calculation) - # so they should only be returned in case answer_coordinates + answer_texts are provided - - features = self._add_numeric_values(table, token_ids_dict, features, columns_to_numeric_values) - - features = self._add_numeric_values_scale(table, token_ids_dict, features) - - # we do not add table id and table id hash (was used in the original implementation) - # if table: - # features['table_id'] = create_string_feature([table.table_id.encode('utf8')]) - # features['table_id_hash'] = create_int_feature([fingerprint(table.table_id) % _MAX_INT]) - - return features - - def _to_trimmed_features( - self, - question, - table, - question_tokens, - tokenized_table, - num_columns, - num_rows, - drop_rows_to_fit=False, - ): - """Finds optimal number of table tokens to include and serializes.""" - init_num_rows = num_rows - while True: - num_tokens = self._get_max_num_tokens( - question_tokens, - tokenized_table, - num_rows=num_rows, - num_columns=num_columns, - ) - if num_tokens is not None: - # We could fit the table. - break - if not drop_rows_to_fit or num_rows == 0: - raise ValueError("Sequence too long") - # Try to drop a row to fit the table. - num_rows -= 1 - - serialized_example = self._serialize(question_tokens, tokenized_table, num_columns, num_rows, num_tokens) - - assert len(serialized_example.tokens) <= self.model_max_length - - feature_dict = { - "column_ids": serialized_example.column_ids, - "row_ids": serialized_example.row_ids, - "segment_ids": serialized_example.segment_ids, - } - - features = self._to_features(serialized_example.tokens, feature_dict, table=table, question=question) - - return serialized_example, features - - #### Everything related to label ids calculation #### - def _get_all_answer_ids_from_coordinates( self, column_ids, @@ -885,16 +1526,16 @@ def _get_all_answer_ids_from_coordinates( return answer_ids, missing_count def _get_all_answer_ids(self, column_ids, row_ids, question, answer_coordinates): - """Maps lists of questions with answer coordinates to token indexes. - Here, we swap column and row coordinates. In the TSV format, the coordinates - are given as (row, column) tuples. Here, we swap them to (column, row) format. + """ + Maps lists of questions with answer coordinates to token indexes. Here, we swap column and row coordinates. In + the TSV format, the coordinates are given as (row, column) tuples. Here, we swap them to (column, row) format. """ - def _to_coordinates(question, answer_coordinates_question): + def _to_coordinates(answer_coordinates_question): return [(coords[1], coords[0]) for coords in answer_coordinates_question] return self._get_all_answer_ids_from_coordinates( - column_ids, row_ids, answers_list=(_to_coordinates(question, answer_coordinates)) + column_ids, row_ids, answers_list=(_to_coordinates(answer_coordinates)) ) def _find_tokens(self, text, segment): @@ -986,329 +1627,78 @@ def get_answer_ids( ) return self._get_answer_ids(column_ids, row_ids, question, answer_coordinates_question) - #### End of everything related to label ids calculation #### - - def batch_encode_plus( - self, - table, - queries: Union[ - List[TextInput], - List[PreTokenizedInput], - List[EncodedInput], - ], - answer_coordinates: Optional[List[Tuple]] = None, - answer_texts: Optional[List[TextInput]] = None, - add_special_tokens: bool = True, - padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, - truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, - max_length: Optional[int] = None, - stride: int = 0, - is_split_into_words: bool = False, - pad_to_multiple_of: Optional[int] = None, - return_tensors: Optional[Union[str, TensorType]] = None, - return_token_type_ids: Optional[bool] = True, - return_attention_mask: Optional[bool] = None, - return_overflowing_tokens: bool = False, - return_special_tokens_mask: bool = False, - return_offsets_mapping: bool = False, - return_length: bool = False, - verbose: bool = True, - **kwargs - ) -> BatchEncoding: - """ - Tokenize and prepare for the model a list of one or more sequences related to a table. - .. warning:: - This method is deprecated, ``__call__`` should be used instead. - Args: - queries (:obj:`List[str]`): - Batch of sequences (queries) related to a table to be encoded. - This is a list of string-sequences (see details in ``encode_plus``). - """ - - # Backward compatibility for 'truncation_strategy', 'pad_to_max_length' - # padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( - # padding=padding, - # truncation=truncation, - # max_length=max_length, - # pad_to_multiple_of=pad_to_multiple_of, - # verbose=verbose, - # **kwargs, - # ) - - return self._batch_encode_plus( - table=table, - queries=queries, - answer_coordinates=answer_coordinates, - answer_texts=answer_texts, - add_special_tokens=add_special_tokens, - padding_strategy=padding_strategy, - truncation_strategy=truncation_strategy, - max_length=max_length, - stride=stride, - is_split_into_words=is_split_into_words, - pad_to_multiple_of=pad_to_multiple_of, - return_tensors=return_tensors, - return_token_type_ids=return_token_type_ids, - return_attention_mask=return_attention_mask, - return_overflowing_tokens=return_overflowing_tokens, - return_special_tokens_mask=return_special_tokens_mask, - return_offsets_mapping=return_offsets_mapping, - return_length=return_length, - verbose=verbose, - **kwargs, - ) - - def _batch_encode_plus( + def _pad( self, - table, - queries: Union[ - List[TextInput], - List[PreTokenizedInput], - List[EncodedInput], - ], - answer_coordinates: Optional[List[Tuple]] = None, - answer_texts: Optional[List[TextInput]] = None, - add_special_tokens: bool = True, - padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, - truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, + encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding], max_length: Optional[int] = None, - stride: int = 0, - is_split_into_words: bool = False, - pad_to_multiple_of: Optional[int] = None, - return_tensors: Optional[Union[str, TensorType]] = None, - return_token_type_ids: Optional[bool] = True, - return_attention_mask: Optional[bool] = None, - return_overflowing_tokens: bool = False, - return_special_tokens_mask: bool = False, - return_offsets_mapping: bool = False, - return_length: bool = False, - verbose: bool = True, - **kwargs - ) -> BatchEncoding: - - if return_offsets_mapping: - raise NotImplementedError( - "return_offset_mapping is not available when using Python tokenizers." - "To use this feature, change your tokenizer to one deriving from " - "transformers.PreTrainedTokenizerFast." - ) - - if "is_pretokenized" in kwargs: - warnings.warn( - "`is_pretokenized` is deprecated and will be removed in a future version, use `is_split_into_words` instead.", - FutureWarning, - ) - - if "is_split_into_words" in kwargs: - raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.") - - batch_outputs = self._batch_prepare_for_model( - table=table, - queries=queries, - answer_coordinates=answer_coordinates, - answer_texts=answer_texts, - add_special_tokens=add_special_tokens, - padding_strategy=padding_strategy, - truncation_strategy=truncation_strategy, - max_length=max_length, - stride=stride, - pad_to_multiple_of=pad_to_multiple_of, - return_attention_mask=return_attention_mask, - return_token_type_ids=return_token_type_ids, - return_overflowing_tokens=return_overflowing_tokens, - return_special_tokens_mask=return_special_tokens_mask, - return_length=return_length, - return_tensors=return_tensors, - verbose=verbose, - ) - - return BatchEncoding(batch_outputs) - - def _batch_prepare_for_model( - self, - table, - queries: Union[ - List[TextInput], - List[PreTokenizedInput], - List[EncodedInput], - ], - answer_coordinates: Optional[List[Tuple]] = None, - answer_texts: Optional[List[TextInput]] = None, - add_special_tokens: bool = True, padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, - truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, - max_length: Optional[int] = None, - stride: int = 0, pad_to_multiple_of: Optional[int] = None, - return_tensors: Optional[str] = None, - return_token_type_ids: Optional[bool] = True, return_attention_mask: Optional[bool] = None, - return_overflowing_tokens: bool = False, - return_special_tokens_mask: bool = False, - return_length: bool = False, - verbose: bool = True, - **kwargs - ) -> BatchEncoding: + ) -> dict: """ - Prepares a sequence of strings (queries) related to a table so that it can be used by the model. - It creates input ids, adds special tokens, truncates the table if overflowing (if the drop_rows_to_fit - parameter is set to True) while taking into account the special tokens and manages a moving window - (with user defined stride) for overflowing tokens - - This function is based on prepare_for_model (but in Tapas, training examples depend on each other, - so we defined it at a batch level) + Pad encoded inputs (on left/right and up to predefined length or max length in the batch) Args: - table: Pandas dataframe - queries: List of Strings, containing questions related to the table + encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`). + max_length: maximum length of the returned list and optionally padding length (see below). + Will truncate by taking into account the special tokens. + padding_strategy: PaddingStrategy to use for padding. + + - PaddingStrategy.LONGEST Pad to the longest sequence in the batch + - PaddingStrategy.MAX_LENGTH: Pad to the max length (default) + - PaddingStrategy.DO_NOT_PAD: Do not pad + The tokenizer padding sides are defined in self.padding_side: + + - 'left': pads on the left of the sequences + - 'right': pads on the right of the sequences + pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability + >= 7.5 (Volta). + return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics) """ - - if "return_lengths" in kwargs: - if verbose: - warnings.warn( - "The PreTrainedTokenizerBase.prepare_for_model `return_lengths` parameter is deprecated. " - "Please use `return_length` instead.", - FutureWarning, - ) - return_length = kwargs["return_lengths"] - - # Backward compatibility for 'truncation_strategy', 'pad_to_max_length' - # padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( - # padding=padding, - # truncation=truncation, - # max_length=max_length, - # pad_to_multiple_of=pad_to_multiple_of, - # verbose=verbose, - # **kwargs, - # ) - # Load from model defaults - if return_token_type_ids is None: - return_token_type_ids = "token_type_ids" in self.model_input_names if return_attention_mask is None: return_attention_mask = "attention_mask" in self.model_input_names - encoded_inputs = {} - - if return_overflowing_tokens: - # currently, if drop_rows_to_fit is set to False and a table is too big, a ValueError is thrown - # see function _get_num_rows - raise ValueError("Overflowing tokens is currently not supported") - - if (answer_coordinates and not answer_texts) or (not answer_coordinates and answer_texts): - raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided") - - add_loss_variables = None - if answer_coordinates is not None and answer_texts is not None: - assert len(answer_coordinates) == len(answer_texts) == len(queries) - add_loss_variables = True - - # First, tokenize the table and get the number of rows and columns - tokenized_table = self._tokenize_table(table) - num_rows = self._get_num_rows(table, self.drop_rows_to_fit) - num_columns = self._get_num_columns(table) + if padding_strategy == PaddingStrategy.LONGEST: + max_length = len(encoded_inputs["input_ids"]) - # Second, create the input ids for every table + query pair (and all the other features). This is a list of lists - features_examples = {} - position_to_label_ids = {} - for position, query in enumerate(queries): - if isinstance(query, str): - text_tokens = self.tokenize(query) - # currently, padding is done within the _to_trimmed_features function - serialized_example, features = self._to_trimmed_features( - question=query, - table=table, - question_tokens=text_tokens, - tokenized_table=tokenized_table, - num_columns=num_columns, - num_rows=num_rows, - drop_rows_to_fit=self.drop_rows_to_fit, - ) + if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0): + max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of - if add_loss_variables: - column_ids = serialized_example.column_ids - row_ids = serialized_example.row_ids + needs_to_be_padded = ( + padding_strategy != PaddingStrategy.DO_NOT_PAD and len(encoded_inputs["input_ids"]) != max_length + ) - # create label ids from answer texts and coordinates - label_ids = self.get_answer_ids( - column_ids, - row_ids, - tokenized_table, - query, - answer_texts[position], - answer_coordinates[position], + if needs_to_be_padded: + difference = max_length - len(encoded_inputs["input_ids"]) + if self.padding_side == "right": + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference + if "token_type_ids" in encoded_inputs: + encoded_inputs["token_type_ids"] = ( + encoded_inputs["token_type_ids"] + [[self.pad_token_type_id] * 7] * difference ) - self._pad_to_seq_length(label_ids) - position_to_label_ids[position] = label_ids - features["label_ids"] = label_ids - - if position == 0: - prev_label_ids = [0] * len(features["input_ids"]) - else: - # TO DO: add prev label ids logic (see line 1118 in tf_example_utils.py) - prev_label_ids = position_to_label_ids[position - 1] - self._pad_to_seq_length(prev_label_ids) - features["prev_label_ids"] = prev_label_ids - - else: - prev_label_ids = [0] * len(features["input_ids"]) - self._pad_to_seq_length(prev_label_ids) - features["prev_label_ids"] = prev_label_ids - - features_examples[position] = features + if "special_tokens_mask" in encoded_inputs: + encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference + encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference + elif self.padding_side == "left": + if return_attention_mask: + encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"]) + if "token_type_ids" in encoded_inputs: + encoded_inputs["token_type_ids"] = [[self.pad_token_type_id] * 7] * difference + encoded_inputs[ + "token_type_ids" + ] + if "special_tokens_mask" in encoded_inputs: + encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"] + encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"] else: - raise ValueError("Query is not valid. Should be a string.") - - # Build output dictionnary - encoded_inputs["input_ids"] = [features_examples[position]["input_ids"] for position in range(len(queries))] - encoded_inputs["attention_mask"] = [ - features_examples[position]["attention_mask"] for position in range(len(queries)) - ] - - token_types = [ - "segment_ids", - "column_ids", - "row_ids", - "prev_label_ids", - "column_ranks", - "inv_column_ranks", - "numeric_relations", - ] - token_type_ids = [] - for position in range(len(queries)): - token_type_ids_example = [] - for token_idx in range(self.model_max_length): - token_ids = [] - for type in token_types: - token_ids.append(features_examples[position][type][token_idx]) - token_type_ids_example.append(token_ids) - # token_type_ids_example is a list of seq_length elements, each element being a list of 7 elements - token_type_ids.append(token_type_ids_example) - - if return_token_type_ids: - encoded_inputs["token_type_ids"] = token_type_ids - - if add_loss_variables: - encoded_inputs["label_ids"] = [ - features_examples[position]["label_ids"] for position in range(len(queries)) - ] - encoded_inputs["numeric_values"] = [ - features_examples[position]["numeric_values"] for position in range(len(queries)) - ] - encoded_inputs["numeric_values_scale"] = [ - features_examples[position]["numeric_values_scale"] for position in range(len(queries)) - ] - # to do: add aggregation function id, classification class index and answer (or should people prepare this themselves?) - - if return_special_tokens_mask: - raise ValueError("Special tokens mask is currently not supported") - - if return_length: - encoded_inputs["length"] = len(encoded_inputs["input_ids"]) - - batch_outputs = BatchEncoding(encoded_inputs, tensor_type=return_tensors) + raise ValueError("Invalid padding strategy:" + str(self.padding_side)) + else: + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) - return batch_outputs + return encoded_inputs #### Everything related to converting logits to predictions #### @@ -1336,11 +1726,13 @@ def _parse_coordinates(self, raw_coordinates): def convert_logits_to_predictions( self, data, logits, logits_agg=None, logits_cls=None, cell_classification_threshold=0.5 ): - """Converts logits to actual predictions. + """ + Converts logits to actual predictions. Args: data (:obj:`dict`): - Dictionary mapping features to actual values. Should be created using :class:`~transformers.TapasTokenizer`. + Dictionary mapping features to actual values. Should be created using + :class:`~transformers.TapasTokenizer`. logits (:obj:`torch.FloatTensor` of shape ``(batch_size, sequence_length)``): Tensor containing the logits at the token level. logits_agg (:obj:`torch.FloatTensor` of shape ``(batch_size, num_aggregation_labels)``, `optional`): @@ -1348,16 +1740,17 @@ def convert_logits_to_predictions( logits_cls (:obj:`torch.FloatTensor` of shape ``(batch_size, num_classification_labels)``, `optional`): Tensor containing the classification logits. cell_classification_threshold (:obj:`float`, `optional`, defaults to 0.5): - Threshold to be used for cell selection. All table cells for which their probability is larger than this threshold will be selected. + Threshold to be used for cell selection. All table cells for which their probability is larger than + this threshold will be selected + Returns: - :obj:`tuple` comprising various elements depending on the inputs: - answer_coordinates_batch (``List[List[[tuple]]``) of length ``batch_size``: - Answer coordinates as a list of lists of tuples. Each element in the list contains the predicted answer coordinates of a single example in the batch, as a list of tuples. - Each tuple is a cell (row, column pair). - aggregation_predictions (`optional`, returned when ``logits_aggregation`` is provided) ``List[int]`` of length ``batch_size``: - Prediction indices of the aggregation head. - classification_predictions (`optional`, returned when ``logits_cls`` is provided) ``List[int]`` of length ``batch_size``: - Prediction indices of the classification head. + :obj:`tuple` comprising various elements depending on the inputs: answer_coordinates_batch + (``List[List[[tuple]]``) of length ``batch_size``: Answer coordinates as a list of lists of tuples. Each + element in the list contains the predicted answer coordinates of a single example in the batch, as a list + of tuples. Each tuple is a cell (row, column pair). aggregation_predictions (`optional`, returned when + ``logits_aggregation`` is provided) ``List[int]`` of length ``batch_size``: Prediction indices of the + aggregation head. classification_predictions (`optional`, returned when ``logits_cls`` is provided) + ``List[int]`` of length ``batch_size``: Prediction indices of the classification head. """ # compute probabilities from token logits dist_per_token = torch.distributions.Bernoulli(logits=logits) @@ -1428,11 +1821,14 @@ def convert_logits_to_predictions( #### End of everything related to converting logits to predictions #### + """ BasicTokenizer and WordPieceTokenizer (taken from tokenization_bert.py)""" + class BasicTokenizer(object): """ - Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). + Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.) + Args: do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): Whether or not to lowercase the input when tokenizing. @@ -1440,9 +1836,8 @@ class BasicTokenizer(object): Collection of tokens which will never be split during tokenization. Only has an effect when :obj:`do_basic_tokenize=True` tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): - Whether or not to tokenize Chinese characters. - This should likely be deactivated for Japanese (see this `issue - `__). + Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this + `issue `__). strip_accents: (:obj:`bool`, `optional`): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for :obj:`lowercase` (as in the original BERT). @@ -1459,7 +1854,8 @@ def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars= def tokenize(self, text, never_split=None): """ Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see - WordPieceTokenizer. + WordPieceTokenizer + Args: **never_split**: (`optional`) list of str Kept for backward compatibility purposes. Now implemented directly at the base class level (see @@ -1587,11 +1983,13 @@ def __init__(self, vocab, unk_token, max_input_chars_per_word=100): def tokenize(self, text): """ Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform - tokenization using the given vocabulary. - For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`. + tokenization using the given vocabulary. For example, :obj:`input = "unaffable"` wil return as output + :obj:`["un", "##aff", "##able"]` + Args: text: A single token or whitespace separated tokens. This should have - already been passed through `BasicTokenizer`. + already been passed through `BasicTokenizer` + Returns: A list of wordpiece tokens. """ @@ -1630,17 +2028,18 @@ def tokenize(self, text): return output_tokens -""" Below: utilities for TAPAS tokenizer (independent from PyTorch/Tensorflow). +""" + Below: utilities for TAPAS tokenizer (independent from PyTorch/Tensorflow). - This includes functions to parse numeric values (dates and numbers) from texts - to create the column_ranks, inv_column_ranks, numeric_values, numeric values_scale - and numeric_relations. + This includes functions to parse numeric values (dates and numbers) from texts to create the column_ranks, + inv_column_ranks, numeric_values, numeric values_scale and numeric_relations. - These are meant to be used in an academic setup, for production use cases - Gold mine or Aqua should be used. + These are meant to be used in an academic setup, for production use cases Gold mine or Aqua should be used. + + Mainly copied from number_utils.py and constants.py (both found under the "utils" directory) of the original + implementation. + """ - Mainly copied from number_utils.py and constants.py (both found under the "utils" directory) - of the original implementation.""" class Relation(enum.Enum): HEADER_TO_CELL = 1 # Connects header to cell. @@ -1861,8 +2260,8 @@ def normalize_for_match(text): def get_all_spans(text, max_ngram_length): - """Split a text into all possible ngrams up to 'max_ngram_length'. - Split points are white space and punctuation. + """ + Split a text into all possible ngrams up to 'max_ngram_length'. Split points are white space and punctuation. Args: text: Text to split. @@ -1882,10 +2281,12 @@ def get_all_spans(text, max_ngram_length): def parse_text(text): - """Extracts longest number and date spans. + """ + Extracts longest number and date spans. Args: - text: text to annotate. + text: text to annotate + Returns: List of longest numeric value spans. """ @@ -1976,21 +2377,19 @@ def _get_all_types(numeric_values): def get_numeric_sort_key_fn(numeric_values): - """Creates a function that can be used as a sort key or to compare the values. - Maps to primitive types and finds the biggest common subset. - Consider the values "05/05/2010" and "August 2007". - With the corresponding primitive values (2010.,5.,5.) and (2007.,8., None). - These values can be compared by year and date so we map to the sequence - (2010., 5.), (2007., 8.). - If we added a third value "2006" with primitive value (2006., None, None), - we could only compare by the year so we would map to (2010.,), (2007.,) - and (2006.,). + """ + Creates a function that can be used as a sort key or to compare the values. Maps to primitive types and finds the + biggest common subset. Consider the values "05/05/2010" and "August 2007". With the corresponding primitive values + (2010.,5.,5.) and (2007.,8., None). These values can be compared by year and date so we map to the sequence (2010., + 5.), (2007., 8.). If we added a third value "2006" with primitive value (2006., None, None), we could only compare + by the year so we would map to (2010.,), (2007.,) and (2006.,). Args: - numeric_values: Values to compare. + numeric_values: Values to compare + Returns: - A function that can be used as a sort key function (mapping numeric values - to a comparable tuple). + A function that can be used as a sort key function (mapping numeric values to a comparable tuple) + Raises: ValueError if values don't have a common type or are not comparable. """ @@ -2031,7 +2430,8 @@ def _get_numeric_values(text): def _parse_column_values(table, col_index): - """Parses text in column and returns a dict mapping row_index to values. + """ + Parses text in column and returns a dict mapping row_index to values. Args: table: Pandas dataframe @@ -2062,4 +2462,4 @@ def get_numeric_relation(value, other_value, sort_key_fn): return Relation.LT if value > other_value: return Relation.GT - return None \ No newline at end of file + return None diff --git a/tests/test_modeling_tapas.py b/tests/test_modeling_tapas.py index 016d1ce92804..a519117857c6 100644 --- a/tests/test_modeling_tapas.py +++ b/tests/test_modeling_tapas.py @@ -14,7 +14,6 @@ # limitations under the License. - import unittest import numpy as np @@ -26,17 +25,16 @@ from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask - if is_torch_available(): import torch from transformers import ( TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST, TapasConfig, - TapasModel, TapasForMaskedLM, - TapasForSequenceClassification, TapasForQuestionAnswering, + TapasForSequenceClassification, + TapasModel, ) @@ -143,10 +141,8 @@ def prepare_config_and_inputs(self): input_mask = random_attention_mask([self.batch_size, self.seq_length]) token_type_ids = [] - for type_vocab_size in self.type_vocab_sizes: - token_type_ids.append( - ids_tensor(shape=[self.batch_size, self.seq_length], vocab_size=type_vocab_size) - ) + for type_vocab_size in self.type_vocab_sizes: + token_type_ids.append(ids_tensor(shape=[self.batch_size, self.seq_length], vocab_size=type_vocab_size)) token_type_ids = torch.stack(token_type_ids, dim=2) sequence_labels = None @@ -219,8 +215,18 @@ def prepare_config_and_inputs(self): ) def create_and_check_model( - self, config, input_ids, input_mask, token_type_ids, sequence_labels, token_labels, label_ids, answer, - numeric_values, numeric_values_scale, aggregation_labels + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + answer, + numeric_values, + numeric_values_scale, + aggregation_labels, ): model = TapasModel(config=config) model.to(torch_device) @@ -232,8 +238,18 @@ def create_and_check_model( self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) def create_and_check_for_masked_lm( - self, config, input_ids, input_mask, token_type_ids, sequence_labels, token_labels, label_ids, answer, - numeric_values, numeric_values_scale, aggregation_labels + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + answer, + numeric_values, + numeric_values_scale, + aggregation_labels, ): model = TapasForMaskedLM(config=config) model.to(torch_device) @@ -242,22 +258,48 @@ def create_and_check_for_masked_lm( self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size)) def create_and_check_for_question_answering( - self, config, input_ids, input_mask, token_type_ids, sequence_labels, token_labels, label_ids, answer, - numeric_values, numeric_values_scale, aggregation_labels + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + answer, + numeric_values, + numeric_values_scale, + aggregation_labels, ): model = TapasForQuestionAnswering(config=config) model.to(torch_device) model.eval() - result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, - label_ids=label_ids, answer=answer, numeric_values=numeric_values, - numeric_values_scale=numeric_values_scale, aggregation_labels=aggregation_labels + result = model( + input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + label_ids=label_ids, + answer=answer, + numeric_values=numeric_values, + numeric_values_scale=numeric_values_scale, + aggregation_labels=aggregation_labels, ) self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels)) def create_and_check_for_sequence_classification( - self, config, input_ids, input_mask, token_type_ids, sequence_labels, token_labels, label_ids, answer, - numeric_values, numeric_values_scale, aggregation_labels + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + answer, + numeric_values, + numeric_values_scale, + aggregation_labels, ): config.num_labels = self.num_labels model = TapasForSequenceClassification(config) @@ -375,6 +417,7 @@ def test_for_sequence_classification(self): # def test_large_inputs_in_fp16_dont_cause_overflow(self): # pass + # Below: tests for Tapas utilities, based on segmented_tensor_test.py of the original implementation. # These test the operations on segmented tensors. class TapasUtilitiesTest(unittest.TestCase): diff --git a/tests/test_tokenization_tapas.py b/tests/test_tokenization_tapas.py new file mode 100644 index 000000000000..baae26423406 --- /dev/null +++ b/tests/test_tokenization_tapas.py @@ -0,0 +1,1195 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import inspect +import os +import shutil +import tempfile +import unittest +from typing import List, Tuple + +import pandas as pd + +from transformers import AddedToken +from transformers.testing_utils import require_tokenizers, slow +from transformers.tokenization_tapas import ( + VOCAB_FILES_NAMES, + BasicTokenizer, + TapasTokenizer, + WordpieceTokenizer, + _is_control, + _is_punctuation, + _is_whitespace, +) + +from .test_tokenization_common import TokenizerTesterMixin, filter_non_english + + +@require_tokenizers +class TapasTokenizationTest(TokenizerTesterMixin, unittest.TestCase): + + tokenizer_class = TapasTokenizer + test_rust_tokenizer = False + space_between_special_tokens = True + from_pretrained_filter = filter_non_english + + def get_table( + self, + tokenizer: TapasTokenizer, + length=5, + ): + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + + if length == 0: + data = {} + else: + data = {toks[0]: [toks[tok] for tok in range(1, length)]} + + table = pd.DataFrame.from_dict(data) + + return table + + def get_table_and_query( + self, + tokenizer: TapasTokenizer, + add_special_tokens: bool = True, + length=5, + ): + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + table = self.get_table(tokenizer, length=length - 3) + query = " ".join(toks[:3]) + + return table, query + + def get_clean_sequence( + self, + tokenizer: TapasTokenizer, + with_prefix_space=False, + max_length=20, + min_length=5, + empty_table: bool = False, + add_special_tokens: bool = True, + return_table_and_query: bool = False, + ): + + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + + if empty_table: + table = pd.DataFrame.from_dict({}) + query = " ".join(toks[:min_length]) + else: + data = {toks[0]: [toks[tok] for tok in range(1, min_length - 3)]} + table = pd.DataFrame.from_dict(data) + query = " ".join(toks[:3]) + + output_ids = tokenizer.encode(table, query, add_special_tokens=add_special_tokens) + output_txt = tokenizer.decode(output_ids) + + assert len(output_ids) >= min_length, "Update the code to generate the sequences so that they are larger" + assert len(output_ids) <= max_length, "Update the code to generate the sequences so that they are smaller" + + if return_table_and_query: + return output_txt, output_ids, table, query + + return output_txt, output_ids + + # def get_clean_sequence(self, tokenizer, with_prefix_space=False, max_length=20, min_length=5) -> Tuple[str, list]: + # data = { + # 'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + # 'Age': ["56", "45", "59"], + # 'Number of movies': ["87", "53", "69"], + # 'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"] + # } + # table = pd.DataFrame.from_dict(data) + # output_ids = tokenizer.encode(table, add_special_tokens=False, max_length=max_length) + # output_txt = tokenizer.decode(output_ids) + # + # return output_txt, output_ids + + def setUp(self): + super().setUp() + + vocab_tokens = [ + "[UNK]", + "[CLS]", + "[SEP]", + "[PAD]", + "[MASK]", + "want", + "##want", + "##ed", + "wa", + "un", + "runn", + "##ing", + ",", + "low", + "lowest", + ] + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer: + vocab_writer.write("".join([x + "\n" for x in vocab_tokens])) + + def get_input_output_texts(self, tokenizer): + input_text = "UNwant\u00E9d,running" + output_text = "unwanted, running" + return input_text, output_text + + def test_full_tokenizer(self): + tokenizer = self.tokenizer_class(self.vocab_file) + + tokens = tokenizer.tokenize("UNwant\u00E9d,running") + self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"]) + self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11]) + + def test_rust_and_python_full_tokenizers(self): + if not self.test_rust_tokenizer: + return + + tokenizer = self.get_tokenizer() + rust_tokenizer = self.get_rust_tokenizer() + + sequence = "UNwant\u00E9d,running" + + tokens = tokenizer.tokenize(sequence) + rust_tokens = rust_tokenizer.tokenize(sequence) + self.assertListEqual(tokens, rust_tokens) + + ids = tokenizer.encode(sequence, add_special_tokens=False) + rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False) + self.assertListEqual(ids, rust_ids) + + rust_tokenizer = self.get_rust_tokenizer() + ids = tokenizer.encode(sequence) + rust_ids = rust_tokenizer.encode(sequence) + self.assertListEqual(ids, rust_ids) + + # With lower casing + tokenizer = self.get_tokenizer(do_lower_case=True) + rust_tokenizer = self.get_rust_tokenizer(do_lower_case=True) + + sequence = "UNwant\u00E9d,running" + + tokens = tokenizer.tokenize(sequence) + rust_tokens = rust_tokenizer.tokenize(sequence) + self.assertListEqual(tokens, rust_tokens) + + ids = tokenizer.encode(sequence, add_special_tokens=False) + rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False) + self.assertListEqual(ids, rust_ids) + + rust_tokenizer = self.get_rust_tokenizer() + ids = tokenizer.encode(sequence) + rust_ids = rust_tokenizer.encode(sequence) + self.assertListEqual(ids, rust_ids) + + def test_chinese(self): + tokenizer = BasicTokenizer() + + self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"]) + + def test_basic_tokenizer_lower(self): + tokenizer = BasicTokenizer(do_lower_case=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? "), ["hello", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_lower_strip_accents_false(self): + tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hällo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"]) + + def test_basic_tokenizer_lower_strip_accents_true(self): + tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hallo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_lower_strip_accents_default(self): + tokenizer = BasicTokenizer(do_lower_case=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hallo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_no_lower(self): + tokenizer = BasicTokenizer(do_lower_case=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? "), ["HeLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_no_lower_strip_accents_false(self): + tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["HäLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_no_lower_strip_accents_true(self): + tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["HaLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_respects_never_split_tokens(self): + tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"]) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"] + ) + + def test_wordpiece_tokenizer(self): + vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"] + + vocab = {} + for (i, token) in enumerate(vocab_tokens): + vocab[token] = i + tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]") + + self.assertListEqual(tokenizer.tokenize(""), []) + + self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"]) + + self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"]) + + def test_is_whitespace(self): + self.assertTrue(_is_whitespace(" ")) + self.assertTrue(_is_whitespace("\t")) + self.assertTrue(_is_whitespace("\r")) + self.assertTrue(_is_whitespace("\n")) + self.assertTrue(_is_whitespace("\u00A0")) + + self.assertFalse(_is_whitespace("A")) + self.assertFalse(_is_whitespace("-")) + + def test_is_control(self): + self.assertTrue(_is_control("\u0005")) + + self.assertFalse(_is_control("A")) + self.assertFalse(_is_control(" ")) + self.assertFalse(_is_control("\t")) + self.assertFalse(_is_control("\r")) + + def test_is_punctuation(self): + self.assertTrue(_is_punctuation("-")) + self.assertTrue(_is_punctuation("$")) + self.assertTrue(_is_punctuation("`")) + self.assertTrue(_is_punctuation(".")) + + self.assertFalse(_is_punctuation("A")) + self.assertFalse(_is_punctuation(" ")) + + def test_clean_text(self): + tokenizer = self.get_tokenizer() + # rust_tokenizer = self.get_rust_tokenizer() + + # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340 + self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]]) + + # self.assertListEqual( + # [rust_tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], [], ["[UNK]"]] + # ) + + @slow + def test_sequence_builders(self): + tokenizer = self.tokenizer_class.from_pretrained("tapas-base-uncased") + + text = tokenizer.encode("sequence builders", add_special_tokens=False) + text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False) + + encoded_sentence = tokenizer.build_inputs_with_special_tokens(text) + encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2) + + assert encoded_sentence == [101] + text + [102] + assert encoded_pair == [101] + text + [102] + text_2 + [102] + + def test_offsets_with_special_characters(self): + for tokenizer, pretrained_name, kwargs in self.tokenizers_list: + with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)): + tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs) + + sentence = f"A, naïve {tokenizer_r.mask_token} AllenNLP sentence." + tokens = tokenizer_r.encode_plus( + sentence, + return_attention_mask=False, + return_token_type_ids=False, + return_offsets_mapping=True, + add_special_tokens=True, + ) + + do_lower_case = tokenizer_r.do_lower_case if hasattr(tokenizer_r, "do_lower_case") else False + expected_results = ( + [ + ((0, 0), tokenizer_r.cls_token), + ((0, 1), "A"), + ((1, 2), ","), + ((3, 5), "na"), + ((5, 6), "##ï"), + ((6, 8), "##ve"), + ((9, 15), tokenizer_r.mask_token), + ((16, 21), "Allen"), + ((21, 23), "##NL"), + ((23, 24), "##P"), + ((25, 33), "sentence"), + ((33, 34), "."), + ((0, 0), tokenizer_r.sep_token), + ] + if not do_lower_case + else [ + ((0, 0), tokenizer_r.cls_token), + ((0, 1), "a"), + ((1, 2), ","), + ((3, 8), "naive"), + ((9, 15), tokenizer_r.mask_token), + ((16, 21), "allen"), + ((21, 23), "##nl"), + ((23, 24), "##p"), + ((25, 33), "sentence"), + ((33, 34), "."), + ((0, 0), tokenizer_r.sep_token), + ] + ) + + self.assertEqual( + [e[1] for e in expected_results], tokenizer_r.convert_ids_to_tokens(tokens["input_ids"]) + ) + self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"]) + + def test_tapas_integration_test(self): + data = { + "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + "Age": ["56", "45", "59"], + "Number of movies": ["87", "53", "69"], + "Date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"], + } + queries = [ + "When was Brad Pitt born?", + "Which actor appeared in the least number of movies?", + "What is the average number of movies?", + ] + table = pd.DataFrame.from_dict(data) + + # TODO: Should update this in the future + tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512) + + expected_results = { + "input_ids": [ + 101, + 2043, + 2001, + 8226, + 15091, + 2141, + 1029, + 102, + 5889, + 2287, + 2193, + 1997, + 5691, + 3058, + 1997, + 4182, + 8226, + 15091, + 5179, + 6584, + 2324, + 2285, + 3699, + 14720, + 4487, + 6178, + 9488, + 3429, + 5187, + 2340, + 2281, + 3326, + 2577, + 18856, + 7828, + 3240, + 5354, + 6353, + 1020, + 2089, + 3777, + ], + "attention_mask": [ + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + ], + "token_type_ids": [ + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0, 0], + [1, 2, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 1, 1, 0, 0, 0, 0], + [1, 1, 1, 0, 0, 0, 0], + [1, 2, 1, 0, 2, 2, 0], + [1, 3, 1, 0, 3, 1, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 2, 2, 0, 1, 3, 0], + [1, 3, 2, 0, 1, 3, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 2, 3, 0, 3, 1, 0], + [1, 3, 3, 0, 2, 2, 0], + [1, 4, 3, 0, 1, 3, 0], + [1, 4, 3, 0, 1, 3, 0], + [1, 4, 3, 0, 1, 3, 0], + ], + } + + new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0], padding="max_length") + + self.assertDictEqual(new_encoded_inputs, expected_results) + + def test_add_special_tokens(self): + tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + input_table = self.get_table(tokenizer, length=0) + + special_token = "[SPECIAL_TOKEN]" + + tokenizer.add_special_tokens({"cls_token": special_token}) + encoded_special_token = tokenizer.encode(input_table, special_token, add_special_tokens=False) + self.assertEqual(len(encoded_special_token), 1) + + decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True) + self.assertTrue(special_token not in decoded) + + def test_add_tokens_tokenizer(self): + tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + vocab_size = tokenizer.vocab_size + all_size = len(tokenizer) + + self.assertNotEqual(vocab_size, 0) + + # We usually have added tokens from the start in tests because our vocab fixtures are + # smaller than the original vocabs - let's not assert this + # self.assertEqual(vocab_size, all_size) + + new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"] + added_toks = tokenizer.add_tokens(new_toks) + vocab_size_2 = tokenizer.vocab_size + all_size_2 = len(tokenizer) + + self.assertNotEqual(vocab_size_2, 0) + self.assertEqual(vocab_size, vocab_size_2) + self.assertEqual(added_toks, len(new_toks)) + self.assertEqual(all_size_2, all_size + len(new_toks)) + + tokens = tokenizer.encode(table, "aaaaa bbbbbb low cccccccccdddddddd l", add_special_tokens=False) + + self.assertGreaterEqual(len(tokens), 4) + self.assertGreater(tokens[0], tokenizer.vocab_size - 1) + self.assertGreater(tokens[-2], tokenizer.vocab_size - 1) + + new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"} + added_toks_2 = tokenizer.add_special_tokens(new_toks_2) + vocab_size_3 = tokenizer.vocab_size + all_size_3 = len(tokenizer) + + self.assertNotEqual(vocab_size_3, 0) + self.assertEqual(vocab_size, vocab_size_3) + self.assertEqual(added_toks_2, len(new_toks_2)) + self.assertEqual(all_size_3, all_size_2 + len(new_toks_2)) + + tokens = tokenizer.encode( + table, + ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l", + add_special_tokens=False, + ) + + self.assertGreaterEqual(len(tokens), 6) + self.assertGreater(tokens[0], tokenizer.vocab_size - 1) + self.assertGreater(tokens[0], tokens[1]) + self.assertGreater(tokens[-2], tokenizer.vocab_size - 1) + self.assertGreater(tokens[-2], tokens[-3]) + self.assertEqual(tokens[0], tokenizer.eos_token_id) + self.assertEqual(tokens[-2], tokenizer.pad_token_id) + + @require_tokenizers + def test_encode_decode_with_spaces(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + + # new_toks = ["[ABC]", "[DEF]"] # TODO(thom) add this one back when Rust toks are ready: , "GHI IHG"] + new_toks = [AddedToken("[ABC]", normalized=False), AddedToken("[DEF]", normalized=False)] + tokenizer.add_tokens(new_toks) + input = "[ABC][DEF][ABC][DEF]" # TODO(thom) add back cf above: "[ABC] [DEF] [ABC] GHI IHG [DEF]" + if self.space_between_special_tokens: + output = "[ABC] [DEF] [ABC] [DEF]" + else: + output = input + encoded = tokenizer.encode(table, input, add_special_tokens=False) + decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens) + self.assertIn(decoded, [output, output.lower()]) + + def test_encode_plus_with_padding(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence = "Sequence" + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_size = 10 + padding_idx = tokenizer.pad_token_id + token_type_padding_idx = tokenizer.pad_token_type_id + + encoded_sequence = tokenizer.encode_plus(table, sequence, return_special_tokens_mask=True) + input_ids = encoded_sequence["input_ids"] + special_tokens_mask = encoded_sequence["special_tokens_mask"] + sequence_length = len(input_ids) + + # Test 'longest' and 'no_padding' don't do anything + tokenizer.padding_side = "right" + + not_padded_sequence = tokenizer.encode_plus( + table, + sequence, + padding=True, + return_special_tokens_mask=True, + ) + not_padded_input_ids = not_padded_sequence["input_ids"] + + not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"] + not_padded_sequence_length = len(not_padded_input_ids) + + assert sequence_length == not_padded_sequence_length + assert input_ids == not_padded_input_ids + assert special_tokens_mask == not_padded_special_tokens_mask + + not_padded_sequence = tokenizer.encode_plus( + table, + sequence, + padding=False, + return_special_tokens_mask=True, + ) + not_padded_input_ids = not_padded_sequence["input_ids"] + + not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"] + not_padded_sequence_length = len(not_padded_input_ids) + + assert sequence_length == not_padded_sequence_length + assert input_ids == not_padded_input_ids + assert special_tokens_mask == not_padded_special_tokens_mask + + # Test right padding + tokenizer.padding_side = "right" + + right_padded_sequence = tokenizer.encode_plus( + table, + sequence, + max_length=sequence_length + padding_size, + padding="max_length", + return_special_tokens_mask=True, + ) + right_padded_input_ids = right_padded_sequence["input_ids"] + + right_padded_special_tokens_mask = right_padded_sequence["special_tokens_mask"] + right_padded_sequence_length = len(right_padded_input_ids) + + assert sequence_length + padding_size == right_padded_sequence_length + assert input_ids + [padding_idx] * padding_size == right_padded_input_ids + assert special_tokens_mask + [1] * padding_size == right_padded_special_tokens_mask + + # Test left padding + tokenizer.padding_side = "left" + left_padded_sequence = tokenizer.encode_plus( + table, + sequence, + max_length=sequence_length + padding_size, + padding="max_length", + return_special_tokens_mask=True, + ) + left_padded_input_ids = left_padded_sequence["input_ids"] + left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"] + left_padded_sequence_length = len(left_padded_input_ids) + + assert sequence_length + padding_size == left_padded_sequence_length + assert [padding_idx] * padding_size + input_ids == left_padded_input_ids + assert [1] * padding_size + special_tokens_mask == left_padded_special_tokens_mask + + if "token_type_ids" in tokenizer.model_input_names: + token_type_ids = encoded_sequence["token_type_ids"] + left_padded_token_type_ids = left_padded_sequence["token_type_ids"] + right_padded_token_type_ids = right_padded_sequence["token_type_ids"] + + assert ( + token_type_ids + [[token_type_padding_idx] * 7] * padding_size == right_padded_token_type_ids + ) + assert [[token_type_padding_idx] * 7] * padding_size + token_type_ids == left_padded_token_type_ids + + if "attention_mask" in tokenizer.model_input_names: + attention_mask = encoded_sequence["attention_mask"] + right_padded_attention_mask = right_padded_sequence["attention_mask"] + left_padded_attention_mask = left_padded_sequence["attention_mask"] + + assert attention_mask + [0] * padding_size == right_padded_attention_mask + assert [0] * padding_size + attention_mask == left_padded_attention_mask + + def test_internal_consistency(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + input_text, output_text = self.get_input_output_texts(tokenizer) + + tokens = tokenizer.tokenize(input_text) + ids = tokenizer.convert_tokens_to_ids(tokens) + ids_2 = tokenizer.encode(table, input_text, add_special_tokens=False) + self.assertListEqual(ids, ids_2) + + tokens_2 = tokenizer.convert_ids_to_tokens(ids) + self.assertNotEqual(len(tokens_2), 0) + text_2 = tokenizer.decode(ids) + self.assertIsInstance(text_2, str) + + self.assertEqual(text_2, output_text) + + def test_mask_output(self): + tokenizers = self.get_tokenizers(fast=False, do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table, query = self.get_table_and_query(tokenizer) + + if ( + tokenizer.build_inputs_with_special_tokens.__qualname__.split(".")[0] != "PreTrainedTokenizer" + and "token_type_ids" in tokenizer.model_input_names + ): + information = tokenizer.encode_plus(table, query, add_special_tokens=True) + sequences, mask = information["input_ids"], information["token_type_ids"] + self.assertEqual(len(sequences), len(mask)) + + @unittest.skip("TAPAS tokenizer only handles two sequences.") + def test_maximum_encoding_length_pair_input(self): + pass + + @unittest.skip("TAPAS tokenizer only handles two sequences.") + def test_maximum_encoding_length_single_input(self): + pass + + def test_number_of_added_tokens(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + + table, query = self.get_table_and_query(tokenizer) + + sequences = tokenizer.encode(table, query, add_special_tokens=False) + attached_sequences = tokenizer.encode(table, query, add_special_tokens=True) + + # Method is implemented (e.g. not GPT-2) + if len(attached_sequences) != 2: + self.assertEqual( + tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences) + ) + + def test_padding_to_max_length(self): + """We keep this test for backward compatibility but it should be removed when `pad_to_max_length` will be deprecated""" + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer) + sequence = "Sequence" + padding_size = 10 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_idx = tokenizer.pad_token_id + + # Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "right" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + # FIXME: the next line should be padding(max_length) to avoid warning + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, pad_to_max_length=True + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert encoded_sequence + [padding_idx] * padding_size == padded_sequence + + # Check that nothing is done when a maximum length is not specified + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence, pad_to_max_length=True) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + def test_padding_to_multiple_of(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + if tokenizer.pad_token is None: + self.skipTest("No padding token.") + else: + empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8) + normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8) + for key, value in empty_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + normal_tokens = tokenizer("This", pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # Should also work with truncation + normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # truncation to something which is not a multiple of pad_to_multiple_of raises an error + self.assertRaises( + ValueError, + tokenizer.__call__, + "This", + padding=True, + truncation=True, + max_length=12, + pad_to_multiple_of=8, + ) + + def test_call(self): + # Tests that all call wrap to encode_plus and batch_encode_plus + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + # Test not batched + table = self.get_table(tokenizer, length=0) + encoded_sequences_1 = tokenizer.encode_plus(table, sequences[0]) + encoded_sequences_2 = tokenizer(table, sequences[0]) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + # Test not batched pairs + table = self.get_table(tokenizer, length=10) + encoded_sequences_1 = tokenizer.encode_plus(table, sequences[1]) + encoded_sequences_2 = tokenizer(table, sequences[1]) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + # Test batched + table = self.get_table(tokenizer, length=0) + encoded_sequences_1 = tokenizer.batch_encode_plus(table, sequences) + encoded_sequences_2 = tokenizer(table, sequences) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + def test_batch_encode_plus_batch_sequence_length(self): + # Tests that all encoded values have the correct size + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + encoded_sequences = [tokenizer.encode_plus(table, sequence) for sequence in sequences] + encoded_sequences_batch = tokenizer.batch_encode_plus(table, sequences, padding=False) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + maximum_length = len( + max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len) + ) + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences_padded = [ + tokenizer.encode_plus(table, sequence, max_length=maximum_length, padding="max_length") + for sequence in sequences + ] + + encoded_sequences_batch_padded = tokenizer.batch_encode_plus(table, sequences, padding=True) + self.assertListEqual( + encoded_sequences_padded, + self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded), + ) + + # check 'longest' is unsensitive to a max length + encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=True) + encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus( + table, sequences, max_length=maximum_length + 10, padding="longest" + ) + for key in encoded_sequences_batch_padded_1.keys(): + self.assertListEqual( + encoded_sequences_batch_padded_1[key], + encoded_sequences_batch_padded_2[key], + ) + + # check 'no_padding' is unsensitive to a max length + encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=False) + encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus( + table, sequences, max_length=maximum_length + 10, padding=False + ) + for key in encoded_sequences_batch_padded_1.keys(): + self.assertListEqual( + encoded_sequences_batch_padded_1[key], + encoded_sequences_batch_padded_2[key], + ) + + def test_batch_encode_plus_overflowing_tokens(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + table = self.get_table(tokenizer, length=0) + string_sequences = ["Testing the prepare_for_model method.", "Test"] + + if tokenizer.pad_token is None: + tokenizer.add_special_tokens({"pad_token": "[PAD]"}) + + tokenizer.batch_encode_plus( + table, string_sequences, return_overflowing_tokens=True, truncation=True, padding=True, max_length=3 + ) + + def test_batch_encode_plus_padding(self): + # Test that padded sequences are equivalent between batch_encode_plus and encode_plus + + # Right padding tests + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + max_length = 100 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences = [ + tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length") + for sequence in sequences + ] + encoded_sequences_batch = tokenizer.batch_encode_plus( + table, sequences, max_length=max_length, padding="max_length" + ) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + # Left padding tests + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + tokenizer.padding_side = "left" + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + max_length = 100 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences = [ + tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length") + for sequence in sequences + ] + encoded_sequences_batch = tokenizer.batch_encode_plus( + table, sequences, max_length=max_length, padding="max_length" + ) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + def test_padding_to_multiple_of(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + if tokenizer.pad_token is None: + self.skipTest("No padding token.") + else: + empty_tokens = tokenizer(table, padding=True, pad_to_multiple_of=8) + normal_tokens = tokenizer(table, "This is a sample input", padding=True, pad_to_multiple_of=8) + for key, value in empty_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + normal_tokens = tokenizer(table, "This", pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # Should also work with truncation + normal_tokens = tokenizer(table, "This", padding=True, truncation=True, pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # truncation to something which is not a multiple of pad_to_multiple_of raises an error + self.assertRaises( + ValueError, + tokenizer.__call__, + table, + "This", + padding=True, + truncation=True, + max_length=12, + pad_to_multiple_of=8, + ) + + @unittest.skip("TAPAS cannot handle `prepare_for_model` without passing by `encode_plus` or `batch_encode_plus`") + def test_prepare_for_model(self): + pass + + def test_tokenizer_slow_store_full_signature(self): + signature = inspect.signature(self.tokenizer_class.__init__) + tokenizer = self.get_tokenizer() + + for parameter_name, parameter in signature.parameters.items(): + if parameter.default != inspect.Parameter.empty: + self.assertIn(parameter_name, tokenizer.init_kwargs) + + def test_special_tokens_mask_input_pairs(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + sequence_0 = "Encode this." + empty_table = self.get_table(tokenizer, length=0) + table = self.get_table(tokenizer, length=10) + encoded_sequence = tokenizer.encode(empty_table, sequence_0, add_special_tokens=False) + encoded_sequence += tokenizer.encode(table, "", add_special_tokens=False) + encoded_sequence_dict = tokenizer.encode_plus( + table, + sequence_0, + add_special_tokens=True, + return_special_tokens_mask=True, + # add_prefix_space=False, + ) + encoded_sequence_w_special = encoded_sequence_dict["input_ids"] + special_tokens_mask = encoded_sequence_dict["special_tokens_mask"] + self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special)) + + filtered_sequence = [ + (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special) + ] + filtered_sequence = [x for x in filtered_sequence if x is not None] + self.assertEqual(encoded_sequence, filtered_sequence) + + def test_special_tokens_mask(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence_0 = "Encode this." + # Testing single inputs + encoded_sequence = tokenizer.encode(table, sequence_0, add_special_tokens=False) + encoded_sequence_dict = tokenizer.encode_plus( + table, sequence_0, add_special_tokens=True, return_special_tokens_mask=True + ) + encoded_sequence_w_special = encoded_sequence_dict["input_ids"] + special_tokens_mask = encoded_sequence_dict["special_tokens_mask"] + self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special)) + + filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]] + self.assertEqual(encoded_sequence, filtered_sequence) + + def test_save_and_load_tokenizer(self): + # safety check on max_len default value so we are sure the test works + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + self.assertNotEqual(tokenizer.model_max_length, 42) + + # Now let's start the test + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + # Isolate this from the other tests because we save additional tokens/etc + table = self.get_table(tokenizer, length=0) + tmpdirname = tempfile.mkdtemp() + + sample_text = " He is very happy, UNwant\u00E9d,running" + before_tokens = tokenizer.encode(table, sample_text, add_special_tokens=False) + before_vocab = tokenizer.get_vocab() + tokenizer.save_pretrained(tmpdirname) + + after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname) + after_tokens = after_tokenizer.encode(table, sample_text, add_special_tokens=False) + after_vocab = after_tokenizer.get_vocab() + self.assertListEqual(before_tokens, after_tokens) + self.assertDictEqual(before_vocab, after_vocab) + + shutil.rmtree(tmpdirname) + + def test_right_and_left_padding(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence = "Sequence" + padding_size = 10 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_idx = tokenizer.pad_token_id + + # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "right" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, padding="max_length" + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert encoded_sequence + [padding_idx] * padding_size == padded_sequence + + # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "left" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, padding="max_length" + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert [padding_idx] * padding_size + encoded_sequence == padded_sequence + + # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding' + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence, padding=True) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + tokenizer.padding_side = "left" + padded_sequence_left = tokenizer.encode(table, sequence, padding="longest") + padded_sequence_left_length = len(padded_sequence_left) + assert sequence_length == padded_sequence_left_length + assert encoded_sequence == padded_sequence_left + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + tokenizer.padding_side = "left" + padded_sequence_left = tokenizer.encode(table, sequence, padding=False) + padded_sequence_left_length = len(padded_sequence_left) + assert sequence_length == padded_sequence_left_length + assert encoded_sequence == padded_sequence_left + + @unittest.skip("TAPAS doesn't handle pre-tokenized inputs.") + def test_pretokenized_inputs(self): + pass