diff --git a/README.md b/README.md index 1a898a9f076e..a710d6db6a54 100644 --- a/README.md +++ b/README.md @@ -188,6 +188,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT. 1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer. 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. +1. **[TAPAS](https://huggingface.co/transformers/master/model_doc/tapas.html)** released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos. 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau. 1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. @@ -222,4 +223,4 @@ We now have a [paper](https://arxiv.org/abs/1910.03771) you can cite for the year={2019}, volume={abs/1910.03771} } -``` +``` \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 737f562f663e..7b68b3ce91bc 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -145,22 +145,25 @@ conversion utilities for the following models: 27. :doc:`T5 ` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer `__ by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu. -28. :doc:`Transformer-XL ` (from Google/CMU) released with the paper `Transformer-XL: +28. :doc:`TAPAS ` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via + Pre-training `__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, + Francesco Piccinno and Julian Martin Eisenschlos. +29. :doc:`Transformer-XL ` (from Google/CMU) released with the paper `Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context `__ by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov. -29. :doc:`XLM ` (from Facebook) released together with the paper `Cross-lingual Language Model +30. :doc:`XLM ` (from Facebook) released together with the paper `Cross-lingual Language Model Pretraining `__ by Guillaume Lample and Alexis Conneau. -30. :doc:`XLM-ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: +31. :doc:`XLM-ProphetNet ` (from Microsoft Research) released with the paper `ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training `__ by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou. -31. :doc:`XLM-RoBERTa ` (from Facebook AI), released together with the paper `Unsupervised +32. :doc:`XLM-RoBERTa ` (from Facebook AI), released together with the paper `Unsupervised Cross-lingual Representation Learning at Scale `__ by Alexis Conneau*, Kartikay Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. -32. :doc:`XLNet ` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive +33. :doc:`XLNet ` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive Pretraining for Language Understanding `__ by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. -33. `Other community models `__, contributed by the `community +34. `Other community models `__, contributed by the `community `__. .. toctree:: @@ -258,6 +261,7 @@ conversion utilities for the following models: model_doc/roberta model_doc/squeezebert model_doc/t5 + model_doc/tapas model_doc/transformerxl model_doc/xlm model_doc/xlmprophetnet diff --git a/docs/source/model_doc/tapas.rst b/docs/source/model_doc/tapas.rst new file mode 100644 index 000000000000..d46fb25b0dae --- /dev/null +++ b/docs/source/model_doc/tapas.rst @@ -0,0 +1,378 @@ +TAPAS +----------------------------------------------------------------------------------------------------------------------- + +Overview +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training +`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and +Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for answering questions +about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular +structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising millions +of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads on top: a cell +selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among +selected cells. TAPAS has been fine-tuned on several datasets: SQA (Sequential Question Answering by Microsoft), WTQ +(Wiki Table Questions by Stanford University) and WikiSQL (by Salesforce). It achieves state-of-the-art on both SQA and +WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture. + +The abstract from the paper is the following: + +*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the +collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations +instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, +the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we +present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak +supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation +operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective +joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with +three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by +improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL +and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our +setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.* + +In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset +of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. +The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, +and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving +a new state-of-the-art as well as state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for +table entailment (a binary classification task). For more details, see their follow-up paper: `Understanding tables with +intermediate pre-training `__ by Julian Martin Eisenschlos, Syrine Krichene and +Thomas Müller. + +The original code can be found `here `__. + +Tips: + +- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell + of the table). According to the authors, this usually results in a slightly better performance, and allows you to + encode longer sequences without running out of embeddings. This is reflected in the ``reset_position_index_per_cell`` + parameter of :class:`~transformers.TapasConfig`, which is set to ``True`` by default. + There are both pre-trained models in the `model hub `_ with absolute and relative + position embeddings. Note that it's usually advised to pad the inputs on the right rather than the left. +- TAPAS is based on BERT, so ``TAPAS-base`` for example corresponds to a ``BERT-base`` architecture. Of course, TAPAS-large + will result in the best performance (the results reported in the paper are from TAPAS-large). Metrics of the various + sized models are shown on the `original Github repository `_. +- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a + conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the + previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that + case, you have to feed every training example one by one to the model, such that the `prev_label_ids` token type ids + can be overwritten by the predicted `label_ids` of the model to the previous question. See "Usage" section for more info. +- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore + efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained + with a causal language modeling (CLM) objective are better in that regard. + + +Usage: fine-tuning +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here we explain how you can fine-tune :class:`~transformers.TapasForQuestionAnswering` on your own dataset. + +=========================================================================== +STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment +=========================================================================== + +Basically, there are 3 different ways in which one can fine-tune :class:`~transformers.TapasForQuestionAnswering`, corresponding to +the different datasets on which Tapas was fine-tuned: + +1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example if you + first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is he?". Here, questions + do not involve any aggregation (all questions are cell selection questions). +2. WTQ/WikiSQL: if you're not interested in asking questions in a conversational set-up, but rather just asking questions related + to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or averaging cell values. + You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his career?". This case is also called **weak + supervision**, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer + to the question as supervision. +3. WikiSQL-supervised: this dataset is actually the same dataset as WikiSQL, but here the model is given the ground truth aggregation + operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation operator is much easier. + +To summarize: + ++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+ +| **Task** | **Example datasets** | **Description** | ++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+ +| Conversational | SQA | Conversational, only cell selection questions | ++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+ +| Weak supervision for aggregation | WTQ, WikiSQL | Questions might involve aggregation, and the model must learn this given only the answer as supervision | ++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+ +| Strong supervision for aggregation | WikiSQL-supervised | Questions might involve aggregation, and the model must learn this given the gold aggregation operator | ++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+ + +Initializing a model with a pre-trained base and randomly initialized classification heads from the model hub is as easy as: + +.. code-block:: + + >>> from transformers import TapasForQuestionAnswering + + >>> # for example, the base sized model + >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased') + + +Of course, you don't necessarily have to follow one these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters +you want when initializing :class:`~transformers.TapasConfig`, and then create a :class:`~transformers.TapasForQuestionAnswering` based on that +configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it +this way. Here's an example: + +.. code-block:: + + >>> from transformers import TapasConfig, TapasForQuestionAnswering + + >>> # you can initialize the classification heads any way you want (see docs of TapasConfig) + >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True, select_one_column=False) + >>> # initializing the pre-trained base sized model with our custom classification heads + >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased', config=config) + +What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues +due to the L2-loss which is somewhat brittle. See `here `__ for more info. + +For a list of all pre-trained and fine-tuned TAPAS checkpoints available in the HuggingFace model hub, see `here `__. + +=========================================== +STEP 2: Prepare your data in the SQA format +=========================================== + +Second, no matter what you picked above, you should prepare your dataset in the `SQA format `__. +This format is a TSV/CSV file with the following columns: + +- ``id``: optional, id of the table-question pair, for bookkeeping purposes. +- ``annotator``: optional, id of the person who annotated the table-question pair, for bookkeeping purposes. +- ``position``: integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). + You don't need this column in case you're going for WTQ/WikiSQL/WikiSQL-supervised. +- ``question``: string +- ``table_file``: string, name of a csv file containing the tabular data +- ``answer_coordinates``: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer) +- ``answer_text``: list of one or more strings (each string being a cell value that is part of the answer) +- ``aggregation_label``: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case) +- ``float_answer``: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL) + +The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the TAPAS algorithm used conversion +scripts with some automated logic to convert the other datasets (WTQ and WikiSQL) into the SQA format. The author explains this `here `__. +Interestingly, these conversion scripts are not perfect (the ``answer_coordinates`` and ``float_answer`` fields are populated based on the ``answer_text``), +meaning that WTQ and WikiSQL results could actually be improved. + + +========================================================================================== +STEP 3: Convert your data into PyTorch tensors using :class:`~transformers.TapasTokenizer` +========================================================================================== + +Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then +use :class:`~transformers.TapasTokenizer` to convert table-question pairs into :obj:`input_ids`, :obj:`attention_mask`, :obj:`token_type_ids` +and so on. Again, based on which of the three cases you picked above, :class:`~transformers.TapasForQuestionAnswering` requires different inputs +to be fine-tuned: + ++------------------------------------+----------------------------------------------------------------------------------------------+ +| **Task** | **Required inputs** | ++------------------------------------+----------------------------------------------------------------------------------------------+ +| Conversational | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``label_ids`` | ++------------------------------------+----------------------------------------------------------------------------------------------+ +| Weak supervision for aggregation | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``label_ids``, ``numeric_values``, | +| | ``numeric_values_scale``, ``float_answer`` | ++------------------------------------+----------------------------------------------------------------------------------------------+ +| Strong supervision for aggregation | ``input ids``, ``attention mask``, ``token type ids``, ``label ids``, ``aggregation_labels`` | ++------------------------------------+----------------------------------------------------------------------------------------------+ + +:class:`~transformers.TapasTokenizer` creates the ``label_ids``, ``numeric_values`` and ``numeric_values_scale`` based on the +``answer_coordinates`` and ``answer_text`` columns of the TSV file. The ``float_answer`` and ``aggregation_labels`` are already in the TSV file of step 2. +Here's an example: + +.. code-block:: + + >>> from transformers import TapasTokenizer + >>> import pandas as pd + + >>> model_name = 'google/tapas-base-uncased' + >>> tokenizer = TapasTokenizer.from_pretrained(model_name) + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]} + >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"] + >>> answer_coordinates = [[(0, 0)], [(1, 0)], [(0, 2), (1, 2), (2, 2)]] + >>> answer_text = [["Brad Pitt"], ["69"], ["209"]] + >>> table = pd.Dataframe(data) + >>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt') + >>> inputs + {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]), + 'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), label_ids: tensor([[ ... ]])} + +Note that :class:`~transformers.TapasTokenizer` expects the data of the table to be text-only. You can use ``.astype(str)`` on a dataframe to turn it into +text-only data. Of course, this only shows how to encode a single training example. It is advised to create a PyTorch dataset and a corresponding dataloader: + +.. code-block:: + + >>> import torch + >>> import pandas as pd + + >>> tsv_path = "your_path_to_the_tsv_file" + >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files" + + >>> class TableDataset(torch.utils.data.Dataset): + ... def __init__(self, data, tokenizer): + ... self.data = data + ... self.tokenizer = tokenizer + ... + ... def __getitem__(self, idx): + ... item = data.iloc[idx] + ... table = pd.read_csv(table_csv_path + item.table_file).astype(str) + ... encoding = self.tokenizer(table=table, + ... queries=item.question, + ... answer_coordinates=item.answer_coordinates, + ... answer_text=item.answer_text, + ... padding="max_length", + ... return_tensors="pt" + ... ) + ... # we add the float_answer which is also required (weak supervision for aggregation) + ... encoding["float_answer"] = torch.tensor(item.float_answer) + ... return encoding + ... + ... def __len__(self): + ... return len(self.data) + + >>> data = pd.read_csv(tsv_path, sep='\t') + >>> train_dataset = TableDataset(data, tokenizer) + >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32) + +Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your +dataset involves conversational questions (such as in SQA), then you should first group together the ``queries``, ``answer_coordinates`` and +``answer_text`` per table (in the order of their ``position`` index) and batch encode each table with its questions. This will make sure that +the ``prev_label_ids`` token types (see docs of :class:`~transformers.TapasTokenizer`) are set correctly. + +=================================================== +STEP 4: Train (fine-tune) TapasForQuestionAnswering +=================================================== + +You can then fine-tune :class:`~transformers.TapasForQuestionAnswering` using native PyTorch as follows: + +.. code-block:: + + >>> from transformers import TapasForQuestionAnswering + + >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-uncased") + + >>> for epoch in range(2): # loop over the dataset multiple times + ... for idx, batch in enumerate(train_dataloader): + ... # get the inputs; + ... input_ids, attention_mask, token_type_ids, label_ids, numeric_values, numeric_values_scale, float_answer = batch + + ... # zero the parameter gradients + ... optimizer.zero_grad() + + ... # forward + backward + optimize + ... outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, + ... label_ids=label_ids, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, + ... float_answer=float_answer) + ... loss = outputs.loss + ... loss.backward() + ... optimizer.step() + +Usage: inference +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Here we explain how you can use :class:`~transformers.TapasForQuestionAnswering` for inference (i.e. making predictions on new data). +For inference, only ``input_ids``, ``attention_mask`` and ``token_type_ids`` (which you can obtain using +:class:`~transformers.TapasTokenizer`) have to provided to the model to obtain the logits. Next, you can use the handy +``convert_logits_to_predictions`` method of :class:`~transformers.TapasTokenizer` to convert these into predicted coordinates +and optional aggregation indices. + +However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference +can be done in parallel on all table-question pairs of a batch. Here's an example of that: + +.. code-block:: + + >>> from transformers import TapasTokenizer, TapasForQuestionAnswering + >>> import pandas as pd + + >>> model_name = 'google/tapas-base-uncased-finetuned-wtq' + >>> model = TapasForQuestionAnswering.from_pretrained(model_name) + >>> tokenizer = TapasTokenizer.from_pretrained(model_name) + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]} + >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"] + >>> table = pd.Dataframe(data) + >>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") + >>> outputs = model(**inputs) + >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions( + ... inputs, + ... output.logits, + ... outputs.logits_aggregation + ...) + + >>> # let's print out the results: + >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"} + >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices] + + >>> answers = [] + >>> for coordinates in predicted_answer_coordinates: + ... if len(coordinates) == 1: + ... # only a single cell: + ... answers.append(df.iat[coordinates[0]]) + ... else: + ... # multiple cells + ... cell_values = [] + ... for coordinate in coordinates: + ... cell_values.append(df.iat[coordinate]) + ... answers.append(", ".join(cell_values)) + + >>> display(df) + >>> print("") + >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string): + ... print(query) + ... if predicted_agg == "NONE": + ... print("Predicted answer: " + answer) + ... else: + ... print("Predicted answer: " + predicted_agg + " > " + answer) + When was Brad Pitt born? + Predicted answer: 18 december 1963 + Which actor appeared in the least number of movies? + Predicted answer: Leonardo Di Caprio + What is the average number of movies? + Predicted answer: AVERAGE > 87, 53, 69 + +In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that +the ``prev_label_ids`` token types can be overwritten by the predicted ``label_ids`` of the previous table-question pair. + + +Tapas specific outputs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.modeling_tapas.TableQuestionAnsweringOutput + :members: + + +TapasConfig +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasConfig + :members: + + +TapasTokenizer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasTokenizer + :members: __call__, convert_logits_to_predictions, save_vocabulary + + +TapasModel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasModel + :members: + + +TapasForMaskedLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasForMaskedLM + :members: + + +TapasForSequenceClassification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasForSequenceClassification + :members: forward + + +TapasForQuestionAnswering +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. autoclass:: transformers.TapasForQuestionAnswering + :members: \ No newline at end of file diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index ee5da4399984..df1c3ee1f2bb 100755 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -61,6 +61,7 @@ from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config +from .configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig from .configuration_utils import PretrainedConfig from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig @@ -190,6 +191,7 @@ from .tokenization_retribert import RetriBertTokenizer from .tokenization_roberta import RobertaTokenizer from .tokenization_squeezebert import SqueezeBertTokenizer +from .tokenization_tapas import TapasTokenizer, TapasTruncationStrategy from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer from .tokenization_utils import PreTrainedTokenizer from .tokenization_utils_base import ( @@ -558,6 +560,14 @@ T5PreTrainedModel, load_tf_weights_in_t5, ) + from .modeling_tapas import ( + TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST, + TapasForMaskedLM, + TapasForQuestionAnswering, + TapasForSequenceClassification, + TapasModel, + load_tf_weights_in_tapas, + ) from .modeling_transfo_xl import ( TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST, AdaptiveEmbedding, diff --git a/src/transformers/commands/convert.py b/src/transformers/commands/convert.py index 1e054b6a30eb..03ac380cdaa2 100644 --- a/src/transformers/commands/convert.py +++ b/src/transformers/commands/convert.py @@ -130,6 +130,13 @@ def run(self): raise ImportError(IMPORT_ERROR_MESSAGE) convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output) + elif self._model_type == "tapas": + try: + from transformers.convert_tapas_original_tf_checkpoint_to_pytorch import ( + convert_tf_checkpoint_to_pytorch, + ) + except ImportError: + raise ImportError(IMPORT_ERROR_MESSAGE) elif self._model_type == "xlnet": try: from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import ( diff --git a/src/transformers/configuration_auto.py b/src/transformers/configuration_auto.py index 3e411ac37ec7..b6b8e6ca7b56 100644 --- a/src/transformers/configuration_auto.py +++ b/src/transformers/configuration_auto.py @@ -48,6 +48,7 @@ from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config +from .configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig from .configuration_utils import PretrainedConfig from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig @@ -88,6 +89,7 @@ SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP, + TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, ] for key, value, in pretrained_map.items() ) @@ -131,6 +133,7 @@ ("dpr", DPRConfig), ("layoutlm", LayoutLMConfig), ("rag", RagConfig), + ("tapas", TapasConfig), ] ) @@ -172,6 +175,7 @@ ("rag", "RAG"), ("xlm-prophetnet", "XLMProphetNet"), ("prophetnet", "ProphetNet"), + ("tapas", "TAPAS"), ] ) diff --git a/src/transformers/configuration_tapas.py b/src/transformers/configuration_tapas.py new file mode 100644 index 000000000000..844e433ac02d --- /dev/null +++ b/src/transformers/configuration_tapas.py @@ -0,0 +1,209 @@ +# coding=utf-8 +# Copyright 2020 Google Research and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" TAPAS configuration. Adds additional hyperparameters to the configuration of BERT.""" + + +from .configuration_utils import PretrainedConfig + + +TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP = {"nielsr/tapas-base-finetuned-sqa": "https://huggingface.co/nielsr/tapas-base-finetuned-sqa/resolve/main/config.json", + "nielsr/tapas-base-finetuned-wtq": "https://huggingface.co/nielsr/tapas-base-finetuned-wtq/resolve/main/config.json", + "nielsr/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/nielsr/tapas-base-finetuned-wikisql-supervised/resolve/main/config.json", + "nielsr/tapas-base-finetuned-tabfact": "https://huggingface.co/nielsr/tapas-base-finetuned-tabfact/resolve/main/config.json"} + + +class TapasConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. It is used to + instantiate a TAPAS model according to the specified arguments, defining the model architecture. Instantiating a + configuration with the defaults will yield a similar configuration to that of the TAPAS `tapas-base-finetuned-sqa` + architecture. Configuration objects inherit from :class:`~transformers.PreTrainedConfig` and can be used to control + the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information. + + Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original + implementation. Original implementation available at https://github.com/google-research/tapas/tree/master. + + Args: + vocab_size (:obj:`int`, `optional`, defaults to 30522): + Vocabulary size of the TAPAS model. Defines the number of different tokens that can be represented by the + :obj:`inputs_ids` passed when calling :class:`~transformers.TapasModel`. + hidden_size (:obj:`int`, `optional`, defaults to 768): + Dimensionality of the encoder layers and the pooler layer. + num_hidden_layers (:obj:`int`, `optional`, defaults to 12): + Number of hidden layers in the Transformer encoder. + num_attention_heads (:obj:`int`, `optional`, defaults to 12): + Number of attention heads for each attention layer in the Transformer encoder. + intermediate_size (:obj:`int`, `optional`, defaults to 3072): + Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. + hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`): + The non-linear activation function (function or string) in the encoder and pooler. If string, + :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported. + hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): + The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. + attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1): + The dropout ratio for the attention probabilities. + max_position_embeddings (:obj:`int`, `optional`, defaults to 1024): + The maximum sequence length that this model might ever be used with. Typically set this to something large + just in case (e.g., 512 or 1024 or 2048). + type_vocab_sizes (:obj:`List[int]`, `optional`, defaults to [3, 256, 256, 2, 256, 256, 10]): + The vocabulary sizes of the :obj:`token_type_ids` passed when calling :class:`~transformers.TapasModel`. + initializer_range (:obj:`float`, `optional`, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12): + The epsilon used by the layer normalization layers. + gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to use gradient checkpointing to save memory at the expense of a slower backward pass. + positive_label_weight (:obj:`float`, `optional`, defaults to 10.0): + Weight for positive labels. + num_aggregation_labels (:obj:`int`, `optional`, defaults to 0): + The number of aggregation operators to predict. + aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0): + Importance weight for the aggregation loss. + use_answer_as_supervision (:obj:`bool`, `optional`): + Whether to use the answer as the only supervision for aggregation examples. + answer_loss_importance (:obj:`float`, `optional`, defaults to 1.0): + Importance weight for the regression loss. + use_normalized_answer_loss (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to normalize the answer loss by the maximum of the predicted and expected value. + huber_loss_delta: (:obj:`float`, `optional`): + Delta parameter used to calculate the regression loss. + temperature: (:obj:`float`, `optional`, defaults to 1.0): + Value used to control (OR change) the skewness of cell logits probabilities. + aggregation_temperature: (:obj:`float`, `optional`, defaults to 1.0): + Scales aggregation logits to control the skewness of probabilities. + use_gumbel_for_cells: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to apply Gumbel-Softmax to cell selection. + use_gumbel_for_aggregation: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to apply Gumbel-Softmax to aggregation selection. + average_approximation_function: (:obj:`string`, `optional`, defaults to :obj:`"ratio"`): + Method to calculate the expected average of cells in the weak supervision case. One of :obj:`"ratio"`, + :obj:`"first_order"` or :obj:`"second_order"`. + cell_selection_preference: (:obj:`float`, `optional`): + Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for + aggregation (WTQ, WikiSQL). If the total mass of the aggregation probabilities (excluding the "NONE" + operator) is higher than this hyperparameter, then aggregation is predicted for an example. + answer_loss_cutoff: (:obj:`float`, `optional`): + Ignore examples with answer loss larger than cutoff. + max_num_rows: (:obj:`int`, `optional`, defaults to 64): + Maximum number of rows. + max_num_columns: (:obj:`int`, `optional`, defaults to 32): + Maximum number of columns. + average_logits_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to average logits per cell. + select_one_column: (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether to constrain the model to only select cells from a single column. + allow_empty_column_selection: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to allow not to select any column. + init_cell_selection_weights_to_zero: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to initialize cell selection weights to 0 so that the initial probabilities are 50%. + reset_position_index_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether to restart position indexes at every cell (i.e. use relative position embeddings). + disable_per_token_loss: (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to disable any (strong or weak) supervision on cells. + + Example:: + + >>> from transformers import TapasModel, TapasConfig + >>> # Initializing a Tapas configuration + >>> configuration = TapasConfig() + >>> # Initializing a model from the configuration + >>> model = TapasModel(configuration) + >>> # Accessing the model configuration + >>> configuration = model.config + """ + + model_type = "tapas" + + def __init__( + self, + vocab_size=30522, + hidden_size=768, + num_hidden_layers=12, + num_attention_heads=12, + intermediate_size=3072, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + max_position_embeddings=1024, + type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10], + initializer_range=0.02, + layer_norm_eps=1e-12, + pad_token_id=0, + gradient_checkpointing=False, + positive_label_weight=10.0, + num_aggregation_labels=0, + aggregation_loss_weight=1.0, + use_answer_as_supervision=None, + answer_loss_importance=1.0, + use_normalized_answer_loss=False, + huber_loss_delta=None, + temperature=1.0, + aggregation_temperature=1.0, + use_gumbel_for_cells=False, + use_gumbel_for_aggregation=False, + average_approximation_function="ratio", + cell_selection_preference=None, + answer_loss_cutoff=None, + max_num_rows=64, + max_num_columns=32, + average_logits_per_cell=False, + select_one_column=True, + allow_empty_column_selection=False, + init_cell_selection_weights_to_zero=False, + reset_position_index_per_cell=True, + disable_per_token_loss=False, + **kwargs + ): + + super().__init__(pad_token_id=pad_token_id, **kwargs) + + # BERT hyperparameters (with updated max_position_embeddings and type_vocab_sizes) + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.hidden_act = hidden_act + self.intermediate_size = intermediate_size + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.max_position_embeddings = max_position_embeddings + self.type_vocab_sizes = type_vocab_sizes + self.initializer_range = initializer_range + self.layer_norm_eps = layer_norm_eps + self.gradient_checkpointing = gradient_checkpointing + + # Fine-tuning task hyperparameters + self.positive_label_weight = positive_label_weight + self.num_aggregation_labels = num_aggregation_labels + self.aggregation_loss_weight = aggregation_loss_weight + self.use_answer_as_supervision = use_answer_as_supervision + self.answer_loss_importance = answer_loss_importance + self.use_normalized_answer_loss = use_normalized_answer_loss + self.huber_loss_delta = huber_loss_delta + self.temperature = temperature + self.aggregation_temperature = aggregation_temperature + self.use_gumbel_for_cells = use_gumbel_for_cells + self.use_gumbel_for_aggregation = use_gumbel_for_aggregation + self.average_approximation_function = average_approximation_function + self.cell_selection_preference = cell_selection_preference + self.answer_loss_cutoff = answer_loss_cutoff + self.max_num_rows = max_num_rows + self.max_num_columns = max_num_columns + self.average_logits_per_cell = average_logits_per_cell + self.select_one_column = select_one_column + self.allow_empty_column_selection = allow_empty_column_selection + self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero + self.reset_position_index_per_cell = reset_position_index_per_cell + self.disable_per_token_loss = disable_per_token_loss \ No newline at end of file diff --git a/src/transformers/configuration_utils.py b/src/transformers/configuration_utils.py index eb21fa203423..5dc320dfb32c 100755 --- a/src/transformers/configuration_utils.py +++ b/src/transformers/configuration_utils.py @@ -163,7 +163,7 @@ class PretrainedConfig(object): def __init__(self, **kwargs): # Attributes with defaults - self.return_dict = kwargs.pop("return_dict", False) + self.return_dict = kwargs.pop("return_dict", True) self.output_hidden_states = kwargs.pop("output_hidden_states", False) self.output_attentions = kwargs.pop("output_attentions", False) self.use_cache = kwargs.pop("use_cache", True) # Not used by all models diff --git a/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py new file mode 100644 index 000000000000..e78ec9e7ae88 --- /dev/null +++ b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py @@ -0,0 +1,120 @@ +# coding=utf-8 +# Copyright 2018 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert TAPAS checkpoint.""" + + +import argparse + +import torch + +from transformers import ( + TapasConfig, + TapasModel, + TapasForQuestionAnswering, + TapasForSequenceClassification, + load_tf_weights_in_tapas, +) +from transformers.utils import logging + + +logging.set_verbosity_info() + + +def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, tapas_config_file, pytorch_dump_path): + # Initialise PyTorch model. Defaults to TapasForQuestionAnswering with default SQA config. + # Uncomment another config and/or model to change this. If you want to convert a checkpoint + # that has absolute position embeddings, make sure to set reset_position_index_per_cell of + # TapasConfig to False. + + # WTQ config + # config = TapasConfig( + # # run_task_main.py hparams + # num_aggregation_labels = 4, + # use_answer_as_supervision = True, + # # hparam_utils.py hparams + # answer_loss_cutoff = 0.664694, + # cell_selection_preference = 0.207951, + # huber_loss_delta = 0.121194, + # init_cell_selection_weights_to_zero = True, + # select_one_column = True, + # allow_empty_column_selection = False, + # temperature = 0.0352513, + # ) + + # WikiSQL config + # config = TapasConfig( + # # run_task_main.py hparams + # num_aggregation_labels = 4, + # use_answer_as_supervision = True, + # # hparam_utils.py hparams + # answer_loss_cutoff = 0.185567, + # cell_selection_preference = 0.611754, + # huber_loss_delta = 1265.74, + # init_cell_selection_weights_to_zero = False, + # select_one_column = False, + # allow_empty_column_selection = False, + # temperature = 0.107515, + # ) + + # WikiSQL-supervised config + # config = TapasConfig( + # # run_task_main.py hparams + # num_aggregation_labels = 4, + # use_answer_as_supervision = False, + # # hparam_utils.py hparams + # answer_loss_cutoff = 36.4519, + # cell_selection_preference = 0.903421, + # huber_loss_delta = 222.088, + # init_cell_selection_weights_to_zero = True, + # select_one_column = True, + # allow_empty_column_selection = True, + # temperature = 0.763141, + # ) + + # SQA config + config = TapasConfig() + + print("Building PyTorch model from configuration: {}".format(str(config))) + model = TapasModel(config) + #model = TapasForQuestionAnswering(config) + # model = TapasForSequenceClassification(config) + + # Load weights from tf checkpoint + load_tf_weights_in_tapas(model, config, tf_checkpoint_path) + + # Save pytorch-model + print("Save PyTorch model to {}".format(pytorch_dump_path)) + torch.save(model.state_dict(), pytorch_dump_path) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path." + ) + parser.add_argument( + "--tapas_config_file", + default=None, + type=str, + required=True, + help="The config json file corresponding to the pre-trained TAPAS model. \n" + "This specifies the model architecture.", + ) + parser.add_argument( + "--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model." + ) + args = parser.parse_args() + convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.tapas_config_file, args.pytorch_dump_path) \ No newline at end of file diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py index d9f2ec0db686..20834e4550a3 100644 --- a/src/transformers/file_utils.py +++ b/src/transformers/file_utils.py @@ -193,6 +193,20 @@ _tokenizers_available = False +try: + import torch_scatter + + # Check we're not importing a "torch_scatter" directory somewhere + _scatter_available = hasattr(torch_scatter, "__version__") and hasattr(torch_scatter, "scatter") + if _scatter_available: + logger.debug(f"Succesfully imported torch-scatter version {torch_scatter.__version__}") + else: + logger.debug("Imported a torch_scatter object but this doesn't seem to be the torch-scatter library.") + +except ImportError: + _scatter_available = False + + default_cache_path = os.path.join(torch_cache_home, "transformers") @@ -289,6 +303,14 @@ def wrapper(*args, **kwargs): # docstyle-ignore +def is_sklearn_available(): + return _has_sklearn + + +def is_scatter_available(): + return _scatter_available + + DATASETS_IMPORT_ERROR = """ {0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with: ``` @@ -368,6 +390,12 @@ def wrapper(*args, **kwargs): installation page: https://github.com/google/flax and follow the ones that match your environment. """ +SCATTER_IMPORT_ERROR = """ +{0} requires the torch-scatter library but it was not found in your environment. You can install it with pip as +explained here: https://github.com/rusty1s/pytorch_scatter. + +""" + def requires_datasets(obj): name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__ @@ -417,6 +445,12 @@ def requires_sentencepiece(obj): raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name)) +def requires_scatter(obj): + name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__ + if not is_scatter_available(): + raise ImportError(SCATTER_IMPORT_ERROR.format(name)) + + def add_start_docstrings(*docstr): def docstring_decorator(fn): fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "") diff --git a/src/transformers/modeling_auto.py b/src/transformers/modeling_auto.py index 3ec971325075..4f0ff52550d5 100644 --- a/src/transformers/modeling_auto.py +++ b/src/transformers/modeling_auto.py @@ -49,6 +49,7 @@ RobertaConfig, SqueezeBertConfig, T5Config, + TapasConfig, TransfoXLConfig, XLMConfig, XLMProphetNetConfig, @@ -188,6 +189,7 @@ SqueezeBertModel, ) from .modeling_t5 import T5ForConditionalGeneration, T5Model +from .modeling_tapas import TapasForMaskedLM, TapasForQuestionAnswering, TapasForSequenceClassification, TapasModel from .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel from .modeling_xlm import ( XLMForMultipleChoice, @@ -229,6 +231,7 @@ [ (RetriBertConfig, RetriBertModel), (T5Config, T5Model), + (TapasConfig, TapasModel), (DistilBertConfig, DistilBertModel), (AlbertConfig, AlbertModel), (CamembertConfig, CamembertModel), @@ -265,6 +268,7 @@ (LayoutLMConfig, LayoutLMForMaskedLM), (RetriBertConfig, RetriBertModel), (T5Config, T5ForConditionalGeneration), + (TapasConfig, TapasForMaskedLM), (DistilBertConfig, DistilBertForMaskedLM), (AlbertConfig, AlbertForPreTraining), (CamembertConfig, CamembertForMaskedLM), @@ -292,6 +296,7 @@ [ (LayoutLMConfig, LayoutLMForMaskedLM), (T5Config, T5ForConditionalGeneration), + (TapasConfig, TapasForMaskedLM), (DistilBertConfig, DistilBertForMaskedLM), (AlbertConfig, AlbertForMaskedLM), (CamembertConfig, CamembertForMaskedLM), @@ -351,6 +356,7 @@ (LongformerConfig, LongformerForMaskedLM), (RobertaConfig, RobertaForMaskedLM), (SqueezeBertConfig, SqueezeBertForMaskedLM), + (TapasConfig, TapasForMaskedLM), (BertConfig, BertForMaskedLM), (MobileBertConfig, MobileBertForMaskedLM), (FlaubertConfig, FlaubertWithLMHeadModel), @@ -396,6 +402,7 @@ (DebertaConfig, DebertaForSequenceClassification), (GPT2Config, GPT2ForSequenceClassification), (OpenAIGPTConfig, OpenAIGPTForSequenceClassification), + (TapasConfig, TapasForSequenceClassification), ] ) @@ -410,6 +417,7 @@ (RobertaConfig, RobertaForQuestionAnswering), (SqueezeBertConfig, SqueezeBertForQuestionAnswering), (BertConfig, BertForQuestionAnswering), + (TapasConfig, TapasForQuestionAnswering), (XLNetConfig, XLNetForQuestionAnsweringSimple), (FlaubertConfig, FlaubertForQuestionAnsweringSimple), (MobileBertConfig, MobileBertForQuestionAnswering), diff --git a/src/transformers/modeling_tapas.py b/src/transformers/modeling_tapas.py new file mode 100644 index 000000000000..963dcb8bfa7f --- /dev/null +++ b/src/transformers/modeling_tapas.py @@ -0,0 +1,2306 @@ +# coding=utf-8 +# Copyright 2020 Google Research and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""PyTorch TAPAS model. """ + + +import enum +import math +import os +from dataclasses import dataclass +from typing import Optional, Tuple + +import torch +import torch.nn as nn +from torch.nn import CrossEntropyLoss, MSELoss + +from .activations import ACT2FN +from .configuration_tapas import TapasConfig +from .file_utils import ( + ModelOutput, + add_start_docstrings, + add_start_docstrings_to_model_forward, + is_scatter_available, + replace_return_docstrings, + requires_scatter, +) +from .modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, MaskedLMOutput, SequenceClassifierOutput +from .modeling_utils import ( + PreTrainedModel, + apply_chunking_to_forward, + find_pruneable_heads_and_indices, + prune_linear_layer, +) +from .utils import logging + + +# soft dependency +if is_scatter_available(): + from torch_scatter import scatter + + +logger = logging.get_logger(__name__) + +_CONFIG_FOR_DOC = "TapasConfig" +_TOKENIZER_FOR_DOC = "TapasTokenizer" + +TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST = [ + "nielsr/tapas-base-finetuned-sqa", + "nielsr/tapas-base-finetuned-wtq", + "nielsr/tapas-base-finetuned-wikisql-supervised", + # See all TAPAS models at https://huggingface.co/models?filter=tapas +] + +EPSILON_ZERO_DIVISION = 1e-10 +CLOSE_ENOUGH_TO_LOG_ZERO = -10000.0 + + +@dataclass +class TableQuestionAnsweringOutput(ModelOutput): + """ + Output type of :class:`~transformers.TapasForQuestionAnswering`. + + Args: + loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label_ids` (and possibly :obj:`answer`, :obj:`aggregation_labels`, :obj:`numeric_values` and :obj:`numeric_values_scale` are provided)): + Total loss as the sum of the hierarchical cell selection log-likelihood loss and (optionally) the + semi-supervised regression loss and (optionally) supervised loss for aggregations. + logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): + Prediction scores of the cell selection head, for every token. + logits_aggregation (:obj:`torch.FloatTensor`, `optional`, of shape :obj:`(batch_size, num_aggregation_labels)`): + Prediction scores of the aggregation head, for every aggregation operator. + hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``): + Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) + of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of + each layer plus the initial embedding outputs. + attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``): + Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads, + sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the + weighted average in the self-attention heads. + """ + + loss: Optional[torch.FloatTensor] = None + logits: torch.FloatTensor = None + logits_aggregation: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +def load_tf_weights_in_tapas(model, config, tf_checkpoint_path): + """ + Load tf checkpoints in a PyTorch model. This is an adaptation from load_tf_weights_in_bert + + - add cell selection and aggregation heads + - take into account additional token type embedding layers + """ + try: + import re + + import numpy as np + import tensorflow as tf + except ImportError: + logger.error( + "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see " + "https://www.tensorflow.org/install/ for installation instructions." + ) + raise + tf_path = os.path.abspath(tf_checkpoint_path) + logger.info("Converting TensorFlow checkpoint from {}".format(tf_path)) + # Load weights from TF model + init_vars = tf.train.list_variables(tf_path) + names = [] + arrays = [] + for name, shape in init_vars: + logger.info("Loading TF weight {} with shape {}".format(name, shape)) + array = tf.train.load_variable(tf_path, name) + names.append(name) + arrays.append(array) + + for name, array in zip(names, arrays): + name = name.split("/") + # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculate m and v + # which are not required for using pretrained model + if any( + n + in [ + "adam_v", + "adam_m", + "AdamWeightDecayOptimizer", + "AdamWeightDecayOptimizer_1", + "global_step", + "seq_relationship", + ] + for n in name + ): + logger.info("Skipping {}".format("/".join(name))) + continue + # in case the model is TapasForSequenceClassification, we skip output_bias and output_weights + # since these are not used for classification + if isinstance(model, TapasForSequenceClassification): + if any( + n + in [ + "output_bias", + "output_weights", + ] + for n in name + ): + logger.info("Skipping {}".format("/".join(name))) + continue + # in case the model is TapasModel, we skip output_bias, output_weights, output_bias_cls and output_weights_cls + # since this model does not have MLM and NSP heads + if isinstance(model, TapasModel): + if any( + n + in [ + "output_bias", + "output_weights", + "output_bias_cls", + "output_weights_cls", + ] + for n in name + ): + logger.info("Skipping {}".format("/".join(name))) + continue + # if first scope name starts with "bert", change it to "tapas" + if name[0] == "bert": + name[0] = "tapas" + pointer = model + for m_name in name: + if re.fullmatch(r"[A-Za-z]+_\d+", m_name): + scope_names = re.split(r"_(\d+)", m_name) + else: + scope_names = [m_name] + if scope_names[0] == "kernel" or scope_names[0] == "gamma": + pointer = getattr(pointer, "weight") + elif scope_names[0] == "beta": + pointer = getattr(pointer, "bias") + # cell selection heads + elif scope_names[0] == "output_bias": + pointer = getattr(pointer, "output_bias") + elif scope_names[0] == "output_weights": + pointer = getattr(pointer, "output_weights") + elif scope_names[0] == "column_output_bias": + pointer = getattr(pointer, "column_output_bias") + elif scope_names[0] == "column_output_weights": + pointer = getattr(pointer, "column_output_weights") + # aggregation head + elif scope_names[0] == "output_bias_agg": + pointer = getattr(pointer, "aggregation_classifier") + pointer = getattr(pointer, "bias") + elif scope_names[0] == "output_weights_agg": + pointer = getattr(pointer, "aggregation_classifier") + pointer = getattr(pointer, "weight") + # classification head + elif scope_names[0] == "output_bias_cls": + pointer = getattr(pointer, "classifier") + pointer = getattr(pointer, "bias") + elif scope_names[0] == "output_weights_cls": + pointer = getattr(pointer, "classifier") + pointer = getattr(pointer, "weight") + else: + try: + pointer = getattr(pointer, scope_names[0]) + except AttributeError: + logger.info("Skipping {}".format("/".join(name))) + continue + if len(scope_names) >= 2: + num = int(scope_names[1]) + pointer = pointer[num] + if m_name[-11:] == "_embeddings": + pointer = getattr(pointer, "weight") + elif m_name[-13:] in [ + "_embeddings_0", + "_embeddings_1", + "_embeddings_2", + "_embeddings_3", + "_embeddings_4", + "_embeddings_5", + "_embeddings_6", + ]: + pointer = getattr(pointer, "weight") + elif m_name == "kernel": + array = np.transpose(array) + try: + assert ( + pointer.shape == array.shape + ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched" + except AssertionError as e: + e.args += (pointer.shape, array.shape) + raise + logger.info("Initialize PyTorch weight {}".format(name)) + # added a check to see whether the array is a scalar (because bias terms in Tapas checkpoints can be scalar => should first be converted to numpy arrays) + if np.isscalar(array): + array = np.array(array) + pointer.data = torch.from_numpy(array) + return model + + +class TapasEmbeddings(nn.Module): + """ + Construct the embeddings from word, position and token_type embeddings. Same as BertEmbeddings but with a number of + additional token type embeddings to encode tabular structure. + """ + + def __init__(self, config): + super().__init__() + # we do not include config.disabled_features and config.disable_position_embeddings from the original implementation + # word embeddings + self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id) + # position embeddings + self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size) + # token type embeddings + token_type_embedding_name = "token_type_embeddings" + + for i, type_vocab_sizes in enumerate(config.type_vocab_sizes): + name = "%s_%d" % (token_type_embedding_name, i) + setattr(self, name, nn.Embedding(type_vocab_sizes, config.hidden_size)) + + self.number_of_token_type_embeddings = len(config.type_vocab_sizes) + + # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load + # any TensorFlow checkpoint file + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + self.config = config + + def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None): + if input_ids is not None: + input_shape = input_ids.size() + else: + input_shape = inputs_embeds.size()[:-1] + + seq_length = input_shape[1] + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if position_ids is None: + # create absolute position embeddings + position_ids = torch.arange(seq_length, dtype=torch.long, device=device) + position_ids = position_ids.unsqueeze(0).expand(input_shape) + # when self.config.reset_position_index_per_cell is set to True, create relative position embeddings + if self.config.reset_position_index_per_cell: + col_index = IndexMap( + token_type_ids[:, :, 1], self.config.type_vocab_sizes[1], batch_dims=1 + ) # shape (batch_size, seq_len) + row_index = IndexMap( + token_type_ids[:, :, 2], self.config.type_vocab_sizes[2], batch_dims=1 + ) # shape (batch_size, seq_len) + full_index = ProductIndexMap(col_index, row_index) # shape (batch_size, seq_len) + + first_position_per_segment = reduce_min(position_ids, full_index)[ + 0 + ] # shape (max_rows * max_columns,). First absolute position for every cell + first_position = gather( + first_position_per_segment, full_index + ) # ? shape (batch_size, seq_len). First absolute position of the cell for every token + position = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0) # shape (1, seq_len) + position_ids = torch.min( + torch.as_tensor(self.config.max_position_embeddings - 1, device=device), position - first_position + ) + + if token_type_ids is None: + token_type_ids = torch.zeros( + (input_shape + self.number_of_token_type_embeddings), dtype=torch.long, device=device + ) + + if inputs_embeds is None: + inputs_embeds = self.word_embeddings(input_ids) + + position_embeddings = self.position_embeddings(position_ids) + + embeddings = inputs_embeds + position_embeddings + + token_type_embedding_name = "token_type_embeddings" + + for i in range(self.number_of_token_type_embeddings): + name = f"{token_type_embedding_name}_{i}" + embeddings += getattr(self, name)(token_type_ids[:, :, i]) + + embeddings = self.LayerNorm(embeddings) + embeddings = self.dropout(embeddings) + return embeddings + + +# Copied from transformers.modeling_bert.BertSelfAttention with Bert->Tapas +class TapasSelfAttention(nn.Module): + def __init__(self, config): + super().__init__() + if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"): + raise ValueError( + "The hidden size (%d) is not a multiple of the number of attention " + "heads (%d)" % (config.hidden_size, config.num_attention_heads) + ) + + self.num_attention_heads = config.num_attention_heads + self.attention_head_size = int(config.hidden_size / config.num_attention_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + + self.query = nn.Linear(config.hidden_size, self.all_head_size) + self.key = nn.Linear(config.hidden_size, self.all_head_size) + self.value = nn.Linear(config.hidden_size, self.all_head_size) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + + def transpose_for_scores(self, x): + new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) + x = x.view(*new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=False, + ): + mixed_query_layer = self.query(hidden_states) + + # If this is instantiated as a cross-attention module, the keys + # and values come from an encoder; the attention mask needs to be + # such that the encoder's padding tokens are not attended to. + if encoder_hidden_states is not None: + mixed_key_layer = self.key(encoder_hidden_states) + mixed_value_layer = self.value(encoder_hidden_states) + attention_mask = encoder_attention_mask + else: + mixed_key_layer = self.key(hidden_states) + mixed_value_layer = self.value(hidden_states) + + query_layer = self.transpose_for_scores(mixed_query_layer) + key_layer = self.transpose_for_scores(mixed_key_layer) + value_layer = self.transpose_for_scores(mixed_value_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in TapasModel forward() function) + attention_scores = attention_scores + attention_mask + + # Normalize the attention scores to probabilities. + attention_probs = nn.Softmax(dim=-1)(attention_scores) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = torch.matmul(attention_probs, value_layer) + + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(*new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + return outputs + + +# Copied from transformers.modeling_bert.BertSelfOutput +class TapasSelfOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states, input_tensor): + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +# Copied from transformers.modeling_bert.BertAttention with Bert->Tapas +class TapasAttention(nn.Module): + def __init__(self, config): + super().__init__() + self.self = TapasSelfAttention(config) + self.output = TapasSelfOutput(config) + self.pruned_heads = set() + + def prune_heads(self, heads): + if len(heads) == 0: + return + heads, index = find_pruneable_heads_and_indices( + heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads + ) + + # Prune linear layers + self.self.query = prune_linear_layer(self.self.query, index) + self.self.key = prune_linear_layer(self.self.key, index) + self.self.value = prune_linear_layer(self.self.value, index) + self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) + + # Update hyper params and store pruned heads + self.self.num_attention_heads = self.self.num_attention_heads - len(heads) + self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads + self.pruned_heads = self.pruned_heads.union(heads) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=False, + ): + self_outputs = self.self( + hidden_states, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions, + ) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +# Copied from transformers.modeling_bert.BertIntermediate +class TapasIntermediate(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.intermediate_size) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states): + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +# Copied from transformers.modeling_bert.BertOutput +class TapasOutput(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.intermediate_size, config.hidden_size) + self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states, input_tensor): + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + hidden_states = self.LayerNorm(hidden_states + input_tensor) + return hidden_states + + +# Copied from transformers.modeling_bert.BertLayer with Bert->Tapas +class TapasLayer(nn.Module): + def __init__(self, config): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.seq_len_dim = 1 + self.attention = TapasAttention(config) + self.is_decoder = config.is_decoder + self.add_cross_attention = config.add_cross_attention + if self.add_cross_attention: + assert self.is_decoder, f"{self} should be used as a decoder model if cross attention is added" + self.crossattention = TapasAttention(config) + self.intermediate = TapasIntermediate(config) + self.output = TapasOutput(config) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=False, + ): + self_attention_outputs = self.attention( + hidden_states, + attention_mask, + head_mask, + output_attentions=output_attentions, + ) + attention_output = self_attention_outputs[0] + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + if self.is_decoder and encoder_hidden_states is not None: + assert hasattr( + self, "crossattention" + ), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`" + cross_attention_outputs = self.crossattention( + attention_output, + attention_mask, + head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions, + ) + attention_output = cross_attention_outputs[0] + outputs = outputs + cross_attention_outputs[1:] # add cross attentions if we output attention weights + + layer_output = apply_chunking_to_forward( + self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output + ) + outputs = (layer_output,) + outputs + return outputs + + def feed_forward_chunk(self, attention_output): + intermediate_output = self.intermediate(attention_output) + layer_output = self.output(intermediate_output, attention_output) + return layer_output + + +# Copied from transformers.modeling_bert.BertEncoder with Bert->Tapas +class TapasEncoder(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.layer = nn.ModuleList([TapasLayer(config) for _ in range(config.num_hidden_layers)]) + + def forward( + self, + hidden_states, + attention_mask=None, + head_mask=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=False, + output_hidden_states=False, + return_dict=True, + ): + all_hidden_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + for i, layer_module in enumerate(self.layer): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + + if getattr(self.config, "gradient_checkpointing", False): + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(layer_module), + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + ) + else: + layer_outputs = layer_module( + hidden_states, + attention_mask, + layer_head_mask, + encoder_hidden_states, + encoder_attention_mask, + output_attentions, + ) + hidden_states = layer_outputs[0] + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + if not return_dict: + return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None) + return BaseModelOutput( + last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions + ) + + +# Copied from transformers.modeling_bert.BertPooler +class TapasPooler(nn.Module): + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.activation = nn.Tanh() + + def forward(self, hidden_states): + # We "pool" the model by simply taking the hidden state corresponding + # to the first token. + first_token_tensor = hidden_states[:, 0] + pooled_output = self.dense(first_token_tensor) + pooled_output = self.activation(pooled_output) + return pooled_output + + +class TapasPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = TapasConfig + base_model_prefix = "tapas" + + # Copied from transformers.modeling_bert.BertPreTrainedModel._init_weights + def _init_weights(self, module): + """ Initialize the weights """ + if isinstance(module, (nn.Linear, nn.Embedding)): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + if isinstance(module, nn.Linear) and module.bias is not None: + module.bias.data.zero_() + + +TAPAS_START_DOCSTRING = r""" + This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic + methods the library implements for all its models (such as downloading or saving, resizing the input embeddings, + pruning heads etc.) + + This model is also a PyTorch `torch.nn.Module `__ + subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to + general usage and behavior. + + Parameters: + config (:class:`~transformers.TapasConfig`): Model configuration class with all the parameters of the model. + Initializing with a config file does not load the weights associated with the model, only the + configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model + weights. +""" + +TAPAS_INPUTS_DOCSTRING = r""" + Args: + input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`): + Indices of input sequence tokens in the vocabulary. + Indices can be obtained using :class:`~transformers.TapasTokenizer`. See + :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for + details. + + `What are input IDs? <../glossary.html#input-ids>`__ + attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`): + Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``: + + - 1 for tokens that are **not masked**, + - 0 for tokens that are **masked**. + + `What are attention masks? <../glossary.html#attention-mask>`__ + token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0}, 7)`, `optional`): + Token indices that encode tabular structure. Indices can be obtained using :class:`~transformers.TapasTokenizer`. + See this class for more info. + + `What are token type IDs? <../glossary.html#token-type-ids>`_ + position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`): + Indices of positions of each input sequence tokens in the position embeddings. If ``reset_position_index_per_cell`` + of :class:`~transformers.TapasConfig` is set to ``True``, relative position embeddings will be used. Selected in the + range ``[0, config.max_position_embeddings - 1]``. + + `What are position IDs? <../glossary.html#position-ids>`_ + head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`): + Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``: + - 1 indicates the head is **not masked**, + - 0 indicates the head is **masked**. + inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`): + Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation. + This is useful if you want more control over how to convert :obj:`input_ids` indices into associated + vectors than the model's internal embedding lookup matrix. + output_attentions (:obj:`bool`, `optional`): + Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned + tensors for more detail. + output_hidden_states (:obj:`bool`, `optional`): + Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for + more detail. + return_dict (:obj:`bool`, `optional`): + Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple. +""" + + +@add_start_docstrings( + "The bare Tapas Model transformer outputting raw hidden-states without any specific head on top.", + TAPAS_START_DOCSTRING, +) +class TapasModel(TapasPreTrainedModel): + """ + This class is a small change compared to :class:`~transformers.BertModel`, taking into account the additional token + type ids. + + The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of + cross-attention is added between the self-attention layers, following the architecture described in `Attention is + all you need `__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, + Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. + + """ + + config_class = TapasConfig + base_model_prefix = "tapas" + + def __init__(self, config): + requires_scatter(self) + super().__init__(config) + self.config = config + + self.embeddings = TapasEmbeddings(config) + self.encoder = TapasEncoder(config) + self.pooler = TapasPooler(config) + + self.init_weights() + + def get_input_embeddings(self): + return self.embeddings.word_embeddings + + def set_input_embeddings(self, value): + self.embeddings.word_embeddings = value + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids=None, + attention_mask=None, + token_type_ids=None, + position_ids=None, + head_mask=None, + inputs_embeds=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + Returns: + + Examples:: + + >>> from transformers import TapasTokenizer, TapasModel + >>> import pandas as pd + + >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased') + >>> model = TapasModel.from_pretrained('google/tapas-base-uncased') + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + ... 'Age': ["56", "45", "59"], + ... 'Number of movies': ["87", "53", "69"] + ... } + >>> table = pd.DataFrame.from_dict(data) + >>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"] + + >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt") + >>> outputs = model(**inputs) + + >>> last_hidden_states = outputs.last_hidden_state + """ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if input_ids is not None and inputs_embeds is not None: + raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time") + elif input_ids is not None: + input_shape = input_ids.size() + elif inputs_embeds is not None: + input_shape = inputs_embeds.size()[:-1] + else: + raise ValueError("You have to specify either input_ids or inputs_embeds") + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + if attention_mask is None: + attention_mask = torch.ones(input_shape, device=device) + if token_type_ids is None: + token_type_ids = torch.zeros( + (*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device + ) + + # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length] + # ourselves in which case we just need to make it broadcastable to all heads. + extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device) + + # If a 2D ou 3D attention mask is provided for the cross-attention + # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length] + if self.config.is_decoder and encoder_hidden_states is not None: + encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size() + encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length) + if encoder_attention_mask is None: + encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device) + encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask) + else: + encoder_extended_attention_mask = None + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers) + + embedding_output = self.embeddings( + input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds + ) + encoder_outputs = self.encoder( + embedding_output, + attention_mask=extended_attention_mask, + head_mask=head_mask, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_extended_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + sequence_output = encoder_outputs[0] + pooled_output = self.pooler(sequence_output) + + if not return_dict: + return (sequence_output, pooled_output) + encoder_outputs[1:] + + return BaseModelOutputWithPooling( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + +@add_start_docstrings("""Tapas Model with a `language modeling` head on top. """, TAPAS_START_DOCSTRING) +class TapasForMaskedLM(TapasPreTrainedModel): + config_class = TapasConfig + base_model_prefix = "tapas" + + def __init__(self, config): + super().__init__(config) + + self.tapas = TapasModel(config) + self.lm_head = nn.Linear(config.hidden_size, config.vocab_size) + + self.init_weights() + + def get_output_embeddings(self): + return self.lm_head + + @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids=None, + attention_mask=None, + token_type_ids=None, + position_ids=None, + head_mask=None, + inputs_embeds=None, + encoder_hidden_states=None, + encoder_attention_mask=None, + labels=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + **kwargs + ): + r""" + labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`): + Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ..., + config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored + (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]`` + + Returns: + + Examples:: + + >>> from transformers import TapasTokenizer, TapasForMaskedLM + >>> import pandas as pd + + >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased') + >>> model = TapasForMaskedLM.from_pretrained('google/tapas-base-uncased') + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + ... 'Age': ["56", "45", "59"], + ... 'Number of movies': ["87", "53", "69"] + ... } + >>> table = pd.DataFrame.from_dict(data) + + >>> inputs = tokenizer(table=table, queries="How many [MASK] has George [MASK] played in?", return_tensors="pt") + >>> labels = tokenizer(table=table, queries="How many movies has George Clooney played in?", return_tensors="pt")["input_ids"] + + >>> outputs = model(**inputs, labels=labels) + >>> last_hidden_states = outputs.last_hidden_state + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.tapas( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + encoder_hidden_states=encoder_hidden_states, + encoder_attention_mask=encoder_attention_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = outputs[0] + prediction_scores = self.lm_head(sequence_output) + + masked_lm_loss = None + if labels is not None: + loss_fct = CrossEntropyLoss() # -100 index = padding token + masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) + + if not return_dict: + output = (prediction_scores,) + outputs[2:] + return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output + + return MaskedLMOutput( + loss=masked_lm_loss, + logits=prediction_scores, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +# Copied from transformers.modeling_roberta.RobertaLMHead with Roberta->Tapas +class TapasLMHead(nn.Module): + """Tapas Head for masked language modeling.""" + + def __init__(self, config): + super().__init__() + self.dense = nn.Linear(config.hidden_size, config.hidden_size) + self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + + self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False) + self.bias = nn.Parameter(torch.zeros(config.vocab_size)) + + # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings` + self.decoder.bias = self.bias + + def forward(self, features, **kwargs): + x = self.dense(features) + x = gelu(x) + x = self.layer_norm(x) + + # project back to size of vocabulary with bias + x = self.decoder(x) + + return x + + +@add_start_docstrings( + """ + Tapas Model with a cell selection head and optionally aggregation head on top for question-answering tasks on + tables (linear layers on top of the hidden-states output to compute `logits` and optionally `logits_aggregation`), + e.g. for SQA, WTQ or WikiSQL tasks. + """, + TAPAS_START_DOCSTRING, +) +class TapasForQuestionAnswering(TapasPreTrainedModel): + def __init__(self, config): + super().__init__(config) + + # base model + self.tapas = TapasModel(config) + + # dropout (only used when training) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + # cell selection heads + if config.init_cell_selection_weights_to_zero: + # init_cell_selection_weights_to_zero: Whether the initial weights should be + # set to 0. This ensures that all tokens have the same prior probability. + self.output_weights = nn.Parameter(torch.zeros(config.hidden_size)) + self.column_output_weights = nn.Parameter(torch.zeros(config.hidden_size)) + else: + self.output_weights = nn.Parameter(torch.empty(config.hidden_size)) + nn.init.normal_( + self.output_weights, std=0.02 + ) # here, a truncated normal is used in the original implementation + self.column_output_weights = nn.Parameter(torch.empty(config.hidden_size)) + nn.init.normal_( + self.column_output_weights, std=0.02 + ) # here, a truncated normal is used in the original implementation + self.output_bias = nn.Parameter(torch.zeros([])) + self.column_output_bias = nn.Parameter(torch.zeros([])) + + # aggregation head + if config.num_aggregation_labels > 0: + self.aggregation_classifier = nn.Linear(config.hidden_size, config.num_aggregation_labels) + + self.init_weights() + + @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=TableQuestionAnsweringOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids=None, + attention_mask=None, + token_type_ids=None, + position_ids=None, + head_mask=None, + inputs_embeds=None, + table_mask=None, + label_ids=None, + aggregation_labels=None, + float_answer=None, + numeric_values=None, + numeric_values_scale=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + table_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): + Mask for the table. Indicates which tokens belong to the table (1). Question tokens, table headers and + padding are 0. + label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): + Labels per token for computing the hierarchical cell selection loss. This encodes the positions of the + answer appearing in the table. Can be obtained using :class:`~transformers.TapasTokenizer`. + + - 1 for tokens that are **part of the answer**, + - 0 for tokens that are **not part of the answer**. + + aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`, `optional`): + Aggregation function index for every example in the batch for computing the aggregation loss. Indices + should be in :obj:`[0, ..., config.num_aggregation_labels - 1]`. Only required in case of strong + supervision for aggregation (WikiSQL-SUPERVISED). + float_answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`, `optional`): + Float answer for every example in the batch. Set to `float('nan')` for cell selection questions. + Only required in case of weak supervision (WTQ, WikiSQL) to calculate the aggregate mask and regression loss. + numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): + Numeric values of every token, NaN for tokens which are not numeric values. Can be obtained using + :class:`~transformers.TapasTokenizer`. Only required in case of weak supervision for aggregation (WTQ, + WikiSQL) to calculate the regression loss. + numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`): + Scale of the numeric values of every token. Can be obtained using :class:`~transformers.TapasTokenizer`. + Only required in case of weak supervision for aggregation (WTQ, WikiSQL) to calculate the regression loss. + + Returns: + + Examples:: + + >>> from transformers import TapasTokenizer, TapasForQuestionAnswering + >>> import pandas as pd + + >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased-finetuned-wtq') + >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased-finetuned-wtq') + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + ... 'Age': ["56", "45", "59"], + ... 'Number of movies': ["87", "53", "69"] + ... } + >>> table = pd.DataFrame.from_dict(data) + >>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"] + + >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt") + >>> outputs = model(**inputs) + + >>> logits = outputs.logits + >>> logits_aggregation = outputs.logits_aggregation + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.tapas( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = outputs[0] + pooled_output = outputs[1] + + sequence_output = self.dropout(sequence_output) + + if input_ids is not None: + input_shape = input_ids.size() + else: + input_shape = inputs_embeds.size()[:-1] + + device = input_ids.device if input_ids is not None else inputs_embeds.device + + # Construct indices for the table. + if token_type_ids is None: + token_type_ids = torch.zeros( + (*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device + ) + + token_types = [ + "segment_ids", + "column_ids", + "row_ids", + "prev_label_ids", + "column_ranks", + "inv_column_ranks", + "numeric_relations", + ] + + row_ids = token_type_ids[:, :, token_types.index("row_ids")] + column_ids = token_type_ids[:, :, token_types.index("column_ids")] + + row_index = IndexMap( + indices=torch.min(row_ids, torch.as_tensor(self.config.max_num_rows - 1, device=row_ids.device)), + num_segments=self.config.max_num_rows, + batch_dims=1, + ) + col_index = IndexMap( + indices=torch.min(column_ids, torch.as_tensor(self.config.max_num_columns - 1, device=column_ids.device)), + num_segments=self.config.max_num_columns, + batch_dims=1, + ) + cell_index = ProductIndexMap(row_index, col_index) + + # Masks. + input_shape = input_ids.size() if input_ids is not None else inputs_embeds.size()[:-1] + device = input_ids.device if input_ids is not None else inputs_embeds.device + if attention_mask is None: + attention_mask = torch.ones(input_shape, device=device) + # Table cells only, without question tokens and table headers. + if table_mask is None: + table_mask = torch.where(row_ids > 0, torch.ones_like(row_ids), torch.zeros_like(row_ids)) + # torch.FloatTensor[batch_size, seq_length] + input_mask_float = attention_mask.float().to(device) + table_mask_float = table_mask.float().to(device) + # Mask for cells that exist in the table (i.e. that are not padding). + cell_mask, _ = reduce_mean(input_mask_float, cell_index) + + # Compute logits per token. These are used to select individual cells. + logits = compute_token_logits(sequence_output, self.config.temperature, self.output_weights, self.output_bias) + + # Compute logits per column. These are used to select a column. + column_logits = None + if self.config.select_one_column: + column_logits = compute_column_logits( + sequence_output, + self.column_output_weights, + self.column_output_bias, + cell_index, + cell_mask, + self.config.allow_empty_column_selection, + ) + + ########## Aggregation logits ############## + logits_aggregation = None + if self.config.num_aggregation_labels > 0: + logits_aggregation = self.aggregation_classifier(pooled_output) + + # Total loss calculation + total_loss = 0.0 + calculate_loss = False + if label_ids is not None: + calculate_loss = True + is_supervised = not self.config.num_aggregation_labels > 0 or not self.config.use_answer_as_supervision + + ### Semi-supervised cell selection in case of no aggregation + ############################################################# + + # If the answer (the denotation) appears directly in the table we might + # select the answer without applying any aggregation function. There are + # some ambiguous cases, see utils._calculate_aggregate_mask for more info. + # `aggregate_mask` is 1 for examples where we chose to aggregate and 0 + # for examples where we chose to select the answer directly. + # `label_ids` encodes the positions of the answer appearing in the table. + if is_supervised: + aggregate_mask = None + else: + if float_answer is not None: + assert label_ids.shape[0] == float_answer.shape[0], "Make sure the answers are a FloatTensor of shape (batch_size,)" + # [batch_size] + aggregate_mask = _calculate_aggregate_mask( + float_answer, + pooled_output, + self.config.cell_selection_preference, + label_ids, + self.aggregation_classifier, + ) + else: + raise ValueError("You have to specify float answers in order to calculate the aggregate mask") + + ### Cell selection log-likelihood + ################################# + + if self.config.average_logits_per_cell: + logits_per_cell, _ = reduce_mean(logits, cell_index) + logits = gather(logits_per_cell, cell_index) + dist_per_token = torch.distributions.Bernoulli(logits=logits) + + # Compute cell selection loss per example. + selection_loss_per_example = None + if not self.config.select_one_column: + weight = torch.where( + label_ids == 0, + torch.ones_like(label_ids, dtype=torch.float32), + self.config.positive_label_weight * torch.ones_like(label_ids, dtype=torch.float32), + ) + selection_loss_per_token = -dist_per_token.log_prob(label_ids) * weight + selection_loss_per_example = torch.sum(selection_loss_per_token * input_mask_float, dim=1) / ( + torch.sum(input_mask_float, dim=1) + EPSILON_ZERO_DIVISION + ) + else: + selection_loss_per_example, logits = _single_column_cell_selection_loss( + logits, column_logits, label_ids, cell_index, col_index, cell_mask + ) + dist_per_token = torch.distributions.Bernoulli(logits=logits) + + ### Supervised cell selection + ############################# + if self.config.disable_per_token_loss: + pass + elif is_supervised: + total_loss += torch.mean(selection_loss_per_example) + else: + # For the not supervised case, do not assign loss for cell selection + total_loss += torch.mean(selection_loss_per_example * (1.0 - aggregate_mask)) + + ### Semi-supervised regression loss and supervised loss for aggregations + ######################f################################################### + if self.config.num_aggregation_labels > 0: + if is_supervised: + # Note that `aggregate_mask` is None if the setting is supervised. + if aggregation_labels is not None: + assert label_ids.shape[0] == aggregation_labels.shape[0], "Make sure the aggregation labels are a LongTensor of shape (batch_size,)" + per_example_additional_loss = _calculate_aggregation_loss( + logits_aggregation, aggregate_mask, aggregation_labels, + self.config.use_answer_as_supervision, self.config.num_aggregation_labels, + self.config.aggregation_loss_weight + ) + else: + raise ValueError( + "You have to specify aggregation labels in order to calculate the aggregation loss" + ) + else: + # Set aggregation labels to zeros + aggregation_labels = torch.zeros(label_ids.shape[0], dtype=torch.long, device=label_ids.device) + per_example_additional_loss = _calculate_aggregation_loss( + logits_aggregation, aggregate_mask, aggregation_labels, + self.config.use_answer_as_supervision, self.config.num_aggregation_labels, + self.config.aggregation_loss_weight + ) + + if self.config.use_answer_as_supervision: + if numeric_values is not None and numeric_values_scale is not None: + assert numeric_values.shape == numeric_values_scale.shape + # Add regression loss for numeric answers which require aggregation. + answer_loss, large_answer_loss_mask = _calculate_regression_loss( + float_answer, + aggregate_mask, + dist_per_token, + numeric_values, + numeric_values_scale, + table_mask_float, + logits_aggregation, + self.config, + ) + per_example_additional_loss += answer_loss + # Zero loss for examples with answer_loss > cutoff. + per_example_additional_loss *= large_answer_loss_mask + else: + raise ValueError( + "You have to specify numeric values and numeric values scale in order to calculate the regression loss" + ) + + total_loss += torch.mean(per_example_additional_loss) + + else: + # if no label ids are provided, set them to zeros in order to properly compute logits + label_ids = torch.zeros_like(logits) + _, logits = _single_column_cell_selection_loss( + logits, column_logits, label_ids, cell_index, col_index, cell_mask + ) + if not return_dict: + output = (logits, logits_aggregation) + outputs[2:] + return ((total_loss,) + output) if calculate_loss else output + + return TableQuestionAnsweringOutput( + loss=total_loss, + logits=logits, + logits_aggregation=logits_aggregation, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +@add_start_docstrings( + """ + Tapas Model with a sequence classification head on top (a linear layer on top of the pooled output), e.g. for + table entailment tasks, such as TabFact (Chen et al., 2020). + """, + TAPAS_START_DOCSTRING, +) +class TapasForSequenceClassification(TapasPreTrainedModel): + def __init__(self, config): + super().__init__(config) + self.num_labels = config.num_labels + + self.tapas = TapasModel(config) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + self.classifier = nn.Linear(config.hidden_size, config.num_labels) + + self.init_weights() + + @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length")) + @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC) + def forward( + self, + input_ids=None, + attention_mask=None, + token_type_ids=None, + position_ids=None, + head_mask=None, + inputs_embeds=None, + labels=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + r""" + labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`): + Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ..., + config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss), + If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). Note: this is called + "classification_class_index" in the original implementation. + + Returns: + + Examples:: + + >>> from transformers import TapasTokenizer, TapasForSequenceClassification + >>> import torch + >>> import pandas as pd + + >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased-finetuned-tabfact') + >>> model = TapasForSequenceClassification.from_pretrained('google/tapas-base-uncased-finetuned-tabfact') + + >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + ... 'Age': ["56", "45", "59"], + ... 'Number of movies': ["87", "53", "69"] + ... } + >>> table = pd.DataFrame.from_dict(data) + >>> queries = ["There is only one actor who is 45 years old", "There are 3 actors which played in more than 60 movies"] + + >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt") + >>> labels = torch.tensor([1, 0]) # 1 means entailed, 0 means refuted + + >>> outputs = model(**inputs, labels=labels) + >>> loss = outputs.loss + >>> logits = outputs.logits + """ + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + outputs = self.tapas( + input_ids, + attention_mask=attention_mask, + token_type_ids=token_type_ids, + position_ids=position_ids, + head_mask=head_mask, + inputs_embeds=inputs_embeds, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + pooled_output = outputs[1] + + pooled_output = self.dropout(pooled_output) + logits = self.classifier(pooled_output) + + loss = None + if labels is not None: + if self.num_labels == 1: + # We are doing regression + loss_fct = MSELoss() + loss = loss_fct(logits.view(-1), labels.view(-1)) + else: + loss_fct = CrossEntropyLoss() + loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) + + if not return_dict: + output = (logits,) + outputs[2:] + return ((loss,) + output) if loss is not None else output + + return SequenceClassifierOutput( + loss=loss, + logits=logits, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + ) + + +""" TAPAS utilities.""" + + +class AverageApproximationFunction(str, enum.Enum): + RATIO = "ratio" + FIRST_ORDER = "first_order" + SECOND_ORDER = "second_order" + + +### Beginning of everything related to segmented tensors ### + + +class IndexMap(object): + """Index grouping entries within a tensor.""" + + def __init__(self, indices, num_segments, batch_dims=0): + """ + Creates an index + + Args: + indices (:obj:`torch.LongTensor`, same shape as a `values` Tensor to which the indices refer): + Tensor containing the indices. + num_segments (:obj:`torch.LongTensor`): + Scalar tensor, the number of segments. All elements in a batched segmented tensor must have the same + number of segments (although many segments can be empty). + batch_dims (:obj:`int`, `optional`, defaults to 0): + The number of batch dimensions. The first `batch_dims` dimensions of a SegmentedTensor are treated as + batch dimensions. Segments in different batch elements are always distinct even if they have the same + index. + """ + self.indices = torch.as_tensor(indices) + self.num_segments = torch.as_tensor(num_segments, device=indices.device) + self.batch_dims = batch_dims + + def batch_shape(self): + return self.indices.size()[: self.batch_dims] # returns a torch.Size object + + +class ProductIndexMap(IndexMap): + """The product of two indices.""" + + def __init__(self, outer_index, inner_index): + """ + Combines indices i and j into pairs (i, j). The result is an index where each segment (i, j) is the + intersection of segments i and j. For example if the inputs represent table cells indexed by respectively rows + and columns the output will be a table indexed by (row, column) pairs, i.e. by cell. The implementation + combines indices {0, .., n - 1} and {0, .., m - 1} into {0, .., nm - 1}. The output has `num_segments` equal to + `outer_index.num_segments` * `inner_index.num_segments` + + Args: + outer_index (:obj:`IndexMap`): + IndexMap. + inner_index (:obj:`IndexMap`): + IndexMap, must have the same shape as `outer_index`. + """ + if outer_index.batch_dims != inner_index.batch_dims: + raise ValueError("outer_index.batch_dims and inner_index.batch_dims must be the same.") + + super(ProductIndexMap, self).__init__( + indices=(inner_index.indices + outer_index.indices * inner_index.num_segments), + num_segments=inner_index.num_segments * outer_index.num_segments, + batch_dims=inner_index.batch_dims, + ) + self.outer_index = outer_index + self.inner_index = inner_index + + def project_outer(self, index): + """Projects an index with the same index set onto the outer components.""" + return IndexMap( + indices=(index.indices // self.inner_index.num_segments).type(torch.float).floor().type(torch.long), + num_segments=self.outer_index.num_segments, + batch_dims=index.batch_dims, + ) + + def project_inner(self, index): + """Projects an index with the same index set onto the inner components.""" + return IndexMap( + indices=torch.fmod(index.indices, self.inner_index.num_segments) + .type(torch.float) + .floor() + .type(torch.long), + num_segments=self.inner_index.num_segments, + batch_dims=index.batch_dims, + ) + + +def gather(values, index, name="segmented_gather"): + """ + Gathers from `values` using the index map. For each element in the domain of the index map this operation looks up + a value for that index in `values`. Two elements from the same segment always get assigned the same value. + + Args: + values (:obj:`torch.Tensor` of shape (B1, ..., Bn, num_segments, V1, ...)): + Tensor with segment values. + index (:obj:`IndexMap` of shape (B1, ..., Bn, I1, ..., Ik)): + IndexMap. + name (:obj:`str`, `optional`, defaults to 'segmented_gather'): + Name for the operation. Currently not used + + Returns: + :obj:`tuple(torch.Tensor)`: Tensor of shape (B1, ..., Bn, I1, ..., Ik, V1, ...) with the gathered values. + """ + indices = index.indices + # first, check whether the indices of the index represent scalar values (i.e. not vectorized) + if len(values.shape[index.batch_dims :]) < 2: + return torch.gather( + values, + index.batch_dims, + indices.view( + values.size()[0], -1 + ), # torch.gather expects index to have the same number of dimensions as values + ).view(indices.size()) + else: + # this means we have a vectorized version + # we have to adjust the index + indices = indices.unsqueeze(-1).expand(values.shape) + return torch.gather(values, index.batch_dims, indices) + + +def flatten(index, name="segmented_flatten"): + """ + Flattens a batched index map (which is typically of shape batch_size, seq_length) to a 1d index map. This operation + relabels the segments to keep batch elements distinct. The k-th batch element will have indices shifted by + `num_segments` * (k - 1). The result is a tensor with `num_segments` multiplied by the number of elements in the + batch. + + Args: + index (:obj:`IndexMap`): + IndexMap to flatten. + name (:obj:`str`, `optional`, defaults to 'segmented_flatten'): + Name for the operation. Currently not used + + Returns: + (:obj:`IndexMap`): The flattened IndexMap. + """ + # first, get batch_size as scalar tensor + batch_size = torch.prod(torch.tensor(list(index.batch_shape()))) + # next, create offset as 1-D tensor of length batch_size, + # and multiply element-wise by num segments (to offset different elements in the batch) e.g. if batch size is 2: [0, 64] + offset = torch.arange(start=0, end=batch_size, device=index.num_segments.device) * index.num_segments + offset = offset.view(index.batch_shape()) + for _ in range(index.batch_dims, len(index.indices.size())): # typically range(1,2) + offset = offset.unsqueeze(-1) + + indices = offset + index.indices + return IndexMap(indices=indices.view(-1), num_segments=index.num_segments * batch_size, batch_dims=0) + + +def range_index_map(batch_shape, num_segments, name="range_index_map"): + """ + Constructs an index map equal to range(num_segments). + + Args: + batch_shape (:obj:`torch.Size`): + Batch shape + num_segments (:obj:`int`): + Number of segments + name (:obj:`str`, `optional`, defaults to 'range_index_map'): + Name for the operation. Currently not used + + Returns: + (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments). + """ + batch_shape = torch.as_tensor( + batch_shape, dtype=torch.long + ) # create a rank 1 tensor vector containing batch_shape (e.g. [2]) + assert len(batch_shape.size()) == 1 + num_segments = torch.as_tensor(num_segments) # create a rank 0 tensor (scalar) containing num_segments (e.g. 64) + assert len(num_segments.size()) == 0 + + indices = torch.arange( + start=0, end=num_segments, device=num_segments.device + ) # create a rank 1 vector with num_segments elements + new_tensor = torch.cat( + [torch.ones_like(batch_shape, dtype=torch.long, device=num_segments.device), num_segments.unsqueeze(dim=0)], + dim=0, + ) + # new_tensor is just a vector of [1 64] for example (assuming only 1 batch dimension) + new_shape = [int(x) for x in new_tensor.tolist()] + indices = indices.view(new_shape) + + multiples = torch.cat([batch_shape, torch.as_tensor([1])], dim=0) + indices = indices.repeat(multiples.tolist()) + # equivalent (in Numpy:) + # indices = torch.as_tensor(np.tile(indices.numpy(), multiples.tolist())) + + return IndexMap(indices=indices, num_segments=num_segments, batch_dims=list(batch_shape.size())[0]) + + +def _segment_reduce(values, index, segment_reduce_fn, name): + """ + Applies a segment reduction segment-wise. + + Args: + values (:obj:`torch.Tensor`): + Tensor with segment values. + index (:obj:`IndexMap`): + IndexMap. + segment_reduce_fn (:obj:`str`): + Name for the reduce operation. One of "sum", "mean", "max" or "min". + name (:obj:`str`): + Name for the operation. Currently not used + + Returns: + (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments). + """ + # Flatten the batch dimensions, as segments ops (scatter) do not support batching. + # However if `values` has extra dimensions to the right keep them + # unflattened. Segmented ops support vector-valued operations. + flat_index = flatten(index) + vector_shape = values.size()[len(index.indices.size()) :] # torch.Size object + flattened_shape = torch.cat( + [torch.as_tensor([-1], dtype=torch.long), torch.as_tensor(vector_shape, dtype=torch.long)], dim=0 + ) + # changed "view" by "reshape" in the following line + flat_values = values.reshape(flattened_shape.tolist()) + + segment_means = scatter( + src=flat_values, + index=flat_index.indices.type(torch.long), + dim=0, + dim_size=flat_index.num_segments, + reduce=segment_reduce_fn, + ) + + # Unflatten the values. + new_shape = torch.cat( + [ + torch.as_tensor(index.batch_shape(), dtype=torch.long), + torch.as_tensor([index.num_segments], dtype=torch.long), + torch.as_tensor(vector_shape, dtype=torch.long), + ], + dim=0, + ) + + output_values = segment_means.view(new_shape.tolist()) + output_index = range_index_map(index.batch_shape(), index.num_segments) + return output_values, output_index + + +def reduce_sum(values, index, name="segmented_reduce_sum"): + """ + Sums a tensor over its segments. + + Outputs 0 for empty segments. + + This operations computes the sum over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. + - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a sum of + vectors rather than scalars. Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + + Args: + values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): + Tensor containing the values of which the sum must be taken segment-wise. + index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): + Index defining the segments. + name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): + Name for the operation. Currently not used + + Returns: + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. . + """ + return _segment_reduce(values, index, "sum", name) + + +def reduce_mean(values, index, name="segmented_reduce_mean"): + """ + Averages a tensor over its segments. + + Outputs 0 for empty segments. + + This operations computes the mean over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. + - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a mean of + vectors rather than scalars. + + Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + + Args: + values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): + Tensor containing the values of which the mean must be taken segment-wise. + index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): + Index defining the segments. + name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): + Name for the operation. Currently not used + + Returns: + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. + """ + return _segment_reduce(values, index, "mean", name) + + +def reduce_max(values, index, name="segmented_reduce_max"): + """ + Computes the maximum over segments. + + This operation computes the maximum over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. + - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise + maximum of vectors rather than scalars. + + Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + + Args: + values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): + Tensor containing the values of which the max must be taken segment-wise. + index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): + Index defining the segments. + name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): + Name for the operation. Currently not used + + Returns: + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. + """ + return _segment_reduce(values, index, "max", name) + + +def reduce_min(values, index, name="segmented_reduce_min"): + """ + Computes the minimum over segments. + + This operations computes the minimum over segments, with support for: + - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. + - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise minimum + of vectors rather than scalars. + + Only the middle dimensions [I1, ..., Ik] are reduced by the operation. + + Args: + values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]): + Tensor containing the values of which the min must be taken segment-wise. + index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].): + Index defining the segments. + name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'): + Name for the operation. Currently not used + + Returns: + output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the + output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. + """ + return _segment_reduce(values, index, "min", name) + + +### End of everything related to segmented tensors ### + + +def compute_column_logits( + sequence_output, column_output_weights, column_output_bias, cell_index, cell_mask, allow_empty_column_selection +): + """ + Computes the column logits. + + Args: + sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): + Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model. + column_output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size)`): + Weights of the linear layer for column selection. + column_output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`): + Bias of the linear layer for column selection. + cell_index (:obj:`ProductIndexMap`): + Index that groups tokens into cells. + cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`): + Mask for cells that exist in the table (i.e. that are not padding). + allow_empty_column_selection (:obj:`bool`): + Whether to allow not to select any column + + Returns: + column_logits (:obj:`torch.FloatTensor`of shape :obj:`(batch_size, max_num_cols)`): Tensor containing the + column logits for every example in the batch. + """ + + # First, compute the token logits (batch_size, seq_len) - without temperature + token_logits = torch.einsum("bsj,j->bs", sequence_output, column_output_weights) + column_output_bias + + # Next, average the logits per cell (batch_size, max_num_cols*max_num_rows) + cell_logits, cell_logits_index = reduce_mean(token_logits, cell_index) + + # Finally, average the logits per column (batch_size, max_num_cols) + column_index = cell_index.project_inner(cell_logits_index) + column_logits, out_index = reduce_sum(cell_logits * cell_mask, column_index) + + cell_count, _ = reduce_sum(cell_mask, column_index) + column_logits /= cell_count + EPSILON_ZERO_DIVISION + + # Mask columns that do not appear in the example. + is_padding = torch.logical_and(cell_count < 0.5, ~torch.eq(out_index.indices, 0)) + column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor( + is_padding, dtype=torch.float32, device=is_padding.device + ) + + if not allow_empty_column_selection: + column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor( + torch.eq(out_index.indices, 0), dtype=torch.float32, device=out_index.indices.device + ) + + return column_logits + + +def _single_column_cell_selection_loss(token_logits, column_logits, label_ids, cell_index, col_index, cell_mask): + """ + Computes the loss for cell selection constrained to a single column. The loss is a hierarchical log-likelihood. The + model first predicts a column and then selects cells within that column (conditioned on the column). Cells outside + the selected column are never selected. + + Args: + token_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): + Tensor containing the logits per token. + column_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_cols)`): + Tensor containing the logits per column. + label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): + Labels per token. + cell_index (:obj:`ProductIndexMap`): + Index that groups tokens into cells. + col_index (:obj:`IndexMap`): + Index that groups tokens into columns. + cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`): + Mask for cells that exist in the table (i.e. that are not padding). + + Returns: + selection_loss_per_example (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Loss for each example. + logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): New logits which are only + allowed to select cells in a single column. Logits outside of the most likely column according to + `column_logits` will be set to a very low value (such that the probabilities are 0). + """ + ## Part 1: column loss + + # First find the column we should select. We use the column with maximum + # number of selected cells. + labels_per_column, _ = reduce_sum( + torch.as_tensor(label_ids, dtype=torch.float32, device=label_ids.device), col_index + ) + # shape of labels_per_column is (batch_size, max_num_cols). It contains the number of label ids for every column, for every example + column_label = torch.argmax(labels_per_column, dim=-1) # shape (batch_size,) + # Check if there are no selected cells in the column. In that case the model + # should predict the special column id 0, which means "select nothing". + no_cell_selected = torch.eq( + torch.max(labels_per_column, dim=-1)[0], 0 + ) # no_cell_selected is of shape (batch_size,) and equals True + # if an example of the batch has no cells selected (i.e. if there are no label_ids set to 1 for that example) + column_label = torch.where( + no_cell_selected.view(column_label.size()), torch.zeros_like(column_label), column_label + ) + + column_dist = torch.distributions.Categorical(logits=column_logits) # shape (batch_size, max_num_cols) + column_loss_per_example = -column_dist.log_prob(column_label) + + ## Part 2: cell loss + + # Reduce the labels and logits to per-cell from per-token. + # logits_per_cell: shape (batch_size, max_num_rows*max_num_cols) i.e. (batch_size, 64*32) + logits_per_cell, _ = reduce_mean(token_logits, cell_index) + # labels_per_cell: shape (batch_size, 64*32), indicating whether each cell should be selected (1) or not (0) + labels_per_cell, labels_index = reduce_max( + torch.as_tensor(label_ids, dtype=torch.long, device=label_ids.device), cell_index + ) + + # Mask for the selected column. + # column_id_for_cells: shape (batch_size, 64*32), indicating to which column each cell belongs + column_id_for_cells = cell_index.project_inner(labels_index).indices + # column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column to be selected + column_mask = torch.as_tensor( + torch.eq(column_id_for_cells, torch.unsqueeze(column_label, dim=-1)), + dtype=torch.float32, + device=cell_mask.device, + ) + + # Compute the log-likelihood for cells, but only for the selected column. + cell_dist = torch.distributions.Bernoulli(logits=logits_per_cell) # shape (batch_size, 64*32) + cell_log_prob = cell_dist.log_prob(labels_per_cell.type(torch.float32)) # shape(batch_size, 64*32) + + cell_loss = -torch.sum(cell_log_prob * column_mask * cell_mask, dim=1) + + # We need to normalize the loss by the number of cells in the column. + cell_loss /= torch.sum(column_mask * cell_mask, dim=1) + EPSILON_ZERO_DIVISION + + selection_loss_per_example = column_loss_per_example + selection_loss_per_example += torch.where( + no_cell_selected.view(selection_loss_per_example.size()), + torch.zeros_like(selection_loss_per_example), + cell_loss, + ) + + # Set the probs outside the selected column (selected by the *model*) + # to 0. This ensures backwards compatibility with models that select + # cells from multiple columns. + selected_column_id = torch.as_tensor( + torch.argmax(column_logits, dim=-1), dtype=torch.long, device=column_logits.device + ) # shape (batch_size,) + + # selected_column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column selected by the model + selected_column_mask = torch.as_tensor( + torch.eq(column_id_for_cells, torch.unsqueeze(selected_column_id, dim=-1)), + dtype=torch.float32, + device=selected_column_id.device, + ) + + # Never select cells with the special column id 0. + selected_column_mask = torch.where( + torch.eq(column_id_for_cells, 0).view(selected_column_mask.size()), + torch.zeros_like(selected_column_mask), + selected_column_mask, + ) + new_logits_per_cell = logits_per_cell + CLOSE_ENOUGH_TO_LOG_ZERO * (1.0 - cell_mask * selected_column_mask) + logits = gather(new_logits_per_cell, cell_index) + + return selection_loss_per_example, logits + + +def compute_token_logits(sequence_output, temperature, output_weights, output_bias): + """ + Computes logits per token + + Args: + sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`): + Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model. + temperature (:obj:`float`): + Temperature for the Bernoulli distribution. + output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size,)`): + Weights of the linear layer for cell selection. + output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`): + Bias of the linear layer for cell selection + + Returns: + logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): Logits per token. + """ + logits = (torch.einsum("bsj,j->bs", sequence_output, output_weights) + output_bias) / temperature + + return logits + + +def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, label_ids, aggregation_classifier): + """ + Finds examples where the model should select cells with no aggregation. + + Returns a mask that determines for which examples should the model select answers directly from the table, without + any aggregation function. If the answer is a piece of text the case is unambiguous as aggregation functions only + apply to numbers. If the answer is a number but does not appear in the table then we must use some aggregation + case. The ambiguous case is when the answer is a number that also appears in the table. In this case we use the + aggregation function probabilities predicted by the model to decide whether to select or aggregate. The threshold + for this is a hyperparameter `cell_selection_preference + + Args: + answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): + Answer for every example in the batch. Nan if there is no scalar answer. + pooled_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`): + Output of the pooler (BertPooler) on top of the encoder layer. + cell_selection_preference (:obj:`float`): + Preference for cell selection in ambiguous cases. + label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`): + Labels per token. aggregation_classifier (:obj:`torch.nn.Linear`): Aggregation head + + Returns: + aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A mask set to 1 for examples that + should use aggregation functions. + """ + # torch.FloatTensor(batch_size,) + aggregate_mask_init = torch.logical_not(torch.isnan(answer)).type(torch.FloatTensor).to(answer.device) + logits_aggregation = aggregation_classifier(pooled_output) + dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation) + # Index 0 correponds to "no aggregation". + aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1) + + # Cell selection examples according to current model. + is_pred_cell_selection = aggregation_ops_total_mass <= cell_selection_preference + + # Examples with non-empty cell selection supervision. + is_cell_supervision_available = torch.sum(label_ids, dim=1) > 0 + + # torch.where is not equivalent to tf.where (in tensorflow 1) + # hence the added .view on the condition to match the shape of the first tensor + aggregate_mask = torch.where( + torch.logical_and(is_pred_cell_selection, is_cell_supervision_available).view(aggregate_mask_init.size()), + torch.zeros_like(aggregate_mask_init, dtype=torch.float32), + aggregate_mask_init, + ) + + aggregate_mask = aggregate_mask.detach() + + return aggregate_mask + + +def _calculate_aggregation_loss_known( + logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels +): + """ + Calculates aggregation loss when its type is known during training. + + In the weakly supervised setting, the only known information is that for cell selection examples, "no aggregation" + should be predicted. For other examples (those that require aggregation), no loss is accumulated. In the setting + where aggregation type is always known, standard cross entropy loss is accumulated for all examples + + Args: + logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): + Logits per aggregation operation. + aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): + A mask set to 1 for examples that should use aggregation functions. + aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`): + Aggregation function id for every example in the batch. + use_answer_as_supervision (:obj:`bool`, `optional`): + Whether to use the answer as the only supervision for aggregation examples. + num_aggregation_labels (:obj:`int`, `optional`, defaults to 0): + The number of aggregation operators to predict. + + Returns: + aggregation_loss_known (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (when its + type is known during training) per example. + """ + if use_answer_as_supervision: + # Prepare "no aggregation" targets for cell selection examples. + target_aggregation = torch.zeros_like(aggregate_mask, dtype=torch.long) + else: + # Use aggregation supervision as the target. + target_aggregation = aggregation_labels + + one_hot_labels = torch.nn.functional.one_hot(target_aggregation, num_classes=num_aggregation_labels).type( + torch.float32 + ) + log_probs = torch.nn.functional.log_softmax(logits_aggregation, dim=-1) + + # torch.FloatTensor[batch_size] + per_example_aggregation_intermediate = -torch.sum(one_hot_labels * log_probs, dim=-1) + if use_answer_as_supervision: + # Accumulate loss only for examples requiring cell selection + # (no aggregation). + return per_example_aggregation_intermediate * (1 - aggregate_mask) + else: + return per_example_aggregation_intermediate + + +def _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask): + """ + Calculates aggregation loss in the case of answer supervision. + + Args: + logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): + Logits per aggregation operation. + aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): + A mask set to 1 for examples that should use aggregation functions + + Returns: + aggregation_loss_unknown (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (in case of + answer supervision) per example. + """ + dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation) + # Index 0 correponds to "no aggregation". + aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1) + # Predict some aggregation in case of an answer that needs aggregation. + # This increases the probability of all aggregation functions, in a way + # similar to MML, but without considering whether the function gives the + # correct answer. + return -torch.log(aggregation_ops_total_mass) * aggregate_mask + + +def _calculate_aggregation_loss( + logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels, + aggregation_loss_weight +): + """ + Calculates the aggregation loss per example. + + Args: + logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): + Logits per aggregation operation. + aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`): + A mask set to 1 for examples that should use aggregation functions. + aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`): + Aggregation function id for every example in the batch. + use_answer_as_supervision (:obj:`bool`, `optional`): + Whether to use the answer as the only supervision for aggregation examples. + num_aggregation_labels (:obj:`int`, `optional`, defaults to 0): + The number of aggregation operators to predict. + aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0): + Importance weight for the aggregation loss. + + Returns: + aggregation_loss (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss per example. + """ + per_example_aggregation_loss = _calculate_aggregation_loss_known( + logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels + ) + + if use_answer_as_supervision: + # Add aggregation loss for numeric answers that need aggregation. + per_example_aggregation_loss += _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask) + return aggregation_loss_weight * per_example_aggregation_loss + + +def _calculate_expected_result( + dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config +): + """ + Calculates the expected result given cell and aggregation probabilities. + + Args: + dist_per_cell (:obj:`torch.distributions.Bernoulli`): + Cell selection distribution for each cell. + numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Numeric values of every token. Nan for tokens which are not numeric values. + numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Scale of the numeric values of every token. + input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Mask for the table, without question tokens and table headers. + logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): + Logits per aggregation operation. + config (:class:`~transformers.TapasConfig`): + Model configuration class with all the hyperparameters of the model + + Returns: + expected_result (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): The expected result per example. + """ + if config.use_gumbel_for_cells: + gumbel_dist = torch.distributions.RelaxedBernoulli( + # The token logits where already divided by the temperature and used for + # computing cell selection errors so we need to multiply it again here + temperature=config.temperature, + logits=dist_per_cell.logits * config.temperature, + ) + scaled_probability_per_cell = gumbel_dist.sample() + else: + scaled_probability_per_cell = dist_per_cell.probs + + # [batch_size, seq_length] + scaled_probability_per_cell = (scaled_probability_per_cell / numeric_values_scale) * input_mask_float + count_result = torch.sum(scaled_probability_per_cell, dim=1) + numeric_values_masked = torch.where( + torch.isnan(numeric_values), torch.zeros_like(numeric_values), numeric_values + ) # Mask non-numeric table values to zero. + sum_result = torch.sum(scaled_probability_per_cell * numeric_values_masked, dim=1) + avg_approximation = config.average_approximation_function + if avg_approximation == AverageApproximationFunction.RATIO: + average_result = sum_result / (count_result + EPSILON_ZERO_DIVISION) + elif avg_approximation == AverageApproximationFunction.FIRST_ORDER: + # The sum of all probabilities except that correspond to other cells + ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1 + average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell / ex, dim=1) + elif avg_approximation == AverageApproximationFunction.SECOND_ORDER: + # The sum of all probabilities except that correspond to other cells + ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1 + pointwise_var = scaled_probability_per_cell * (1 - scaled_probability_per_cell) + var = torch.sum(pointwise_var, dim=1, keepdim=True) - pointwise_var + + multiplier = (var / torch.square(ex) + 1) / ex + average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell * multiplier, dim=1) + else: + raise ValueError(f"Invalid average_approximation_function: {config.average_approximation_function}") + + if config.use_gumbel_for_aggregation: + gumbel_dist = torch.distributions.RelaxedOneHotCategorical( + config.aggregation_temperature, logits=logits_aggregation[:, 1:] + ) + # [batch_size, num_aggregation_labels - 1] + aggregation_op_only_probs = gumbel_dist.sample() + else: + # [batch_size, num_aggregation_labels - 1] + aggregation_op_only_probs = torch.nn.functional.softmax( + logits_aggregation[:, 1:] / config.aggregation_temperature, dim=-1 + ) + + all_results = torch.cat( + [ + torch.unsqueeze(sum_result, dim=1), + torch.unsqueeze(average_result, dim=1), + torch.unsqueeze(count_result, dim=1), + ], + dim=1, + ) + + expected_result = torch.sum(all_results * aggregation_op_only_probs, dim=1) + return expected_result + + +# PyTorch does not currently support Huber loss with custom delta so we define it ourself +def huber_loss(input, target, delta: float = 1.0): + errors = torch.abs(input - target) # shape (batch_size,) + return torch.where(errors < delta, 0.5 * errors ** 2, errors * delta - (0.5 * delta ** 2)) + + +def _calculate_regression_loss( + answer, + aggregate_mask, + dist_per_cell, + numeric_values, + numeric_values_scale, + input_mask_float, + logits_aggregation, + config, +): + """ + Calculates the regression loss per example. + + Args: + answer (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`): + Answer for every example in the batch. Nan if there is no scalar answer. + aggregate_mask (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`): + A mask set to 1 for examples that should use aggregation functions. + dist_per_cell (:obj:`torch.distributions.Bernoulli`): + Cell selection distribution for each cell. + numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Numeric values of every token. Nan for tokens which are not numeric values. + numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Scale of the numeric values of every token. + input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`): + Mask for the table, without question tokens and table headers. + logits_aggregation (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`): + Logits per aggregation operation. + config (:class:`~transformers.TapasConfig`): + Model configuration class with all the parameters of the model + + Returns: + per_example_answer_loss_scaled (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Scales answer loss for + each example in the batch. large_answer_loss_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A + mask which is 1 for examples for which their answer loss is larger than the answer_loss_cutoff. + """ + # [batch_size] + expected_result = _calculate_expected_result( + dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config + ) + + # [batch_size] + answer_masked = torch.where(torch.isnan(answer), torch.zeros_like(answer), answer) + + if config.use_normalized_answer_loss: + normalizer = (torch.max(torch.abs(expected_result), torch.abs(answer_masked)) + EPSILON_ZERO_DIVISION).detach() + + normalized_answer_masked = answer_masked / normalizer + normalized_expected_result = expected_result / normalizer + per_example_answer_loss = huber_loss( + normalized_expected_result * aggregate_mask, normalized_answer_masked * aggregate_mask + ) + else: + per_example_answer_loss = huber_loss( + expected_result * aggregate_mask, answer_masked * aggregate_mask, delta=config.huber_loss_delta + ) + + if config.answer_loss_cutoff is None: + large_answer_loss_mask = torch.ones_like(per_example_answer_loss, dtype=torch.float32) + + else: + large_answer_loss_mask = torch.where( + per_example_answer_loss > config.answer_loss_cutoff, + torch.zeros_like(per_example_answer_loss, dtype=torch.float32), + torch.ones_like(per_example_answer_loss, dtype=torch.float32), + ) + per_example_answer_loss_scaled = config.answer_loss_importance * (per_example_answer_loss * aggregate_mask) + + return per_example_answer_loss_scaled, large_answer_loss_mask \ No newline at end of file diff --git a/src/transformers/tokenization_auto.py b/src/transformers/tokenization_auto.py index 9cadfdfb3690..86c42b6e490b 100644 --- a/src/transformers/tokenization_auto.py +++ b/src/transformers/tokenization_auto.py @@ -50,6 +50,7 @@ RobertaConfig, SqueezeBertConfig, T5Config, + TapasConfig, TransfoXLConfig, XLMConfig, XLMProphetNetConfig, @@ -85,6 +86,7 @@ from .tokenization_retribert import RetriBertTokenizer from .tokenization_roberta import RobertaTokenizer from .tokenization_squeezebert import SqueezeBertTokenizer +from .tokenization_tapas import TapasTokenizer from .tokenization_transfo_xl import TransfoXLTokenizer from .tokenization_xlm import XLMTokenizer from .utils import logging @@ -210,6 +212,7 @@ (RagConfig, (RagTokenizer, None)), (XLMProphetNetConfig, (XLMProphetNetTokenizer, None)), (ProphetNetConfig, (ProphetNetTokenizer, None)), + (TapasConfig, (TapasTokenizer, None)), ] ) diff --git a/src/transformers/tokenization_tapas.py b/src/transformers/tokenization_tapas.py new file mode 100644 index 000000000000..cc30f74620d2 --- /dev/null +++ b/src/transformers/tokenization_tapas.py @@ -0,0 +1,2766 @@ +# coding=utf-8 +# Copyright 2020 Google Research and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Tokenization class for TAPAS model.""" + + +import ast +import collections +import datetime +import enum +import itertools +import math +import os +import re +import unicodedata +from dataclasses import dataclass +from typing import Callable, Dict, Generator, List, Optional, Text, Tuple, Union + +import pandas as pd +import torch +from transformers import add_end_docstrings + +from .tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace +from .tokenization_utils_base import ( + BatchEncoding, + EncodedInput, + PaddingStrategy, + PreTokenizedInput, + TensorType, + TextInput, + ExplicitEnum, ENCODE_KWARGS_DOCSTRING, +) +from .utils import logging + + +logger = logging.get_logger(__name__) + + +VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"} + +PRETRAINED_VOCAB_FILES_MAP = { + "vocab_file": { + "nielsr/tapas-base-finetuned-sqa": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt", + "nielsr/tapas-base-finetuned-wtq": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt", + "nielsr/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt", + } +} + +PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { + "nielsr/tapas-base-finetuned-sqa": 1024, + "nielsr/tapas-base-finetuned-wtq": 1024, + "nielsr/tapas-base-finetuned-wikisql-supervised": 1024, +} + + +PRETRAINED_INIT_CONFIGURATION = { + "nielsr/tapas-base-finetuned-sqa": {"do_lower_case": True}, + "nielsr/tapas-base-finetuned-wtq": {"do_lower_case": True}, + "nielsr/tapas-base-finetuned-wikisql-supervised": {"do_lower_case": True}, +} + + +class TapasTruncationStrategy(ExplicitEnum): + """ + Possible values for the ``truncation`` argument in :meth:`~transformers.TapasTokenizer.__call__`. Useful for + tab-completion in an IDE. + """ + + DROP_ROWS_TO_FIT = "drop_rows_to_fit" + DO_NOT_TRUNCATE = "do_not_truncate" + + +TableValue = collections.namedtuple("TokenValue", ["token", "column_id", "row_id"]) + + +@dataclass(frozen=True) +class TokenCoordinates: + column_index: int + row_index: int + token_index: int + + +@dataclass +class TokenizedTable: + rows: List[List[List[Text]]] + selected_tokens: List[TokenCoordinates] + + +@dataclass(frozen=True) +class SerializedExample: + tokens: List[Text] + column_ids: List[int] + row_ids: List[int] + segment_ids: List[int] + + +def _is_inner_wordpiece(token: Text): + return token.startswith("##") + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + with open(vocab_file, "r", encoding="utf-8") as reader: + tokens = reader.readlines() + for index, token in enumerate(tokens): + token = token.rstrip("\n") + vocab[token] = index + return vocab + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a piece of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + +TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING = r""" + add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to encode the sequences with the special tokens relative to their model. + padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`): + Activates and controls padding. Accepts the following values: + + * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a + single sequence if provided). + * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the + maximum acceptable input length for the model if that argument is not provided. + * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of + different lengths). + truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.TapasTruncationStrategy`, `optional`, defaults to :obj:`False`): + Activates and controls truncation. Accepts the following values: + + * :obj:`True` or :obj:`'drop_rows_to_fit'`: Truncate to a maximum length specified with the argument + :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not + provided. This will truncate row by row, removing rows from the table. + * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with + sequence lengths greater than the model maximum admissible input size). + max_length (:obj:`int`, `optional`): + Controls the maximum length to use by one of the truncation/padding parameters. + + If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum + length is required by one of the truncation/padding parameters. If the model has no specific maximum + input length (like XLNet) truncation/padding to a maximum length will be deactivated. + is_split_into_words (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer + will skip the pre-tokenization step. This is useful for NER or token classification. + pad_to_multiple_of (:obj:`int`, `optional`): + If set will pad the sequence to a multiple of the provided value. This is especially useful to enable + the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). + return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`): + If set, will return tensors instead of list of python integers. Acceptable values are: + + * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects. + * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects. + * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects. +""" + + +class TapasTokenizer(PreTrainedTokenizer): + r""" + Construct a TAPAS tokenizer. Based on WordPiece. Flattens a table and one or more related sentences to be used by + TAPAS models. + + This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods. + Users should refer to this superclass for more information regarding those methods. + :class:`~transformers.TapasTokenizer` creates several token type ids to encode tabular structure. To be more + precise, it adds 7 token type ids, in the following order: :obj:`segment_ids`, :obj:`column_ids`, :obj:`row_ids`, + :obj:`prev_label_ids`, :obj:`column_ranks`, :obj:`inv_column_ranks` and :obj:`numeric_relations`: + + - segment_ids: indicate whether a token belongs to the question (0) or the table (1). 0 for special tokens and + padding. + - column_ids: indicate to which column of the table a token belongs (starting from 1). Is 0 for all question + tokens, special tokens and padding. + - row_ids: indicate to which row of the table a token belongs (starting from 1). Is 0 for all question tokens, + special tokens and padding. Tokens of column headers are also 0. + - prev_label_ids: indicate whether a token was (part of) an answer to the previous question (1) or not (0). Useful + in a conversational setup (such as SQA). + - column_ranks: indicate the rank of a table token relative to a column, if applicable. For example, if you have a + column "number of movies" with values 87, 53 and 69, then the column ranks of these tokens are 3, 1 and 2 respectively. + 0 for all question tokens, special tokens and padding. + - inv_column_ranks: indicate the inverse rank of a table token relative to a column, if applicable. For example, if + you have a column "number of movies" with values 87, 53 and 69, then the inverse column ranks of these tokens are 1, 3 and + 2 respectively. 0 for all question tokens, special tokens and padding. + - numeric_relations: indicate numeric relations between the question and the tokens of the table. 0 for all + question tokens, special tokens and padding. + + :class:`~transformers.TapasTokenizer` runs end-to-end tokenization on a table and associated sentences: punctuation + splitting and wordpiece. + + Args: + vocab_file (:obj:`str`): + File containing the vocabulary. + do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to lowercase the input when tokenizing. + do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to do basic tokenization before WordPiece. + never_split (:obj:`Iterable`, `optional`): + Collection of tokens which will never be split during tokenization. Only has an effect when + :obj:`do_basic_tokenize=True` + unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): + The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this + token instead. + sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`): + The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for + sequence classification or for a text and a question for question answering. It is also used as the last + token of a sequence built with special tokens. + pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`): + The token used for padding, for example when batching sequences of different lengths. + cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`): + The classifier token which is used when doing sequence classification (classification of the whole sequence + instead of per-token classification). It is the first token of the sequence when built with special tokens. + mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`): + The token used for masking values. This is the token used when training this model with masked language + modeling. This is the token which the model will try to predict. + empty_token (:obj:`str`, `optional`, defaults to :obj:`"[EMPTY]"`): + The token used for empty cell values in a table. Empty cell values include "", "n/a", "nan" and "?". + tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this + `issue `__). + strip_accents: (:obj:`bool`, `optional`): + Whether or not to strip all accents. If this option is not specified, then it will be determined by the + value for :obj:`lowercase` (as in the original BERT). + cell_trim_length (:obj:`int`, `optional`, defaults to -1): + If > 0: Trim cells so that the length is <= this value. Also disables further cell trimming, should thus be + used with 'drop_rows_to_fit' below. + max_column_id (:obj:`int`, `optional`): + Max column id to extract. + max_row_id (:obj:`int`, `optional`): + Max row id to extract. + strip_column_names (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to add empty strings instead of column names. + update_answer_coordinates (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to recompute the answer coordinates from the answer text. + drop_rows_to_fit (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether to drop the last rows if a table doesn't fit within max sequence length. + + """ + + vocab_files_names = VOCAB_FILES_NAMES + pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP + max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES + pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION + + def __init__( + self, + vocab_file, + do_lower_case=True, + do_basic_tokenize=True, + never_split=None, + unk_token="[UNK]", + sep_token="[SEP]", + pad_token="[PAD]", + cls_token="[CLS]", + mask_token="[MASK]", + empty_token="[EMPTY]", + tokenize_chinese_chars=True, + strip_accents=None, + cell_trim_length: int = -1, + max_column_id: int = None, + max_row_id: int = None, + strip_column_names: bool = False, + update_answer_coordinates: bool = False, + drop_rows_to_fit: bool = False, + model_max_length: int = 512, + additional_special_tokens: Optional[List[str]] = None, + **kwargs + ): + if additional_special_tokens is not None: + if empty_token not in additional_special_tokens: + additional_special_tokens.append(empty_token) + else: + additional_special_tokens = [empty_token] + + super().__init__( + do_lower_case=do_lower_case, + do_basic_tokenize=do_basic_tokenize, + never_split=never_split, + unk_token=unk_token, + sep_token=sep_token, + pad_token=pad_token, + cls_token=cls_token, + mask_token=mask_token, + empty_token=empty_token, + tokenize_chinese_chars=tokenize_chinese_chars, + strip_accents=strip_accents, + cell_trim_length=cell_trim_length, + max_column_id=max_column_id, + max_row_id=max_row_id, + strip_column_names=strip_column_names, + update_answer_coordinates=update_answer_coordinates, + drop_rows_to_fit=drop_rows_to_fit, + model_max_length=model_max_length, + additional_special_tokens=additional_special_tokens, + **kwargs, + ) + + if not os.path.isfile(vocab_file): + raise ValueError( + "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained " + "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file) + ) + self.vocab = load_vocab(vocab_file) + self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()]) + self.do_basic_tokenize = do_basic_tokenize + if do_basic_tokenize: + self.basic_tokenizer = BasicTokenizer( + do_lower_case=do_lower_case, + never_split=never_split, + tokenize_chinese_chars=tokenize_chinese_chars, + strip_accents=strip_accents, + ) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token) + + # Additional properties + self.cell_trim_length = cell_trim_length + self.max_column_id = max_column_id if max_column_id is not None else self.model_max_length + self.max_row_id = max_row_id if max_row_id is not None else self.model_max_length + self.strip_column_names = strip_column_names + self.update_answer_coordinates = update_answer_coordinates + self.drop_rows_to_fit = drop_rows_to_fit + + @property + def do_lower_case(self): + return self.basic_tokenizer.do_lower_case + + @property + def vocab_size(self): + return len(self.vocab) + + def get_vocab(self): + return dict(self.vocab, **self.added_tokens_encoder) + + def _tokenize(self, text): + if format_text(text) == EMPTY_TEXT: + return [self.additional_special_tokens[0]] + split_tokens = [] + if self.do_basic_tokenize: + for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens): + + # If the token is part of the never_split set + if token in self.basic_tokenizer.never_split: + split_tokens.append(token) + else: + split_tokens += self.wordpiece_tokenizer.tokenize(token) + else: + split_tokens = self.wordpiece_tokenizer.tokenize(text) + return split_tokens + + def _convert_token_to_id(self, token): + """ Converts a token (str) in an id using the vocab. """ + return self.vocab.get(token, self.vocab.get(self.unk_token)) + + def _convert_id_to_token(self, index): + """Converts an index (integer) in a token (str) using the vocab.""" + return self.ids_to_tokens.get(index, self.unk_token) + + def convert_tokens_to_string(self, tokens): + """ Converts a sequence of tokens (string) in a single string. """ + out_string = " ".join(tokens).replace(" ##", "").strip() + return out_string + + def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]: + index = 0 + if os.path.isdir(save_directory): + vocab_file = os.path.join( + save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"] + ) + else: + vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory + with open(vocab_file, "w", encoding="utf-8") as writer: + for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]): + if index != token_index: + logger.warning( + f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive." + " Please check that the vocabulary is not corrupted!" + ) + index = token_index + writer.write(token + "\n") + index += 1 + return (vocab_file,) + + def create_attention_mask_from_sequences(self, query_ids: List[int], table_values: List[TableValue]) -> List[int]: + """ + Creates the attention mask according to the query token IDs and a list of table values. + + Args: + query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID. + table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the + token value, the column ID and the row ID of said token. + + Returns: + :obj:`List[int]`: List of ints containing the attention mask values. + """ + return [1] * (1 + len(query_ids) + 1 + len(table_values)) + + def create_segment_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + """ + Creates the segment token type IDs according to the query token IDs and a list of table values. + + Args: + query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID. + table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the + token value, the column ID and the row ID of said token. + + Returns: + :obj:`List[int]`: List of ints containing the segment token type IDs values. + """ + table_ids = list(zip(*table_values))[0] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + [1] * len(table_ids) + + def create_column_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + """ + Creates the column token type IDs according to the query token IDs and a list of table values. + + Args: + query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID. + table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the + token value, the column ID and the row ID of said token. + + Returns: + :obj:`List[int]`: List of ints containing the column token type IDs values. + """ + table_column_ids = list(zip(*table_values))[1] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + list(table_column_ids) + + def create_row_token_type_ids_from_sequences( + self, query_ids: List[int], table_values: List[TableValue] + ) -> List[int]: + """ + Creates the row token type IDs according to the query token IDs and a list of table values. + + Args: + query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID. + table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the + token value, the column ID and the row ID of said token. + + Returns: + :obj:`List[int]`: List of ints containing the row token type IDs values. + """ + table_row_ids = list(zip(*table_values))[2] if table_values else [] + return [0] * (1 + len(query_ids) + 1) + list(table_row_ids) + + def build_inputs_with_special_tokens( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None + ) -> List[int]: + """ + Build model inputs from a question and flattened table for question answering or sequence classification tasks by concatenating and + adding special tokens. + + Args: + token_ids_0 (:obj:`List[int]`): The ids of the question. + token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table. + + Returns: + :obj:`List[int]`: The model input with special tokens. + """ + if token_ids_1 is None: + raise ValueError("With TAPAS, you must provide both question IDs and table IDs.") + + return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1 + + def get_special_tokens_mask( + self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False + ) -> List[int]: + """ + Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding + special tokens using the tokenizer ``prepare_for_model`` method. + + Args: + token_ids_0 (:obj:`List[int]`): + List of question IDs. + token_ids_1 (:obj:`List[int]`, `optional`): + List of flattened table IDs. + already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): + Whether or not the token list is already formatted with special tokens for the model. + + Returns: + :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. + """ + + if already_has_special_tokens: + if token_ids_1 is not None: + raise ValueError( + "You should not supply a second sequence if the provided sequence of " + "ids is already formatted with special tokens for the model." + ) + return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0)) + + if token_ids_1 is not None: + return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + return [1] + ([0] * len(token_ids_0)) + [1] + + @add_end_docstrings(TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) + def __call__( + self, + table: pd.DataFrame, + queries: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ] + ] = None, + answer_coordinates: Optional[ + Union[ + List[Tuple], + List[List[Tuple]] + ] + ] = None, + answer_text: Optional[ + Union[ + List[TextInput], + List[List[TextInput]] + ] + ] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + """ + Main method to tokenize and prepare for the model one or several sequence(s) related to a table. + + Args: + table (:obj:`pd.DataFrame`): + Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to + convert it to string. + queries (:obj:`str` or :obj:`List[str]`): + Question or batch of questions related to a table to be encoded. Note that + in case of a batch, all questions must refer to the **same** table. + answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`): + Answer coordinates of each table-question pair in the batch. In case only a single table-question pair + is provided, then the answer_coordinates must be a single list of one or more tuples. Each tuple must be + a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column + has index 0. In case a batch of table-question pairs is provided, then the answer_coordinates must be a + list of lists of tuples (each list corresponding to a single table-question pair). + answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`): + Answer text of each table-question pair in the batch. In case only a single table-question pair + is provided, then the answer_text must be a single list of one or more strings. Each string must be + the answer text of a corresponding answer coordinate. In case a batch of table-question pairs is provided, then + the answer_coordinates must be a list of lists of strings (each list corresponding to a single table-question pair). + """ + assert isinstance(table, pd.DataFrame), "Table must be of type pd.DataFrame" + + # Input type checking for clearer error + assert ( + queries is None + or isinstance(queries, str) + or ( + isinstance(queries, (list, tuple)) + and ( + len(queries) == 0 + or ( + isinstance(queries[0], str) + or ( + isinstance(queries[0], (list, tuple)) + and (len(queries[0]) == 0 or isinstance(queries[0][0], str)) + ) + ) + ) + ) + ), ( + "queries input must of type `str` (single example), `List[str]` (batch or single pretokenized example) " + "or `List[List[str]]` (batch of pretokenized examples)." + ) + + is_batched = isinstance(queries, (list, tuple)) + + if is_batched: + return self.batch_encode_plus( + table=table, + queries=queries, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + else: + return self.encode_plus( + table=table, + query=queries, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) + def batch_encode_plus( + self, + table: pd.DataFrame, + queries: Optional[ + Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ] + ] = None, + answer_coordinates: Optional[List[List[Tuple]]] = None, + answer_text: Optional[List[List[TextInput]]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + """ + Prepare a table and a list of strings for the model. + + .. warning:: + This method is deprecated, ``__call__`` should be used instead. + + Args: + table (:obj:`pd.DataFrame`): + Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to + convert it to string. + queries (:obj:`List[str]`): + Batch of questions related to a table to be encoded. Note that all questions must refer to + the **same** table. + answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`): + Answer coordinates of each table-question pair in the batch. Each tuple must be + a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column + has index 0. The answer_coordinates must be a + list of lists of tuples (each list corresponding to a single table-question pair). + answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`): + Answer text of each table-question pair in the batch. In case a batch of table-question pairs is provided, then + the answer_coordinates must be a list of lists of strings (each list corresponding to a single table-question pair). Each string must be + the answer text of a corresponding answer coordinate. + """ + if return_token_type_ids is not None and not add_special_tokens: + raise ValueError( + "Asking to return token_type_ids while setting add_special_tokens to False " + "results in an undefined behavior. Please set add_special_tokens to True or " + "set return_token_type_ids to None." + ) + + if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text): + raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided") + elif answer_coordinates is None and answer_text is None: + answer_coordinates = answer_text = [None] * len(queries) + + if "is_split_into_words" in kwargs: + raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.") + + if return_offsets_mapping: + raise NotImplementedError( + "return_offset_mapping is not available when using Python tokenizers." + "To use this feature, change your tokenizer to one deriving from " + "transformers.PreTrainedTokenizerFast." + ) + + return self._batch_encode_plus( + table=table, + queries=queries, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + def _batch_encode_plus( + self, + table, + queries: Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ], + answer_coordinates: Optional[List[List[Tuple]]] = None, + answer_text: Optional[List[List[TextInput]]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = None, + return_overflowing_tokens: bool = False, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + table_tokens = self._tokenize_table(table) + + queries_tokens = [] + for query in queries: + query_tokens = self.tokenize(query) + queries_tokens.append(query_tokens) + + batch_outputs = self._batch_prepare_for_model( + table, + queries, + tokenized_table=table_tokens, + queries_tokens=queries_tokens, + answer_coordinates=answer_coordinates, + padding=padding, + truncation=truncation, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + prepend_batch_axis=True, + return_attention_mask=return_attention_mask, + return_token_type_ids=return_token_type_ids, + return_overflowing_tokens=return_overflowing_tokens, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + verbose=verbose, + ) + + return BatchEncoding(batch_outputs) + + def _batch_prepare_for_model( + self, + raw_table: pd.DataFrame, + raw_queries: Union[ + List[TextInput], + List[PreTokenizedInput], + List[EncodedInput], + ], + tokenized_table: Optional[TokenizedTable] = None, + queries_tokens: Optional[List[List[str]]] = None, + answer_coordinates: Optional[List[List[Tuple]]] = None, + answer_text: Optional[List[List[TextInput]]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + prepend_batch_axis: bool = False, + **kwargs + ) -> BatchEncoding: + batch_outputs = {} + + for index, example in enumerate(zip(raw_queries, queries_tokens, answer_coordinates, answer_text)): + raw_query, query_tokens, answer_coords, answer_txt = example + outputs = self.prepare_for_model( + raw_table, + raw_query, + tokenized_table=tokenized_table, + query_tokens=query_tokens, + answer_coordinates=answer_coords, + answer_text=answer_txt, + add_special_tokens=add_special_tokens, + padding=PaddingStrategy.DO_NOT_PAD.value, # we pad in batch afterwards + truncation=truncation, + max_length=max_length, + pad_to_multiple_of=None, # we pad in batch afterwards + return_attention_mask=False, # we pad in batch afterwards + return_token_type_ids=return_token_type_ids, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + return_tensors=None, # We convert the whole batch to tensors at the end + prepend_batch_axis=False, + verbose=verbose, + prev_answer_coordinates=answer_coordinates[index-1] if index != 0 else None, + prev_answer_text=answer_text[index-1] if index != 0 else None, + ) + + for key, value in outputs.items(): + if key not in batch_outputs: + batch_outputs[key] = [] + batch_outputs[key].append(value) + + batch_outputs = self.pad( + batch_outputs, + padding=padding, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors) + + return batch_outputs + + @add_end_docstrings(ENCODE_KWARGS_DOCSTRING) + def encode( + self, + table: pd.DataFrame, + query: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ] + ] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + **kwargs + ) -> List[int]: + """ + Prepare a table and a string for the model. This method does not return token type IDs, attention masks, etc. + which are necessary for the model to work correctly. Use that method if you want to build your processing + on your own, otherwise refer to ``__call__``. + + Args: + table (:obj:`pd.DataFrame`): + Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to + convert it to string. + query (:obj:`str` or :obj:`List[str]`): + Question related to a table to be encoded. + """ + encoded_inputs = self.encode_plus( + table, + query=query, + add_special_tokens=add_special_tokens, + padding=padding, + truncation=truncation, + max_length=max_length, + return_tensors=return_tensors, + **kwargs, + ) + + return encoded_inputs["input_ids"] + + @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) + def encode_plus( + self, + table: pd.DataFrame, + query: Optional[ + Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ] + ] = None, + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = None, + return_attention_mask: Optional[bool] = None, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ) -> BatchEncoding: + """ + Prepare a table and a string for the model. + + Args: + table (:obj:`pd.DataFrame`): + Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to + convert it to string. + query (:obj:`str` or :obj:`List[str]`): + Question related to a table to be encoded. + answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`): + Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single + list of one or more tuples. Each tuple must be + a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column + has index 0. + answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`): + Answer text of each table-question pair in the batch. The answer_text must be a single list of one + or more strings. Each string must be + the answer text of a corresponding answer coordinate. + """ + if return_token_type_ids is not None and not add_special_tokens: + raise ValueError( + "Asking to return token_type_ids while setting add_special_tokens to False " + "results in an undefined behavior. Please set add_special_tokens to True or " + "set return_token_type_ids to None." + ) + + if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text): + raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided") + + if "is_split_into_words" in kwargs: + raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.") + + if return_offsets_mapping: + raise NotImplementedError( + "return_offset_mapping is not available when using Python tokenizers." + "To use this feature, change your tokenizer to one deriving from " + "transformers.PreTrainedTokenizerFast." + ) + + return self._encode_plus( + table=table, + query=query, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + truncation=truncation, + padding=padding, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + return_token_type_ids=return_token_type_ids, + return_attention_mask=return_attention_mask, + return_special_tokens_mask=return_special_tokens_mask, + return_offsets_mapping=return_offsets_mapping, + return_length=return_length, + verbose=verbose, + **kwargs, + ) + + def _encode_plus( + self, + table: pd.DataFrame, + query: Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ], + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + **kwargs + ): + if query is None: + query = "" + logger.warning( + "TAPAS is a question answering model but you have not passed a query. Please be aware that the " + "model will probably not behave correctly." + ) + + table_tokens = self._tokenize_table(table) + query_tokens = self.tokenize(query) + + return self.prepare_for_model( + table, + query, + tokenized_table=table_tokens, + query_tokens=query_tokens, + answer_coordinates=answer_coordinates, + answer_text=answer_text, + add_special_tokens=add_special_tokens, + truncation=truncation, + padding=padding, + max_length=max_length, + pad_to_multiple_of=pad_to_multiple_of, + return_tensors=return_tensors, + prepend_batch_axis=True, + return_attention_mask=return_attention_mask, + return_token_type_ids=return_token_type_ids, + return_special_tokens_mask=return_special_tokens_mask, + return_length=return_length, + verbose=verbose, + ) + + @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING) + def prepare_for_model( + self, + raw_table: pd.DataFrame, + raw_query: Union[ + TextInput, + PreTokenizedInput, + EncodedInput, + ], + tokenized_table: Optional[TokenizedTable] = None, + query_tokens: Optional[TokenizedTable] = None, + answer_coordinates: Optional[List[Tuple]] = None, + answer_text: Optional[List[TextInput]] = None, + add_special_tokens: bool = True, + padding: Union[bool, str, PaddingStrategy] = False, + truncation: Union[bool, str, TapasTruncationStrategy] = False, + max_length: Optional[int] = None, + pad_to_multiple_of: Optional[int] = None, + return_tensors: Optional[Union[str, TensorType]] = None, + return_token_type_ids: Optional[bool] = True, + return_attention_mask: Optional[bool] = True, + return_special_tokens_mask: bool = False, + return_offsets_mapping: bool = False, + return_length: bool = False, + verbose: bool = True, + prepend_batch_axis: bool = False, + **kwargs + ) -> BatchEncoding: + """ + Prepares a sequence of input id so that it can be used by the model. It + adds special tokens, truncates sequences if overflowing while taking into account the special tokens. + + Args: + raw_table (:obj:`pd.DataFrame`): + The original table before any transformation (like tokenization) was applied to it. + raw_query (:obj:`TextInput` or :obj:`PreTokenizedInput` or :obj:`EncodedInput`): + The original query before any transformation (like tokenization) was applied to it. + tokenized_table (:obj:`TokenizedTable`): + The table after tokenization. + query_tokens (:obj:`List[str]`): + The query after tokenization. + answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`): + Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single + list of one or more tuples. Each tuple must be + a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column + has index 0. + answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`): + Answer text of each table-question pair in the batch. The answer_text must be a single list of one + or more strings. Each string must be + the answer text of a corresponding answer coordinate. + """ + if isinstance(padding, bool): + if padding and (max_length is not None or pad_to_multiple_of is not None): + padding = PaddingStrategy.MAX_LENGTH + else: + padding = PaddingStrategy.DO_NOT_PAD + elif not isinstance(padding, PaddingStrategy): + padding = PaddingStrategy(padding) + + if isinstance(truncation, bool): + if truncation: + truncation = TapasTruncationStrategy.DROP_ROWS_TO_FIT + else: + truncation = TapasTruncationStrategy.DO_NOT_TRUNCATE + elif not isinstance(truncation, TapasTruncationStrategy): + truncation = TapasTruncationStrategy(truncation) + + encoded_inputs = {} + + is_part_of_batch = False + prev_answer_coordinates, prev_answer_text = None, None + if "prev_answer_coordinates" in kwargs and "prev_answer_text" in kwargs: + is_part_of_batch = True + prev_answer_coordinates = kwargs["prev_answer_coordinates"] + prev_answer_text = kwargs["prev_answer_text"] + + num_rows = self._get_num_rows(raw_table, self.drop_rows_to_fit) + num_columns = self._get_num_columns(raw_table) + _, _, num_tokens = self._get_table_boundaries(tokenized_table) + + if truncation != TapasTruncationStrategy.DO_NOT_TRUNCATE and max_length: + num_rows, num_tokens = self._get_truncated_table_rows(query_tokens, tokenized_table, num_rows, num_columns, + max_length, truncation_strategy=truncation) + table_data = list(self._get_table_values(tokenized_table, num_columns, num_rows, num_tokens)) + + query_ids = self.convert_tokens_to_ids(query_tokens) + table_ids = list(zip(*table_data))[0] if len(table_data) > 0 else list(zip(*table_data)) + table_ids = self.convert_tokens_to_ids(list(table_ids)) + + if "return_overflowing_tokens" in kwargs and kwargs["return_overflowing_tokens"]: + raise ValueError("TAPAS does not return overflowing tokens as it works on tables.") + + if add_special_tokens: + input_ids = self.build_inputs_with_special_tokens(query_ids, table_ids) + else: + input_ids = query_ids + table_ids + + if max_length is not None and len(input_ids) > max_length: + raise ValueError( + "Could not encode the query and table header given the maximum length. Encoding the query and table" + f"header results in a length of {len(input_ids)} which is higher than the max_length of {max_length}" + ) + + encoded_inputs["input_ids"] = input_ids + + segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data) + column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data) + row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data) + if not is_part_of_batch or (prev_answer_coordinates is None and prev_answer_text is None): + # simply set the prev_label_ids to zeros + prev_label_ids = [0] * len(row_ids) + else: + prev_label_ids = self.get_answer_ids( + column_ids, row_ids, table_data, prev_answer_text, prev_answer_coordinates + ) + + ### FIRST: parse both the table and question in terms of numeric values + + raw_table = add_numeric_table_values(raw_table) + raw_query = add_numeric_values_to_question(raw_query) + + ### SECOND: add numeric-related features (and not parse them in these functions): + + column_ranks, inv_column_ranks = self._get_numeric_column_ranks( + column_ids, row_ids, raw_table + ) + numeric_relations = self._get_numeric_relations( + raw_query, column_ids, row_ids, raw_table + ) + + # Load from model defaults + if return_token_type_ids is None: + return_token_type_ids = "token_type_ids" in self.model_input_names + if return_attention_mask is None: + return_attention_mask = "attention_mask" in self.model_input_names + + if return_attention_mask: + attention_mask = self.create_attention_mask_from_sequences(query_ids, table_data) + encoded_inputs["attention_mask"] = attention_mask + + if answer_coordinates is not None and answer_text is not None: + label_ids = self.get_answer_ids( + column_ids, row_ids, table_data, answer_text, answer_coordinates + ) + numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids) + numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids) + + encoded_inputs["label_ids"] = label_ids + encoded_inputs["numeric_values"] = numeric_values + encoded_inputs["numeric_values_scale"] = numeric_values_scale + + if return_token_type_ids: + token_type_ids = [ + segment_ids, + column_ids, + row_ids, + prev_label_ids, + column_ranks, + inv_column_ranks, + numeric_relations, + ] + + token_type_ids = [list(ids) for ids in list(zip(*token_type_ids))] + encoded_inputs["token_type_ids"] = token_type_ids + + if return_special_tokens_mask: + if add_special_tokens: + encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(query_ids, table_ids) + else: + encoded_inputs["special_tokens_mask"] = [0] * len(input_ids) + + # Check lengths + if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose: + if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False): + logger.warning( + "Token indices sequence length is longer than the specified maximum sequence length " + "for this model ({} > {}). Running this sequence through the model will result in " + "indexing errors".format(len(encoded_inputs["input_ids"]), self.model_max_length) + ) + self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True + + # Padding + if padding != PaddingStrategy.DO_NOT_PAD or return_attention_mask: + encoded_inputs = self.pad( + encoded_inputs, + max_length=max_length, + padding=padding.value, + pad_to_multiple_of=pad_to_multiple_of, + return_attention_mask=return_attention_mask, + ) + + if return_length: + encoded_inputs["length"] = len(encoded_inputs["input_ids"]) + + batch_outputs = BatchEncoding( + encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis + ) + + return batch_outputs + + def _get_truncated_table_rows( + self, + query_tokens: List[str], + tokenized_table: TokenizedTable, + num_rows: int, + num_columns: int, + max_length: int, + truncation_strategy: Union[str, TapasTruncationStrategy], + ) -> Tuple[int, int]: + """ + Truncates a sequence pair in-place following the strategy. + + Args: + query_tokens (:obj:`List[str]`): + List of strings corresponding to the tokenized query. + tokenized_table (:obj:`TokenizedTable`): + Tokenized table + num_rows (:obj:`int`): + Total number of table rows + num_columns (:obj:`int`): + Total number of table columns + max_length (:obj:`int`): + Total maximum length. + truncation_strategy (:obj:`str` or :obj:`~transformers.TapasTruncationStrategy`): + Truncation strategy to use. Seeing as this method should only be called when truncating, the only + available strategy is the "drop_rows_to_fit" strategy. + + Returns: + :obj:`Tuple(int, int)`: tuple containing the number of rows after truncation, and the number of tokens + available for each table element. + """ + if not isinstance(truncation_strategy, TapasTruncationStrategy): + truncation_strategy = TapasTruncationStrategy(truncation_strategy) + + if truncation_strategy == TapasTruncationStrategy.DROP_ROWS_TO_FIT: + while True: + num_tokens = self._get_max_num_tokens( + query_tokens, + tokenized_table, + num_rows=num_rows, + num_columns=num_columns, + max_length=max_length + ) + + if num_tokens is not None: + # We could fit the table. + break + + # Try to drop a row to fit the table. + num_rows -= 1 + + if num_rows < 1: + break + elif truncation_strategy != TapasTruncationStrategy.DO_NOT_TRUNCATE: + raise ValueError(f"Unknown truncation strategy {truncation_strategy}.") + + return num_rows, num_tokens or 1 + + def _tokenize_table( + self, + table=None, + ): + """ + Tokenizes column headers and cell texts of a table. + + Args: + table (:obj:`pd.Dataframe`): + Table. Returns: :obj:`TokenizedTable`: TokenizedTable object. + """ + tokenized_rows = [] + tokenized_row = [] + # tokenize column headers + for column in table: + if self.strip_column_names: + tokenized_row.append(self.tokenize("")) + else: + tokenized_row.append(self.tokenize(column)) + tokenized_rows.append(tokenized_row) + + # tokenize cell values + for idx, row in table.iterrows(): + tokenized_row = [] + for cell in row: + tokenized_row.append(self.tokenize(cell)) + tokenized_rows.append(tokenized_row) + + token_coordinates = [] + for row_index, row in enumerate(tokenized_rows): + for column_index, cell in enumerate(row): + for token_index, _ in enumerate(cell): + token_coordinates.append( + TokenCoordinates( + row_index=row_index, + column_index=column_index, + token_index=token_index, + ) + ) + + return TokenizedTable( + rows=tokenized_rows, + selected_tokens=token_coordinates, + ) + + def _question_encoding_cost(self, question_tokens): + # Two extra spots of SEP and CLS. + return len(question_tokens) + 2 + + def _get_token_budget(self, question_tokens, max_length=None): + """ + Computes the number of tokens left for the table after tokenizing a question, taking into account the max + sequence length of the model. + + Args: + question_tokens (:obj:`List[String]`): + List of question tokens. Returns: :obj:`int`: the number of tokens left for the table, given the model + max length. + """ + return (max_length if max_length is not None else self.model_max_length) - self._question_encoding_cost(question_tokens) + + def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]: + """Iterates over partial table and returns token, column and row indexes.""" + for tc in table.selected_tokens: + # First row is header row. + if tc.row_index >= num_rows + 1: + continue + if tc.column_index >= num_columns: + continue + cell = table.rows[tc.row_index][tc.column_index] + token = cell[tc.token_index] + word_begin_index = tc.token_index + # Don't add partial words. Find the starting word piece and check if it + # fits in the token budget. + while word_begin_index >= 0 and _is_inner_wordpiece(cell[word_begin_index]): + word_begin_index -= 1 + if word_begin_index >= num_tokens: + continue + yield TableValue(token, tc.column_index + 1, tc.row_index) + + def _get_table_boundaries(self, table): + """Return maximal number of rows, columns and tokens.""" + max_num_tokens = 0 + max_num_columns = 0 + max_num_rows = 0 + for tc in table.selected_tokens: + max_num_columns = max(max_num_columns, tc.column_index + 1) + max_num_rows = max(max_num_rows, tc.row_index + 1) + max_num_tokens = max(max_num_tokens, tc.token_index + 1) + max_num_columns = min(self.max_column_id, max_num_columns) + max_num_rows = min(self.max_row_id, max_num_rows) + return max_num_rows, max_num_columns, max_num_tokens + + def _get_table_cost(self, table, num_columns, num_rows, num_tokens): + return sum(1 for _ in self._get_table_values(table, num_columns, num_rows, num_tokens)) + + def _get_max_num_tokens( + self, + question_tokens, + tokenized_table, + num_columns, + num_rows, + max_length + ): + """Computes max number of tokens that can be squeezed into the budget.""" + token_budget = self._get_token_budget(question_tokens, max_length) + _, _, max_num_tokens = self._get_table_boundaries(tokenized_table) + if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length: + max_num_tokens = self.cell_trim_length + num_tokens = 0 + for num_tokens in range(max_num_tokens + 1): + cost = self._get_table_cost(tokenized_table, num_columns, num_rows, num_tokens + 1) + if cost > token_budget: + break + if num_tokens < max_num_tokens: + if self.cell_trim_length >= 0: + # We don't allow dynamic trimming if a cell_trim_length is set. + return None + if num_tokens == 0: + return None + return num_tokens + + def _get_num_columns(self, table): + num_columns = table.shape[1] + if num_columns >= self.max_column_id: + raise ValueError("Too many columns") + return num_columns + + def _get_num_rows(self, table, drop_rows_to_fit): + num_rows = table.shape[0] + if num_rows >= self.max_row_id: + if drop_rows_to_fit: + num_rows = self.max_row_id - 1 + else: + raise ValueError("Too many rows") + return num_rows + + def _serialize_text(self, question_tokens): + """Serializes texts in index arrays.""" + tokens = [] + segment_ids = [] + column_ids = [] + row_ids = [] + + # add [CLS] token at the beginning + tokens.append(self.cls_token) + segment_ids.append(0) + column_ids.append(0) + row_ids.append(0) + + for token in question_tokens: + tokens.append(token) + segment_ids.append(0) + column_ids.append(0) + row_ids.append(0) + + return tokens, segment_ids, column_ids, row_ids + + def _serialize( + self, + question_tokens, + table, + num_columns, + num_rows, + num_tokens, + ): + """Serializes table and text.""" + tokens, segment_ids, column_ids, row_ids = self._serialize_text(question_tokens) + + # add [SEP] token between question and table tokens + tokens.append(self.sep_token) + segment_ids.append(0) + column_ids.append(0) + row_ids.append(0) + + for token, column_id, row_id in self._get_table_values(table, num_columns, num_rows, num_tokens): + tokens.append(token) + segment_ids.append(1) + column_ids.append(column_id) + row_ids.append(row_id) + + return SerializedExample( + tokens=tokens, + segment_ids=segment_ids, + column_ids=column_ids, + row_ids=row_ids, + ) + + def _get_column_values(self, table, col_index): + table_numeric_values = {} + for row_index, row in table.iterrows(): + cell = row[col_index] + if cell.numeric_value is not None: + table_numeric_values[row_index] = cell.numeric_value + return table_numeric_values + + def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id): + for index in range(len(column_ids)): + if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id: + yield index + + def _get_numeric_column_ranks(self, column_ids, row_ids, table): + """Returns column ranks for all numeric columns.""" + + ranks = [0] * len(column_ids) + inv_ranks = [0] * len(column_ids) + + # original code from tf_example_utils.py of the original implementation + if table is not None: + for col_index in range(len(table.columns)): + table_numeric_values = self._get_column_values(table, col_index) + + if not table_numeric_values: + continue + + try: + key_fn = get_numeric_sort_key_fn(table_numeric_values.values()) + except ValueError: + continue + + table_numeric_values = {row_index: key_fn(value) for row_index, value in table_numeric_values.items()} + + table_numeric_values_inv = collections.defaultdict(list) + for row_index, value in table_numeric_values.items(): + table_numeric_values_inv[value].append(row_index) + + unique_values = sorted(table_numeric_values_inv.keys()) + + for rank, value in enumerate(unique_values): + for row_index in table_numeric_values_inv[value]: + for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index): + ranks[index] = rank + 1 + inv_ranks[index] = len(unique_values) - rank + + return ranks, inv_ranks + + def _get_numeric_sort_key_fn(self, table_numeric_values, value): + """ + Returns the sort key function for comparing value to table values. The function returned will be a suitable + input for the key param of the sort(). See number_annotation_utils._get_numeric_sort_key_fn for details + + Args: + table_numeric_values: Numeric values of a column + value: Numeric value in the question + + Returns: + A function key function to compare column and question values. + """ + if not table_numeric_values: + return None + all_values = list(table_numeric_values.values()) + all_values.append(value) + try: + return get_numeric_sort_key_fn(all_values) + except ValueError: + return None + + def _get_numeric_relations(self, question, column_ids, row_ids, table): + """ + Returns numeric relations embeddings + + Args: + question: Question object. + column_ids: Maps word piece position to column id. + row_ids: Maps word piece position to row id. + table: The table containing the numeric cell values. + """ + + numeric_relations = [0] * len(column_ids) + + # first, we add any numeric value spans to the question: + # Create a dictionary that maps a table cell to the set of all relations + # this cell has with any value in the question. + cell_indices_to_relations = collections.defaultdict(set) + if question is not None and table is not None: + for numeric_value_span in question.numeric_spans: + for value in numeric_value_span.values: + for column_index in range(len(table.columns)): + table_numeric_values = self._get_column_values(table, column_index) + sort_key_fn = self._get_numeric_sort_key_fn(table_numeric_values, value) + if sort_key_fn is None: + continue + for row_index, cell_value in table_numeric_values.items(): + relation = get_numeric_relation(value, cell_value, sort_key_fn) + if relation is not None: + cell_indices_to_relations[column_index, row_index].add(relation) + + # For each cell add a special feature for all its word pieces. + for (column_index, row_index), relations in cell_indices_to_relations.items(): + relation_set_index = 0 + for relation in relations: + assert relation.value >= Relation.EQ.value + relation_set_index += 2 ** (relation.value - Relation.EQ.value) + for cell_token_index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index): + numeric_relations[cell_token_index] = relation_set_index + + return numeric_relations + + def _get_numeric_values(self, table, column_ids, row_ids): + """Returns numeric values for computation of answer loss.""" + + numeric_values = [float("nan")] * len(column_ids) + + if table is not None: + num_rows = table.shape[0] + num_columns = table.shape[1] + + for col_index in range(num_columns): + for row_index in range(num_rows): + numeric_value = table.iloc[row_index, col_index].numeric_value + if numeric_value is not None: + if numeric_value.float_value is None: + continue + float_value = numeric_value.float_value + if float_value == float("inf"): + continue + for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index): + numeric_values[index] = float_value + + return numeric_values + + def _get_numeric_values_scale(self, table, column_ids, row_ids): + """Returns a scale to each token to down weigh the value of long words.""" + + numeric_values_scale = [1.0] * len(column_ids) + + if table is None: + return numeric_values_scale + + num_rows = table.shape[0] + num_columns = table.shape[1] + + for col_index in range(num_columns): + for row_index in range(num_rows): + indices = [index for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index)] + num_indices = len(indices) + if num_indices > 1: + for index in indices: + numeric_values_scale[index] = float(num_indices) + + return numeric_values_scale + + def _pad_to_seq_length(self, inputs): + while len(inputs) > self.model_max_length: + inputs.pop() + while len(inputs) < self.model_max_length: + inputs.append(0) + + def _get_all_answer_ids_from_coordinates( + self, + column_ids, + row_ids, + answers_list, + ): + """Maps lists of answer coordinates to token indexes.""" + answer_ids = [0] * len(column_ids) + found_answers = set() + all_answers = set() + for answers in answers_list: + column_index, row_index = answers + all_answers.add((column_index, row_index)) + for index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index): + found_answers.add((column_index, row_index)) + answer_ids[index] = 1 + + missing_count = len(all_answers) - len(found_answers) + return answer_ids, missing_count + + def _get_all_answer_ids(self, column_ids, row_ids, answer_coordinates): + """ + Maps answer coordinates of a question to token indexes. + + In the SQA format (TSV), the coordinates are given as (row, column) tuples. Here, we first + swap them to (column, row) format before calling _get_all_answer_ids_from_coordinates. + """ + + def _to_coordinates(answer_coordinates_question): + return [(coords[1], coords[0]) for coords in answer_coordinates_question] + + return self._get_all_answer_ids_from_coordinates( + column_ids, row_ids, answers_list=(_to_coordinates(answer_coordinates)) + ) + + def _find_tokens(self, text, segment): + """Return start index of segment in text or None.""" + logging.info("text: %s %s", text, segment) + for index in range(1 + len(text) - len(segment)): + for seg_index, seg_token in enumerate(segment): + if text[index + seg_index].piece != seg_token.piece: + break + else: + return index + return None + + def _find_answer_coordinates_from_answer_text( + self, + tokenized_table, + answer_text, + ): + """Returns all occurrences of answer_text in the table.""" + logging.info("answer text: %s", answer_text) + for row_index, row in enumerate(tokenized_table.rows): + if row_index == 0: + # We don't search for answers in the header. + continue + for col_index, cell in enumerate(row): + token_index = self._find_tokens(cell, answer_text) + if token_index is not None: + yield TokenCoordinates( + row_index=row_index, + column_index=col_index, + token_index=token_index, + ) + + def _find_answer_ids_from_answer_texts( + self, + column_ids, + row_ids, + tokenized_table, + answer_texts, + ): + """Maps question with answer texts to the first matching token indexes.""" + answer_ids = [0] * len(column_ids) + for answer_text in answer_texts: + for coordinates in self._find_answer_coordinates_from_answer_text( + tokenized_table, + answer_text, + ): + # Maps answer coordinates to indexes this can fail if tokens / rows have + # been pruned. + indexes = list( + self._get_cell_token_indexes( + column_ids, + row_ids, + column_id=coordinates.column_index, + row_id=coordinates.row_index - 1, + ) + ) + indexes.sort() + coordinate_answer_ids = [] + if indexes: + begin_index = coordinates.token_index + indexes[0] + end_index = begin_index + len(answer_text) + for index in indexes: + if index >= begin_index and index < end_index: + coordinate_answer_ids.append(index) + if len(coordinate_answer_ids) == len(answer_text): + for index in coordinate_answer_ids: + answer_ids[index] = 1 + break + return answer_ids + + def _get_answer_ids(self, column_ids, row_ids, answer_coordinates): + """Maps answer coordinates of a question to token indexes.""" + answer_ids, missing_count = self._get_all_answer_ids(column_ids, row_ids, answer_coordinates) + + if missing_count: + raise ValueError("Couldn't find all answers") + return answer_ids + + def get_answer_ids( + self, column_ids, row_ids, tokenized_table, answer_texts_question, answer_coordinates_question + ): + if self.update_answer_coordinates: + return self._find_answer_ids_from_answer_texts( + column_ids, + row_ids, + tokenized_table, + answer_texts=[ + self.tokenize(at) + for at in answer_texts_question + ], + ) + return self._get_answer_ids(column_ids, row_ids, answer_coordinates_question) + + def _pad( + self, + encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding], + max_length: Optional[int] = None, + padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, + pad_to_multiple_of: Optional[int] = None, + return_attention_mask: Optional[bool] = None, + ) -> dict: + """ + Pad encoded inputs (on left/right and up to predefined length or max length in the batch) + + Args: + encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`). + max_length: maximum length of the returned list and optionally padding length (see below). + Will truncate by taking into account the special tokens. + padding_strategy: PaddingStrategy to use for padding. + + - PaddingStrategy.LONGEST Pad to the longest sequence in the batch + - PaddingStrategy.MAX_LENGTH: Pad to the max length (default) + - PaddingStrategy.DO_NOT_PAD: Do not pad + The tokenizer padding sides are defined in self.padding_side: + + - 'left': pads on the left of the sequences + - 'right': pads on the right of the sequences + pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value. + This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability + >= 7.5 (Volta). + return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics) + """ + # Load from model defaults + if return_attention_mask is None: + return_attention_mask = "attention_mask" in self.model_input_names + + if padding_strategy == PaddingStrategy.LONGEST: + max_length = len(encoded_inputs["input_ids"]) + + if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0): + max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of + + needs_to_be_padded = ( + padding_strategy != PaddingStrategy.DO_NOT_PAD and len(encoded_inputs["input_ids"]) != max_length + ) + + if needs_to_be_padded: + difference = max_length - len(encoded_inputs["input_ids"]) + if self.padding_side == "right": + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference + if "token_type_ids" in encoded_inputs: + encoded_inputs["token_type_ids"] = ( + encoded_inputs["token_type_ids"] + [[self.pad_token_type_id] * 7] * difference + ) + if "label_ids" in encoded_inputs: + encoded_inputs["label_ids"] = encoded_inputs["label_ids"] + [0] * difference + if "numeric_values" in encoded_inputs: + encoded_inputs["numeric_values"] = encoded_inputs["numeric_values"] + [float("nan")] * difference + if "numeric_values_scale" in encoded_inputs: + encoded_inputs["numeric_values_scale"] = encoded_inputs["numeric_values_scale"] + [1.0] * difference + if "special_tokens_mask" in encoded_inputs: + encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference + encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference + elif self.padding_side == "left": + if return_attention_mask: + encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"]) + if "token_type_ids" in encoded_inputs: + encoded_inputs["token_type_ids"] = [[self.pad_token_type_id] * 7] * difference + encoded_inputs[ + "token_type_ids" + ] + if "label_ids" in encoded_inputs: + encoded_inputs["label_ids"] = [0] * difference + encoded_inputs["label_ids"] + if "numeric_values" in encoded_inputs: + encoded_inputs["numeric_values"] = [float("nan")] * difference + encoded_inputs["numeric_values"] + if "numeric_values_scale" in encoded_inputs: + encoded_inputs["numeric_values_scale"] = [1.0] * difference + encoded_inputs["numeric_values_scale"] + if "special_tokens_mask" in encoded_inputs: + encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"] + encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"] + else: + raise ValueError("Invalid padding strategy:" + str(self.padding_side)) + else: + if return_attention_mask: + encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + + return encoded_inputs + + #### Everything related to converting logits to predictions #### + + def _get_cell_token_probs(self, probabilities, segment_ids, row_ids, column_ids): + for i, p in enumerate(probabilities): + segment_id = segment_ids[i] + col = column_ids[i] - 1 + row = row_ids[i] - 1 + if col >= 0 and row >= 0 and segment_id == 1: + yield i, p + + def _get_mean_cell_probs(self, probabilities, segment_ids, row_ids, column_ids): + """Computes average probability per cell, aggregating over tokens.""" + coords_to_probs = collections.defaultdict(list) + for i, prob in self._get_cell_token_probs(probabilities, segment_ids, row_ids, column_ids): + col = column_ids[i] - 1 + row = row_ids[i] - 1 + coords_to_probs[(col, row)].append(prob) + return {coords: torch.as_tensor(cell_probs).mean() for coords, cell_probs in coords_to_probs.items()} + + def convert_logits_to_predictions( + self, data, logits, logits_agg=None, cell_classification_threshold=0.5 + ): + """ + Converts logits of :class:`~transformers.TapasForQuestionAnswering` to actual predicted answer coordinates + and optional aggregation indices. + + Args: + data (:obj:`dict`): + Dictionary mapping features to actual values. Should be created using + :class:`~transformers.TapasTokenizer`. + logits (:obj:`torch.FloatTensor` of shape ``(batch_size, sequence_length)``): + Tensor containing the logits at the token level. + logits_agg (:obj:`torch.FloatTensor` of shape ``(batch_size, num_aggregation_labels)``, `optional`): + Tensor containing the aggregation logits. + cell_classification_threshold (:obj:`float`, `optional`, defaults to 0.5): + Threshold to be used for cell selection. All table cells for which their probability is larger than + this threshold will be selected. + + Returns: + :obj:`tuple` comprising various elements depending on the inputs: + predicted_answer_coordinates (``List[List[[tuple]]`` of length ``batch_size``): + Predicted answer coordinates as a list of lists of tuples. Each element in the list contains the predicted answer coordinates + of a single example in the batch, as a list of tuples. Each tuple is a cell, i.e. (row index, column index). + predicted_aggregation_indices (`optional`, returned when ``logits_aggregation`` is provided) ``List[int]`` of length ``batch_size``: + Predicted aggregation operator indices of the aggregation head. + """ + # compute probabilities from token logits + dist_per_token = torch.distributions.Bernoulli(logits=logits) + probabilities = dist_per_token.probs * data["attention_mask"].type(torch.float32).to( + dist_per_token.probs.device + ) + + token_types = [ + "segment_ids", + "column_ids", + "row_ids", + "prev_label_ids", + "column_ranks", + "inv_column_ranks", + "numeric_relations", + ] + + # collect input_ids, segment ids, row ids and column ids of batch. Shape (batch_size, seq_len) + input_ids = data["input_ids"] + segment_ids = data["token_type_ids"][:, :, token_types.index("segment_ids")] + row_ids = data["token_type_ids"][:, :, token_types.index("row_ids")] + column_ids = data["token_type_ids"][:, :, token_types.index("column_ids")] + + # next, get answer coordinates for every example in the batch + num_batch = input_ids.shape[0] + predicted_answer_coordinates = [] + for i in range(num_batch): + probabilities_example = probabilities[i].tolist() + segment_ids_example = segment_ids[i] + row_ids_example = row_ids[i] + column_ids_example = column_ids[i] + + max_width = column_ids_example.max() + max_height = row_ids_example.max() + + if max_width == 0 and max_height == 0: + continue + + cell_coords_to_prob = self._get_mean_cell_probs( + probabilities_example, + segment_ids_example.tolist(), + row_ids_example.tolist(), + column_ids_example.tolist(), + ) + + # Select the answers above the classification threshold. + answer_coordinates = [] + for col in range(max_width): + for row in range(max_height): + cell_prob = cell_coords_to_prob.get((col, row), None) + if cell_prob is not None: + if cell_prob > cell_classification_threshold: + answer_coordinates.append((row, col)) + answer_coordinates = sorted(answer_coordinates) + predicted_answer_coordinates.append(answer_coordinates) + + output = predicted_answer_coordinates + + if logits_agg is not None: + predicted_aggregation_indices = logits_agg.argmax(dim=-1) + output = (output, predicted_aggregation_indices.tolist()) + + return output + + #### End of everything related to converting logits to predictions #### + + +# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer +class BasicTokenizer(object): + """ + Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.) + + Args: + do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to lowercase the input when tokenizing. + never_split (:obj:`Iterable`, `optional`): + Collection of tokens which will never be split during tokenization. Only has an effect when + :obj:`do_basic_tokenize=True` + tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): + Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this + `issue `__). + strip_accents: (:obj:`bool`, `optional`): + Whether or not to strip all accents. If this option is not specified, then it will be determined by the + value for :obj:`lowercase` (as in the original BERT). + """ + + def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None): + if never_split is None: + never_split = [] + self.do_lower_case = do_lower_case + self.never_split = set(never_split) + self.tokenize_chinese_chars = tokenize_chinese_chars + self.strip_accents = strip_accents + + def tokenize(self, text, never_split=None): + """ + Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see + WordPieceTokenizer + + Args: + **never_split**: (`optional`) list of str + Kept for backward compatibility purposes. Now implemented directly at the base class level (see + :func:`PreTrainedTokenizer.tokenize`) List of token not to split. + """ + # union() returns a new set by concatenating the two sets. + never_split = self.never_split.union(set(never_split)) if never_split else self.never_split + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + if self.tokenize_chinese_chars: + text = self._tokenize_chinese_chars(text) + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if token not in never_split: + if self.do_lower_case: + token = token.lower() + if self.strip_accents is not False: + token = self._run_strip_accents(token) + elif self.strip_accents: + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token, never_split)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text, never_split=None): + """Splits punctuation on a piece of text.""" + if never_split is not None and text in never_split: + return [text] + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ( + (cp >= 0x4E00 and cp <= 0x9FFF) + or (cp >= 0x3400 and cp <= 0x4DBF) # + or (cp >= 0x20000 and cp <= 0x2A6DF) # + or (cp >= 0x2A700 and cp <= 0x2B73F) # + or (cp >= 0x2B740 and cp <= 0x2B81F) # + or (cp >= 0x2B820 and cp <= 0x2CEAF) # + or (cp >= 0xF900 and cp <= 0xFAFF) + or (cp >= 0x2F800 and cp <= 0x2FA1F) # + ): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xFFFD or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +# Copied from transformers.models.bert.tokenization_bert.WordpieceTokenizer +class WordpieceTokenizer(object): + """Runs WordPiece tokenization.""" + + def __init__(self, vocab, unk_token, max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """ + Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform + tokenization using the given vocabulary. For example, :obj:`input = "unaffable"` wil return as output + :obj:`["un", "##aff", "##able"]` + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer` + + Returns: + A list of wordpiece tokens. + """ + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + + +# Below: utilities for TAPAS tokenizer (independent from PyTorch/Tensorflow). +# This includes functions to parse numeric values (dates and numbers) from both the table and questions in order +# to create the column_ranks, inv_column_ranks, numeric_values, numeric values_scale and numeric_relations in +# prepare_for_model of TapasTokenizer. +# These are meant to be used in an academic setup, for production use cases Gold mine or Aqua should be used. + + +# taken from constants.py of the original implementation +# URL: https://github.com/google-research/tapas/blob/master/tapas/utils/constants.py +class Relation(enum.Enum): + HEADER_TO_CELL = 1 # Connects header to cell. + CELL_TO_HEADER = 2 # Connects cell to header. + QUERY_TO_HEADER = 3 # Connects query to headers. + QUERY_TO_CELL = 4 # Connects query to cells. + ROW_TO_CELL = 5 # Connects row to cells. + CELL_TO_ROW = 6 # Connects cells to row. + EQ = 7 # Annotation value is same as cell value + LT = 8 # Annotation value is less than cell value + GT = 9 # Annotation value is greater than cell value + + +@dataclass +class Date: + year: Optional[int] = None + month: Optional[int] = None + day: Optional[int] = None + + +@dataclass +class NumericValue: + float_value: Optional[float] = None + date: Optional[Date] = None + + +@dataclass +class NumericValueSpan: + begin_index: int = None + end_index: int = None + values: List[NumericValue] = None + + +@dataclass +class Cell: + text: Text + numeric_value: Optional[NumericValue] = None + + +@dataclass +class Question: + original_text: Text # The original raw question string. + text: Text # The question string after normalization. + numeric_spans: Optional[List[NumericValueSpan]] = None + + +# Below: all functions from number_utils.py as well as 2 functions (namely get_all_spans and normalize_for_match) +# from text_utils.py of the original implementation. URL's: +# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_utils.py +# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py + + +# Constants for parsing date expressions. +# Masks that specify (by a bool) which of (year, month, day) will be populated. +_DateMask = collections.namedtuple("_DateMask", ["year", "month", "day"]) + +_YEAR = _DateMask(True, False, False) +_YEAR_MONTH = _DateMask(True, True, False) +_YEAR_MONTH_DAY = _DateMask(True, True, True) +_MONTH = _DateMask(False, True, False) +_MONTH_DAY = _DateMask(False, True, True) + +# Pairs of patterns to pass to 'datetime.strptime' and masks specifying which +# fields will be set by the corresponding pattern. +_DATE_PATTERNS = ( + ("%B", _MONTH), + ("%Y", _YEAR), + ("%Ys", _YEAR), + ("%b %Y", _YEAR_MONTH), + ("%B %Y", _YEAR_MONTH), + ("%B %d", _MONTH_DAY), + ("%b %d", _MONTH_DAY), + ("%d %b", _MONTH_DAY), + ("%d %B", _MONTH_DAY), + ("%B %d, %Y", _YEAR_MONTH_DAY), + ("%d %B %Y", _YEAR_MONTH_DAY), + ("%m-%d-%Y", _YEAR_MONTH_DAY), + ("%Y-%m-%d", _YEAR_MONTH_DAY), + ("%Y-%m", _YEAR_MONTH), + ("%B %Y", _YEAR_MONTH), + ("%d %b %Y", _YEAR_MONTH_DAY), + ("%Y-%m-%d", _YEAR_MONTH_DAY), + ("%b %d, %Y", _YEAR_MONTH_DAY), + ("%d.%m.%Y", _YEAR_MONTH_DAY), + ("%A, %b %d", _MONTH_DAY), + ("%A, %B %d", _MONTH_DAY), +) + +# This mapping is used to convert date patterns to regex patterns. +_FIELD_TO_REGEX = ( + ("%A", r"\w+"), # Weekday as locale’s full name. + ("%B", r"\w+"), # Month as locale’s full name. + ("%Y", r"\d{4}"), # Year with century as a decimal number. + ("%b", r"\w{3}"), # Month as locale’s abbreviated name. + ("%d", r"\d{1,2}"), # Day of the month as a zero-padded decimal number. + ("%m", r"\d{1,2}"), # Month as a zero-padded decimal number. +) + + +def _process_date_pattern(dp): + """Compute a regex for each date pattern to use as a prefilter.""" + pattern, mask = dp + regex = pattern + regex = regex.replace(".", re.escape(".")) + regex = regex.replace("-", re.escape("-")) + regex = regex.replace(" ", r"\s+") + for field, field_regex in _FIELD_TO_REGEX: + regex = regex.replace(field, field_regex) + # Make sure we didn't miss any of the fields. + assert "%" not in regex, regex + return pattern, mask, re.compile("^" + regex + "$") + + +def _process_date_patterns(): + return tuple(_process_date_pattern(dp) for dp in _DATE_PATTERNS) + + +_PROCESSED_DATE_PATTERNS = _process_date_patterns() + +_MAX_DATE_NGRAM_SIZE = 5 + +# Following DynSp: +# https://github.com/Microsoft/DynSP/blob/master/util.py#L414. +_NUMBER_WORDS = [ + "zero", + "one", + "two", + "three", + "four", + "five", + "six", + "seven", + "eight", + "nine", + "ten", + "eleven", + "twelve", +] + +_ORDINAL_WORDS = [ + "zeroth", + "first", + "second", + "third", + "fourth", + "fith", + "sixth", + "seventh", + "eighth", + "ninth", + "tenth", + "eleventh", + "twelfth", +] + +_ORDINAL_SUFFIXES = ["st", "nd", "rd", "th"] + +_NUMBER_PATTERN = re.compile(r"((^|\s)[+-])?((\.\d+)|(\d+(,\d\d\d)*(\.\d*)?))") + +# Following DynSp: +# https://github.com/Microsoft/DynSP/blob/master/util.py#L293. +_MIN_YEAR = 1700 +_MAX_YEAR = 2016 + +_INF = float("INF") + + +def _get_numeric_value_from_date(date, mask): + """Converts date (datetime Python object) to a NumericValue object with a Date object value.""" + if date.year < _MIN_YEAR or date.year > _MAX_YEAR: + raise ValueError("Invalid year: %d" % date.year) + + new_date = Date() + if mask.year: + new_date.year = date.year + if mask.month: + new_date.month = date.month + if mask.day: + new_date.day = date.day + return NumericValue(date=new_date) + + +def _get_span_length_key(span): + """Sorts span by decreasing length first and incresing first index second.""" + return span[1] - span[0], -span[0] + + +def _get_numeric_value_from_float(value): + """Converts float (Python) to a NumericValue object with a float value.""" + return NumericValue(float_value=value) + + +# Doesn't parse ordinal expressions such as '18th of february 1655'. +def _parse_date(text): + """Attempts to format a text as a standard date string (yyyy-mm-dd).""" + text = re.sub(r"Sept\b", "Sep", text) + for in_pattern, mask, regex in _PROCESSED_DATE_PATTERNS: + if not regex.match(text): + continue + try: + date = datetime.datetime.strptime(text, in_pattern).date() + except ValueError: + continue + try: + return _get_numeric_value_from_date(date, mask) + except ValueError: + continue + return None + + +def _parse_number(text): + """Parses simple cardinal and ordinals numbers.""" + for suffix in _ORDINAL_SUFFIXES: + if text.endswith(suffix): + text = text[: -len(suffix)] + break + text = text.replace(",", "") + try: + value = float(text) + except ValueError: + return None + if math.isnan(value): + return None + if value == _INF: + return None + return value + + +def get_all_spans(text, max_ngram_length): + """ + Split a text into all possible ngrams up to 'max_ngram_length'. Split points are white space and punctuation. + + Args: + text: Text to split. + max_ngram_length: maximal ngram length. + Yields: + Spans, tuples of begin-end index. + """ + start_indexes = [] + for index, char in enumerate(text): + if not char.isalnum(): + continue + if index == 0 or not text[index - 1].isalnum(): + start_indexes.append(index) + if index + 1 == len(text) or not text[index + 1].isalnum(): + for start_index in start_indexes[-max_ngram_length:]: + yield start_index, index + 1 + + +def normalize_for_match(text): + return " ".join(text.lower().split()) + + +def format_text(text): + """Lowercases and strips punctuation.""" + text = text.lower().strip() + if text == "n/a" or text == "?" or text == "nan": + text = EMPTY_TEXT + + text = re.sub(r"[^\w\d]+", " ", text).replace("_", " ") + text = " ".join(text.split()) + text = text.strip() + if text: + return text + return EMPTY_TEXT + + +def parse_text(text): + """ + Extracts longest number and date spans. + + Args: + text: text to annotate + + Returns: + List of longest numeric value spans. + """ + span_dict = collections.defaultdict(list) + for match in _NUMBER_PATTERN.finditer(text): + span_text = text[match.start() : match.end()] + number = _parse_number(span_text) + if number is not None: + span_dict[match.span()].append(_get_numeric_value_from_float(number)) + + for begin_index, end_index in get_all_spans(text, max_ngram_length=1): + if (begin_index, end_index) in span_dict: + continue + span_text = text[begin_index:end_index] + + number = _parse_number(span_text) + if number is not None: + span_dict[begin_index, end_index].append(_get_numeric_value_from_float(number)) + for number, word in enumerate(_NUMBER_WORDS): + if span_text == word: + span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number))) + break + for number, word in enumerate(_ORDINAL_WORDS): + if span_text == word: + span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number))) + break + + for begin_index, end_index in get_all_spans(text, max_ngram_length=_MAX_DATE_NGRAM_SIZE): + span_text = text[begin_index:end_index] + date = _parse_date(span_text) + if date is not None: + span_dict[begin_index, end_index].append(date) + + spans = sorted(span_dict.items(), key=lambda span_value: _get_span_length_key(span_value[0]), reverse=True) + selected_spans = [] + for span, value in spans: + for selected_span, _ in selected_spans: + if selected_span[0] <= span[0] and span[1] <= selected_span[1]: + break + else: + selected_spans.append((span, value)) + + selected_spans.sort(key=lambda span_value: span_value[0][0]) + + numeric_value_spans = [] + for span, values in selected_spans: + numeric_value_spans.append(NumericValueSpan(begin_index=span[0], end_index=span[1], values=values)) + return numeric_value_spans + + +# Below: all functions from number_annotation_utils.py and 2 functions (namely filter_invalid_unicode +# and filter_invalid_unicode_from_table) from text_utils.py of the original implementation. URL's: +# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_annotation_utils.py +# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py + + +_PrimitiveNumericValue = Union[float, Tuple[Optional[float], Optional[float], Optional[float]]] +_SortKeyFn = Callable[[NumericValue], Tuple[float, Ellipsis]] + +_DATE_TUPLE_SIZE = 3 + +EMPTY_TEXT = 'EMPTY' + +NUMBER_TYPE = "number" +DATE_TYPE = "date" + + +def _get_value_type(numeric_value): + if numeric_value.float_value is not None: + return NUMBER_TYPE + elif numeric_value.date is not None: + return DATE_TYPE + raise ValueError("Unknown type: %s" % numeric_value) + + +def _get_value_as_primitive_value(numeric_value): + """Maps a NumericValue proto to a float or tuple of float.""" + if numeric_value.float_value is not None: + return numeric_value.float_value + if numeric_value.date is not None: + date = numeric_value.date + value_tuple = [None, None, None] + # All dates fields are cased to float to produce a simple primitive value. + if date.year is not None: + value_tuple[0] = float(date.year) + if date.month is not None: + value_tuple[1] = float(date.month) + if date.day is not None: + value_tuple[2] = float(date.day) + return tuple(value_tuple) + raise ValueError("Unknown type: %s" % numeric_value) + + +def _get_all_types(numeric_values): + return {_get_value_type(value) for value in numeric_values} + + +def get_numeric_sort_key_fn(numeric_values): + """ + Creates a function that can be used as a sort key or to compare the values. Maps to primitive types and finds the + biggest common subset. Consider the values "05/05/2010" and "August 2007". With the corresponding primitive values + (2010.,5.,5.) and (2007.,8., None). These values can be compared by year and date so we map to the sequence (2010., + 5.), (2007., 8.). If we added a third value "2006" with primitive value (2006., None, None), we could only compare + by the year so we would map to (2010.,), (2007.,) and (2006.,). + + Args: + numeric_values: Values to compare + + Returns: + A function that can be used as a sort key function (mapping numeric values to a comparable tuple) + + Raises: + ValueError if values don't have a common type or are not comparable. + """ + value_types = _get_all_types(numeric_values) + if len(value_types) != 1: + raise ValueError("No common value type in %s" % numeric_values) + + value_type = next(iter(value_types)) + if value_type == NUMBER_TYPE: + # Primitive values are simple floats, nothing to do here. + return _get_value_as_primitive_value + + # The type can only be Date at this point which means the primitive type + # is a float triple. + valid_indexes = set(range(_DATE_TUPLE_SIZE)) + + for numeric_value in numeric_values: + value = _get_value_as_primitive_value(numeric_value) + assert isinstance(value, tuple) + for tuple_index, inner_value in enumerate(value): + if inner_value is None: + valid_indexes.discard(tuple_index) + + if not valid_indexes: + raise ValueError("No common value in %s" % numeric_values) + + def _sort_key_fn(numeric_value): + value = _get_value_as_primitive_value(numeric_value) + return tuple(value[index] for index in valid_indexes) + + return _sort_key_fn + + +def _consolidate_numeric_values( + row_index_to_values, + min_consolidation_fraction, + debug_info): + """Finds the most common numeric values in a column and returns them. + Args: + row_index_to_values: + For each row index all the values in that cell. + min_consolidation_fraction: + Fraction of cells that need to have consolidated value. + debug_info: + Additional information only used for logging. + Returns: + For each row index the first value that matches the most common value. + Rows that don't have a matching value are dropped. Empty list if values can't + be consolidated. + """ + type_counts = collections.Counter() + for numeric_values in row_index_to_values.values(): + type_counts.update(_get_all_types(numeric_values)) + if not type_counts: + return {} + max_count = max(type_counts.values()) + if max_count < len(row_index_to_values) * min_consolidation_fraction: + # logging.log_every_n(logging.INFO, 'Can\'t consolidate types: %s %s %d', 100, + # debug_info, row_index_to_values, max_count) + return {} + + valid_types = set() + for value_type, count in type_counts.items(): + if count == max_count: + valid_types.add(value_type) + if len(valid_types) > 1: + assert DATE_TYPE in valid_types + max_type = DATE_TYPE + else: + max_type = next(iter(valid_types)) + + new_row_index_to_value = {} + for index, values in row_index_to_values.items(): + # Extract the first matching value. + for value in values: + if _get_value_type(value) == max_type: + new_row_index_to_value[index] = value + break + + return new_row_index_to_value + + +def _get_numeric_values(text): + """Parses text and returns numeric values.""" + numeric_spans = parse_text(text) + return itertools.chain(*(span.values for span in numeric_spans)) + + +def _get_column_values(table, col_index): + """Parses text in column and returns a dict mapping row_index to values. + This is the _get_column_values function from number_annotation_utils.py of the + original implementation. + Args: + table: Pandas dataframe + col_index: integer, indicating the index of the column to get the numeric values of + """ + index_to_values = {} + for row_index, row in table.iterrows(): + text = normalize_for_match(row[col_index].text) + index_to_values[row_index] = list(_get_numeric_values(text)) + return index_to_values + + +def get_numeric_relation(value, other_value, sort_key_fn): + """Compares two values and returns their relation or None.""" + value = sort_key_fn(value) + other_value = sort_key_fn(other_value) + if value == other_value: + return Relation.EQ + if value < other_value: + return Relation.LT + if value > other_value: + return Relation.GT + return None + + +def add_numeric_values_to_question(question): + """Adds numeric value spans to a question.""" + original_text = question + question = normalize_for_match(question) + numeric_spans = parse_text(question) + return Question(original_text=original_text, + text=question, + numeric_spans=numeric_spans) + + +def filter_invalid_unicode(text): + """Return an empty string and True if 'text' is in invalid unicode.""" + return ("", True) if isinstance(text, bytes) else (text, False) + + +def filter_invalid_unicode_from_table(table): + """Removes invalid unicode from table. + Checks whether a table cell text contains an invalid unicode encoding. If yes, + reset the table cell text to an empty str and log a warning for each invalid + cell. + Args: + table: table to clean. + """ + # to do: add table id support + if not hasattr(table, "table_id"): + table.table_id = 0 + + for row_index, row in table.iterrows(): + for col_index, cell in enumerate(row): + cell, is_invalid = filter_invalid_unicode(cell) + if is_invalid: + logging.warning( + "Scrub an invalid table body @ table_id: %s, row_index: %d, " + "col_index: %d", table.table_id, row_index, col_index) + for col_index, column in enumerate(table.columns): + column, is_invalid = filter_invalid_unicode(column) + if is_invalid: + logging.warning( + "Scrub an invalid table header @ table_id: %s, col_index: %d", + table.table_id, col_index) + + +def add_numeric_table_values(table, + min_consolidation_fraction=0.7, + debug_info = None): + """Parses text in table column-wise and adds the consolidated values. + Consolidation refers to finding values with a common types (date or number). + Args: + table: + Table to annotate. + min_consolidation_fraction: + Fraction of cells in a column that need to have consolidated value. + debug_info: + Additional information used for logging. + """ + table = table.copy() + # First, filter table on invalid unicode + filter_invalid_unicode_from_table(table) + + # Second, replace cell values by Cell objects + for row_index, row in table.iterrows(): + for col_index, cell in enumerate(row): + table.iloc[row_index, col_index] = Cell(text=cell) + + # Third, add numeric_value attributes to these Cell objects + for col_index, column in enumerate(table.columns): + column_values = _consolidate_numeric_values( + _get_column_values(table, col_index), + min_consolidation_fraction=min_consolidation_fraction, + debug_info=(debug_info, column)) + + for row_index, numeric_value in column_values.items(): + table.iloc[row_index, col_index].numeric_value = numeric_value + + return table \ No newline at end of file diff --git a/tests/test_modeling_tapas.py b/tests/test_modeling_tapas.py new file mode 100644 index 000000000000..95bf1614dff0 --- /dev/null +++ b/tests/test_modeling_tapas.py @@ -0,0 +1,863 @@ +# coding=utf-8 +# Copyright 2020 Google Research and The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import unittest + +import copy + +import numpy as np +import pandas as pd + +from transformers import is_torch_available +from transformers.file_utils import cached_property +from transformers.testing_utils import require_torch, slow, torch_device + +from .test_configuration_common import ConfigTester +from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask + + +if is_torch_available(): + import torch + + from transformers import ( + TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST, + TapasConfig, + #TapasForMaskedLM, + TapasForQuestionAnswering, + TapasForSequenceClassification, + TapasModel, + ) + + from transformers.modeling_tapas import ( + IndexMap, + ProductIndexMap, + gather, + flatten, + range_index_map, + reduce_sum, + reduce_mean, + reduce_max, + reduce_min, + ) + + +class TapasModelTester: + """You can also import this e.g from .test_modeling_tapas import TapasModelTester """ + + def __init__( + self, + parent, + batch_size=13, + seq_length=7, + is_training=True, + use_input_mask=True, + use_token_type_ids=True, + use_labels=True, + vocab_size=99, + hidden_size=32, + num_hidden_layers=5, + num_attention_heads=4, + intermediate_size=37, + hidden_act="gelu", + hidden_dropout_prob=0.1, + attention_probs_dropout_prob=0.1, + initializer_range=0.02, + max_position_embeddings=512, + type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10], + type_sequence_label_size=2, + positive_weight=10.0, + num_aggregation_labels=4, + num_labels=2, + aggregation_loss_importance=0.8, + use_answer_as_supervision=True, + answer_loss_importance=0.001, + use_normalized_answer_loss=False, + huber_loss_delta=25.0, + temperature=1.0, + agg_temperature=1.0, + use_gumbel_for_cells=False, + use_gumbel_for_agg=False, + average_approximation_function="ratio", + cell_selection_preference=0.5, + answer_loss_cutoff=100, + max_num_rows=64, + max_num_columns=32, + average_logits_per_cell=True, + select_one_column=True, + allow_empty_column_selection=False, + init_cell_selection_weights_to_zero=False, + reset_position_index_per_cell=True, + disable_per_token_loss=False, + scope=None, + ): + self.parent = parent + self.batch_size = batch_size + self.seq_length = seq_length + self.is_training = is_training + self.use_input_mask = use_input_mask + self.use_token_type_ids = use_token_type_ids + self.use_labels = use_labels + self.vocab_size = vocab_size + self.hidden_size = hidden_size + self.num_hidden_layers = num_hidden_layers + self.num_attention_heads = num_attention_heads + self.intermediate_size = intermediate_size + self.hidden_act = hidden_act + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.initializer_range = initializer_range + self.max_position_embeddings = max_position_embeddings + self.type_vocab_sizes = type_vocab_sizes + self.type_sequence_label_size = type_sequence_label_size + self.positive_weight = positive_weight + self.num_aggregation_labels = num_aggregation_labels + self.num_labels = num_labels + self.aggregation_loss_importance = aggregation_loss_importance + self.use_answer_as_supervision = use_answer_as_supervision + self.answer_loss_importance = answer_loss_importance + self.use_normalized_answer_loss = use_normalized_answer_loss + self.huber_loss_delta = huber_loss_delta + self.temperature = temperature + self.agg_temperature = agg_temperature + self.use_gumbel_for_cells = use_gumbel_for_cells + self.use_gumbel_for_agg = use_gumbel_for_agg + self.average_approximation_function = average_approximation_function + self.cell_selection_preference = cell_selection_preference + self.answer_loss_cutoff = answer_loss_cutoff + self.max_num_rows = max_num_rows + self.max_num_columns = max_num_columns + self.average_logits_per_cell = average_logits_per_cell + self.select_one_column = select_one_column + self.allow_empty_column_selection = allow_empty_column_selection + self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero + self.reset_position_index_per_cell = reset_position_index_per_cell + self.disable_per_token_loss = disable_per_token_loss + self.scope = scope + + def prepare_config_and_inputs(self): + input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size) + + input_mask = None + if self.use_input_mask: + input_mask = random_attention_mask([self.batch_size, self.seq_length]) + + token_type_ids = [] + for type_vocab_size in self.type_vocab_sizes: + token_type_ids.append(ids_tensor(shape=[self.batch_size, self.seq_length], vocab_size=type_vocab_size)) + token_type_ids = torch.stack(token_type_ids, dim=2) + + sequence_labels = None + token_labels = None + label_ids = None + answer = None + numeric_values = None + numeric_values_scale = None + float_answer = None + aggregation_labels = None + if self.use_labels: + sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size) + token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels) + label_ids = ids_tensor([self.batch_size, self.seq_length], vocab_size=2) + numeric_values = floats_tensor([self.batch_size, self.seq_length]) + numeric_values_scale = floats_tensor([self.batch_size, self.seq_length]) + float_answer = floats_tensor([self.batch_size]) + aggregation_labels = ids_tensor([self.batch_size], self.num_aggregation_labels) + + config = TapasConfig( + vocab_size=self.vocab_size, + hidden_size=self.hidden_size, + num_hidden_layers=self.num_hidden_layers, + num_attention_heads=self.num_attention_heads, + intermediate_size=self.intermediate_size, + hidden_act=self.hidden_act, + hidden_dropout_prob=self.hidden_dropout_prob, + attention_probs_dropout_prob=self.attention_probs_dropout_prob, + max_position_embeddings=self.max_position_embeddings, + type_vocab_sizes=self.type_vocab_sizes, + initializer_range=self.initializer_range, + positive_weight=self.positive_weight, + num_aggregation_labels=self.num_aggregation_labels, + num_labels=self.num_labels, + aggregation_loss_importance=self.aggregation_loss_importance, + use_answer_as_supervision=self.use_answer_as_supervision, + answer_loss_importance=self.answer_loss_importance, + use_normalized_answer_loss=self.use_normalized_answer_loss, + huber_loss_delta=self.huber_loss_delta, + temperature=self.temperature, + agg_temperature=self.agg_temperature, + use_gumbel_for_cells=self.use_gumbel_for_cells, + use_gumbel_for_agg=self.use_gumbel_for_agg, + average_approximation_function=self.average_approximation_function, + cell_selection_preference=self.cell_selection_preference, + answer_loss_cutoff=self.answer_loss_cutoff, + max_num_rows=self.max_num_rows, + max_num_columns=self.max_num_columns, + average_logits_per_cell=self.average_logits_per_cell, + select_one_column=self.select_one_column, + allow_empty_column_selection=self.allow_empty_column_selection, + init_cell_selection_weights_to_zero=self.init_cell_selection_weights_to_zero, + reset_position_index_per_cell=self.reset_position_index_per_cell, + disable_per_token_loss=self.disable_per_token_loss, + return_dict=True, + ) + + return ( + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + numeric_values, + numeric_values_scale, + float_answer, + aggregation_labels, + ) + + def create_and_check_model( + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + numeric_values, + numeric_values_scale, + float_answer, + aggregation_labels, + ): + model = TapasModel(config=config) + model.to(torch_device) + model.eval() + result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids) + result = model(input_ids, token_type_ids=token_type_ids) + result = model(input_ids) + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size)) + self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size)) + + # def create_and_check_for_masked_lm( + # self, + # config, + # input_ids, + # input_mask, + # token_type_ids, + # sequence_labels, + # token_labels, + # label_ids, + # numeric_values, + # numeric_values_scale, + # float_answer, + # aggregation_labels, + # ): + # model = TapasForMaskedLM(config=config) + # model.to(torch_device) + # model.eval() + # result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels) + # self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size)) + + def create_and_check_for_question_answering( + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + numeric_values, + numeric_values_scale, + float_answer, + aggregation_labels, + ): + # inference: without aggregation head (SQA). + sqa_config = copy.copy(config) + sqa_config.num_aggregation_labels = 0 + sqa_config.use_answer_as_supervision = False + model = TapasForQuestionAnswering(config=sqa_config) + model.to(torch_device) + model.eval() + result = model( + input_ids=input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + ) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) + + # inference: with aggregation head (WTQ, WikiSQL-supervised) + model = TapasForQuestionAnswering(config=config) + model.to(torch_device) + model.eval() + result = model( + input_ids=input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + ) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) + self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels)) + + # training: can happen in 3 main ways + # case 1: conversational (SQA) + model = TapasForQuestionAnswering(config=sqa_config) + model.to(torch_device) + model.eval() + result = model( + input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + label_ids=label_ids, + ) + self.parent.assertEqual(result.loss.shape, ()) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) + + # case 2: weak supervision for aggregation (WTQ) + model = TapasForQuestionAnswering(config=config) + model.to(torch_device) + model.eval() + result = model( + input_ids=input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + label_ids=label_ids, + numeric_values=numeric_values, + numeric_values_scale=numeric_values_scale, + float_answer=float_answer, + ) + self.parent.assertEqual(result.loss.shape, ()) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) + self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels)) + + # case 3: strong supervision for aggregation (WikiSQL-supervised) + wikisql_config = copy.copy(config) + wikisql_config.use_answer_as_supervision = False + model = TapasForQuestionAnswering(config=wikisql_config) + model.to(torch_device) + model.eval() + result = model( + input_ids, + attention_mask=input_mask, + token_type_ids=token_type_ids, + label_ids=label_ids, + aggregation_labels=aggregation_labels, + ) + self.parent.assertEqual(result.loss.shape, ()) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length)) + self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels)) + + def create_and_check_for_sequence_classification( + self, + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + numeric_values, + numeric_values_scale, + float_answer, + aggregation_labels, + ): + config.num_labels = self.num_labels + model = TapasForSequenceClassification(config) + model.to(torch_device) + model.eval() + result = model(input_ids, attention_mask=input_mask, labels=sequence_labels) + self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + ( + config, + input_ids, + input_mask, + token_type_ids, + sequence_labels, + token_labels, + label_ids, + numeric_values, + numeric_values_scale, + float_answer, + aggregation_labels, + ) = config_and_inputs + inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask} + return config, inputs_dict + + +@require_torch +class TapasModelTest(ModelTesterMixin, unittest.TestCase): + + all_model_classes = ( + ( + TapasModel, + #TapasForMaskedLM, + TapasForQuestionAnswering, + TapasForSequenceClassification, + ) + if is_torch_available() + else None + ) + test_pruning = False + test_torchscript = True + test_resize_embeddings = True + test_head_masking = False + + def setUp(self): + self.model_tester = TapasModelTester(self) + self.config_tester = ConfigTester(self, config_class=TapasConfig, dim=37) + + def test_config(self): + self.config_tester.run_common_tests() + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + # def test_for_masked_lm(self): + # config_and_inputs = self.model_tester.prepare_config_and_inputs() + # self.model_tester.create_and_check_for_masked_lm(*config_and_inputs) + + def test_for_question_answering(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_for_question_answering(*config_and_inputs) + + def test_for_sequence_classification(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs) + + +def prepare_tapas_single_inputs_for_inference(): + # Here we prepare a single table-question pair to test TAPAS inference on: + data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], + 'Age': ["33", "35"], + } + queries = "Which footballer is 33 years old?" + table = pd.DataFrame.from_dict(data) + + return table, queries + + +def prepare_tapas_batch_inputs_for_inference(): + # Here we prepare a batch of 2 table-question pairs to test TAPAS inference on: + data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], + 'Age': ["33", "35"], + 'Number of goals': ["712", "750"] + } + queries = ["Which footballer is 33 years old?", "How many goals does Ronaldo have?"] + table = pd.DataFrame.from_dict(data) + + return table, queries + + +def prepare_tapas_batch_inputs_for_training(): + # Here we prepare a DIFFERENT batch of 2 table-question pairs to test TAPAS training on: + data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], + 'Age': ["33", "35"], + 'Number of goals': ["712", "750"] + } + queries = ["Which footballer is 33 years old?", "What's the total number of goals?"] + table = pd.DataFrame.from_dict(data) + + answer_coordinates = [[(0, 0)], [(0, 2), (1, 2)]] + answer_text = [["Lionel Messi"], ["1462"]] + float_answer = [float("NaN"), float("1462")] + + return table, queries, answer_coordinates, answer_text, float_answer + + +@require_torch +class TapasModelIntegrationTest(unittest.TestCase): + @cached_property + def default_tokenizer(self): + return TapasTokenizer.from_pretrained("nielsr/tapas-base-finetuned-wtq") + + @slow + def test_inference_no_head(self): + # ideally we want to test this with the weights of tapas_inter_masklm_base_reset, + # but since it's not straightforward to do this with the TF 1 implementation, we test it with + # the weights of the WTQ base model (i.e. tapas_wtq_wikisql_sqa_inter_masklm_base_reset) + model = TapasModel.from_pretrained("nielsr/tapas-base-finetuned-wtq") + + tokenizer = default_tokenizer() + table, queries = prepare_tapas_single_inputs_for_inference() + inputs = tokenizer(table=table, queries=queries, return_tensors="pt") + outputs = model(**inputs) + # test the sequence output + expected_slice = torch.tensor( + [[[-0.141581565, -0.599805772, 0.747186482], + [-0.143664181, -0.602008104, 0.749218345], + [-0.15169853, -0.603363097, 0.741370678]]] + ) + + self.assertTrue(torch.allclose(outputs.sequence_output[:, :3, :3], expected_slice, atol=1e-4)) + + # test the pooled output + expected_slice = torch.tensor( + [[0.987518311, -0.970520139, -0.994303405]] + ) + + self.assertTrue(torch.allclose(outputs.pooled_output[:, :3], expected_slice, atol=1e-4)) + + + @unittest.skip(reason="Model not available yet") + def test_inference_masked_lm(self): + pass + + # TapasForQuestionAnswering has 3 possible ways of being fine-tuned: + # - conversational set-up (SQA) + # - weak supervision for aggregation (WTQ, WikiSQL) + # - strong supervision for aggregation (WikiSQL-supervised) + # We test all of them: + @slow + def test_inference_question_answering_head_conversational(self): + # note that nielsr/tapas-base-finetuned-sqa should correspond to tapas_sqa_inter_masklm_base_reset + model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-sqa") + + tokenizer = default_tokenizer() + table, queries = prepare_tapas_single_inputs_for_inference() + inputs = tokenizer(table=table, queries=queries, return_tensors="pt") + outputs = model(**inputs) + # test the logits + logits = outputs.logits + expected_shape = torch.Size((1, 21)) + self.assertEqual(logits.shape, expected_shape) + expected_tensor = torch.tensor([[-9997.22461, -9997.22461, -9997.22461, -9997.22461, -9997.22461, + -9997.22461, -9997.22461, -9997.22461, -9997.22461, -16.2628059, + -10004.082, 15.4330549, 15.4330549, 15.4330549, -9990.42, + -16.3270779, -16.3270779, -16.3270779, -16.3270779, -16.3270779, -10004.8506]]) # ok + + self.assertTrue(torch.allclose(logits, expected_tensor, atol=1e-4)) + + @slow + def test_inference_question_answering_head_weak_supervision(self): + # note that nielsr/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset + model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wtq") + + tokenizer = default_tokenizer() + # let's test on a batch + table, queries = prepare_tapas_batch_inputs_for_inference() + inputs = tokenizer(table=table, queries=queries, padding="longest", return_tensors="pt") + outputs = model(**inputs) + # test the logits + logits = outputs.logits + expected_shape = torch.Size((2, 28)) + self.assertEqual(logits.shape, expected_shape) + expected_slice = torch.tensor([[-160.375504, -160.375504, -160.375504, -10072.3965, -10070.9414, -10094.9736], + [-9861.6123, -9861.6123, -9861.6123, -9861.6123, -9891.01172, 146.600677]]) # ok (batch size = 2) + + self.assertTrue(torch.allclose(logits[:,-6:], expected_slice, atol=1e-4)) + + # test the aggregation logits + logits_aggregation = outputs.logits_aggregation + expected_shape = torch.Size((2, 4)) + self.assertEqual(logits_aggregation.shape, expected_shape) + expected_tensor = torch.tensor([[18.8545208, -9.76614857, -6.3128891, -2.93525243], + [-4.05782509, 40.0351, -5.35329962, 23.3978653]]) # ok (batch size = 2) + + self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=1e-4)) + + tokenizer = default_tokenizer() + + @slow + def test_training_question_answering_head_weak_supervision(self): + # note that nielsr/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset + model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wtq") + model.to(torch_device) + + tokenizer = default_tokenizer() + # let's test on a batch + table, queries, answer_coordinates, answer_text, float_answer = prepare_tapas_batch_inputs_for_training() + inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, + answer_text=answer_text, padding="longest", return_tensors="pt") + + # prepare data (created by the tokenizer) and move to torch_device + input_ids = inputs["input_ids"].to(torch_device) + attention_mask = inputs["attention_mask"].to(torch_device) + token_type_ids = inputs["token_type_ids"].to(torch_device) + label_ids = inputs["label_ids"].to(torch_device) + numeric_values = inputs["numeric_values"].to(torch_device) + numeric_values_scale = inputs["numeric_values_scale"].to(torch_device) + + # the answer should be prepared by the user + float_answer = torch.FloatTensor(float_answer).to(torch_device) + + # forward pass to get loss + logits: + outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label_ids=label_ids, + numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, + float_answer=float_answer) + + # test the loss + loss = outputs.loss + expected_loss = 3.3527612686157227e-08 # ok + self.assertEqual(loss.item(), expected_loss, atol=1e-4) + + # test the logits on the first example + logits = outputs.logits + expected_shape = torch.Size((2, 28)) + self.assertEqual(logits.shape, expected_shape) + expected_slice = torch.tensor([-160.0156, -160.0156, -160.0156, -160.0156, -160.0156, + -10072.2266, -10070.8896, -10092.6006, -10092.6006]) # ok + + self.assertTrue(torch.allclose(logits[:,-9:], expected_slice, atol=1e-4)) + + + # test the aggregation logits on the second example + logits_aggregation = outputs.logits_aggregation + expected_shape = torch.Size((2, 4)) + self.assertEqual(logits_aggregation.shape, expected_shape) + expected_slice = torch.tensor([-4.0538, 40.0304, -5.3554, 23.3965]) # ok + + self.assertTrue(torch.allclose(logits_aggregation[1,-4:], expected_slice, atol=1e-4)) + + @slow + def test_inference_question_answering_head_strong_supervision(self): + # note that nielsr/tapas-base-finetuned-wikisql-supervised should correspond to tapas_wikisql_sqa_inter_masklm_base_reset + model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wikisql-supervised") + + tokenizer = default_tokenizer() + table, queries = prepare_tapas_single_inputs_for_inference() + inputs = tokenizer(table=table, queries=queries, return_tensors="pt") + outputs = model(**inputs) + # test the logits + logits = outputs.logits + expected_shape = torch.Size((1, 21)) + self.assertEqual(logits.shape, expected_shape) + expected_tensor = torch.tensor([[-10011.1084, -10011.1084, -10011.1084, -10011.1084, -10011.1084, + -10011.1084, -10011.1084, -10011.1084, -10011.1084, -18.6185989, + -10008.7969, 17.6355762, 17.6355762, 17.6355762, -10002.4404, + -18.7111301, -18.7111301, -18.7111301, -18.7111301, -18.7111301, -10007.0977]]) # ok + + self.assertTrue(torch.allclose(logits, expected_tensor, atol=1e-4)) + + # test the aggregation logits + logits_aggregation = outputs.logits_aggregation + expected_shape = torch.Size((1, 4)) + self.assertEqual(logits_aggregation.shape, expected_shape) + expected_tensor = torch.tensor([[16.5659733, -3.06624889, -2.34152961, -0.970244825]]) # ok, PyTorch model outputs [[16.5679, -3.0668, -2.3442, -0.9674]] + + self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=1e-4)) + + @slow + def test_inference_classification_head(self): + # note that nielsr/tapas-base-finetuned-tabfact should correspond to tapas_tabfact_inter_masklm_base_reset + model = TapasForSequenceClassification.from_pretrained("nielsr/tapas-base-finetuned-tabfact") + + inputs = prepare_tapas_inputs_for_inference() + outputs = model(**inputs) + + # test the classification logits + logits = outputs.logits + expected_shape = torch.Size((1, 2)) + self.assertEqual(logits.shape, expected_shape) + expected_tensor = torch.tensor([[0.795137286, 9.5572]]) # ok. Note that the PyTorch model outputs [[0.8057, 9.5281]] + + self.assertTrue(torch.allclose(outputs.logits, expected_tensor, atol=1e-4)) + +# Below: tests for Tapas utilities which are defined in modeling_tapas.py. +# These are based on segmented_tensor_test.py of the original implementation. +# URL: https://github.com/google-research/tapas/blob/master/tapas/models/segmented_tensor_test.py +class TapasUtilitiesTest(unittest.TestCase): + def _prepare_tables(self): + """Prepares two tables, both with three distinct rows. + The first table has two columns: + 1.0, 2.0 | 3.0 + 2.0, 0.0 | 1.0 + 1.0, 3.0 | 4.0 + The second table has three columns: + 1.0 | 2.0 | 3.0 + 2.0 | 0.0 | 1.0 + 1.0 | 3.0 | 4.0 + Returns: + SegmentedTensors with the tables. + """ + values = torch.tensor( + [ + [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]], + [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]], + ] + ) + row_index = IndexMap( + indices=torch.tensor( + [ + [[0, 0, 0], [1, 1, 1], [2, 2, 2]], + [[0, 0, 0], [1, 1, 1], [2, 2, 2]], + ] + ), + num_segments=3, + batch_dims=1, + ) + col_index = IndexMap( + indices=torch.tensor( + [ + [[0, 0, 1], [0, 0, 1], [0, 0, 1]], + [[0, 1, 2], [0, 1, 2], [0, 1, 2]], + ] + ), + num_segments=3, + batch_dims=1, + ) + return values, row_index, col_index + + def test_product_index(self): + _, row_index, col_index = self._prepare_tables() + cell_index = ProductIndexMap(row_index, col_index) + row_index_proj = cell_index.project_outer(cell_index) + col_index_proj = cell_index.project_inner(cell_index) + + ind = cell_index.indices + self.assertEqual(cell_index.num_segments, 9) + + # Projections should give back the original indices. + # we use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(row_index.indices.numpy(), row_index_proj.indices.numpy()) + self.assertEqual(row_index.num_segments, row_index_proj.num_segments) + self.assertEqual(row_index.batch_dims, row_index_proj.batch_dims) + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(col_index.indices.numpy(), col_index_proj.indices.numpy()) + self.assertEqual(col_index.batch_dims, col_index_proj.batch_dims) + + # The first and second "column" are identified in the first table. + for i in range(3): + self.assertEqual(ind[0, i, 0], ind[0, i, 1]) + self.assertNotEqual(ind[0, i, 0], ind[0, i, 2]) + + # All rows are distinct in the first table. + for i, i_2 in zip(range(3), range(3)): + for j, j_2 in zip(range(3), range(3)): + if i != i_2 and j != j_2: + self.assertNotEqual(ind[0, i, j], ind[0, i_2, j_2]) + + # All cells are distinct in the second table. + for i, i_2 in zip(range(3), range(3)): + for j, j_2 in zip(range(3), range(3)): + if i != i_2 or j != j_2: + self.assertNotEqual(ind[1, i, j], ind[1, i_2, j_2]) + + def test_flatten(self): + _, row_index, col_index = self._prepare_tables() + row_index_flat = flatten(row_index) + col_index_flat = flatten(col_index) + + shape = [3, 4, 5] + batched_index = IndexMap(indices=torch.zeros(shape).type(torch.LongTensor), num_segments=1, batch_dims=3) + batched_index_flat = flatten(batched_index) + + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal( + row_index_flat.indices.numpy(), [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5] + ) + np.testing.assert_array_equal( + col_index_flat.indices.numpy(), [0, 0, 1, 0, 0, 1, 0, 0, 1, 3, 4, 5, 3, 4, 5, 3, 4, 5] + ) + self.assertEqual(batched_index_flat.num_segments.numpy(), np.prod(shape)) + np.testing.assert_array_equal(batched_index_flat.indices.numpy(), range(np.prod(shape))) + + def test_range_index_map(self): + batch_shape = [3, 4] + num_segments = 5 + index = range_index_map(batch_shape, num_segments) + + self.assertEqual(num_segments, index.num_segments) + self.assertEqual(2, index.batch_dims) + indices = index.indices + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(list(indices.size()), [3, 4, 5]) + for i in range(batch_shape[0]): + for j in range(batch_shape[1]): + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(indices[i, j, :].numpy(), range(num_segments)) + + def test_reduce_sum(self): + values, row_index, col_index = self._prepare_tables() + cell_index = ProductIndexMap(row_index, col_index) + row_sum, _ = reduce_sum(values, row_index) + col_sum, _ = reduce_sum(values, col_index) + cell_sum, _ = reduce_sum(values, cell_index) + + # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose + np.testing.assert_allclose(row_sum.numpy(), [[6.0, 3.0, 8.0], [6.0, 3.0, 8.0]]) + np.testing.assert_allclose(col_sum.numpy(), [[9.0, 8.0, 0.0], [4.0, 5.0, 8.0]]) + np.testing.assert_allclose( + cell_sum.numpy(), + [[3.0, 3.0, 0.0, 2.0, 1.0, 0.0, 4.0, 4.0, 0.0], [1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0]], + ) + + def test_reduce_mean(self): + values, row_index, col_index = self._prepare_tables() + cell_index = ProductIndexMap(row_index, col_index) + row_mean, _ = reduce_mean(values, row_index) + col_mean, _ = reduce_mean(values, col_index) + cell_mean, _ = reduce_mean(values, cell_index) + + # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose + np.testing.assert_allclose( + row_mean.numpy(), [[6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0], [6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0]] + ) + np.testing.assert_allclose(col_mean.numpy(), [[9.0 / 6.0, 8.0 / 3.0, 0.0], [4.0 / 3.0, 5.0 / 3.0, 8.0 / 3.0]]) + np.testing.assert_allclose( + cell_mean.numpy(), + [ + [3.0 / 2.0, 3.0, 0.0, 2.0 / 2.0, 1.0, 0.0, 4.0 / 2.0, 4.0, 0.0], + [1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0], + ], + ) + + def test_reduce_max(self): + values = torch.as_tensor([2.0, 1.0, 0.0, 3.0]) + index = IndexMap(indices=torch.as_tensor([0, 1, 0, 1]), num_segments=2) + maximum, _ = reduce_max(values, index) + + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(maximum.numpy(), [2, 3]) + + def test_reduce_sum_vectorized(self): + values = torch.as_tensor([[1.0, 2.0, 3.0], [2.0, 3.0, 4.0], [3.0, 4.0, 5.0]]) + index = IndexMap(indices=torch.as_tensor([0, 0, 1]), num_segments=2, batch_dims=0) + sums, new_index = reduce_sum(values, index) + + # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose + np.testing.assert_allclose(sums.numpy(), [[3.0, 5.0, 7.0], [3.0, 4.0, 5.0]]) + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(new_index.indices.numpy(), [0, 1]) + np.testing.assert_array_equal(new_index.num_segments.numpy(), 2) + np.testing.assert_array_equal(new_index.batch_dims, 0) + + def test_gather(self): + values, row_index, col_index = self._prepare_tables() + cell_index = ProductIndexMap(row_index, col_index) + + # Compute sums and then gather. The result should have the same shape as + # the original table and each element should contain the sum the values in + # its cell. + sums, _ = reduce_sum(values, cell_index) + cell_sum = gather(sums, cell_index) + assert cell_sum.size() == values.size() + + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_allclose( + cell_sum.numpy(), + [[[3.0, 3.0, 3.0], [2.0, 2.0, 1.0], [4.0, 4.0, 4.0]], [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]]], + ) + + def test_gather_vectorized(self): + values = torch.as_tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) + index = IndexMap(indices=torch.as_tensor([[0, 1], [1, 0]]), num_segments=2, batch_dims=1) + result = gather(values, index) + + # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual + np.testing.assert_array_equal(result.numpy(), [[[1, 2], [3, 4]], [[7, 8], [5, 6]]]) \ No newline at end of file diff --git a/tests/test_tokenization_tapas.py b/tests/test_tokenization_tapas.py new file mode 100644 index 000000000000..5d8cba376515 --- /dev/null +++ b/tests/test_tokenization_tapas.py @@ -0,0 +1,3287 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import inspect +import os +import shutil +import tempfile +import unittest +from typing import List, Tuple +import numpy as np + +import pandas as pd + +from transformers import AddedToken +from transformers.testing_utils import require_tokenizers, slow +from transformers.tokenization_tapas import ( + VOCAB_FILES_NAMES, + BasicTokenizer, + TapasTokenizer, + WordpieceTokenizer, + _is_control, + _is_punctuation, + _is_whitespace, +) + +from .test_tokenization_common import TokenizerTesterMixin, filter_non_english + + +@require_tokenizers +class TapasTokenizationTest(TokenizerTesterMixin, unittest.TestCase): + tokenizer_class = TapasTokenizer + test_rust_tokenizer = False + space_between_special_tokens = True + from_pretrained_filter = filter_non_english + + def get_table( + self, + tokenizer: TapasTokenizer, + length=5, + ): + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + + if length == 0: + data = {} + else: + data = {toks[0]: [toks[tok] for tok in range(1, length)]} + + table = pd.DataFrame.from_dict(data) + + return table + + def get_table_and_query( + self, + tokenizer: TapasTokenizer, + length=5, + ): + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + table = self.get_table(tokenizer, length=length - 3) + query = " ".join(toks[:3]) + + return table, query + + def get_clean_sequence( + self, + tokenizer: TapasTokenizer, + with_prefix_space=False, + max_length=20, + min_length=5, + empty_table: bool = False, + add_special_tokens: bool = True, + return_table_and_query: bool = False, + ): + + toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))] + + if empty_table: + table = pd.DataFrame.from_dict({}) + query = " ".join(toks[:min_length]) + else: + data = {toks[0]: [toks[tok] for tok in range(1, min_length - 3)]} + table = pd.DataFrame.from_dict(data) + query = " ".join(toks[:3]) + + output_ids = tokenizer.encode(table, query, add_special_tokens=add_special_tokens) + output_txt = tokenizer.decode(output_ids) + + assert len(output_ids) >= min_length, "Update the code to generate the sequences so that they are larger" + assert len(output_ids) <= max_length, "Update the code to generate the sequences so that they are smaller" + + if return_table_and_query: + return output_txt, output_ids, table, query + + return output_txt, output_ids + + def setUp(self): + super().setUp() + + vocab_tokens = [ + "[UNK]", + "[CLS]", + "[SEP]", + "[PAD]", + "[MASK]", + "want", + "##want", + "##ed", + "wa", + "un", + "runn", + "##ing", + ",", + "low", + "lowest", + ] + self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"]) + with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer: + vocab_writer.write("".join([x + "\n" for x in vocab_tokens])) + + def get_input_output_texts(self, tokenizer): + input_text = "UNwant\u00E9d,running" + output_text = "unwanted, running" + return input_text, output_text + + def test_full_tokenizer(self): + tokenizer = self.tokenizer_class(self.vocab_file) + + tokens = tokenizer.tokenize("UNwant\u00E9d,running") + self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"]) + self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11]) + + def test_rust_and_python_full_tokenizers(self): + if not self.test_rust_tokenizer: + return + + tokenizer = self.get_tokenizer() + rust_tokenizer = self.get_rust_tokenizer() + + sequence = "UNwant\u00E9d,running" + + tokens = tokenizer.tokenize(sequence) + rust_tokens = rust_tokenizer.tokenize(sequence) + self.assertListEqual(tokens, rust_tokens) + + ids = tokenizer.encode(sequence, add_special_tokens=False) + rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False) + self.assertListEqual(ids, rust_ids) + + rust_tokenizer = self.get_rust_tokenizer() + ids = tokenizer.encode(sequence) + rust_ids = rust_tokenizer.encode(sequence) + self.assertListEqual(ids, rust_ids) + + # With lower casing + tokenizer = self.get_tokenizer(do_lower_case=True) + rust_tokenizer = self.get_rust_tokenizer(do_lower_case=True) + + sequence = "UNwant\u00E9d,running" + + tokens = tokenizer.tokenize(sequence) + rust_tokens = rust_tokenizer.tokenize(sequence) + self.assertListEqual(tokens, rust_tokens) + + ids = tokenizer.encode(sequence, add_special_tokens=False) + rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False) + self.assertListEqual(ids, rust_ids) + + rust_tokenizer = self.get_rust_tokenizer() + ids = tokenizer.encode(sequence) + rust_ids = rust_tokenizer.encode(sequence) + self.assertListEqual(ids, rust_ids) + + def test_chinese(self): + tokenizer = BasicTokenizer() + + self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"]) + + def test_basic_tokenizer_lower(self): + tokenizer = BasicTokenizer(do_lower_case=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? "), ["hello", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_lower_strip_accents_false(self): + tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hällo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"]) + + def test_basic_tokenizer_lower_strip_accents_true(self): + tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hallo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_lower_strip_accents_default(self): + tokenizer = BasicTokenizer(do_lower_case=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["hallo", "!", "how", "are", "you", "?"] + ) + self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"]) + + def test_basic_tokenizer_no_lower(self): + tokenizer = BasicTokenizer(do_lower_case=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? "), ["HeLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_no_lower_strip_accents_false(self): + tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["HäLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_no_lower_strip_accents_true(self): + tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True) + + self.assertListEqual( + tokenizer.tokenize(" \tHäLLo!how \n Are yoU? "), ["HaLLo", "!", "how", "Are", "yoU", "?"] + ) + + def test_basic_tokenizer_respects_never_split_tokens(self): + tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"]) + + self.assertListEqual( + tokenizer.tokenize(" \tHeLLo!how \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"] + ) + + def test_wordpiece_tokenizer(self): + vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"] + + vocab = {} + for (i, token) in enumerate(vocab_tokens): + vocab[token] = i + tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]") + + self.assertListEqual(tokenizer.tokenize(""), []) + + self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"]) + + self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"]) + + def test_is_whitespace(self): + self.assertTrue(_is_whitespace(" ")) + self.assertTrue(_is_whitespace("\t")) + self.assertTrue(_is_whitespace("\r")) + self.assertTrue(_is_whitespace("\n")) + self.assertTrue(_is_whitespace("\u00A0")) + + self.assertFalse(_is_whitespace("A")) + self.assertFalse(_is_whitespace("-")) + + def test_is_control(self): + self.assertTrue(_is_control("\u0005")) + + self.assertFalse(_is_control("A")) + self.assertFalse(_is_control(" ")) + self.assertFalse(_is_control("\t")) + self.assertFalse(_is_control("\r")) + + def test_is_punctuation(self): + self.assertTrue(_is_punctuation("-")) + self.assertTrue(_is_punctuation("$")) + self.assertTrue(_is_punctuation("`")) + self.assertTrue(_is_punctuation(".")) + + self.assertFalse(_is_punctuation("A")) + self.assertFalse(_is_punctuation(" ")) + + def test_clean_text(self): + tokenizer = self.get_tokenizer() + + # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340 + self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], ["[EMPTY]"], ["[UNK]"]]) + + @slow + def test_sequence_builders(self): + tokenizer = self.tokenizer_class.from_pretrained("tapas-base-uncased") + + text = tokenizer.encode("sequence builders", add_special_tokens=False) + text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False) + + encoded_sentence = tokenizer.build_inputs_with_special_tokens(text) + encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2) + + assert encoded_sentence == [101] + text + [102] + assert encoded_pair == [101] + text + [102] + text_2 + [102] + + def test_offsets_with_special_characters(self): + for tokenizer, pretrained_name, kwargs in self.tokenizers_list: + with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)): + tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs) + + sentence = f"A, naïve {tokenizer_r.mask_token} AllenNLP sentence." + tokens = tokenizer_r.encode_plus( + sentence, + return_attention_mask=False, + return_token_type_ids=False, + return_offsets_mapping=True, + add_special_tokens=True, + ) + + do_lower_case = tokenizer_r.do_lower_case if hasattr(tokenizer_r, "do_lower_case") else False + expected_results = ( + [ + ((0, 0), tokenizer_r.cls_token), + ((0, 1), "A"), + ((1, 2), ","), + ((3, 5), "na"), + ((5, 6), "##ï"), + ((6, 8), "##ve"), + ((9, 15), tokenizer_r.mask_token), + ((16, 21), "Allen"), + ((21, 23), "##NL"), + ((23, 24), "##P"), + ((25, 33), "sentence"), + ((33, 34), "."), + ((0, 0), tokenizer_r.sep_token), + ] + if not do_lower_case + else [ + ((0, 0), tokenizer_r.cls_token), + ((0, 1), "a"), + ((1, 2), ","), + ((3, 8), "naive"), + ((9, 15), tokenizer_r.mask_token), + ((16, 21), "allen"), + ((21, 23), "##nl"), + ((23, 24), "##p"), + ((25, 33), "sentence"), + ((33, 34), "."), + ((0, 0), tokenizer_r.sep_token), + ] + ) + + self.assertEqual( + [e[1] for e in expected_results], tokenizer_r.convert_ids_to_tokens(tokens["input_ids"]) + ) + self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"]) + + def test_add_special_tokens(self): + tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + input_table = self.get_table(tokenizer, length=0) + + special_token = "[SPECIAL_TOKEN]" + + tokenizer.add_special_tokens({"cls_token": special_token}) + encoded_special_token = tokenizer.encode(input_table, special_token, add_special_tokens=False) + self.assertEqual(len(encoded_special_token), 1) + + decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True) + self.assertTrue(special_token not in decoded) + + def test_add_tokens_tokenizer(self): + tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + vocab_size = tokenizer.vocab_size + all_size = len(tokenizer) + + self.assertNotEqual(vocab_size, 0) + + # We usually have added tokens from the start in tests because our vocab fixtures are + # smaller than the original vocabs - let's not assert this + # self.assertEqual(vocab_size, all_size) + + new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"] + added_toks = tokenizer.add_tokens(new_toks) + vocab_size_2 = tokenizer.vocab_size + all_size_2 = len(tokenizer) + + self.assertNotEqual(vocab_size_2, 0) + self.assertEqual(vocab_size, vocab_size_2) + self.assertEqual(added_toks, len(new_toks)) + self.assertEqual(all_size_2, all_size + len(new_toks)) + + tokens = tokenizer.encode(table, "aaaaa bbbbbb low cccccccccdddddddd l", add_special_tokens=False) + + self.assertGreaterEqual(len(tokens), 4) + self.assertGreater(tokens[0], tokenizer.vocab_size - 1) + self.assertGreater(tokens[-2], tokenizer.vocab_size - 1) + + new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"} + added_toks_2 = tokenizer.add_special_tokens(new_toks_2) + vocab_size_3 = tokenizer.vocab_size + all_size_3 = len(tokenizer) + + self.assertNotEqual(vocab_size_3, 0) + self.assertEqual(vocab_size, vocab_size_3) + self.assertEqual(added_toks_2, len(new_toks_2)) + self.assertEqual(all_size_3, all_size_2 + len(new_toks_2)) + + tokens = tokenizer.encode( + table, + ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l", + add_special_tokens=False, + ) + + self.assertGreaterEqual(len(tokens), 6) + self.assertGreater(tokens[0], tokenizer.vocab_size - 1) + self.assertGreater(tokens[0], tokens[1]) + self.assertGreater(tokens[-2], tokenizer.vocab_size - 1) + self.assertGreater(tokens[-2], tokens[-3]) + self.assertEqual(tokens[0], tokenizer.eos_token_id) + self.assertEqual(tokens[-2], tokenizer.pad_token_id) + + @require_tokenizers + def test_encode_decode_with_spaces(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + + # new_toks = ["[ABC]", "[DEF]"] # TODO(thom) add this one back when Rust toks are ready: , "GHI IHG"] + new_toks = [AddedToken("[ABC]", normalized=False), AddedToken("[DEF]", normalized=False)] + tokenizer.add_tokens(new_toks) + input = "[ABC][DEF][ABC][DEF]" # TODO(thom) add back cf above: "[ABC] [DEF] [ABC] GHI IHG [DEF]" + if self.space_between_special_tokens: + output = "[ABC] [DEF] [ABC] [DEF]" + else: + output = input + encoded = tokenizer.encode(table, input, add_special_tokens=False) + decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens) + self.assertIn(decoded, [output, output.lower()]) + + def test_encode_plus_with_padding(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence = "Sequence" + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_size = 10 + padding_idx = tokenizer.pad_token_id + token_type_padding_idx = tokenizer.pad_token_type_id + + encoded_sequence = tokenizer.encode_plus(table, sequence, return_special_tokens_mask=True) + input_ids = encoded_sequence["input_ids"] + special_tokens_mask = encoded_sequence["special_tokens_mask"] + sequence_length = len(input_ids) + + # Test 'longest' and 'no_padding' don't do anything + tokenizer.padding_side = "right" + + not_padded_sequence = tokenizer.encode_plus( + table, + sequence, + padding=False, + return_special_tokens_mask=True, + ) + not_padded_input_ids = not_padded_sequence["input_ids"] + + not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"] + not_padded_sequence_length = len(not_padded_input_ids) + + assert sequence_length == not_padded_sequence_length + assert input_ids == not_padded_input_ids + assert special_tokens_mask == not_padded_special_tokens_mask + + not_padded_sequence = tokenizer.encode_plus( + table, + sequence, + padding=False, + return_special_tokens_mask=True, + ) + not_padded_input_ids = not_padded_sequence["input_ids"] + + not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"] + not_padded_sequence_length = len(not_padded_input_ids) + + assert sequence_length == not_padded_sequence_length + assert input_ids == not_padded_input_ids + assert special_tokens_mask == not_padded_special_tokens_mask + + # Test right padding + tokenizer.padding_side = "right" + + right_padded_sequence = tokenizer.encode_plus( + table, + sequence, + max_length=sequence_length + padding_size, + padding="max_length", + return_special_tokens_mask=True, + ) + right_padded_input_ids = right_padded_sequence["input_ids"] + + right_padded_special_tokens_mask = right_padded_sequence["special_tokens_mask"] + right_padded_sequence_length = len(right_padded_input_ids) + + assert sequence_length + padding_size == right_padded_sequence_length + assert input_ids + [padding_idx] * padding_size == right_padded_input_ids + assert special_tokens_mask + [1] * padding_size == right_padded_special_tokens_mask + + # Test left padding + tokenizer.padding_side = "left" + left_padded_sequence = tokenizer.encode_plus( + table, + sequence, + max_length=sequence_length + padding_size, + padding="max_length", + return_special_tokens_mask=True, + ) + left_padded_input_ids = left_padded_sequence["input_ids"] + left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"] + left_padded_sequence_length = len(left_padded_input_ids) + + assert sequence_length + padding_size == left_padded_sequence_length + assert [padding_idx] * padding_size + input_ids == left_padded_input_ids + assert [1] * padding_size + special_tokens_mask == left_padded_special_tokens_mask + + if "token_type_ids" in tokenizer.model_input_names: + token_type_ids = encoded_sequence["token_type_ids"] + left_padded_token_type_ids = left_padded_sequence["token_type_ids"] + right_padded_token_type_ids = right_padded_sequence["token_type_ids"] + + assert ( + token_type_ids + [ + [token_type_padding_idx] * 7] * padding_size == right_padded_token_type_ids + ) + assert [[token_type_padding_idx] * 7] * padding_size + token_type_ids == left_padded_token_type_ids + + if "attention_mask" in tokenizer.model_input_names: + attention_mask = encoded_sequence["attention_mask"] + right_padded_attention_mask = right_padded_sequence["attention_mask"] + left_padded_attention_mask = left_padded_sequence["attention_mask"] + + assert attention_mask + [0] * padding_size == right_padded_attention_mask + assert [0] * padding_size + attention_mask == left_padded_attention_mask + + def test_internal_consistency(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + input_text, output_text = self.get_input_output_texts(tokenizer) + + tokens = tokenizer.tokenize(input_text) + ids = tokenizer.convert_tokens_to_ids(tokens) + ids_2 = tokenizer.encode(table, input_text, add_special_tokens=False) + self.assertListEqual(ids, ids_2) + + tokens_2 = tokenizer.convert_ids_to_tokens(ids) + self.assertNotEqual(len(tokens_2), 0) + text_2 = tokenizer.decode(ids) + self.assertIsInstance(text_2, str) + + self.assertEqual(text_2, output_text) + + def test_mask_output(self): + tokenizers = self.get_tokenizers(fast=False, do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table, query = self.get_table_and_query(tokenizer) + + if ( + tokenizer.build_inputs_with_special_tokens.__qualname__.split(".")[0] != "PreTrainedTokenizer" + and "token_type_ids" in tokenizer.model_input_names + ): + information = tokenizer.encode_plus(table, query, add_special_tokens=True) + sequences, mask = information["input_ids"], information["token_type_ids"] + self.assertEqual(len(sequences), len(mask)) + + @unittest.skip("TAPAS tokenizer only handles two sequences.") + def test_maximum_encoding_length_pair_input(self): + pass + + @unittest.skip("TAPAS tokenizer only handles two sequences.") + def test_maximum_encoding_length_single_input(self): + pass + + def test_number_of_added_tokens(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + + table, query = self.get_table_and_query(tokenizer) + + sequences = tokenizer.encode(table, query, add_special_tokens=False) + attached_sequences = tokenizer.encode(table, query, add_special_tokens=True) + + # Method is implemented (e.g. not GPT-2) + if len(attached_sequences) != 2: + self.assertEqual( + tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences) + ) + + def test_padding_to_max_length(self): + """We keep this test for backward compatibility but it should be removed when `pad_to_max_length` will be deprecated""" + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer) + sequence = "Sequence" + padding_size = 10 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_idx = tokenizer.pad_token_id + + # Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "right" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + # FIXME: the next line should be padding(max_length) to avoid warning + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, padding=True + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert encoded_sequence + [padding_idx] * padding_size == padded_sequence + + # Check that nothing is done when a maximum length is not specified + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence, pad_to_max_length=True) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + def test_padding_to_multiple_of(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + if tokenizer.pad_token is None: + self.skipTest("No padding token.") + else: + empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8) + normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8) + for key, value in empty_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + normal_tokens = tokenizer("This", pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # Should also work with truncation + normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # truncation to something which is not a multiple of pad_to_multiple_of raises an error + self.assertRaises( + ValueError, + tokenizer.__call__, + "This", + padding=True, + truncation=True, + max_length=12, + pad_to_multiple_of=8, + ) + + def test_call(self): + # Tests that all call wrap to encode_plus and batch_encode_plus + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + # Test not batched + table = self.get_table(tokenizer, length=0) + encoded_sequences_1 = tokenizer.encode_plus(table, sequences[0]) + encoded_sequences_2 = tokenizer(table, sequences[0]) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + # Test not batched pairs + table = self.get_table(tokenizer, length=10) + encoded_sequences_1 = tokenizer.encode_plus(table, sequences[1]) + encoded_sequences_2 = tokenizer(table, sequences[1]) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + # Test batched + table = self.get_table(tokenizer, length=0) + encoded_sequences_1 = tokenizer.batch_encode_plus(table, sequences) + encoded_sequences_2 = tokenizer(table, sequences) + self.assertEqual(encoded_sequences_1, encoded_sequences_2) + + def test_batch_encode_plus_batch_sequence_length(self): + # Tests that all encoded values have the correct size + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + encoded_sequences = [tokenizer.encode_plus(table, sequence) for sequence in sequences] + encoded_sequences_batch = tokenizer.batch_encode_plus(table, sequences, padding=False) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + maximum_length = len( + max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len) + ) + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences_padded = [ + tokenizer.encode_plus(table, sequence, max_length=maximum_length, padding="max_length") + for sequence in sequences + ] + + encoded_sequences_batch_padded = tokenizer.batch_encode_plus(table, sequences, padding=True) + self.assertListEqual( + encoded_sequences_padded, + self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded), + ) + + # check 'longest' is unsensitive to a max length + encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=True) + encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus( + table, sequences, max_length=maximum_length + 10, padding="longest" + ) + for key in encoded_sequences_batch_padded_1.keys(): + self.assertListEqual( + encoded_sequences_batch_padded_1[key], + encoded_sequences_batch_padded_2[key], + ) + + # check 'no_padding' is unsensitive to a max length + encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=False) + encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus( + table, sequences, max_length=maximum_length + 10, padding=False + ) + for key in encoded_sequences_batch_padded_1.keys(): + self.assertListEqual( + encoded_sequences_batch_padded_1[key], + encoded_sequences_batch_padded_2[key], + ) + + @unittest.skip("batch_encode_plus does not handle overflowing tokens.") + def test_batch_encode_plus_overflowing_tokens(self): + pass + + def test_batch_encode_plus_padding(self): + # Test that padded sequences are equivalent between batch_encode_plus and encode_plus + + # Right padding tests + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + max_length = 100 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences = [ + tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length") + for sequence in sequences + ] + encoded_sequences_batch = tokenizer.batch_encode_plus( + table, sequences, max_length=max_length, padding="max_length" + ) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + # Left padding tests + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + tokenizer.padding_side = "left" + sequences = [ + "Testing batch encode plus", + "Testing batch encode plus with different sequence lengths", + "Testing batch encode plus with different sequence lengths correctly pads", + ] + + max_length = 100 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequences) + + encoded_sequences = [ + tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length") + for sequence in sequences + ] + encoded_sequences_batch = tokenizer.batch_encode_plus( + table, sequences, max_length=max_length, padding="max_length" + ) + self.assertListEqual( + encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch) + ) + + def test_padding_to_multiple_of(self): + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + if tokenizer.pad_token is None: + self.skipTest("No padding token.") + else: + empty_tokens = tokenizer(table, padding=True, pad_to_multiple_of=8) + normal_tokens = tokenizer(table, "This is a sample input", padding=True, pad_to_multiple_of=8) + for key, value in empty_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + normal_tokens = tokenizer(table, "This", pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + # Should also work with truncation + normal_tokens = tokenizer(table, "This", padding=True, truncation=True, pad_to_multiple_of=8) + for key, value in normal_tokens.items(): + self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key)) + + @unittest.skip("TAPAS cannot handle `prepare_for_model` without passing by `encode_plus` or `batch_encode_plus`") + def test_prepare_for_model(self): + pass + + def test_tokenizer_slow_store_full_signature(self): + signature = inspect.signature(self.tokenizer_class.__init__) + tokenizer = self.get_tokenizer() + + for parameter_name, parameter in signature.parameters.items(): + if parameter.default != inspect.Parameter.empty: + self.assertIn(parameter_name, tokenizer.init_kwargs) + + def test_special_tokens_mask_input_pairs(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + sequence_0 = "Encode this." + empty_table = self.get_table(tokenizer, length=0) + table = self.get_table(tokenizer, length=10) + encoded_sequence = tokenizer.encode(empty_table, sequence_0, add_special_tokens=False) + encoded_sequence += tokenizer.encode(table, "", add_special_tokens=False) + encoded_sequence_dict = tokenizer.encode_plus( + table, + sequence_0, + add_special_tokens=True, + return_special_tokens_mask=True, + # add_prefix_space=False, + ) + encoded_sequence_w_special = encoded_sequence_dict["input_ids"] + special_tokens_mask = encoded_sequence_dict["special_tokens_mask"] + self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special)) + + filtered_sequence = [ + (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special) + ] + filtered_sequence = [x for x in filtered_sequence if x is not None] + self.assertEqual(encoded_sequence, filtered_sequence) + + def test_special_tokens_mask(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence_0 = "Encode this." + # Testing single inputs + encoded_sequence = tokenizer.encode(table, sequence_0, add_special_tokens=False) + encoded_sequence_dict = tokenizer.encode_plus( + table, sequence_0, add_special_tokens=True, return_special_tokens_mask=True + ) + encoded_sequence_w_special = encoded_sequence_dict["input_ids"] + special_tokens_mask = encoded_sequence_dict["special_tokens_mask"] + self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special)) + + filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]] + self.assertEqual(encoded_sequence, filtered_sequence) + + def test_save_and_load_tokenizer(self): + # safety check on max_len default value so we are sure the test works + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + self.assertNotEqual(tokenizer.model_max_length, 42) + + # Now let's start the test + tokenizers = self.get_tokenizers() + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + # Isolate this from the other tests because we save additional tokens/etc + table = self.get_table(tokenizer, length=0) + tmpdirname = tempfile.mkdtemp() + + sample_text = " He is very happy, UNwant\u00E9d,running" + before_tokens = tokenizer.encode(table, sample_text, add_special_tokens=False) + before_vocab = tokenizer.get_vocab() + tokenizer.save_pretrained(tmpdirname) + + after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname) + after_tokens = after_tokenizer.encode(table, sample_text, add_special_tokens=False) + after_vocab = after_tokenizer.get_vocab() + self.assertListEqual(before_tokens, after_tokens) + self.assertDictEqual(before_vocab, after_vocab) + + shutil.rmtree(tmpdirname) + + def test_right_and_left_padding(self): + tokenizers = self.get_tokenizers(do_lower_case=False) + for tokenizer in tokenizers: + with self.subTest(f"{tokenizer.__class__.__name__}"): + table = self.get_table(tokenizer, length=0) + sequence = "Sequence" + padding_size = 10 + + # check correct behaviour if no pad_token_id exists and add it eventually + self._check_no_pad_token_padding(tokenizer, sequence) + + padding_idx = tokenizer.pad_token_id + + # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "right" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, padding="max_length" + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert encoded_sequence + [padding_idx] * padding_size == padded_sequence + + # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True + tokenizer.padding_side = "left" + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + padded_sequence = tokenizer.encode( + table, sequence, max_length=sequence_length + padding_size, padding="max_length" + ) + padded_sequence_length = len(padded_sequence) + assert sequence_length + padding_size == padded_sequence_length + assert [padding_idx] * padding_size + encoded_sequence == padded_sequence + + # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding' + encoded_sequence = tokenizer.encode(table, sequence) + sequence_length = len(encoded_sequence) + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence, padding=True) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + tokenizer.padding_side = "left" + padded_sequence_left = tokenizer.encode(table, sequence, padding="longest") + padded_sequence_left_length = len(padded_sequence_left) + assert sequence_length == padded_sequence_left_length + assert encoded_sequence == padded_sequence_left + + tokenizer.padding_side = "right" + padded_sequence_right = tokenizer.encode(table, sequence) + padded_sequence_right_length = len(padded_sequence_right) + assert sequence_length == padded_sequence_right_length + assert encoded_sequence == padded_sequence_right + + tokenizer.padding_side = "left" + padded_sequence_left = tokenizer.encode(table, sequence, padding=False) + padded_sequence_left_length = len(padded_sequence_left) + assert sequence_length == padded_sequence_left_length + assert encoded_sequence == padded_sequence_left + + @unittest.skip("TAPAS doesn't handle pre-tokenized inputs.") + def test_pretokenized_inputs(self): + pass + + # TODO SET TO SLOW + def test_tapas_truncation_integration_test(self): + data = { + "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + "Age": ["56", "45", "59"], + "Number of movies": ["87", "53", "69"], + "Date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"], + } + queries = [ + "When was Brad Pitt born?", + "Which actor appeared in the least number of movies?", + "What is the average number of movies?", + ] + table = pd.DataFrame.from_dict(data) + + # TODO: Should update this in the future + tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512) + + for i in range(12): + # The table cannot even encode the headers, so raise an error + with self.assertRaises(ValueError): + tokenizer.encode(table=table, query=queries[0], max_length=i, truncation="drop_rows_to_fit") + + for i in range(12, 512): + new_encoded_inputs = tokenizer.encode(table=table, query=queries[0], max_length=i, truncation="drop_rows_to_fit") + + # Ensure that the input IDs are less than the max length defined. + self.assertLessEqual(len(new_encoded_inputs), i) + + # TODO SET TO SLOW + def test_tapas_integration_test(self): + data = { + "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], + "Age": ["56", "45", "59"], + "Number of movies": ["87", "53", "69"], + "Date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"], + } + queries = [ + "When was Brad Pitt born?", + "Which actor appeared in the least number of movies?", + "What is the average number of movies?", + ] + table = pd.DataFrame.from_dict(data) + + # TODO: Should update this in the future + tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512) + + expected_results = { + "input_ids": [ + 101, + 2043, + 2001, + 8226, + 15091, + 2141, + 1029, + 102, + 5889, + 2287, + 2193, + 1997, + 5691, + 3058, + 1997, + 4182, + 8226, + 15091, + 5179, + 6584, + 2324, + 2285, + 3699, + 14720, + 4487, + 6178, + 9488, + 3429, + 5187, + 2340, + 2281, + 3326, + 2577, + 18856, + 7828, + 3240, + 5354, + 6353, + 1020, + 2089, + 3777, + ], + "attention_mask": [ + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + ], + "token_type_ids": [ + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0, 0], + [1, 2, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 3, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 4, 0, 0, 0, 0, 0], + [1, 1, 1, 0, 0, 0, 0], + [1, 1, 1, 0, 0, 0, 0], + [1, 2, 1, 0, 2, 2, 0], + [1, 3, 1, 0, 3, 1, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 4, 1, 0, 2, 2, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 1, 2, 0, 0, 0, 0], + [1, 2, 2, 0, 1, 3, 0], + [1, 3, 2, 0, 1, 3, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 4, 2, 0, 3, 1, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 1, 3, 0, 0, 0, 0], + [1, 2, 3, 0, 3, 1, 0], + [1, 3, 3, 0, 2, 2, 0], + [1, 4, 3, 0, 1, 3, 0], + [1, 4, 3, 0, 1, 3, 0], + [1, 4, 3, 0, 1, 3, 0], + ], + } + + new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0]) + + self.assertDictEqual(dict(new_encoded_inputs), expected_results) + + # TODO SET TO SLOW + def test_full_tokenizer(self): + data = [ + ["Pos", "No", "Driver", "Team", "Laps", "Time/Retired", "Grid", "Points"], + ["1", "32", "Patrick Carpentier", "Team Player's", "87", "1:48:11.023", "1", "22"], + ["2", "1", "Bruno Junqueira", "Newman/Haas Racing", "87", "+0.8 secs", "2", "17"], + ["3", "3", "Paul Tracy", "Team Player's", "87", "+28.6 secs", "3", "14"], + ["4", "9", "Michel Jourdain, Jr.", "Team Rahal", "87", "+40.8 secs", "13", "12"], + ["5", "34", "Mario Haberfeld", "Mi-Jack Conquest Racing", "87", "+42.1 secs", "6", "10"], + ["6", "20", "Oriol Servia", "Patrick Racing", "87", "+1:00.2", "10", "8"], + ["7", "51", "Adrian Fernandez", "Fernandez Racing", "87", "+1:01.4", "5", "6"], + ["8", "12", "Jimmy Vasser", "American Spirit Team Johansson", "87", "+1:01.8", "8", "5"], + ["9", "7", "Tiago Monteiro", "Fittipaldi-Dingman Racing", "86", "+ 1 Lap", "15", "4"], + ["10", "55", "Mario Dominguez", "Herdez Competition", "86", "+ 1 Lap", "11", "3"], + ["11", "27", "Bryan Herta", "PK Racing", "86", "+ 1 Lap", "12", "2"], + ["12", "31", "Ryan Hunter-Reay", "American Spirit Team Johansson", "86", "+ 1 Lap", "17", "1"], + ["13", "19", "Joel Camathias", "Dale Coyne Racing", "85", "+ 2 Laps", "18", "0"], + ["14", "33", "Alex Tagliani", "Rocketsports Racing", "85", "+ 2 Laps", "14", "0"], + ["15", "4", "Roberto Moreno", "Herdez Competition", "85", "+ 2 Laps", "9", "0"], + ["16", "11", "Geoff Boss", "Dale Coyne Racing", "83", "Mechanical", "19", "0"], + ["17", "2", "Sebastien Bourdais", "Newman/Haas Racing", "77", "Mechanical", "4", "0"], + ["18", "15", "Darren Manning", "Walker Racing", "12", "Mechanical", "7", "0"], + ["19", "5", "Rodolfo Lavin", "Walker Racing", "10", "Mechanical", "16", "0"], + ] + query = "what were the drivers names?" + table = pd.DataFrame.from_records(data[1:], columns=data[0]) + + # TODO: Should update this in the future + tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512) + model_inputs = tokenizer(table, query, padding="max_length") + + input_ids = model_inputs["input_ids"] + token_type_ids = np.array(model_inputs["token_type_ids"]) + segment_ids = token_type_ids[:, 0] + column_ids = token_type_ids[:, 1] + row_ids = token_type_ids[:, 2] + + expected_results = { + "input_ids": [ + 101, + 2054, + 2020, + 1996, + 6853, + 3415, + 1029, + 102, + 13433, + 2015, + 2053, + 4062, + 2136, + 10876, + 2051, + 1013, + 3394, + 8370, + 2685, + 1015, + 3590, + 4754, + 29267, + 4765, + 3771, + 2136, + 2447, + 1005, + 1055, + 6584, + 1015, + 1024, + 4466, + 1024, + 2340, + 1012, + 6185, + 2509, + 1015, + 2570, + 1016, + 1015, + 10391, + 12022, + 4226, + 7895, + 10625, + 1013, + 22996, + 3868, + 6584, + 1009, + 1014, + 1012, + 1022, + 10819, + 2015, + 1016, + 2459, + 1017, + 1017, + 2703, + 10555, + 2136, + 2447, + 1005, + 1055, + 6584, + 1009, + 2654, + 1012, + 1020, + 10819, + 2015, + 1017, + 2403, + 1018, + 1023, + 8709, + 8183, + 3126, + 21351, + 2078, + 1010, + 3781, + 1012, + 2136, + 10958, + 8865, + 6584, + 1009, + 2871, + 1012, + 1022, + 10819, + 2015, + 2410, + 2260, + 1019, + 4090, + 7986, + 5292, + 5677, + 8151, + 2771, + 1011, + 2990, + 9187, + 3868, + 6584, + 1009, + 4413, + 1012, + 1015, + 10819, + 2015, + 1020, + 2184, + 1020, + 2322, + 2030, + 20282, + 14262, + 9035, + 4754, + 3868, + 6584, + 1009, + 1015, + 1024, + 4002, + 1012, + 1016, + 2184, + 1022, + 1021, + 4868, + 7918, + 12023, + 12023, + 3868, + 6584, + 1009, + 1015, + 1024, + 5890, + 1012, + 1018, + 1019, + 1020, + 1022, + 2260, + 5261, + 12436, + 18116, + 2137, + 4382, + 2136, + 26447, + 6584, + 1009, + 1015, + 1024, + 5890, + 1012, + 1022, + 1022, + 1019, + 1023, + 1021, + 27339, + 3995, + 10125, + 9711, + 4906, + 25101, + 24657, + 1011, + 22033, + 2386, + 3868, + 6564, + 1009, + 1015, + 5001, + 2321, + 1018, + 2184, + 4583, + 7986, + 14383, + 2075, + 29488, + 14906, + 9351, + 2971, + 6564, + 1009, + 1015, + 5001, + 2340, + 1017, + 2340, + 2676, + 8527, + 2014, + 2696, + 1052, + 2243, + 3868, + 6564, + 1009, + 1015, + 5001, + 2260, + 1016, + 2260, + 2861, + 4575, + 4477, + 1011, + 2128, + 4710, + 2137, + 4382, + 2136, + 26447, + 6564, + 1009, + 1015, + 5001, + 2459, + 1015, + 2410, + 2539, + 8963, + 11503, + 25457, + 3022, + 8512, + 2522, + 9654, + 3868, + 5594, + 1009, + 1016, + 10876, + 2324, + 1014, + 2403, + 3943, + 4074, + 6415, + 15204, + 2072, + 12496, + 25378, + 3868, + 5594, + 1009, + 1016, + 10876, + 2403, + 1014, + 2321, + 1018, + 10704, + 17921, + 14906, + 9351, + 2971, + 5594, + 1009, + 1016, + 10876, + 1023, + 1014, + 2385, + 2340, + 14915, + 5795, + 8512, + 2522, + 9654, + 3868, + 6640, + 6228, + 2539, + 1014, + 2459, + 1016, + 28328, + 8945, + 3126, + 21351, + 2015, + 10625, + 1013, + 22996, + 3868, + 6255, + 6228, + 1018, + 1014, + 2324, + 2321, + 12270, + 11956, + 5232, + 3868, + 2260, + 6228, + 1021, + 1014, + 2539, + 1019, + 8473, + 28027, + 2080, + 2474, + 6371, + 5232, + 3868, + 2184, + 6228, + 2385, + 1014, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ], + "column_ids": [ + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 2, + 3, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 4, + 4, + 4, + 5, + 6, + 6, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 5, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 4, + 4, + 5, + 6, + 7, + 8, + 1, + 2, + 3, + 3, + 3, + 3, + 3, + 4, + 4, + 5, + 6, + 7, + 8, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ], + "row_ids": [ + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 2, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 3, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 4, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 5, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 6, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 7, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 8, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 9, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 10, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 11, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 12, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 13, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 14, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 15, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 16, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 17, + 18, + 18, + 18, + 18, + 18, + 18, + 18, + 18, + 18, + 18, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 19, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ], + "segment_ids": [ + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 1, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + 0, + ], + } + + self.assertListEqual(input_ids, expected_results["input_ids"]) + self.assertListEqual(segment_ids.tolist(), expected_results["segment_ids"]) + self.assertListEqual(column_ids.tolist(), expected_results["column_ids"]) + self.assertListEqual(row_ids.tolist(), expected_results["row_ids"])