diff --git a/README.md b/README.md
index 1a898a9f076e..a710d6db6a54 100644
--- a/README.md
+++ b/README.md
@@ -188,6 +188,7 @@ Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih.
 ultilingual BERT into [DistilmBERT](https://github.com/huggingface/transformers/tree/master/examples/distillation) and a German version of DistilBERT.
 1. **[SqueezeBert](https://huggingface.co/transformers/model_doc/squeezebert.html)** released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
 1. **[T5](https://huggingface.co/transformers/model_doc/t5.html)** (from Google AI) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
+1. **[TAPAS](https://huggingface.co/transformers/master/model_doc/tapas.html)** released with the paper [TAPAS: Weakly Supervised Table Parsing via Pre-training](https://arxiv.org/abs/2004.02349) by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and Julian Martin Eisenschlos.
 1. **[Transformer-XL](https://huggingface.co/transformers/model_doc/transformerxl.html)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
 1. **[XLM](https://huggingface.co/transformers/model_doc/xlm.html)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
 1. **[XLM-ProphetNet](https://huggingface.co/transformers/model_doc/xlmprophetnet.html)** (from Microsoft Research) released with the paper [ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training](https://arxiv.org/abs/2001.04063) by Yu Yan, Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
@@ -222,4 +223,4 @@ We now have a [paper](https://arxiv.org/abs/1910.03771) you can cite for the 
   year={2019},
   volume={abs/1910.03771}
 }
-```
+```
\ No newline at end of file
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 737f562f663e..7b68b3ce91bc 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -145,22 +145,25 @@ conversion utilities for the following models:
 27. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
     Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
     Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-28. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
+28. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
+    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
+    Francesco Piccinno and Julian Martin Eisenschlos.
+29. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
     Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,
     Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
-29. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
+30. :doc:`XLM <model_doc/xlm>` (from Facebook) released together with the paper `Cross-lingual Language Model
     Pretraining <https://arxiv.org/abs/1901.07291>`__ by Guillaume Lample and Alexis Conneau.
-30. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
+31. :doc:`XLM-ProphetNet <model_doc/xlmprophetnet>` (from Microsoft Research) released with the paper `ProphetNet:
     Predicting Future N-gram for Sequence-to-Sequence Pre-training <https://arxiv.org/abs/2001.04063>`__ by Yu Yan,
     Weizhen Qi, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang and Ming Zhou.
-31. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
+32. :doc:`XLM-RoBERTa <model_doc/xlmroberta>` (from Facebook AI), released together with the paper `Unsupervised
     Cross-lingual Representation Learning at Scale <https://arxiv.org/abs/1911.02116>`__ by Alexis Conneau*, Kartikay
     Khandelwal*, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke
     Zettlemoyer and Veselin Stoyanov.
-32. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
+33. :doc:`XLNet <model_doc/xlnet>` (from Google/CMU) released with the paper `​XLNet: Generalized Autoregressive
     Pretraining for Language Understanding <https://arxiv.org/abs/1906.08237>`__ by Zhilin Yang*, Zihang Dai*, Yiming
     Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
-33. `Other community models <https://huggingface.co/models>`__, contributed by the `community
+34. `Other community models <https://huggingface.co/models>`__, contributed by the `community
     <https://huggingface.co/users>`__.
 
 .. toctree::
@@ -258,6 +261,7 @@ conversion utilities for the following models:
     model_doc/roberta
     model_doc/squeezebert
     model_doc/t5
+    model_doc/tapas
     model_doc/transformerxl
     model_doc/xlm
     model_doc/xlmprophetnet
diff --git a/docs/source/model_doc/tapas.rst b/docs/source/model_doc/tapas.rst
new file mode 100644
index 000000000000..d46fb25b0dae
--- /dev/null
+++ b/docs/source/model_doc/tapas.rst
@@ -0,0 +1,378 @@
+TAPAS
+-----------------------------------------------------------------------------------------------------------------------
+
+Overview
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training
+<https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and
+Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for answering questions
+about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular
+structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising millions
+of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads on top: a cell
+selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among
+selected cells. TAPAS has been fine-tuned on several datasets: SQA (Sequential Question Answering by Microsoft), WTQ
+(Wiki Table Questions by Stanford University) and WikiSQL (by Salesforce). It achieves state-of-the-art on both SQA and
+WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
+
+The abstract from the paper is the following:
+
+*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the
+collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations
+instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition,
+the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we
+present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak
+supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation
+operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective
+joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with
+three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by
+improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL
+and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our
+setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
+
+In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset
+of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning.
+The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM,
+and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving
+a new state-of-the-art as well as state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for
+table entailment (a binary classification task). For more details, see their follow-up paper: `Understanding tables with
+intermediate pre-training <https://arxiv.org/abs/2010.00571>`__ by Julian Martin Eisenschlos, Syrine Krichene and
+Thomas Müller.
+
+The original code can be found `here <https://github.com/google-research/tapas>`__.
+
+Tips:
+
+- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell
+  of the table). According to the authors, this usually results in a slightly better performance, and allows you to
+  encode longer sequences without running out of embeddings. This is reflected in the ``reset_position_index_per_cell`` 
+  parameter of :class:`~transformers.TapasConfig`, which is set to ``True`` by default. 
+  There are both pre-trained models in the `model hub <https://huggingface.co/models?search=tapas>`_ with absolute and relative 
+  position embeddings. Note that it's usually advised to pad the inputs on the right rather than the left.
+- TAPAS is based on BERT, so ``TAPAS-base`` for example corresponds to a ``BERT-base`` architecture. Of course, TAPAS-large 
+  will result in the best performance (the results reported in the paper are from TAPAS-large). Metrics of the various 
+  sized models are shown on the `original Github repository <https://github.com/google-research/tapas>`_. 
+- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a
+  conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the
+  previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that
+  case, you have to feed every training example one by one to the model, such that the `prev_label_ids` token type ids
+  can be overwritten by the predicted `label_ids` of the model to the previous question. See "Usage" section for more info.
+- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.
+
+
+Usage: fine-tuning
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here we explain how you can fine-tune :class:`~transformers.TapasForQuestionAnswering` on your own dataset. 
+
+===========================================================================
+STEP 1: Choose one of the 3 ways in which you can use TAPAS - or experiment
+===========================================================================
+
+Basically, there are 3 different ways in which one can fine-tune :class:`~transformers.TapasForQuestionAnswering`, corresponding to 
+the different datasets on which Tapas was fine-tuned:
+
+1. SQA: if you're interested in asking follow-up questions related to a table, in a conversational set-up. For example if you 
+   first ask "what's the name of the first actor?" then you can ask a follow-up question such as "how old is he?". Here, questions 
+   do not involve any aggregation (all questions are cell selection questions).
+2. WTQ/WikiSQL: if you're not interested in asking questions in a conversational set-up, but rather just asking questions related 
+   to a table, which might involve aggregation, such as counting a number of rows, summing up cell values or averaging cell values. 
+   You can then for example ask "what's the total number of goals Cristiano Ronaldo made in his career?". This case is also called **weak 
+   supervision**, since the model itself must learn the appropriate aggregation operator (SUM/COUNT/AVERAGE/NONE) given only the answer 
+   to the question as supervision.
+3. WikiSQL-supervised: this dataset is actually the same dataset as WikiSQL, but here the model is given the ground truth aggregation 
+   operator during training. This is also called **strong supervision**. Here, learning the appropriate aggregation operator is much easier.
+
+To summarize:
+
++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
+| **Task**                           | **Example datasets** | **Description**                                                                                                   |
++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
+| Conversational                     | SQA                  | Conversational, only cell selection questions                                                                     |
++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
+| Weak supervision for aggregation   | WTQ, WikiSQL         | Questions might involve aggregation, and the model must learn this given only the answer as supervision           |
++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
+| Strong supervision for aggregation | WikiSQL-supervised   | Questions might involve aggregation, and the model must learn this given the gold aggregation operator            |
++------------------------------------+----------------------+-------------------------------------------------------------------------------------------------------------------+
+
+Initializing a model with a pre-trained base and randomly initialized classification heads from the model hub is as easy as:
+
+.. code-block::
+
+        >>> from transformers import TapasForQuestionAnswering
+
+        >>> # for example, the base sized model 
+        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased')
+
+
+Of course, you don't necessarily have to follow one these three ways in which TAPAS was fine-tuned. You can also experiment by defining any hyperparameters 
+you want when initializing :class:`~transformers.TapasConfig`, and then create a :class:`~transformers.TapasForQuestionAnswering` based on that 
+configuration. For example, if you have a dataset that has both conversational questions and questions that might involve aggregation, then you can do it 
+this way. Here's an example:
+
+.. code-block::
+
+        >>> from transformers import TapasConfig, TapasForQuestionAnswering
+
+        >>> # you can initialize the classification heads any way you want (see docs of TapasConfig)
+        >>> config = TapasConfig(num_aggregation_labels=3, average_logits_per_cell=True, select_one_column=False)
+        >>> # initializing the pre-trained base sized model with our custom classification heads
+        >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased', config=config)
+
+What you can also do is start from an already fine-tuned checkpoint. A note here is that the already fine-tuned checkpoint on WTQ has some issues
+due to the L2-loss which is somewhat brittle. See `here <https://github.com/google-research/tapas/issues/91#issuecomment-735719340>`__ for more info.
+
+For a list of all pre-trained and fine-tuned TAPAS checkpoints available in the HuggingFace model hub, see `here <https://huggingface.co/models?search=tapas>`__.
+
+===========================================
+STEP 2: Prepare your data in the SQA format
+===========================================
+
+Second, no matter what you picked above, you should prepare your dataset in the `SQA format <https://www.microsoft.com/en-us/download/details.aspx?id=54253>`__. 
+This format is a TSV/CSV file with the following columns:
+
+- ``id``: optional, id of the table-question pair, for bookkeeping purposes. 
+- ``annotator``: optional, id of the person who annotated the table-question pair, for bookkeeping purposes. 
+- ``position``: integer indicating if the question is the first, second, third,... related to the table. Only required in case of conversational setup (SQA). 
+  You don't need this column in case you're going for WTQ/WikiSQL/WikiSQL-supervised.
+- ``question``: string
+- ``table_file``: string, name of a csv file containing the tabular data
+- ``answer_coordinates``: list of one or more tuples (each tuple being a cell coordinate, i.e. row, column pair that is part of the answer)
+- ``answer_text``: list of one or more strings (each string being a cell value that is part of the answer)
+- ``aggregation_label``: index of the aggregation operator. Only required in case of strong supervision for aggregation (the WikiSQL-supervised case)
+- ``float_answer``: the float answer to the question, if there is one (np.nan if there isn't). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
+
+The tables themselves should be present in a folder, each table being a separate csv file. Note that the authors of the TAPAS algorithm used conversion 
+scripts with some automated logic to convert the other datasets (WTQ and WikiSQL) into the SQA format. The author explains this `here <https://github.com/google-research/tapas/issues/50#issuecomment-705465960>`__. 
+Interestingly, these conversion scripts are not perfect (the ``answer_coordinates`` and ``float_answer`` fields are populated based on the ``answer_text``), 
+meaning that WTQ and WikiSQL results could actually be improved.
+
+
+==========================================================================================
+STEP 3: Convert your data into PyTorch tensors using :class:`~transformers.TapasTokenizer`
+==========================================================================================
+
+Third, given that you've prepared your data in this TSV/CSV format (and corresponding CSV files containing the tabular data), you can then 
+use :class:`~transformers.TapasTokenizer` to convert table-question pairs into :obj:`input_ids`, :obj:`attention_mask`, :obj:`token_type_ids`
+and so on. Again, based on which of the three cases you picked above, :class:`~transformers.TapasForQuestionAnswering` requires different inputs 
+to be fine-tuned:
+
++------------------------------------+----------------------------------------------------------------------------------------------+
+| **Task**                           | **Required inputs**                                                                          |
++------------------------------------+----------------------------------------------------------------------------------------------+
+| Conversational                     | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``label_ids``                         |
++------------------------------------+----------------------------------------------------------------------------------------------+
+| Weak supervision for aggregation   | ``input_ids``, ``attention_mask``, ``token_type_ids``, ``label_ids``, ``numeric_values``,    |
+|                                    | ``numeric_values_scale``, ``float_answer``                                                   |
++------------------------------------+----------------------------------------------------------------------------------------------+
+| Strong supervision for aggregation | ``input ids``, ``attention mask``, ``token type ids``, ``label ids``, ``aggregation_labels`` |
++------------------------------------+----------------------------------------------------------------------------------------------+
+
+:class:`~transformers.TapasTokenizer` creates the ``label_ids``, ``numeric_values`` and ``numeric_values_scale`` based on the 
+``answer_coordinates`` and ``answer_text`` columns of the TSV file. The ``float_answer`` and ``aggregation_labels`` are already in the TSV file of step 2. 
+Here's an example:
+
+.. code-block::
+
+        >>> from transformers import TapasTokenizer
+        >>> import pandas as pd
+
+        >>> model_name = 'google/tapas-base-uncased'
+        >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+
+        >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+        >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+        >>> answer_coordinates = [[(0, 0)], [(1, 0)], [(0, 2), (1, 2), (2, 2)]]
+        >>> answer_text = [["Brad Pitt"], ["69"], ["209"]]
+        >>> table = pd.Dataframe(data)
+        >>> inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates, answer_text=answer_text, padding='max_length', return_tensors='pt')
+        >>> inputs
+        {'input_ids': tensor([[ ... ]]), 'attention_mask': tensor([[...]]), 'token_type_ids': tensor([[[...]]]),
+        'numeric_values': tensor([[ ... ]]), 'numeric_values_scale: tensor([[ ... ]]), label_ids: tensor([[ ... ]])}
+
+Note that :class:`~transformers.TapasTokenizer` expects the data of the table to be text-only. You can use ``.astype(str)`` on a dataframe to turn it into
+text-only data. Of course, this only shows how to encode a single training example. It is advised to create a PyTorch dataset and a corresponding dataloader:
+
+.. code-block::
+
+        >>> import torch
+        >>> import pandas as pd
+
+        >>> tsv_path = "your_path_to_the_tsv_file"
+        >>> table_csv_path = "your_path_to_a_directory_containing_all_csv_files"
+
+        >>> class TableDataset(torch.utils.data.Dataset):
+        ...     def __init__(self, data, tokenizer):
+        ...         self.data = data
+        ...         self.tokenizer = tokenizer
+        ...
+        ...     def __getitem__(self, idx):
+        ...         item = data.iloc[idx]
+        ...         table = pd.read_csv(table_csv_path + item.table_file).astype(str)
+        ...         encoding = self.tokenizer(table=table, 
+        ...                                 queries=item.question, 
+        ...                                 answer_coordinates=item.answer_coordinates, 
+        ...                                 answer_text=item.answer_text,
+        ...                                 padding="max_length",
+        ...                                 return_tensors="pt"
+        ...         )
+        ...         # we add the float_answer which is also required (weak supervision for aggregation)
+        ...         encoding["float_answer"] = torch.tensor(item.float_answer) 
+        ...         return encoding
+        ...
+        ...     def __len__(self):
+        ...        return len(self.data)
+
+        >>> data = pd.read_csv(tsv_path, sep='\t')
+        >>> train_dataset = TableDataset(data, tokenizer)
+        >>> train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32)
+
+Note that here, we encode each table-question pair independently. This is fine as long as your dataset is **not conversational**. In case your 
+dataset involves conversational questions (such as in SQA), then you should first group together the ``queries``, ``answer_coordinates`` and 
+``answer_text`` per table (in the order of their ``position`` index) and batch encode each table with its questions. This will make sure that 
+the ``prev_label_ids`` token types (see docs of :class:`~transformers.TapasTokenizer`) are set correctly. 
+
+===================================================
+STEP 4: Train (fine-tune) TapasForQuestionAnswering
+===================================================
+
+You can then fine-tune :class:`~transformers.TapasForQuestionAnswering` using native PyTorch as follows:
+
+.. code-block::
+
+        >>> from transformers import TapasForQuestionAnswering
+
+        >>> model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-uncased")
+
+        >>> for epoch in range(2):  # loop over the dataset multiple times
+        ...    for idx, batch in enumerate(train_dataloader):
+        ...         # get the inputs; 
+        ...         input_ids, attention_mask, token_type_ids, label_ids, numeric_values, numeric_values_scale, float_answer = batch
+
+        ...         # zero the parameter gradients
+        ...         optimizer.zero_grad()
+
+        ...         # forward + backward + optimize
+        ...         outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, 
+        ...                        label_ids=label_ids, numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
+        ...                        float_answer=float_answer)
+        ...         loss = outputs.loss
+        ...         loss.backward()
+        ...         optimizer.step()
+
+Usage: inference
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here we explain how you can use :class:`~transformers.TapasForQuestionAnswering` for inference (i.e. making predictions on new data).
+For inference, only ``input_ids``, ``attention_mask`` and ``token_type_ids`` (which you can obtain using 
+:class:`~transformers.TapasTokenizer`) have to provided to the model to obtain the logits. Next, you can use the handy 
+``convert_logits_to_predictions`` method of :class:`~transformers.TapasTokenizer` to convert these into predicted coordinates 
+and optional aggregation indices. 
+
+However, note that inference is **different** depending on whether or not the setup is conversational. In a non-conversational set-up, inference 
+can be done in parallel on all table-question pairs of a batch. Here's an example of that:
+
+.. code-block::
+
+        >>> from transformers import TapasTokenizer, TapasForQuestionAnswering
+        >>> import pandas as pd 
+
+        >>> model_name = 'google/tapas-base-uncased-finetuned-wtq'
+        >>> model = TapasForQuestionAnswering.from_pretrained(model_name)
+        >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
+
+        >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
+        >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
+        >>> table = pd.Dataframe(data)
+        >>> inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") 
+        >>> outputs = model(**inputs)
+        >>> predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
+        ...         inputs, 
+        ...         output.logits, 
+        ...         outputs.logits_aggregation
+        ...)
+
+        >>> # let's print out the results:
+        >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
+        >>> aggregation_predictions_string = [id2aggregation[x] for x in predicted_aggregation_indices]
+
+        >>> answers = []
+        >>> for coordinates in predicted_answer_coordinates:
+        ...   if len(coordinates) == 1:
+        ...     # only a single cell:
+        ...     answers.append(df.iat[coordinates[0]])
+        ...   else:
+        ...     # multiple cells
+        ...     cell_values = []
+        ...     for coordinate in coordinates:
+        ...        cell_values.append(df.iat[coordinate])
+        ...     answers.append(", ".join(cell_values))
+
+        >>> display(df)
+        >>> print("")
+        >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):
+        ...   print(query)
+        ...   if predicted_agg == "NONE":
+        ...     print("Predicted answer: " + answer)
+        ...   else:
+        ...     print("Predicted answer: " + predicted_agg + " > " + answer)    
+        When was Brad Pitt born?
+        Predicted answer: 18 december 1963
+        Which actor appeared in the least number of movies?
+        Predicted answer: Leonardo Di Caprio
+        What is the average number of movies?
+        Predicted answer: AVERAGE > 87, 53, 69
+
+In case of a conversational set-up, then each table-question pair must be provided **sequentially** to the model, such that
+the ``prev_label_ids`` token types can be overwritten by the predicted ``label_ids`` of the previous table-question pair. 
+
+
+Tapas specific outputs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.modeling_tapas.TableQuestionAnsweringOutput
+    :members:
+
+
+TapasConfig
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasConfig
+    :members:
+
+
+TapasTokenizer
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasTokenizer
+    :members: __call__, convert_logits_to_predictions, save_vocabulary
+
+
+TapasModel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasModel
+    :members:
+
+
+TapasForMaskedLM
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasForMaskedLM
+    :members:
+
+
+TapasForSequenceClassification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasForSequenceClassification
+    :members: forward
+
+
+TapasForQuestionAnswering
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: transformers.TapasForQuestionAnswering
+    :members:
\ No newline at end of file
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index ee5da4399984..df1c3ee1f2bb 100755
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -61,6 +61,7 @@
 from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
 from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
 from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
+from .configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
 from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
@@ -190,6 +191,7 @@
 from .tokenization_retribert import RetriBertTokenizer
 from .tokenization_roberta import RobertaTokenizer
 from .tokenization_squeezebert import SqueezeBertTokenizer
+from .tokenization_tapas import TapasTokenizer, TapasTruncationStrategy
 from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer
 from .tokenization_utils import PreTrainedTokenizer
 from .tokenization_utils_base import (
@@ -558,6 +560,14 @@
         T5PreTrainedModel,
         load_tf_weights_in_t5,
     )
+    from .modeling_tapas import (
+        TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST,
+        TapasForMaskedLM,
+        TapasForQuestionAnswering,
+        TapasForSequenceClassification,
+        TapasModel,
+        load_tf_weights_in_tapas,
+    )
     from .modeling_transfo_xl import (
         TRANSFO_XL_PRETRAINED_MODEL_ARCHIVE_LIST,
         AdaptiveEmbedding,
diff --git a/src/transformers/commands/convert.py b/src/transformers/commands/convert.py
index 1e054b6a30eb..03ac380cdaa2 100644
--- a/src/transformers/commands/convert.py
+++ b/src/transformers/commands/convert.py
@@ -130,6 +130,13 @@ def run(self):
                 raise ImportError(IMPORT_ERROR_MESSAGE)
 
             convert_gpt2_checkpoint_to_pytorch(self._tf_checkpoint, self._config, self._pytorch_dump_output)
+        elif self._model_type == "tapas":
+            try:
+                from transformers.convert_tapas_original_tf_checkpoint_to_pytorch import (
+                    convert_tf_checkpoint_to_pytorch,
+                )
+            except ImportError:
+                raise ImportError(IMPORT_ERROR_MESSAGE)
         elif self._model_type == "xlnet":
             try:
                 from transformers.convert_xlnet_original_tf_checkpoint_to_pytorch import (
diff --git a/src/transformers/configuration_auto.py b/src/transformers/configuration_auto.py
index 3e411ac37ec7..b6b8e6ca7b56 100644
--- a/src/transformers/configuration_auto.py
+++ b/src/transformers/configuration_auto.py
@@ -48,6 +48,7 @@
 from .configuration_roberta import ROBERTA_PRETRAINED_CONFIG_ARCHIVE_MAP, RobertaConfig
 from .configuration_squeezebert import SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP, SqueezeBertConfig
 from .configuration_t5 import T5_PRETRAINED_CONFIG_ARCHIVE_MAP, T5Config
+from .configuration_tapas import TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP, TapasConfig
 from .configuration_transfo_xl import TRANSFO_XL_PRETRAINED_CONFIG_ARCHIVE_MAP, TransfoXLConfig
 from .configuration_utils import PretrainedConfig
 from .configuration_xlm import XLM_PRETRAINED_CONFIG_ARCHIVE_MAP, XLMConfig
@@ -88,6 +89,7 @@
         SQUEEZEBERT_PRETRAINED_CONFIG_ARCHIVE_MAP,
         XLM_PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
         PROPHETNET_PRETRAINED_CONFIG_ARCHIVE_MAP,
+        TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP,
     ]
     for key, value, in pretrained_map.items()
 )
@@ -131,6 +133,7 @@
         ("dpr", DPRConfig),
         ("layoutlm", LayoutLMConfig),
         ("rag", RagConfig),
+        ("tapas", TapasConfig),
     ]
 )
 
@@ -172,6 +175,7 @@
         ("rag", "RAG"),
         ("xlm-prophetnet", "XLMProphetNet"),
         ("prophetnet", "ProphetNet"),
+        ("tapas", "TAPAS"),
     ]
 )
 
diff --git a/src/transformers/configuration_tapas.py b/src/transformers/configuration_tapas.py
new file mode 100644
index 000000000000..844e433ac02d
--- /dev/null
+++ b/src/transformers/configuration_tapas.py
@@ -0,0 +1,209 @@
+# coding=utf-8
+# Copyright 2020 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" TAPAS configuration. Adds additional hyperparameters to the configuration of BERT."""
+
+
+from .configuration_utils import PretrainedConfig
+
+
+TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP = {"nielsr/tapas-base-finetuned-sqa": "https://huggingface.co/nielsr/tapas-base-finetuned-sqa/resolve/main/config.json", 
+                                       "nielsr/tapas-base-finetuned-wtq": "https://huggingface.co/nielsr/tapas-base-finetuned-wtq/resolve/main/config.json",
+                                       "nielsr/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/nielsr/tapas-base-finetuned-wikisql-supervised/resolve/main/config.json",
+                                       "nielsr/tapas-base-finetuned-tabfact": "https://huggingface.co/nielsr/tapas-base-finetuned-tabfact/resolve/main/config.json"}  
+
+
+class TapasConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. It is used to
+    instantiate a TAPAS model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the TAPAS `tapas-base-finetuned-sqa`
+    architecture. Configuration objects inherit from :class:`~transformers.PreTrainedConfig` and can be used to control
+    the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+
+    Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original
+    implementation. Original implementation available at https://github.com/google-research/tapas/tree/master.
+
+    Args:
+        vocab_size (:obj:`int`, `optional`, defaults to 30522):
+            Vocabulary size of the TAPAS model. Defines the number of different tokens that can be represented by the
+            :obj:`inputs_ids` passed when calling :class:`~transformers.TapasModel`.
+        hidden_size (:obj:`int`, `optional`, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (:obj:`int`, `optional`, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (:obj:`int`, `optional`, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (:obj:`int`, `optional`, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (:obj:`str` or :obj:`Callable`, `optional`, defaults to :obj:`"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string,
+            :obj:`"gelu"`, :obj:`"relu"`, :obj:`"swish"` and :obj:`"gelu_new"` are supported.
+        hidden_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (:obj:`float`, `optional`, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (:obj:`int`, `optional`, defaults to 1024):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_sizes (:obj:`List[int]`, `optional`, defaults to [3, 256, 256, 2, 256, 256, 10]):
+            The vocabulary sizes of the :obj:`token_type_ids` passed when calling :class:`~transformers.TapasModel`.
+        initializer_range (:obj:`float`, `optional`, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (:obj:`float`, `optional`, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        gradient_checkpointing (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use gradient checkpointing to save memory at the expense of a slower backward pass.
+        positive_label_weight (:obj:`float`, `optional`, defaults to 10.0):
+            Weight for positive labels.
+        num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
+            The number of aggregation operators to predict.
+        aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0):
+            Importance weight for the aggregation loss.
+        use_answer_as_supervision (:obj:`bool`, `optional`):
+            Whether to use the answer as the only supervision for aggregation examples.
+        answer_loss_importance (:obj:`float`, `optional`, defaults to 1.0):
+            Importance weight for the regression loss.
+        use_normalized_answer_loss (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to normalize the answer loss by the maximum of the predicted and expected value.
+        huber_loss_delta: (:obj:`float`, `optional`):
+            Delta parameter used to calculate the regression loss.
+        temperature: (:obj:`float`, `optional`, defaults to 1.0):
+            Value used to control (OR change) the skewness of cell logits probabilities.
+        aggregation_temperature: (:obj:`float`, `optional`, defaults to 1.0):
+            Scales aggregation logits to control the skewness of probabilities.
+        use_gumbel_for_cells: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to apply Gumbel-Softmax to cell selection.
+        use_gumbel_for_aggregation: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to apply Gumbel-Softmax to aggregation selection.
+        average_approximation_function: (:obj:`string`, `optional`, defaults to :obj:`"ratio"`):
+            Method to calculate the expected average of cells in the weak supervision case. One of :obj:`"ratio"`, 
+            :obj:`"first_order"` or :obj:`"second_order"`.
+        cell_selection_preference: (:obj:`float`, `optional`):
+            Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for
+            aggregation (WTQ, WikiSQL). If the total mass of the aggregation probabilities (excluding the "NONE"
+            operator) is higher than this hyperparameter, then aggregation is predicted for an example.
+        answer_loss_cutoff: (:obj:`float`, `optional`):
+            Ignore examples with answer loss larger than cutoff.
+        max_num_rows: (:obj:`int`, `optional`, defaults to 64):
+            Maximum number of rows.
+        max_num_columns: (:obj:`int`, `optional`, defaults to 32):
+            Maximum number of columns.
+        average_logits_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to average logits per cell.
+        select_one_column: (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to constrain the model to only select cells from a single column.
+        allow_empty_column_selection: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to allow not to select any column.
+        init_cell_selection_weights_to_zero: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to initialize cell selection weights to 0 so that the initial probabilities are 50%.
+        reset_position_index_per_cell: (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether to restart position indexes at every cell (i.e. use relative position embeddings).
+        disable_per_token_loss: (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to disable any (strong or weak) supervision on cells.
+
+    Example::
+
+        >>> from transformers import TapasModel, TapasConfig
+        >>> # Initializing a Tapas configuration
+        >>> configuration = TapasConfig()
+        >>> # Initializing a model from the configuration
+        >>> model = TapasModel(configuration)
+        >>> # Accessing the model configuration
+        >>> configuration = model.config
+    """
+
+    model_type = "tapas"
+
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=1024,
+        type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10],
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=0,
+        gradient_checkpointing=False,
+        positive_label_weight=10.0,
+        num_aggregation_labels=0,
+        aggregation_loss_weight=1.0,
+        use_answer_as_supervision=None,
+        answer_loss_importance=1.0,
+        use_normalized_answer_loss=False,
+        huber_loss_delta=None,
+        temperature=1.0,
+        aggregation_temperature=1.0,
+        use_gumbel_for_cells=False,
+        use_gumbel_for_aggregation=False,
+        average_approximation_function="ratio",
+        cell_selection_preference=None,
+        answer_loss_cutoff=None,
+        max_num_rows=64,
+        max_num_columns=32,
+        average_logits_per_cell=False,
+        select_one_column=True,
+        allow_empty_column_selection=False,
+        init_cell_selection_weights_to_zero=False,
+        reset_position_index_per_cell=True,
+        disable_per_token_loss=False,
+        **kwargs
+    ):
+
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+
+        # BERT hyperparameters (with updated max_position_embeddings and type_vocab_sizes)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_sizes = type_vocab_sizes
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.gradient_checkpointing = gradient_checkpointing
+
+        # Fine-tuning task hyperparameters
+        self.positive_label_weight = positive_label_weight
+        self.num_aggregation_labels = num_aggregation_labels
+        self.aggregation_loss_weight = aggregation_loss_weight
+        self.use_answer_as_supervision = use_answer_as_supervision
+        self.answer_loss_importance = answer_loss_importance
+        self.use_normalized_answer_loss = use_normalized_answer_loss
+        self.huber_loss_delta = huber_loss_delta
+        self.temperature = temperature
+        self.aggregation_temperature = aggregation_temperature
+        self.use_gumbel_for_cells = use_gumbel_for_cells
+        self.use_gumbel_for_aggregation = use_gumbel_for_aggregation
+        self.average_approximation_function = average_approximation_function
+        self.cell_selection_preference = cell_selection_preference
+        self.answer_loss_cutoff = answer_loss_cutoff
+        self.max_num_rows = max_num_rows
+        self.max_num_columns = max_num_columns
+        self.average_logits_per_cell = average_logits_per_cell
+        self.select_one_column = select_one_column
+        self.allow_empty_column_selection = allow_empty_column_selection
+        self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero
+        self.reset_position_index_per_cell = reset_position_index_per_cell
+        self.disable_per_token_loss = disable_per_token_loss
\ No newline at end of file
diff --git a/src/transformers/configuration_utils.py b/src/transformers/configuration_utils.py
index eb21fa203423..5dc320dfb32c 100755
--- a/src/transformers/configuration_utils.py
+++ b/src/transformers/configuration_utils.py
@@ -163,7 +163,7 @@ class PretrainedConfig(object):
 
     def __init__(self, **kwargs):
         # Attributes with defaults
-        self.return_dict = kwargs.pop("return_dict", False)
+        self.return_dict = kwargs.pop("return_dict", True)
         self.output_hidden_states = kwargs.pop("output_hidden_states", False)
         self.output_attentions = kwargs.pop("output_attentions", False)
         self.use_cache = kwargs.pop("use_cache", True)  # Not used by all models
diff --git a/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py
new file mode 100644
index 000000000000..e78ec9e7ae88
--- /dev/null
+++ b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py
@@ -0,0 +1,120 @@
+# coding=utf-8
+# Copyright 2018 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert TAPAS checkpoint."""
+
+
+import argparse
+
+import torch
+
+from transformers import (
+    TapasConfig,
+    TapasModel,
+    TapasForQuestionAnswering,
+    TapasForSequenceClassification,
+    load_tf_weights_in_tapas,
+)
+from transformers.utils import logging
+
+
+logging.set_verbosity_info()
+
+
+def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, tapas_config_file, pytorch_dump_path):
+    # Initialise PyTorch model. Defaults to TapasForQuestionAnswering with default SQA config.
+    # Uncomment another config and/or model to change this. If you want to convert a checkpoint
+    # that has absolute position embeddings, make sure to set reset_position_index_per_cell of
+    # TapasConfig to False.
+
+    # WTQ config
+    # config = TapasConfig(
+    #         # run_task_main.py hparams
+    #         num_aggregation_labels = 4,
+    #         use_answer_as_supervision = True,
+    #         # hparam_utils.py hparams
+    #         answer_loss_cutoff = 0.664694,
+    #         cell_selection_preference = 0.207951,
+    #         huber_loss_delta = 0.121194,
+    #         init_cell_selection_weights_to_zero = True,
+    #         select_one_column = True,
+    #         allow_empty_column_selection = False,
+    #         temperature = 0.0352513,
+    # )
+
+    # WikiSQL config
+    # config = TapasConfig(
+    #         # run_task_main.py hparams
+    #         num_aggregation_labels = 4,
+    #         use_answer_as_supervision = True,
+    #         # hparam_utils.py hparams
+    #         answer_loss_cutoff = 0.185567,
+    #         cell_selection_preference = 0.611754,
+    #         huber_loss_delta = 1265.74,
+    #         init_cell_selection_weights_to_zero = False,
+    #         select_one_column = False,
+    #         allow_empty_column_selection = False,
+    #         temperature = 0.107515,
+    # )
+
+    # WikiSQL-supervised config
+    # config = TapasConfig(
+    #         # run_task_main.py hparams
+    #         num_aggregation_labels = 4,
+    #         use_answer_as_supervision = False,
+    #         # hparam_utils.py hparams
+    #         answer_loss_cutoff = 36.4519,
+    #         cell_selection_preference = 0.903421,
+    #         huber_loss_delta = 222.088,
+    #         init_cell_selection_weights_to_zero = True,
+    #         select_one_column = True,
+    #         allow_empty_column_selection = True,
+    #         temperature = 0.763141,
+    # )
+
+    # SQA config
+    config = TapasConfig()
+
+    print("Building PyTorch model from configuration: {}".format(str(config)))
+    model = TapasModel(config)
+    #model = TapasForQuestionAnswering(config)
+    # model = TapasForSequenceClassification(config)
+
+    # Load weights from tf checkpoint
+    load_tf_weights_in_tapas(model, config, tf_checkpoint_path)
+
+    # Save pytorch-model
+    print("Save PyTorch model to {}".format(pytorch_dump_path))
+    torch.save(model.state_dict(), pytorch_dump_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # Required parameters
+    parser.add_argument(
+        "--tf_checkpoint_path", default=None, type=str, required=True, help="Path to the TensorFlow checkpoint path."
+    )
+    parser.add_argument(
+        "--tapas_config_file",
+        default=None,
+        type=str,
+        required=True,
+        help="The config json file corresponding to the pre-trained TAPAS model. \n"
+        "This specifies the model architecture.",
+    )
+    parser.add_argument(
+        "--pytorch_dump_path", default=None, type=str, required=True, help="Path to the output PyTorch model."
+    )
+    args = parser.parse_args()
+    convert_tf_checkpoint_to_pytorch(args.tf_checkpoint_path, args.tapas_config_file, args.pytorch_dump_path)
\ No newline at end of file
diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py
index d9f2ec0db686..20834e4550a3 100644
--- a/src/transformers/file_utils.py
+++ b/src/transformers/file_utils.py
@@ -193,6 +193,20 @@
     _tokenizers_available = False
 
 
+try:
+    import torch_scatter
+
+    # Check we're not importing a "torch_scatter" directory somewhere
+    _scatter_available = hasattr(torch_scatter, "__version__") and hasattr(torch_scatter, "scatter")
+    if _scatter_available:
+        logger.debug(f"Succesfully imported torch-scatter version {torch_scatter.__version__}")
+    else:
+        logger.debug("Imported a torch_scatter object but this doesn't seem to be the torch-scatter library.")
+
+except ImportError:
+    _scatter_available = False
+
+
 default_cache_path = os.path.join(torch_cache_home, "transformers")
 
 
@@ -289,6 +303,14 @@ def wrapper(*args, **kwargs):
 
 
 # docstyle-ignore
+def is_sklearn_available():
+    return _has_sklearn
+
+
+def is_scatter_available():
+    return _scatter_available
+
+
 DATASETS_IMPORT_ERROR = """
 {0} requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
 ```
@@ -368,6 +390,12 @@ def wrapper(*args, **kwargs):
 installation page: https://github.com/google/flax and follow the ones that match your environment.
 """
 
+SCATTER_IMPORT_ERROR = """
+{0} requires the torch-scatter library but it was not found in your environment. You can install it with pip as
+explained here: https://github.com/rusty1s/pytorch_scatter.
+
+"""
+
 
 def requires_datasets(obj):
     name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
@@ -417,6 +445,12 @@ def requires_sentencepiece(obj):
         raise ImportError(SENTENCEPIECE_IMPORT_ERROR.format(name))
 
 
+def requires_scatter(obj):
+    name = obj.__name__ if hasattr(obj, "__name__") else obj.__class__.__name__
+    if not is_scatter_available():
+        raise ImportError(SCATTER_IMPORT_ERROR.format(name))
+
+
 def add_start_docstrings(*docstr):
     def docstring_decorator(fn):
         fn.__doc__ = "".join(docstr) + (fn.__doc__ if fn.__doc__ is not None else "")
diff --git a/src/transformers/modeling_auto.py b/src/transformers/modeling_auto.py
index 3ec971325075..4f0ff52550d5 100644
--- a/src/transformers/modeling_auto.py
+++ b/src/transformers/modeling_auto.py
@@ -49,6 +49,7 @@
     RobertaConfig,
     SqueezeBertConfig,
     T5Config,
+    TapasConfig,
     TransfoXLConfig,
     XLMConfig,
     XLMProphetNetConfig,
@@ -188,6 +189,7 @@
     SqueezeBertModel,
 )
 from .modeling_t5 import T5ForConditionalGeneration, T5Model
+from .modeling_tapas import TapasForMaskedLM, TapasForQuestionAnswering, TapasForSequenceClassification, TapasModel
 from .modeling_transfo_xl import TransfoXLLMHeadModel, TransfoXLModel
 from .modeling_xlm import (
     XLMForMultipleChoice,
@@ -229,6 +231,7 @@
     [
         (RetriBertConfig, RetriBertModel),
         (T5Config, T5Model),
+        (TapasConfig, TapasModel),
         (DistilBertConfig, DistilBertModel),
         (AlbertConfig, AlbertModel),
         (CamembertConfig, CamembertModel),
@@ -265,6 +268,7 @@
         (LayoutLMConfig, LayoutLMForMaskedLM),
         (RetriBertConfig, RetriBertModel),
         (T5Config, T5ForConditionalGeneration),
+        (TapasConfig, TapasForMaskedLM),
         (DistilBertConfig, DistilBertForMaskedLM),
         (AlbertConfig, AlbertForPreTraining),
         (CamembertConfig, CamembertForMaskedLM),
@@ -292,6 +296,7 @@
     [
         (LayoutLMConfig, LayoutLMForMaskedLM),
         (T5Config, T5ForConditionalGeneration),
+        (TapasConfig, TapasForMaskedLM),
         (DistilBertConfig, DistilBertForMaskedLM),
         (AlbertConfig, AlbertForMaskedLM),
         (CamembertConfig, CamembertForMaskedLM),
@@ -351,6 +356,7 @@
         (LongformerConfig, LongformerForMaskedLM),
         (RobertaConfig, RobertaForMaskedLM),
         (SqueezeBertConfig, SqueezeBertForMaskedLM),
+        (TapasConfig, TapasForMaskedLM),
         (BertConfig, BertForMaskedLM),
         (MobileBertConfig, MobileBertForMaskedLM),
         (FlaubertConfig, FlaubertWithLMHeadModel),
@@ -396,6 +402,7 @@
         (DebertaConfig, DebertaForSequenceClassification),
         (GPT2Config, GPT2ForSequenceClassification),
         (OpenAIGPTConfig, OpenAIGPTForSequenceClassification),
+        (TapasConfig, TapasForSequenceClassification),
     ]
 )
 
@@ -410,6 +417,7 @@
         (RobertaConfig, RobertaForQuestionAnswering),
         (SqueezeBertConfig, SqueezeBertForQuestionAnswering),
         (BertConfig, BertForQuestionAnswering),
+        (TapasConfig, TapasForQuestionAnswering),
         (XLNetConfig, XLNetForQuestionAnsweringSimple),
         (FlaubertConfig, FlaubertForQuestionAnsweringSimple),
         (MobileBertConfig, MobileBertForQuestionAnswering),
diff --git a/src/transformers/modeling_tapas.py b/src/transformers/modeling_tapas.py
new file mode 100644
index 000000000000..963dcb8bfa7f
--- /dev/null
+++ b/src/transformers/modeling_tapas.py
@@ -0,0 +1,2306 @@
+# coding=utf-8
+# Copyright 2020 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch TAPAS model. """
+
+
+import enum
+import math
+import os
+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss, MSELoss
+
+from .activations import ACT2FN
+from .configuration_tapas import TapasConfig
+from .file_utils import (
+    ModelOutput,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    is_scatter_available,
+    replace_return_docstrings,
+    requires_scatter,
+)
+from .modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling, MaskedLMOutput, SequenceClassifierOutput
+from .modeling_utils import (
+    PreTrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from .utils import logging
+
+
+# soft dependency
+if is_scatter_available():
+    from torch_scatter import scatter
+
+
+logger = logging.get_logger(__name__)
+
+_CONFIG_FOR_DOC = "TapasConfig"
+_TOKENIZER_FOR_DOC = "TapasTokenizer"
+
+TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "nielsr/tapas-base-finetuned-sqa",
+    "nielsr/tapas-base-finetuned-wtq",
+    "nielsr/tapas-base-finetuned-wikisql-supervised",
+    # See all TAPAS models at https://huggingface.co/models?filter=tapas
+]
+
+EPSILON_ZERO_DIVISION = 1e-10
+CLOSE_ENOUGH_TO_LOG_ZERO = -10000.0
+
+
+@dataclass
+class TableQuestionAnsweringOutput(ModelOutput):
+    """
+    Output type of :class:`~transformers.TapasForQuestionAnswering`.
+
+    Args:
+        loss (:obj:`torch.FloatTensor` of shape :obj:`(1,)`, `optional`, returned when :obj:`label_ids` (and possibly :obj:`answer`, :obj:`aggregation_labels`, :obj:`numeric_values` and :obj:`numeric_values_scale` are provided)):
+            Total loss as the sum of the hierarchical cell selection log-likelihood loss and (optionally) the
+            semi-supervised regression loss and (optionally) supervised loss for aggregations.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Prediction scores of the cell selection head, for every token.
+        logits_aggregation (:obj:`torch.FloatTensor`, `optional`, of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Prediction scores of the aggregation head, for every aggregation operator.
+        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
+            of shape :obj:`(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of
+            each layer plus the initial embedding outputs.
+        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
+            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape :obj:`(batch_size, num_heads,
+            sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
+            weighted average in the self-attention heads.
+    """
+
+    loss: Optional[torch.FloatTensor] = None
+    logits: torch.FloatTensor = None
+    logits_aggregation: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+
+
+def load_tf_weights_in_tapas(model, config, tf_checkpoint_path):
+    """
+    Load tf checkpoints in a PyTorch model. This is an adaptation from load_tf_weights_in_bert
+
+    - add cell selection and aggregation heads
+    - take into account additional token type embedding layers
+    """
+    try:
+        import re
+
+        import numpy as np
+        import tensorflow as tf
+    except ImportError:
+        logger.error(
+            "Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see "
+            "https://www.tensorflow.org/install/ for installation instructions."
+        )
+        raise
+    tf_path = os.path.abspath(tf_checkpoint_path)
+    logger.info("Converting TensorFlow checkpoint from {}".format(tf_path))
+    # Load weights from TF model
+    init_vars = tf.train.list_variables(tf_path)
+    names = []
+    arrays = []
+    for name, shape in init_vars:
+        logger.info("Loading TF weight {} with shape {}".format(name, shape))
+        array = tf.train.load_variable(tf_path, name)
+        names.append(name)
+        arrays.append(array)
+
+    for name, array in zip(names, arrays):
+        name = name.split("/")
+        # adam_v and adam_m are variables used in AdamWeightDecayOptimizer to calculate m and v
+        # which are not required for using pretrained model
+        if any(
+            n
+            in [
+                "adam_v",
+                "adam_m",
+                "AdamWeightDecayOptimizer",
+                "AdamWeightDecayOptimizer_1",
+                "global_step",
+                "seq_relationship",
+            ]
+            for n in name
+        ):
+            logger.info("Skipping {}".format("/".join(name)))
+            continue
+        # in case the model is TapasForSequenceClassification, we skip output_bias and output_weights
+        # since these are not used for classification
+        if isinstance(model, TapasForSequenceClassification):
+            if any(
+                n
+                in [
+                    "output_bias",
+                    "output_weights",
+                ]
+                for n in name
+            ):
+                logger.info("Skipping {}".format("/".join(name)))
+                continue
+        # in case the model is TapasModel, we skip output_bias, output_weights, output_bias_cls and output_weights_cls
+        # since this model does not have MLM and NSP heads  
+        if isinstance(model, TapasModel):
+            if any(
+                n
+                in [
+                    "output_bias",
+                    "output_weights",
+                    "output_bias_cls",
+                    "output_weights_cls",
+                ]
+                for n in name
+            ):
+                logger.info("Skipping {}".format("/".join(name)))
+                continue
+        # if first scope name starts with "bert", change it to "tapas"
+        if name[0] == "bert":
+            name[0] = "tapas"
+        pointer = model
+        for m_name in name:
+            if re.fullmatch(r"[A-Za-z]+_\d+", m_name):
+                scope_names = re.split(r"_(\d+)", m_name)
+            else:
+                scope_names = [m_name]
+            if scope_names[0] == "kernel" or scope_names[0] == "gamma":
+                pointer = getattr(pointer, "weight")
+            elif scope_names[0] == "beta":
+                pointer = getattr(pointer, "bias")
+            # cell selection heads
+            elif scope_names[0] == "output_bias":
+                pointer = getattr(pointer, "output_bias")
+            elif scope_names[0] == "output_weights":
+                pointer = getattr(pointer, "output_weights")
+            elif scope_names[0] == "column_output_bias":
+                pointer = getattr(pointer, "column_output_bias")
+            elif scope_names[0] == "column_output_weights":
+                pointer = getattr(pointer, "column_output_weights")
+            # aggregation head
+            elif scope_names[0] == "output_bias_agg":
+                pointer = getattr(pointer, "aggregation_classifier")
+                pointer = getattr(pointer, "bias")
+            elif scope_names[0] == "output_weights_agg":
+                pointer = getattr(pointer, "aggregation_classifier")
+                pointer = getattr(pointer, "weight")
+            # classification head
+            elif scope_names[0] == "output_bias_cls":
+                pointer = getattr(pointer, "classifier")
+                pointer = getattr(pointer, "bias")
+            elif scope_names[0] == "output_weights_cls":
+                pointer = getattr(pointer, "classifier")
+                pointer = getattr(pointer, "weight")
+            else:
+                try:
+                    pointer = getattr(pointer, scope_names[0])
+                except AttributeError:
+                    logger.info("Skipping {}".format("/".join(name)))
+                    continue
+            if len(scope_names) >= 2:
+                num = int(scope_names[1])
+                pointer = pointer[num]
+        if m_name[-11:] == "_embeddings":
+            pointer = getattr(pointer, "weight")
+        elif m_name[-13:] in [
+            "_embeddings_0",
+            "_embeddings_1",
+            "_embeddings_2",
+            "_embeddings_3",
+            "_embeddings_4",
+            "_embeddings_5",
+            "_embeddings_6",
+        ]:
+            pointer = getattr(pointer, "weight")
+        elif m_name == "kernel":
+            array = np.transpose(array)
+        try:
+            assert (
+                pointer.shape == array.shape
+            ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
+        except AssertionError as e:
+            e.args += (pointer.shape, array.shape)
+            raise
+        logger.info("Initialize PyTorch weight {}".format(name))
+        # added a check to see whether the array is a scalar (because bias terms in Tapas checkpoints can be scalar => should first be converted to numpy arrays)
+        if np.isscalar(array):
+            array = np.array(array)
+        pointer.data = torch.from_numpy(array)
+    return model
+
+
+class TapasEmbeddings(nn.Module):
+    """
+    Construct the embeddings from word, position and token_type embeddings. Same as BertEmbeddings but with a number of
+    additional token type embeddings to encode tabular structure.
+    """
+
+    def __init__(self, config):
+        super().__init__()
+        # we do not include config.disabled_features and config.disable_position_embeddings from the original implementation
+        # word embeddings
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        # position embeddings
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        # token type embeddings
+        token_type_embedding_name = "token_type_embeddings"
+
+        for i, type_vocab_sizes in enumerate(config.type_vocab_sizes):
+            name = "%s_%d" % (token_type_embedding_name, i)
+            setattr(self, name, nn.Embedding(type_vocab_sizes, config.hidden_size))
+
+        self.number_of_token_type_embeddings = len(config.type_vocab_sizes)
+
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        self.config = config
+
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        seq_length = input_shape[1]
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        if position_ids is None:
+            # create absolute position embeddings
+            position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
+            position_ids = position_ids.unsqueeze(0).expand(input_shape)
+            # when self.config.reset_position_index_per_cell is set to True, create relative position embeddings
+            if self.config.reset_position_index_per_cell:
+                col_index = IndexMap(
+                    token_type_ids[:, :, 1], self.config.type_vocab_sizes[1], batch_dims=1
+                )  # shape (batch_size, seq_len)
+                row_index = IndexMap(
+                    token_type_ids[:, :, 2], self.config.type_vocab_sizes[2], batch_dims=1
+                )  # shape (batch_size, seq_len)
+                full_index = ProductIndexMap(col_index, row_index)  # shape (batch_size, seq_len)
+
+                first_position_per_segment = reduce_min(position_ids, full_index)[
+                    0
+                ]  # shape (max_rows * max_columns,). First absolute position for every cell
+                first_position = gather(
+                    first_position_per_segment, full_index
+                )  # ? shape (batch_size, seq_len). First absolute position of the cell for every token
+                position = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0)  # shape (1, seq_len)
+                position_ids = torch.min(
+                    torch.as_tensor(self.config.max_position_embeddings - 1, device=device), position - first_position
+                )
+
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                (input_shape + self.number_of_token_type_embeddings), dtype=torch.long, device=device
+            )
+
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+
+        position_embeddings = self.position_embeddings(position_ids)
+
+        embeddings = inputs_embeds + position_embeddings
+
+        token_type_embedding_name = "token_type_embeddings"
+
+        for i in range(self.number_of_token_type_embeddings):
+            name = f"{token_type_embedding_name}_{i}"
+            embeddings += getattr(self, name)(token_type_ids[:, :, i])
+
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+
+
+# Copied from transformers.modeling_bert.BertSelfAttention with Bert->Tapas
+class TapasSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        if encoder_hidden_states is not None:
+            mixed_key_layer = self.key(encoder_hidden_states)
+            mixed_value_layer = self.value(encoder_hidden_states)
+            attention_mask = encoder_attention_mask
+        else:
+            mixed_key_layer = self.key(hidden_states)
+            mixed_value_layer = self.value(hidden_states)
+
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in TapasModel forward() function)
+            attention_scores = attention_scores + attention_mask
+
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+
+        context_layer = torch.matmul(attention_probs, value_layer)
+
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+        return outputs
+
+
+# Copied from transformers.modeling_bert.BertSelfOutput
+class TapasSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from transformers.modeling_bert.BertAttention with Bert->Tapas
+class TapasAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.self = TapasSelfAttention(config)
+        self.output = TapasSelfOutput(config)
+        self.pruned_heads = set()
+
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+
+
+# Copied from transformers.modeling_bert.BertIntermediate
+class TapasIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+
+
+# Copied from transformers.modeling_bert.BertOutput
+class TapasOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+# Copied from transformers.modeling_bert.BertLayer with Bert->Tapas
+class TapasLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = TapasAttention(config)
+        self.is_decoder = config.is_decoder
+        self.add_cross_attention = config.add_cross_attention
+        if self.add_cross_attention:
+            assert self.is_decoder, f"{self} should be used as a decoder model if cross attention is added"
+            self.crossattention = TapasAttention(config)
+        self.intermediate = TapasIntermediate(config)
+        self.output = TapasOutput(config)
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+    ):
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
+
+        if self.is_decoder and encoder_hidden_states is not None:
+            assert hasattr(
+                self, "crossattention"
+            ), f"If `encoder_hidden_states` are passed, {self} has to be instantiated with cross-attention layers by setting `config.add_cross_attention=True`"
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights
+
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+        return outputs
+
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+
+
+# Copied from transformers.modeling_bert.BertEncoder with Bert->Tapas
+class TapasEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([TapasLayer(config) for _ in range(config.num_hidden_layers)])
+
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        for i, layer_module in enumerate(self.layer):
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+
+            if getattr(self.config, "gradient_checkpointing", False):
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+
+        if not return_dict:
+            return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states, hidden_states=all_hidden_states, attentions=all_attentions
+        )
+
+
+# Copied from transformers.modeling_bert.BertPooler
+class TapasPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+
+
+class TapasPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+
+    config_class = TapasConfig
+    base_model_prefix = "tapas"
+
+    # Copied from transformers.modeling_bert.BertPreTrainedModel._init_weights
+    def _init_weights(self, module):
+        """ Initialize the weights """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+
+
+TAPAS_START_DOCSTRING = r"""
+    This model inherits from :class:`~transformers.PreTrainedModel`. Check the superclass documentation for the generic
+    methods the library implements for all its models (such as downloading or saving, resizing the input embeddings,
+    pruning heads etc.)
+
+    This model is also a PyTorch `torch.nn.Module <https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`__
+    subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to
+    general usage and behavior.
+
+    Parameters:
+        config (:class:`~transformers.TapasConfig`): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the :meth:`~transformers.PreTrainedModel.from_pretrained` method to load the model
+            weights.
+"""
+
+TAPAS_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`):
+            Indices of input sequence tokens in the vocabulary.
+            Indices can be obtained using :class:`~transformers.TapasTokenizer`. See
+            :meth:`transformers.PreTrainedTokenizer.encode` and :meth:`transformers.PreTrainedTokenizer.__call__` for
+            details.
+        
+            `What are input IDs? <../glossary.html#input-ids>`__
+        attention_mask (:obj:`torch.FloatTensor` of shape :obj:`({0})`, `optional`):
+            Mask to avoid performing attention on padding token indices. Mask values selected in ``[0, 1]``:
+
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+
+            `What are attention masks? <../glossary.html#attention-mask>`__
+        token_type_ids (:obj:`torch.LongTensor` of shape :obj:`({0}, 7)`, `optional`):
+            Token indices that encode tabular structure. Indices can be obtained using :class:`~transformers.TapasTokenizer`. 
+            See this class for more info.
+            
+            `What are token type IDs? <../glossary.html#token-type-ids>`_
+        position_ids (:obj:`torch.LongTensor` of shape :obj:`({0})`, `optional`):
+            Indices of positions of each input sequence tokens in the position embeddings. If ``reset_position_index_per_cell`` 
+            of :class:`~transformers.TapasConfig` is set to ``True``, relative position embeddings will be used. Selected in the 
+            range ``[0, config.max_position_embeddings - 1]``.
+            
+            `What are position IDs? <../glossary.html#position-ids>`_
+        head_mask (:obj:`torch.FloatTensor` of shape :obj:`(num_heads,)` or :obj:`(num_layers, num_heads)`, `optional`):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in ``[0, 1]``:
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        inputs_embeds (:obj:`torch.FloatTensor` of shape :obj:`({0}, hidden_size)`, `optional`):
+            Optionally, instead of passing :obj:`input_ids` you can choose to directly pass an embedded representation.
+            This is useful if you want more control over how to convert :obj:`input_ids` indices into associated
+            vectors than the model's internal embedding lookup matrix.
+        output_attentions (:obj:`bool`, `optional`):
+            Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under returned
+            tensors for more detail.
+        output_hidden_states (:obj:`bool`, `optional`):
+            Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors for
+            more detail.
+        return_dict (:obj:`bool`, `optional`):
+            Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
+"""
+
+
+@add_start_docstrings(
+    "The bare Tapas Model transformer outputting raw hidden-states without any specific head on top.",
+    TAPAS_START_DOCSTRING,
+)
+class TapasModel(TapasPreTrainedModel):
+    """
+    This class is a small change compared to :class:`~transformers.BertModel`, taking into account the additional token
+    type ids.
+
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
+    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+
+    """
+
+    config_class = TapasConfig
+    base_model_prefix = "tapas"
+
+    def __init__(self, config):
+        requires_scatter(self)
+        super().__init__(config)
+        self.config = config
+
+        self.embeddings = TapasEmbeddings(config)
+        self.encoder = TapasEncoder(config)
+        self.pooler = TapasPooler(config)
+
+        self.init_weights()
+
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+
+    @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        Returns:
+
+        Examples::
+
+            >>> from transformers import TapasTokenizer, TapasModel
+            >>> import pandas as pd
+
+            >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased')
+            >>> model = TapasModel.from_pretrained('google/tapas-base-uncased')
+
+            >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
+            ...         'Age': ["56", "45", "59"], 
+            ...         'Number of movies': ["87", "53", "69"]
+            ... }
+            >>> table = pd.DataFrame.from_dict(data)
+            >>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"]
+
+            >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
+            >>> outputs = model(**inputs)
+
+            >>> last_hidden_states = outputs.last_hidden_state
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds")
+
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                (*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device
+            )
+
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
+
+        # If a 2D ou 3D attention mask is provided for the cross-attention
+        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
+        if self.config.is_decoder and encoder_hidden_states is not None:
+            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+
+        embedding_output = self.embeddings(
+            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
+        )
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output)
+
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+@add_start_docstrings("""Tapas Model with a `language modeling` head on top. """, TAPAS_START_DOCSTRING)
+class TapasForMaskedLM(TapasPreTrainedModel):
+    config_class = TapasConfig
+    base_model_prefix = "tapas"
+
+    def __init__(self, config):
+        super().__init__(config)
+
+        self.tapas = TapasModel(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
+
+        self.init_weights()
+
+    def get_output_embeddings(self):
+        return self.lm_head
+
+    @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=MaskedLMOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        **kwargs
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
+            config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
+            (masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
+
+        Returns:
+
+        Examples::
+
+            >>> from transformers import TapasTokenizer, TapasForMaskedLM
+            >>> import pandas as pd
+
+            >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased')
+            >>> model = TapasForMaskedLM.from_pretrained('google/tapas-base-uncased')
+
+            >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
+            ...         'Age': ["56", "45", "59"], 
+            ...         'Number of movies': ["87", "53", "69"]
+            ... }
+            >>> table = pd.DataFrame.from_dict(data)
+
+            >>> inputs = tokenizer(table=table, queries="How many [MASK] has George [MASK] played in?", return_tensors="pt")
+            >>> labels = tokenizer(table=table, queries="How many movies has George Clooney played in?", return_tensors="pt")["input_ids"]
+
+            >>> outputs = model(**inputs, labels=labels)
+            >>> last_hidden_states = outputs.last_hidden_state
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.tapas(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        prediction_scores = self.lm_head(sequence_output)
+
+        masked_lm_loss = None
+        if labels is not None:
+            loss_fct = CrossEntropyLoss()  # -100 index = padding token
+            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
+
+        return MaskedLMOutput(
+            loss=masked_lm_loss,
+            logits=prediction_scores,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+# Copied from transformers.modeling_roberta.RobertaLMHead with Roberta->Tapas
+class TapasLMHead(nn.Module):
+    """Tapas Head for masked language modeling."""
+
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+
+    def forward(self, features, **kwargs):
+        x = self.dense(features)
+        x = gelu(x)
+        x = self.layer_norm(x)
+
+        # project back to size of vocabulary with bias
+        x = self.decoder(x)
+
+        return x
+
+
+@add_start_docstrings(
+    """
+    Tapas Model with a cell selection head and optionally aggregation head on top for question-answering tasks on
+    tables (linear layers on top of the hidden-states output to compute `logits` and optionally `logits_aggregation`),
+    e.g. for SQA, WTQ or WikiSQL tasks.
+    """,
+    TAPAS_START_DOCSTRING,
+)
+class TapasForQuestionAnswering(TapasPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+
+        # base model
+        self.tapas = TapasModel(config)
+
+        # dropout (only used when training)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # cell selection heads
+        if config.init_cell_selection_weights_to_zero:
+            # init_cell_selection_weights_to_zero: Whether the initial weights should be
+            # set to 0. This ensures that all tokens have the same prior probability.
+            self.output_weights = nn.Parameter(torch.zeros(config.hidden_size))
+            self.column_output_weights = nn.Parameter(torch.zeros(config.hidden_size))
+        else:
+            self.output_weights = nn.Parameter(torch.empty(config.hidden_size))
+            nn.init.normal_(
+                self.output_weights, std=0.02
+            )  # here, a truncated normal is used in the original implementation
+            self.column_output_weights = nn.Parameter(torch.empty(config.hidden_size))
+            nn.init.normal_(
+                self.column_output_weights, std=0.02
+            )  # here, a truncated normal is used in the original implementation
+        self.output_bias = nn.Parameter(torch.zeros([]))
+        self.column_output_bias = nn.Parameter(torch.zeros([]))
+
+        # aggregation head
+        if config.num_aggregation_labels > 0:
+            self.aggregation_classifier = nn.Linear(config.hidden_size, config.num_aggregation_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=TableQuestionAnsweringOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        table_mask=None,
+        label_ids=None,
+        aggregation_labels=None,
+        float_answer=None,
+        numeric_values=None,
+        numeric_values_scale=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        table_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
+            Mask for the table. Indicates which tokens belong to the table (1). Question tokens, table headers and
+            padding are 0.
+        label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
+            Labels per token for computing the hierarchical cell selection loss. This encodes the positions of the
+            answer appearing in the table. Can be obtained using :class:`~transformers.TapasTokenizer`. 
+
+            - 1 for tokens that are **part of the answer**, 
+            - 0 for tokens that are **not part of the answer**.
+
+        aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`, `optional`):
+            Aggregation function index for every example in the batch for computing the aggregation loss. Indices
+            should be in :obj:`[0, ..., config.num_aggregation_labels - 1]`. Only required in case of strong
+            supervision for aggregation (WikiSQL-SUPERVISED).
+        float_answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`, `optional`):
+            Float answer for every example in the batch. Set to `float('nan')` for cell selection questions.  
+            Only required in case of weak supervision (WTQ, WikiSQL) to calculate the aggregate mask and regression loss.
+        numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
+            Numeric values of every token, NaN for tokens which are not numeric values. Can be obtained using
+            :class:`~transformers.TapasTokenizer`. Only required in case of weak supervision for aggregation (WTQ,
+            WikiSQL) to calculate the regression loss.
+        numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`, `optional`):
+            Scale of the numeric values of every token. Can be obtained using :class:`~transformers.TapasTokenizer`.
+            Only required in case of weak supervision for aggregation (WTQ, WikiSQL) to calculate the regression loss.
+
+        Returns:
+
+        Examples::
+
+            >>> from transformers import TapasTokenizer, TapasForQuestionAnswering
+            >>> import pandas as pd
+
+            >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased-finetuned-wtq')
+            >>> model = TapasForQuestionAnswering.from_pretrained('google/tapas-base-uncased-finetuned-wtq')
+
+            >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
+            ...         'Age': ["56", "45", "59"], 
+            ...         'Number of movies': ["87", "53", "69"]
+            ... }
+            >>> table = pd.DataFrame.from_dict(data)
+            >>> queries = ["How many movies has George Clooney played in?", "How old is Brad Pitt?"]
+
+            >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
+            >>> outputs = model(**inputs)
+
+            >>> logits = outputs.logits
+            >>> logits_aggregation = outputs.logits_aggregation
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.tapas(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        sequence_output = outputs[0]
+        pooled_output = outputs[1]
+
+        sequence_output = self.dropout(sequence_output)
+
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+
+        # Construct indices for the table.
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(
+                (*input_shape, len(self.config.type_vocab_sizes)), dtype=torch.long, device=device
+            )
+
+        token_types = [
+            "segment_ids",
+            "column_ids",
+            "row_ids",
+            "prev_label_ids",
+            "column_ranks",
+            "inv_column_ranks",
+            "numeric_relations",
+        ]
+
+        row_ids = token_type_ids[:, :, token_types.index("row_ids")]
+        column_ids = token_type_ids[:, :, token_types.index("column_ids")]
+
+        row_index = IndexMap(
+            indices=torch.min(row_ids, torch.as_tensor(self.config.max_num_rows - 1, device=row_ids.device)),
+            num_segments=self.config.max_num_rows,
+            batch_dims=1,
+        )
+        col_index = IndexMap(
+            indices=torch.min(column_ids, torch.as_tensor(self.config.max_num_columns - 1, device=column_ids.device)),
+            num_segments=self.config.max_num_columns,
+            batch_dims=1,
+        )
+        cell_index = ProductIndexMap(row_index, col_index)
+
+        # Masks.
+        input_shape = input_ids.size() if input_ids is not None else inputs_embeds.size()[:-1]
+        device = input_ids.device if input_ids is not None else inputs_embeds.device
+        if attention_mask is None:
+            attention_mask = torch.ones(input_shape, device=device)
+        # Table cells only, without question tokens and table headers.
+        if table_mask is None:
+            table_mask = torch.where(row_ids > 0, torch.ones_like(row_ids), torch.zeros_like(row_ids))
+        # torch.FloatTensor[batch_size, seq_length] 
+        input_mask_float = attention_mask.float().to(device)
+        table_mask_float = table_mask.float().to(device)
+        # Mask for cells that exist in the table (i.e. that are not padding).
+        cell_mask, _ = reduce_mean(input_mask_float, cell_index)
+
+        # Compute logits per token. These are used to select individual cells.
+        logits = compute_token_logits(sequence_output, self.config.temperature, self.output_weights, self.output_bias)
+
+        # Compute logits per column. These are used to select a column.
+        column_logits = None
+        if self.config.select_one_column:
+            column_logits = compute_column_logits(
+                sequence_output,
+                self.column_output_weights,
+                self.column_output_bias,
+                cell_index,
+                cell_mask,
+                self.config.allow_empty_column_selection,
+            )
+
+        ########## Aggregation logits ##############
+        logits_aggregation = None
+        if self.config.num_aggregation_labels > 0:
+            logits_aggregation = self.aggregation_classifier(pooled_output)
+
+        # Total loss calculation
+        total_loss = 0.0
+        calculate_loss = False
+        if label_ids is not None:
+            calculate_loss = True
+            is_supervised = not self.config.num_aggregation_labels > 0 or not self.config.use_answer_as_supervision
+
+            ### Semi-supervised cell selection in case of no aggregation
+            #############################################################
+
+            # If the answer (the denotation) appears directly in the table we might
+            # select the answer without applying any aggregation function. There are
+            # some ambiguous cases, see utils._calculate_aggregate_mask for more info.
+            # `aggregate_mask` is 1 for examples where we chose to aggregate and 0
+            #  for examples where we chose to select the answer directly.
+            # `label_ids` encodes the positions of the answer appearing in the table.
+            if is_supervised:
+                aggregate_mask = None
+            else:
+                if float_answer is not None:
+                    assert label_ids.shape[0] == float_answer.shape[0], "Make sure the answers are a FloatTensor of shape (batch_size,)"
+                    # <float32>[batch_size]
+                    aggregate_mask = _calculate_aggregate_mask(
+                        float_answer,
+                        pooled_output,
+                        self.config.cell_selection_preference,
+                        label_ids,
+                        self.aggregation_classifier,
+                    )
+                else:
+                    raise ValueError("You have to specify float answers in order to calculate the aggregate mask")
+
+            ### Cell selection log-likelihood
+            #################################
+
+            if self.config.average_logits_per_cell:
+                logits_per_cell, _ = reduce_mean(logits, cell_index)
+                logits = gather(logits_per_cell, cell_index)
+            dist_per_token = torch.distributions.Bernoulli(logits=logits)
+
+            # Compute cell selection loss per example.
+            selection_loss_per_example = None
+            if not self.config.select_one_column:
+                weight = torch.where(
+                    label_ids == 0,
+                    torch.ones_like(label_ids, dtype=torch.float32),
+                    self.config.positive_label_weight * torch.ones_like(label_ids, dtype=torch.float32),
+                )
+                selection_loss_per_token = -dist_per_token.log_prob(label_ids) * weight
+                selection_loss_per_example = torch.sum(selection_loss_per_token * input_mask_float, dim=1) / (
+                    torch.sum(input_mask_float, dim=1) + EPSILON_ZERO_DIVISION
+                )
+            else:
+                selection_loss_per_example, logits = _single_column_cell_selection_loss(
+                    logits, column_logits, label_ids, cell_index, col_index, cell_mask
+                )
+                dist_per_token = torch.distributions.Bernoulli(logits=logits)
+
+            ### Supervised cell selection
+            #############################
+            if self.config.disable_per_token_loss:
+                pass
+            elif is_supervised:
+                total_loss += torch.mean(selection_loss_per_example)
+            else:
+                # For the not supervised case, do not assign loss for cell selection
+                total_loss += torch.mean(selection_loss_per_example * (1.0 - aggregate_mask))
+
+            ### Semi-supervised regression loss and supervised loss for aggregations
+            ######################f###################################################
+            if self.config.num_aggregation_labels > 0:
+                if is_supervised:
+                    # Note that `aggregate_mask` is None if the setting is supervised.
+                    if aggregation_labels is not None:
+                        assert label_ids.shape[0] == aggregation_labels.shape[0], "Make sure the aggregation labels are a LongTensor of shape (batch_size,)"
+                        per_example_additional_loss = _calculate_aggregation_loss(
+                            logits_aggregation, aggregate_mask, aggregation_labels, 
+                            self.config.use_answer_as_supervision, self.config.num_aggregation_labels,
+                            self.config.aggregation_loss_weight
+                        )
+                    else:
+                        raise ValueError(
+                            "You have to specify aggregation labels in order to calculate the aggregation loss"
+                        )
+                else:
+                    # Set aggregation labels to zeros
+                    aggregation_labels = torch.zeros(label_ids.shape[0], dtype=torch.long, device=label_ids.device)
+                    per_example_additional_loss = _calculate_aggregation_loss(
+                        logits_aggregation, aggregate_mask, aggregation_labels, 
+                        self.config.use_answer_as_supervision, self.config.num_aggregation_labels,
+                        self.config.aggregation_loss_weight
+                    )
+
+                if self.config.use_answer_as_supervision:
+                    if numeric_values is not None and numeric_values_scale is not None:
+                        assert numeric_values.shape == numeric_values_scale.shape
+                        # Add regression loss for numeric answers which require aggregation.
+                        answer_loss, large_answer_loss_mask = _calculate_regression_loss(
+                            float_answer,
+                            aggregate_mask,
+                            dist_per_token,
+                            numeric_values,
+                            numeric_values_scale,
+                            table_mask_float,
+                            logits_aggregation,
+                            self.config,
+                        )
+                        per_example_additional_loss += answer_loss
+                        # Zero loss for examples with answer_loss > cutoff.
+                        per_example_additional_loss *= large_answer_loss_mask
+                    else:
+                        raise ValueError(
+                            "You have to specify numeric values and numeric values scale in order to calculate the regression loss"
+                        )
+
+                total_loss += torch.mean(per_example_additional_loss)
+
+        else:
+            # if no label ids are provided, set them to zeros in order to properly compute logits
+            label_ids = torch.zeros_like(logits)
+            _, logits = _single_column_cell_selection_loss(
+                logits, column_logits, label_ids, cell_index, col_index, cell_mask
+            )
+        if not return_dict:
+            output = (logits, logits_aggregation) + outputs[2:]
+            return ((total_loss,) + output) if calculate_loss else output
+
+        return TableQuestionAnsweringOutput(
+            loss=total_loss,
+            logits=logits,
+            logits_aggregation=logits_aggregation,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+@add_start_docstrings(
+    """
+    Tapas Model with a sequence classification head on top (a linear layer on top of the pooled output), e.g. for
+    table entailment tasks, such as TabFact (Chen et al., 2020).
+    """,
+    TAPAS_START_DOCSTRING,
+)
+class TapasForSequenceClassification(TapasPreTrainedModel):
+    def __init__(self, config):
+        super().__init__(config)
+        self.num_labels = config.num_labels
+
+        self.tapas = TapasModel(config)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
+
+        self.init_weights()
+
+    @add_start_docstrings_to_model_forward(TAPAS_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @replace_return_docstrings(output_type=SequenceClassifierOutput, config_class=_CONFIG_FOR_DOC)
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        token_type_ids=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        labels=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+    ):
+        r"""
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size,)`, `optional`):
+            Labels for computing the sequence classification/regression loss. Indices should be in :obj:`[0, ...,
+            config.num_labels - 1]`. If :obj:`config.num_labels == 1` a regression loss is computed (Mean-Square loss),
+            If :obj:`config.num_labels > 1` a classification loss is computed (Cross-Entropy). Note: this is called
+            "classification_class_index" in the original implementation.
+
+        Returns:
+
+        Examples::
+
+            >>> from transformers import TapasTokenizer, TapasForSequenceClassification
+            >>> import torch
+            >>> import pandas as pd
+
+            >>> tokenizer = TapasTokenizer.from_pretrained('google/tapas-base-uncased-finetuned-tabfact')
+            >>> model = TapasForSequenceClassification.from_pretrained('google/tapas-base-uncased-finetuned-tabfact')
+
+            >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
+            ...         'Age': ["56", "45", "59"], 
+            ...         'Number of movies': ["87", "53", "69"]
+            ... }
+            >>> table = pd.DataFrame.from_dict(data)
+            >>> queries = ["There is only one actor who is 45 years old", "There are 3 actors which played in more than 60 movies"]
+
+            >>> inputs = tokenizer(table=table, queries=queries, padding="max_length", return_tensors="pt")
+            >>> labels = torch.tensor([1, 0]) # 1 means entailed, 0 means refuted
+
+            >>> outputs = model(**inputs, labels=labels)
+            >>> loss = outputs.loss
+            >>> logits = outputs.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        outputs = self.tapas(
+            input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+
+        pooled_output = outputs[1]
+
+        pooled_output = self.dropout(pooled_output)
+        logits = self.classifier(pooled_output)
+
+        loss = None
+        if labels is not None:
+            if self.num_labels == 1:
+                #  We are doing regression
+                loss_fct = MSELoss()
+                loss = loss_fct(logits.view(-1), labels.view(-1))
+            else:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
+
+        if not return_dict:
+            output = (logits,) + outputs[2:]
+            return ((loss,) + output) if loss is not None else output
+
+        return SequenceClassifierOutput(
+            loss=loss,
+            logits=logits,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+
+""" TAPAS utilities."""
+
+
+class AverageApproximationFunction(str, enum.Enum):
+    RATIO = "ratio"
+    FIRST_ORDER = "first_order"
+    SECOND_ORDER = "second_order"
+
+
+### Beginning of everything related to segmented tensors ###
+
+
+class IndexMap(object):
+    """Index grouping entries within a tensor."""
+
+    def __init__(self, indices, num_segments, batch_dims=0):
+        """
+        Creates an index
+
+        Args:
+            indices (:obj:`torch.LongTensor`, same shape as a `values` Tensor to which the indices refer):
+                Tensor containing the indices.
+            num_segments (:obj:`torch.LongTensor`):
+                Scalar tensor, the number of segments. All elements in a batched segmented tensor must have the same
+                number of segments (although many segments can be empty).
+            batch_dims (:obj:`int`, `optional`, defaults to 0):
+                The number of batch dimensions. The first `batch_dims` dimensions of a SegmentedTensor are treated as
+                batch dimensions. Segments in different batch elements are always distinct even if they have the same
+                index.
+        """
+        self.indices = torch.as_tensor(indices)
+        self.num_segments = torch.as_tensor(num_segments, device=indices.device)
+        self.batch_dims = batch_dims
+
+    def batch_shape(self):
+        return self.indices.size()[: self.batch_dims]  # returns a torch.Size object
+
+
+class ProductIndexMap(IndexMap):
+    """The product of two indices."""
+
+    def __init__(self, outer_index, inner_index):
+        """
+        Combines indices i and j into pairs (i, j). The result is an index where each segment (i, j) is the
+        intersection of segments i and j. For example if the inputs represent table cells indexed by respectively rows
+        and columns the output will be a table indexed by (row, column) pairs, i.e. by cell. The implementation
+        combines indices {0, .., n - 1} and {0, .., m - 1} into {0, .., nm - 1}. The output has `num_segments` equal to
+        `outer_index.num_segments` * `inner_index.num_segments`
+
+        Args:
+            outer_index (:obj:`IndexMap`):
+                IndexMap.
+            inner_index (:obj:`IndexMap`):
+                IndexMap, must have the same shape as `outer_index`.
+        """
+        if outer_index.batch_dims != inner_index.batch_dims:
+            raise ValueError("outer_index.batch_dims and inner_index.batch_dims must be the same.")
+
+        super(ProductIndexMap, self).__init__(
+            indices=(inner_index.indices + outer_index.indices * inner_index.num_segments),
+            num_segments=inner_index.num_segments * outer_index.num_segments,
+            batch_dims=inner_index.batch_dims,
+        )
+        self.outer_index = outer_index
+        self.inner_index = inner_index
+
+    def project_outer(self, index):
+        """Projects an index with the same index set onto the outer components."""
+        return IndexMap(
+            indices=(index.indices // self.inner_index.num_segments).type(torch.float).floor().type(torch.long),
+            num_segments=self.outer_index.num_segments,
+            batch_dims=index.batch_dims,
+        )
+
+    def project_inner(self, index):
+        """Projects an index with the same index set onto the inner components."""
+        return IndexMap(
+            indices=torch.fmod(index.indices, self.inner_index.num_segments)
+            .type(torch.float)
+            .floor()
+            .type(torch.long),
+            num_segments=self.inner_index.num_segments,
+            batch_dims=index.batch_dims,
+        )
+
+
+def gather(values, index, name="segmented_gather"):
+    """
+    Gathers from `values` using the index map. For each element in the domain of the index map this operation looks up
+    a value for that index in `values`. Two elements from the same segment always get assigned the same value.
+
+    Args:
+        values (:obj:`torch.Tensor` of shape (B1, ..., Bn, num_segments, V1, ...)):
+            Tensor with segment values.
+        index (:obj:`IndexMap` of shape (B1, ..., Bn, I1, ..., Ik)):
+            IndexMap.
+        name (:obj:`str`, `optional`, defaults to 'segmented_gather'):
+            Name for the operation. Currently not used
+
+    Returns:
+        :obj:`tuple(torch.Tensor)`: Tensor of shape (B1, ..., Bn, I1, ..., Ik, V1, ...) with the gathered values.
+    """
+    indices = index.indices
+    # first, check whether the indices of the index represent scalar values (i.e. not vectorized)
+    if len(values.shape[index.batch_dims :]) < 2:
+        return torch.gather(
+            values,
+            index.batch_dims,
+            indices.view(
+                values.size()[0], -1
+            ),  # torch.gather expects index to have the same number of dimensions as values
+        ).view(indices.size())
+    else:
+        # this means we have a vectorized version
+        # we have to adjust the index
+        indices = indices.unsqueeze(-1).expand(values.shape)
+        return torch.gather(values, index.batch_dims, indices)
+
+
+def flatten(index, name="segmented_flatten"):
+    """
+    Flattens a batched index map (which is typically of shape batch_size, seq_length) to a 1d index map. This operation
+    relabels the segments to keep batch elements distinct. The k-th batch element will have indices shifted by
+    `num_segments` * (k - 1). The result is a tensor with `num_segments` multiplied by the number of elements in the
+    batch.
+
+    Args:
+        index (:obj:`IndexMap`):
+            IndexMap to flatten.
+        name (:obj:`str`, `optional`, defaults to 'segmented_flatten'):
+            Name for the operation. Currently not used
+
+    Returns:
+        (:obj:`IndexMap`): The flattened IndexMap.
+    """
+    # first, get batch_size as scalar tensor
+    batch_size = torch.prod(torch.tensor(list(index.batch_shape())))
+    # next, create offset as 1-D tensor of length batch_size,
+    # and multiply element-wise by num segments (to offset different elements in the batch) e.g. if batch size is 2: [0, 64]
+    offset = torch.arange(start=0, end=batch_size, device=index.num_segments.device) * index.num_segments
+    offset = offset.view(index.batch_shape())
+    for _ in range(index.batch_dims, len(index.indices.size())):  # typically range(1,2)
+        offset = offset.unsqueeze(-1)
+
+    indices = offset + index.indices
+    return IndexMap(indices=indices.view(-1), num_segments=index.num_segments * batch_size, batch_dims=0)
+
+
+def range_index_map(batch_shape, num_segments, name="range_index_map"):
+    """
+    Constructs an index map equal to range(num_segments).
+
+    Args:
+        batch_shape (:obj:`torch.Size`):
+            Batch shape
+        num_segments (:obj:`int`):
+            Number of segments
+        name (:obj:`str`, `optional`, defaults to 'range_index_map'):
+            Name for the operation. Currently not used
+
+    Returns:
+        (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments).
+    """
+    batch_shape = torch.as_tensor(
+        batch_shape, dtype=torch.long
+    )  # create a rank 1 tensor vector containing batch_shape (e.g. [2])
+    assert len(batch_shape.size()) == 1
+    num_segments = torch.as_tensor(num_segments)  # create a rank 0 tensor (scalar) containing num_segments (e.g. 64)
+    assert len(num_segments.size()) == 0
+
+    indices = torch.arange(
+        start=0, end=num_segments, device=num_segments.device
+    )  # create a rank 1 vector with num_segments elements
+    new_tensor = torch.cat(
+        [torch.ones_like(batch_shape, dtype=torch.long, device=num_segments.device), num_segments.unsqueeze(dim=0)],
+        dim=0,
+    )
+    # new_tensor is just a vector of [1 64] for example (assuming only 1 batch dimension)
+    new_shape = [int(x) for x in new_tensor.tolist()]
+    indices = indices.view(new_shape)
+
+    multiples = torch.cat([batch_shape, torch.as_tensor([1])], dim=0)
+    indices = indices.repeat(multiples.tolist())
+    # equivalent (in Numpy:)
+    # indices = torch.as_tensor(np.tile(indices.numpy(), multiples.tolist()))
+
+    return IndexMap(indices=indices, num_segments=num_segments, batch_dims=list(batch_shape.size())[0])
+
+
+def _segment_reduce(values, index, segment_reduce_fn, name):
+    """
+    Applies a segment reduction segment-wise.
+
+    Args:
+        values (:obj:`torch.Tensor`):
+            Tensor with segment values.
+        index (:obj:`IndexMap`):
+            IndexMap.
+        segment_reduce_fn (:obj:`str`):
+            Name for the reduce operation. One of "sum", "mean", "max" or "min".
+        name (:obj:`str`):
+            Name for the operation. Currently not used
+
+    Returns:
+        (:obj:`IndexMap`): IndexMap of shape batch_shape with elements equal to range(num_segments).
+    """
+    # Flatten the batch dimensions, as segments ops (scatter) do not support batching.
+    # However if `values` has extra dimensions to the right keep them
+    # unflattened. Segmented ops support vector-valued operations.
+    flat_index = flatten(index)
+    vector_shape = values.size()[len(index.indices.size()) :]  # torch.Size object
+    flattened_shape = torch.cat(
+        [torch.as_tensor([-1], dtype=torch.long), torch.as_tensor(vector_shape, dtype=torch.long)], dim=0
+    )
+    # changed "view" by "reshape" in the following line
+    flat_values = values.reshape(flattened_shape.tolist())
+
+    segment_means = scatter(
+        src=flat_values,
+        index=flat_index.indices.type(torch.long),
+        dim=0,
+        dim_size=flat_index.num_segments,
+        reduce=segment_reduce_fn,
+    )
+
+    # Unflatten the values.
+    new_shape = torch.cat(
+        [
+            torch.as_tensor(index.batch_shape(), dtype=torch.long),
+            torch.as_tensor([index.num_segments], dtype=torch.long),
+            torch.as_tensor(vector_shape, dtype=torch.long),
+        ],
+        dim=0,
+    )
+
+    output_values = segment_means.view(new_shape.tolist())
+    output_index = range_index_map(index.batch_shape(), index.num_segments)
+    return output_values, output_index
+
+
+def reduce_sum(values, index, name="segmented_reduce_sum"):
+    """
+    Sums a tensor over its segments. 
+    
+    Outputs 0 for empty segments. 
+    
+    This operations computes the sum over segments, with support for:
+        - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. 
+        - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a sum of 
+          vectors rather than scalars. Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
+
+    Args:
+        values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
+            Tensor containing the values of which the sum must be taken segment-wise.
+        index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
+            Index defining the segments.
+        name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
+            Name for the operation. Currently not used
+
+    Returns:
+        output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
+        output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments]. .
+    """
+    return _segment_reduce(values, index, "sum", name)
+
+
+def reduce_mean(values, index, name="segmented_reduce_mean"):
+    """
+    Averages a tensor over its segments. 
+    
+    Outputs 0 for empty segments. 
+    
+    This operations computes the mean over segments, with support for:
+        - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. 
+        - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be a mean of 
+          vectors rather than scalars. 
+        
+    Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
+
+    Args:
+        values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
+            Tensor containing the values of which the mean must be taken segment-wise.
+        index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
+            Index defining the segments.
+        name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
+            Name for the operation. Currently not used
+
+    Returns:
+        output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
+        output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
+    """
+    return _segment_reduce(values, index, "mean", name)
+
+
+def reduce_max(values, index, name="segmented_reduce_max"):
+    """
+    Computes the maximum over segments. 
+    
+    This operation computes the maximum over segments, with support for:
+        - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. 
+        - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise 
+          maximum of vectors rather than scalars. 
+
+    Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
+    
+    Args:
+        values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
+            Tensor containing the values of which the max must be taken segment-wise.
+        index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
+            Index defining the segments.
+        name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
+            Name for the operation. Currently not used
+
+    Returns:
+        output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
+        output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
+    """
+    return _segment_reduce(values, index, "max", name)
+
+
+def reduce_min(values, index, name="segmented_reduce_min"):
+    """
+    Computes the minimum over segments. 
+    
+    This operations computes the minimum over segments, with support for:
+        - Batching using the first dimensions [B1, B2, ..., Bn]. Each element in a batch can have different indices. 
+        - Vectorization using the last dimension [V1, V2, ...]. If they are present, the output will be an element-wise minimum 
+          of vectors rather than scalars. 
+        
+    Only the middle dimensions [I1, ..., Ik] are reduced by the operation.
+
+    Args:
+        values (:obj:`torch.Tensor` of shape [B1, B2, ..., Bn, I1, .., Ik, V1, V2, ..]):
+            Tensor containing the values of which the min must be taken segment-wise.
+        index (:obj:`IndexMap`, indices are of shape [B1, B2, ..., Bn, I1, .., Ik].):
+            Index defining the segments.
+        name (:obj:`str`, `optional`, defaults to 'segmented_reduce_sum'):
+            Name for the operation. Currently not used
+
+    Returns:
+        output_values (:obj:`torch.Tensor`of shape [B1, B2, ..., Bn, num_segments, V1, V2, ..]): Tensor containing the
+        output values. output_index (:obj:`IndexMap`): IndexMap with shape [B1, B2, ..., Bn, num_segments].
+    """
+    return _segment_reduce(values, index, "min", name)
+
+
+### End of everything related to segmented tensors ###
+
+
+def compute_column_logits(
+    sequence_output, column_output_weights, column_output_bias, cell_index, cell_mask, allow_empty_column_selection
+):
+    """
+    Computes the column logits.
+
+    Args:
+        sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model.
+        column_output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size)`):
+            Weights of the linear layer for column selection.
+        column_output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`):
+            Bias of the linear layer for column selection.
+        cell_index (:obj:`ProductIndexMap`):
+            Index that groups tokens into cells.
+        cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`):
+            Mask for cells that exist in the table (i.e. that are not padding).
+        allow_empty_column_selection (:obj:`bool`):
+            Whether to allow not to select any column
+
+    Returns:
+        column_logits (:obj:`torch.FloatTensor`of shape :obj:`(batch_size, max_num_cols)`): Tensor containing the
+        column logits for every example in the batch.
+    """
+
+    # First, compute the token logits (batch_size, seq_len) - without temperature
+    token_logits = torch.einsum("bsj,j->bs", sequence_output, column_output_weights) + column_output_bias
+
+    # Next, average the logits per cell (batch_size, max_num_cols*max_num_rows)
+    cell_logits, cell_logits_index = reduce_mean(token_logits, cell_index)
+
+    # Finally, average the logits per column (batch_size, max_num_cols)
+    column_index = cell_index.project_inner(cell_logits_index)
+    column_logits, out_index = reduce_sum(cell_logits * cell_mask, column_index)
+
+    cell_count, _ = reduce_sum(cell_mask, column_index)
+    column_logits /= cell_count + EPSILON_ZERO_DIVISION
+
+    # Mask columns that do not appear in the example.
+    is_padding = torch.logical_and(cell_count < 0.5, ~torch.eq(out_index.indices, 0))
+    column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor(
+        is_padding, dtype=torch.float32, device=is_padding.device
+    )
+
+    if not allow_empty_column_selection:
+        column_logits += CLOSE_ENOUGH_TO_LOG_ZERO * torch.as_tensor(
+            torch.eq(out_index.indices, 0), dtype=torch.float32, device=out_index.indices.device
+        )
+
+    return column_logits
+
+
+def _single_column_cell_selection_loss(token_logits, column_logits, label_ids, cell_index, col_index, cell_mask):
+    """
+    Computes the loss for cell selection constrained to a single column. The loss is a hierarchical log-likelihood. The
+    model first predicts a column and then selects cells within that column (conditioned on the column). Cells outside
+    the selected column are never selected.
+
+    Args:
+        token_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Tensor containing the logits per token.
+        column_logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_cols)`):
+            Tensor containing the logits per column.
+        label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Labels per token.
+        cell_index (:obj:`ProductIndexMap`):
+            Index that groups tokens into cells.
+        col_index (:obj:`IndexMap`):
+            Index that groups tokens into columns.
+        cell_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, max_num_rows * max_num_cols)`):
+            Mask for cells that exist in the table (i.e. that are not padding).
+
+    Returns:
+        selection_loss_per_example (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Loss for each example.
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): New logits which are only
+        allowed to select cells in a single column. Logits outside of the most likely column according to
+        `column_logits` will be set to a very low value (such that the probabilities are 0).
+    """
+    ## Part 1: column loss
+
+    # First find the column we should select. We use the column with maximum
+    # number of selected cells.
+    labels_per_column, _ = reduce_sum(
+        torch.as_tensor(label_ids, dtype=torch.float32, device=label_ids.device), col_index
+    )
+    # shape of labels_per_column is (batch_size, max_num_cols). It contains the number of label ids for every column, for every example
+    column_label = torch.argmax(labels_per_column, dim=-1)  # shape (batch_size,)
+    # Check if there are no selected cells in the column. In that case the model
+    # should predict the special column id 0, which means "select nothing".
+    no_cell_selected = torch.eq(
+        torch.max(labels_per_column, dim=-1)[0], 0
+    )  # no_cell_selected is of shape (batch_size,) and equals True
+    # if an example of the batch has no cells selected (i.e. if there are no label_ids set to 1 for that example)
+    column_label = torch.where(
+        no_cell_selected.view(column_label.size()), torch.zeros_like(column_label), column_label
+    )
+
+    column_dist = torch.distributions.Categorical(logits=column_logits)  # shape (batch_size, max_num_cols)
+    column_loss_per_example = -column_dist.log_prob(column_label)
+
+    ## Part 2: cell loss
+
+    # Reduce the labels and logits to per-cell from per-token.
+    # logits_per_cell: shape (batch_size, max_num_rows*max_num_cols) i.e. (batch_size, 64*32)
+    logits_per_cell, _ = reduce_mean(token_logits, cell_index)
+    # labels_per_cell: shape (batch_size, 64*32), indicating whether each cell should be selected (1) or not (0)
+    labels_per_cell, labels_index = reduce_max(
+        torch.as_tensor(label_ids, dtype=torch.long, device=label_ids.device), cell_index
+    )
+
+    # Mask for the selected column.
+    # column_id_for_cells: shape (batch_size, 64*32), indicating to which column each cell belongs
+    column_id_for_cells = cell_index.project_inner(labels_index).indices
+    # column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column to be selected
+    column_mask = torch.as_tensor(
+        torch.eq(column_id_for_cells, torch.unsqueeze(column_label, dim=-1)),
+        dtype=torch.float32,
+        device=cell_mask.device,
+    )
+
+    # Compute the log-likelihood for cells, but only for the selected column.
+    cell_dist = torch.distributions.Bernoulli(logits=logits_per_cell)  # shape (batch_size, 64*32)
+    cell_log_prob = cell_dist.log_prob(labels_per_cell.type(torch.float32))  # shape(batch_size, 64*32)
+
+    cell_loss = -torch.sum(cell_log_prob * column_mask * cell_mask, dim=1)
+
+    # We need to normalize the loss by the number of cells in the column.
+    cell_loss /= torch.sum(column_mask * cell_mask, dim=1) + EPSILON_ZERO_DIVISION
+
+    selection_loss_per_example = column_loss_per_example
+    selection_loss_per_example += torch.where(
+        no_cell_selected.view(selection_loss_per_example.size()),
+        torch.zeros_like(selection_loss_per_example),
+        cell_loss,
+    )
+
+    # Set the probs outside the selected column (selected by the *model*)
+    # to 0. This ensures backwards compatibility with models that select
+    # cells from multiple columns.
+    selected_column_id = torch.as_tensor(
+        torch.argmax(column_logits, dim=-1), dtype=torch.long, device=column_logits.device
+    )  # shape (batch_size,)
+
+    # selected_column_mask: shape (batch_size, 64*32), equal to 1 if cell belongs to column selected by the model
+    selected_column_mask = torch.as_tensor(
+        torch.eq(column_id_for_cells, torch.unsqueeze(selected_column_id, dim=-1)),
+        dtype=torch.float32,
+        device=selected_column_id.device,
+    )
+
+    # Never select cells with the special column id 0.
+    selected_column_mask = torch.where(
+        torch.eq(column_id_for_cells, 0).view(selected_column_mask.size()),
+        torch.zeros_like(selected_column_mask),
+        selected_column_mask,
+    )
+    new_logits_per_cell = logits_per_cell + CLOSE_ENOUGH_TO_LOG_ZERO * (1.0 - cell_mask * selected_column_mask)
+    logits = gather(new_logits_per_cell, cell_index)
+
+    return selection_loss_per_example, logits
+
+
+def compute_token_logits(sequence_output, temperature, output_weights, output_bias):
+    """
+    Computes logits per token
+
+    Args:
+        sequence_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
+            Also known as last_hidden_state. Sequence of hidden-states at the output of the last layer of the model.
+        temperature (:obj:`float`):
+            Temperature for the Bernoulli distribution.
+        output_weights (:obj:`torch.FloatTensor` of shape :obj:`(hidden_size,)`):
+            Weights of the linear layer for cell selection.
+        output_bias (:obj:`torch.FloatTensor` of shape :obj:`()`):
+            Bias of the linear layer for cell selection
+
+    Returns:
+        logits (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`): Logits per token.
+    """
+    logits = (torch.einsum("bsj,j->bs", sequence_output, output_weights) + output_bias) / temperature
+
+    return logits
+
+
+def _calculate_aggregate_mask(answer, pooled_output, cell_selection_preference, label_ids, aggregation_classifier):
+    """
+    Finds examples where the model should select cells with no aggregation.
+
+    Returns a mask that determines for which examples should the model select answers directly from the table, without
+    any aggregation function. If the answer is a piece of text the case is unambiguous as aggregation functions only
+    apply to numbers. If the answer is a number but does not appear in the table then we must use some aggregation
+    case. The ambiguous case is when the answer is a number that also appears in the table. In this case we use the
+    aggregation function probabilities predicted by the model to decide whether to select or aggregate. The threshold
+    for this is a hyperparameter `cell_selection_preference
+
+    Args:
+        answer (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
+            Answer for every example in the batch. Nan if there is no scalar answer.
+        pooled_output (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, hidden_size)`):
+            Output of the pooler (BertPooler) on top of the encoder layer.
+        cell_selection_preference (:obj:`float`):
+            Preference for cell selection in ambiguous cases.
+        label_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`):
+            Labels per token. aggregation_classifier (:obj:`torch.nn.Linear`): Aggregation head
+
+    Returns:
+        aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A mask set to 1 for examples that
+        should use aggregation functions.
+    """
+    # torch.FloatTensor(batch_size,)
+    aggregate_mask_init = torch.logical_not(torch.isnan(answer)).type(torch.FloatTensor).to(answer.device)
+    logits_aggregation = aggregation_classifier(pooled_output)
+    dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation)
+    # Index 0 correponds to "no aggregation".
+    aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1)
+
+    # Cell selection examples according to current model.
+    is_pred_cell_selection = aggregation_ops_total_mass <= cell_selection_preference
+
+    # Examples with non-empty cell selection supervision.
+    is_cell_supervision_available = torch.sum(label_ids, dim=1) > 0
+
+    # torch.where is not equivalent to tf.where (in tensorflow 1)
+    # hence the added .view on the condition to match the shape of the first tensor
+    aggregate_mask = torch.where(
+        torch.logical_and(is_pred_cell_selection, is_cell_supervision_available).view(aggregate_mask_init.size()),
+        torch.zeros_like(aggregate_mask_init, dtype=torch.float32),
+        aggregate_mask_init,
+    )
+
+    aggregate_mask = aggregate_mask.detach()
+
+    return aggregate_mask
+
+
+def _calculate_aggregation_loss_known(
+    logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels
+):
+    """
+    Calculates aggregation loss when its type is known during training.
+
+    In the weakly supervised setting, the only known information is that for cell selection examples, "no aggregation"
+    should be predicted. For other examples (those that require aggregation), no loss is accumulated. In the setting
+    where aggregation type is always known, standard cross entropy loss is accumulated for all examples
+
+    Args:
+        logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Logits per aggregation operation.
+        aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
+            A mask set to 1 for examples that should use aggregation functions.
+        aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`):
+            Aggregation function id for every example in the batch.
+        use_answer_as_supervision (:obj:`bool`, `optional`):
+            Whether to use the answer as the only supervision for aggregation examples.
+        num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
+            The number of aggregation operators to predict.
+
+    Returns:
+        aggregation_loss_known (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (when its
+        type is known during training) per example.
+    """
+    if use_answer_as_supervision:
+        # Prepare "no aggregation" targets for cell selection examples.
+        target_aggregation = torch.zeros_like(aggregate_mask, dtype=torch.long)
+    else:
+        # Use aggregation supervision as the target.
+        target_aggregation = aggregation_labels
+
+    one_hot_labels = torch.nn.functional.one_hot(target_aggregation, num_classes=num_aggregation_labels).type(
+        torch.float32
+    )
+    log_probs = torch.nn.functional.log_softmax(logits_aggregation, dim=-1)
+
+    # torch.FloatTensor[batch_size]
+    per_example_aggregation_intermediate = -torch.sum(one_hot_labels * log_probs, dim=-1)
+    if use_answer_as_supervision:
+        # Accumulate loss only for examples requiring cell selection
+        # (no aggregation).
+        return per_example_aggregation_intermediate * (1 - aggregate_mask)
+    else:
+        return per_example_aggregation_intermediate
+
+
+def _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask):
+    """
+    Calculates aggregation loss in the case of answer supervision.
+
+    Args:
+        logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Logits per aggregation operation.
+        aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
+            A mask set to 1 for examples that should use aggregation functions
+
+    Returns:
+        aggregation_loss_unknown (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss (in case of
+        answer supervision) per example.
+    """
+    dist_aggregation = torch.distributions.categorical.Categorical(logits=logits_aggregation)
+    # Index 0 correponds to "no aggregation".
+    aggregation_ops_total_mass = torch.sum(dist_aggregation.probs[:, 1:], dim=1)
+    # Predict some aggregation in case of an answer that needs aggregation.
+    # This increases the probability of all aggregation functions, in a way
+    # similar to MML, but without considering whether the function gives the
+    # correct answer.
+    return -torch.log(aggregation_ops_total_mass) * aggregate_mask
+
+
+def _calculate_aggregation_loss(
+    logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels,
+    aggregation_loss_weight
+):
+    """
+    Calculates the aggregation loss per example.
+
+    Args:
+        logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Logits per aggregation operation.
+        aggregate_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, )`):
+            A mask set to 1 for examples that should use aggregation functions.
+        aggregation_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, )`):
+            Aggregation function id for every example in the batch.
+        use_answer_as_supervision (:obj:`bool`, `optional`):
+            Whether to use the answer as the only supervision for aggregation examples.
+        num_aggregation_labels (:obj:`int`, `optional`, defaults to 0):
+            The number of aggregation operators to predict.
+        aggregation_loss_weight (:obj:`float`, `optional`, defaults to 1.0):
+            Importance weight for the aggregation loss.
+
+    Returns:
+        aggregation_loss (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Aggregation loss per example.
+    """
+    per_example_aggregation_loss = _calculate_aggregation_loss_known(
+        logits_aggregation, aggregate_mask, aggregation_labels, use_answer_as_supervision, num_aggregation_labels
+    )
+
+    if use_answer_as_supervision:
+        # Add aggregation loss for numeric answers that need aggregation.
+        per_example_aggregation_loss += _calculate_aggregation_loss_unknown(logits_aggregation, aggregate_mask)
+    return aggregation_loss_weight * per_example_aggregation_loss
+
+
+def _calculate_expected_result(
+    dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config
+):
+    """
+    Calculates the expected result given cell and aggregation probabilities.
+
+    Args:
+        dist_per_cell (:obj:`torch.distributions.Bernoulli`):
+            Cell selection distribution for each cell.
+        numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Numeric values of every token. Nan for tokens which are not numeric values.
+        numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Scale of the numeric values of every token.
+        input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Mask for the table, without question tokens and table headers.
+        logits_aggregation (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Logits per aggregation operation.
+        config (:class:`~transformers.TapasConfig`):
+            Model configuration class with all the hyperparameters of the model
+
+    Returns:
+        expected_result (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): The expected result per example.
+    """
+    if config.use_gumbel_for_cells:
+        gumbel_dist = torch.distributions.RelaxedBernoulli(
+            # The token logits where already divided by the temperature and used for
+            # computing cell selection errors so we need to multiply it again here
+            temperature=config.temperature,
+            logits=dist_per_cell.logits * config.temperature,
+        )
+        scaled_probability_per_cell = gumbel_dist.sample()
+    else:
+        scaled_probability_per_cell = dist_per_cell.probs
+
+    # <float32>[batch_size, seq_length]
+    scaled_probability_per_cell = (scaled_probability_per_cell / numeric_values_scale) * input_mask_float
+    count_result = torch.sum(scaled_probability_per_cell, dim=1)
+    numeric_values_masked = torch.where(
+        torch.isnan(numeric_values), torch.zeros_like(numeric_values), numeric_values
+    )  # Mask non-numeric table values to zero.
+    sum_result = torch.sum(scaled_probability_per_cell * numeric_values_masked, dim=1)
+    avg_approximation = config.average_approximation_function
+    if avg_approximation == AverageApproximationFunction.RATIO:
+        average_result = sum_result / (count_result + EPSILON_ZERO_DIVISION)
+    elif avg_approximation == AverageApproximationFunction.FIRST_ORDER:
+        # The sum of all probabilities except that correspond to other cells
+        ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1
+        average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell / ex, dim=1)
+    elif avg_approximation == AverageApproximationFunction.SECOND_ORDER:
+        # The sum of all probabilities except that correspond to other cells
+        ex = torch.sum(scaled_probability_per_cell, dim=1, keepdim=True) - scaled_probability_per_cell + 1
+        pointwise_var = scaled_probability_per_cell * (1 - scaled_probability_per_cell)
+        var = torch.sum(pointwise_var, dim=1, keepdim=True) - pointwise_var
+
+        multiplier = (var / torch.square(ex) + 1) / ex
+        average_result = torch.sum(numeric_values_masked * scaled_probability_per_cell * multiplier, dim=1)
+    else:
+        raise ValueError(f"Invalid average_approximation_function: {config.average_approximation_function}")
+
+    if config.use_gumbel_for_aggregation:
+        gumbel_dist = torch.distributions.RelaxedOneHotCategorical(
+            config.aggregation_temperature, logits=logits_aggregation[:, 1:]
+        )
+        # <float32>[batch_size, num_aggregation_labels - 1]
+        aggregation_op_only_probs = gumbel_dist.sample()
+    else:
+        # <float32>[batch_size, num_aggregation_labels - 1]
+        aggregation_op_only_probs = torch.nn.functional.softmax(
+            logits_aggregation[:, 1:] / config.aggregation_temperature, dim=-1
+        )
+
+    all_results = torch.cat(
+        [
+            torch.unsqueeze(sum_result, dim=1),
+            torch.unsqueeze(average_result, dim=1),
+            torch.unsqueeze(count_result, dim=1),
+        ],
+        dim=1,
+    )
+
+    expected_result = torch.sum(all_results * aggregation_op_only_probs, dim=1)
+    return expected_result
+
+
+# PyTorch does not currently support Huber loss with custom delta so we define it ourself
+def huber_loss(input, target, delta: float = 1.0):
+    errors = torch.abs(input - target)  # shape (batch_size,)
+    return torch.where(errors < delta, 0.5 * errors ** 2, errors * delta - (0.5 * delta ** 2))
+
+
+def _calculate_regression_loss(
+    answer,
+    aggregate_mask,
+    dist_per_cell,
+    numeric_values,
+    numeric_values_scale,
+    input_mask_float,
+    logits_aggregation,
+    config,
+):
+    """
+    Calculates the regression loss per example.
+
+    Args:
+        answer (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`):
+            Answer for every example in the batch. Nan if there is no scalar answer.
+        aggregate_mask (:obj: `torch.FloatTensor` of shape :obj:`(batch_size,)`):
+            A mask set to 1 for examples that should use aggregation functions.
+        dist_per_cell (:obj:`torch.distributions.Bernoulli`):
+            Cell selection distribution for each cell.
+        numeric_values (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Numeric values of every token. Nan for tokens which are not numeric values.
+        numeric_values_scale (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Scale of the numeric values of every token.
+        input_mask_float (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, seq_length)`):
+            Mask for the table, without question tokens and table headers.
+        logits_aggregation (:obj: `torch.FloatTensor` of shape :obj:`(batch_size, num_aggregation_labels)`):
+            Logits per aggregation operation.
+        config (:class:`~transformers.TapasConfig`):
+            Model configuration class with all the parameters of the model
+
+    Returns:
+        per_example_answer_loss_scaled (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): Scales answer loss for
+        each example in the batch. large_answer_loss_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,)`): A
+        mask which is 1 for examples for which their answer loss is larger than the answer_loss_cutoff.
+    """
+    # <float32>[batch_size]
+    expected_result = _calculate_expected_result(
+        dist_per_cell, numeric_values, numeric_values_scale, input_mask_float, logits_aggregation, config
+    )
+
+    # <float32>[batch_size]
+    answer_masked = torch.where(torch.isnan(answer), torch.zeros_like(answer), answer)
+
+    if config.use_normalized_answer_loss:
+        normalizer = (torch.max(torch.abs(expected_result), torch.abs(answer_masked)) + EPSILON_ZERO_DIVISION).detach()
+
+        normalized_answer_masked = answer_masked / normalizer
+        normalized_expected_result = expected_result / normalizer
+        per_example_answer_loss = huber_loss(
+            normalized_expected_result * aggregate_mask, normalized_answer_masked * aggregate_mask
+        )
+    else:
+        per_example_answer_loss = huber_loss(
+            expected_result * aggregate_mask, answer_masked * aggregate_mask, delta=config.huber_loss_delta
+        )
+
+    if config.answer_loss_cutoff is None:
+        large_answer_loss_mask = torch.ones_like(per_example_answer_loss, dtype=torch.float32)
+
+    else:
+        large_answer_loss_mask = torch.where(
+            per_example_answer_loss > config.answer_loss_cutoff,
+            torch.zeros_like(per_example_answer_loss, dtype=torch.float32),
+            torch.ones_like(per_example_answer_loss, dtype=torch.float32),
+        )
+    per_example_answer_loss_scaled = config.answer_loss_importance * (per_example_answer_loss * aggregate_mask)
+
+    return per_example_answer_loss_scaled, large_answer_loss_mask
\ No newline at end of file
diff --git a/src/transformers/tokenization_auto.py b/src/transformers/tokenization_auto.py
index 9cadfdfb3690..86c42b6e490b 100644
--- a/src/transformers/tokenization_auto.py
+++ b/src/transformers/tokenization_auto.py
@@ -50,6 +50,7 @@
     RobertaConfig,
     SqueezeBertConfig,
     T5Config,
+    TapasConfig,
     TransfoXLConfig,
     XLMConfig,
     XLMProphetNetConfig,
@@ -85,6 +86,7 @@
 from .tokenization_retribert import RetriBertTokenizer
 from .tokenization_roberta import RobertaTokenizer
 from .tokenization_squeezebert import SqueezeBertTokenizer
+from .tokenization_tapas import TapasTokenizer
 from .tokenization_transfo_xl import TransfoXLTokenizer
 from .tokenization_xlm import XLMTokenizer
 from .utils import logging
@@ -210,6 +212,7 @@
         (RagConfig, (RagTokenizer, None)),
         (XLMProphetNetConfig, (XLMProphetNetTokenizer, None)),
         (ProphetNetConfig, (ProphetNetTokenizer, None)),
+        (TapasConfig, (TapasTokenizer, None)),
     ]
 )
 
diff --git a/src/transformers/tokenization_tapas.py b/src/transformers/tokenization_tapas.py
new file mode 100644
index 000000000000..cc30f74620d2
--- /dev/null
+++ b/src/transformers/tokenization_tapas.py
@@ -0,0 +1,2766 @@
+# coding=utf-8
+# Copyright 2020 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Tokenization class for TAPAS model."""
+
+
+import ast
+import collections
+import datetime
+import enum
+import itertools
+import math
+import os
+import re
+import unicodedata
+from dataclasses import dataclass
+from typing import Callable, Dict, Generator, List, Optional, Text, Tuple, Union
+
+import pandas as pd
+import torch
+from transformers import add_end_docstrings
+
+from .tokenization_utils import PreTrainedTokenizer, _is_control, _is_punctuation, _is_whitespace
+from .tokenization_utils_base import (
+    BatchEncoding,
+    EncodedInput,
+    PaddingStrategy,
+    PreTokenizedInput,
+    TensorType,
+    TextInput,
+    ExplicitEnum, ENCODE_KWARGS_DOCSTRING,
+)
+from .utils import logging
+
+
+logger = logging.get_logger(__name__)
+
+
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "nielsr/tapas-base-finetuned-sqa": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt",
+        "nielsr/tapas-base-finetuned-wtq": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",
+        "nielsr/tapas-base-finetuned-wikisql-supervised": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt",
+    }
+}
+
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "nielsr/tapas-base-finetuned-sqa": 1024,
+    "nielsr/tapas-base-finetuned-wtq": 1024,
+    "nielsr/tapas-base-finetuned-wikisql-supervised": 1024,
+}
+
+
+PRETRAINED_INIT_CONFIGURATION = {
+    "nielsr/tapas-base-finetuned-sqa": {"do_lower_case": True},
+    "nielsr/tapas-base-finetuned-wtq": {"do_lower_case": True},
+    "nielsr/tapas-base-finetuned-wikisql-supervised": {"do_lower_case": True},
+}
+
+
+class TapasTruncationStrategy(ExplicitEnum):
+    """
+    Possible values for the ``truncation`` argument in :meth:`~transformers.TapasTokenizer.__call__`. Useful for
+    tab-completion in an IDE.
+    """
+
+    DROP_ROWS_TO_FIT = "drop_rows_to_fit"
+    DO_NOT_TRUNCATE = "do_not_truncate"
+
+
+TableValue = collections.namedtuple("TokenValue", ["token", "column_id", "row_id"])
+
+
+@dataclass(frozen=True)
+class TokenCoordinates:
+    column_index: int
+    row_index: int
+    token_index: int
+
+
+@dataclass
+class TokenizedTable:
+    rows: List[List[List[Text]]]
+    selected_tokens: List[TokenCoordinates]
+
+
+@dataclass(frozen=True)
+class SerializedExample:
+    tokens: List[Text]
+    column_ids: List[int]
+    row_ids: List[int]
+    segment_ids: List[int]
+
+
+def _is_inner_wordpiece(token: Text):
+    return token.startswith("##")
+
+
+def load_vocab(vocab_file):
+    """Loads a vocabulary file into a dictionary."""
+    vocab = collections.OrderedDict()
+    with open(vocab_file, "r", encoding="utf-8") as reader:
+        tokens = reader.readlines()
+    for index, token in enumerate(tokens):
+        token = token.rstrip("\n")
+        vocab[token] = index
+    return vocab
+
+
+def whitespace_tokenize(text):
+    """Runs basic whitespace cleaning and splitting on a piece of text."""
+    text = text.strip()
+    if not text:
+        return []
+    tokens = text.split()
+    return tokens
+
+TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING = r"""
+            add_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`True`):
+                Whether or not to encode the sequences with the special tokens relative to their model.
+            padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`False`):
+                Activates and controls padding. Accepts the following values:
+
+                * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a
+                  single sequence if provided).
+                * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
+                  maximum acceptable input length for the model if that argument is not provided.
+                * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
+                  different lengths).
+            truncation (:obj:`bool`, :obj:`str` or :class:`~transformers.TapasTruncationStrategy`, `optional`, defaults to :obj:`False`):
+                Activates and controls truncation. Accepts the following values:
+
+                * :obj:`True` or :obj:`'drop_rows_to_fit'`: Truncate to a maximum length specified with the argument
+                  :obj:`max_length` or to the maximum acceptable input length for the model if that argument is not
+                  provided. This will truncate row by row, removing rows from the table.
+                * :obj:`False` or :obj:`'do_not_truncate'` (default): No truncation (i.e., can output batch with
+                  sequence lengths greater than the model maximum admissible input size).
+            max_length (:obj:`int`, `optional`):
+                Controls the maximum length to use by one of the truncation/padding parameters.
+
+                If left unset or set to :obj:`None`, this will use the predefined model maximum length if a maximum
+                length is required by one of the truncation/padding parameters. If the model has no specific maximum
+                input length (like XLNet) truncation/padding to a maximum length will be deactivated.
+            is_split_into_words (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the input is already pre-tokenized (e.g., split into words), in which case the tokenizer
+                will skip the pre-tokenization step. This is useful for NER or token classification.
+            pad_to_multiple_of (:obj:`int`, `optional`):
+                If set will pad the sequence to a multiple of the provided value. This is especially useful to enable
+                the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
+            return_tensors (:obj:`str` or :class:`~transformers.tokenization_utils_base.TensorType`, `optional`):
+                If set, will return tensors instead of list of python integers. Acceptable values are:
+
+                * :obj:`'tf'`: Return TensorFlow :obj:`tf.constant` objects.
+                * :obj:`'pt'`: Return PyTorch :obj:`torch.Tensor` objects.
+                * :obj:`'np'`: Return Numpy :obj:`np.ndarray` objects.
+"""
+
+
+class TapasTokenizer(PreTrainedTokenizer):
+    r"""
+    Construct a TAPAS tokenizer. Based on WordPiece. Flattens a table and one or more related sentences to be used by
+    TAPAS models.
+
+    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
+    Users should refer to this superclass for more information regarding those methods.
+    :class:`~transformers.TapasTokenizer` creates several token type ids to encode tabular structure. To be more
+    precise, it adds 7 token type ids, in the following order: :obj:`segment_ids`, :obj:`column_ids`, :obj:`row_ids`,
+    :obj:`prev_label_ids`, :obj:`column_ranks`, :obj:`inv_column_ranks` and :obj:`numeric_relations`:
+
+    - segment_ids: indicate whether a token belongs to the question (0) or the table (1). 0 for special tokens and
+      padding.
+    - column_ids: indicate to which column of the table a token belongs (starting from 1). Is 0 for all question
+      tokens, special tokens and padding.
+    - row_ids: indicate to which row of the table a token belongs (starting from 1). Is 0 for all question tokens,
+      special tokens and padding. Tokens of column headers are also 0.
+    - prev_label_ids: indicate whether a token was (part of) an answer to the previous question (1) or not (0). Useful
+      in a conversational setup (such as SQA).
+    - column_ranks: indicate the rank of a table token relative to a column, if applicable. For example, if you have a
+      column "number of movies" with values 87, 53 and 69, then the column ranks of these tokens are 3, 1 and 2 respectively. 
+      0 for all question tokens, special tokens and padding.
+    - inv_column_ranks: indicate the inverse rank of a table token relative to a column, if applicable. For example, if
+      you have a column "number of movies" with values 87, 53 and 69, then the inverse column ranks of these tokens are 1, 3 and 
+      2 respectively. 0 for all question tokens, special tokens and padding.
+    - numeric_relations: indicate numeric relations between the question and the tokens of the table. 0 for all
+      question tokens, special tokens and padding.
+
+    :class:`~transformers.TapasTokenizer` runs end-to-end tokenization on a table and associated sentences: punctuation
+    splitting and wordpiece.
+
+    Args:
+        vocab_file (:obj:`str`):
+            File containing the vocabulary.
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to lowercase the input when tokenizing.
+        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to do basic tokenization before WordPiece.
+        never_split (:obj:`Iterable`, `optional`):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            :obj:`do_basic_tokenize=True`
+        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
+            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
+            token instead.
+        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
+            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
+            sequence classification or for a text and a question for question answering. It is also used as the last
+            token of a sequence built with special tokens.
+        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
+            The token used for padding, for example when batching sequences of different lengths.
+        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
+            The classifier token which is used when doing sequence classification (classification of the whole sequence
+            instead of per-token classification). It is the first token of the sequence when built with special tokens.
+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
+            The token used for masking values. This is the token used when training this model with masked language
+            modeling. This is the token which the model will try to predict.
+        empty_token (:obj:`str`, `optional`, defaults to :obj:`"[EMPTY]"`):
+            The token used for empty cell values in a table. Empty cell values include "", "n/a", "nan" and "?".
+        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this
+            `issue <https://github.com/huggingface/transformers/issues/328>`__).
+        strip_accents: (:obj:`bool`, `optional`):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for :obj:`lowercase` (as in the original BERT).
+        cell_trim_length (:obj:`int`, `optional`, defaults to -1):
+            If > 0: Trim cells so that the length is <= this value. Also disables further cell trimming, should thus be
+            used with 'drop_rows_to_fit' below.
+        max_column_id (:obj:`int`, `optional`):
+            Max column id to extract.
+        max_row_id (:obj:`int`, `optional`):
+            Max row id to extract.
+        strip_column_names (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to add empty strings instead of column names.
+        update_answer_coordinates (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to recompute the answer coordinates from the answer text.
+        drop_rows_to_fit (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to drop the last rows if a table doesn't fit within max sequence length.
+
+    """
+
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
+
+    def __init__(
+        self,
+        vocab_file,
+        do_lower_case=True,
+        do_basic_tokenize=True,
+        never_split=None,
+        unk_token="[UNK]",
+        sep_token="[SEP]",
+        pad_token="[PAD]",
+        cls_token="[CLS]",
+        mask_token="[MASK]",
+        empty_token="[EMPTY]",
+        tokenize_chinese_chars=True,
+        strip_accents=None,
+        cell_trim_length: int = -1,
+        max_column_id: int = None,
+        max_row_id: int = None,
+        strip_column_names: bool = False,
+        update_answer_coordinates: bool = False,
+        drop_rows_to_fit: bool = False,
+        model_max_length: int = 512,
+        additional_special_tokens: Optional[List[str]] = None,
+        **kwargs
+    ):
+        if additional_special_tokens is not None:
+            if empty_token not in additional_special_tokens:
+                additional_special_tokens.append(empty_token)
+        else:
+            additional_special_tokens = [empty_token]
+
+        super().__init__(
+            do_lower_case=do_lower_case,
+            do_basic_tokenize=do_basic_tokenize,
+            never_split=never_split,
+            unk_token=unk_token,
+            sep_token=sep_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            mask_token=mask_token,
+            empty_token=empty_token,
+            tokenize_chinese_chars=tokenize_chinese_chars,
+            strip_accents=strip_accents,
+            cell_trim_length=cell_trim_length,
+            max_column_id=max_column_id,
+            max_row_id=max_row_id,
+            strip_column_names=strip_column_names,
+            update_answer_coordinates=update_answer_coordinates,
+            drop_rows_to_fit=drop_rows_to_fit,
+            model_max_length=model_max_length,
+            additional_special_tokens=additional_special_tokens,
+            **kwargs,
+        )
+
+        if not os.path.isfile(vocab_file):
+            raise ValueError(
+                "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
+                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
+            )
+        self.vocab = load_vocab(vocab_file)
+        self.ids_to_tokens = collections.OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
+        self.do_basic_tokenize = do_basic_tokenize
+        if do_basic_tokenize:
+            self.basic_tokenizer = BasicTokenizer(
+                do_lower_case=do_lower_case,
+                never_split=never_split,
+                tokenize_chinese_chars=tokenize_chinese_chars,
+                strip_accents=strip_accents,
+            )
+        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)
+
+        # Additional properties
+        self.cell_trim_length = cell_trim_length
+        self.max_column_id = max_column_id if max_column_id is not None else self.model_max_length
+        self.max_row_id = max_row_id if max_row_id is not None else self.model_max_length
+        self.strip_column_names = strip_column_names
+        self.update_answer_coordinates = update_answer_coordinates
+        self.drop_rows_to_fit = drop_rows_to_fit
+
+    @property
+    def do_lower_case(self):
+        return self.basic_tokenizer.do_lower_case
+
+    @property
+    def vocab_size(self):
+        return len(self.vocab)
+
+    def get_vocab(self):
+        return dict(self.vocab, **self.added_tokens_encoder)
+
+    def _tokenize(self, text):
+        if format_text(text) == EMPTY_TEXT:
+            return [self.additional_special_tokens[0]]
+        split_tokens = []
+        if self.do_basic_tokenize:
+            for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
+
+                # If the token is part of the never_split set
+                if token in self.basic_tokenizer.never_split:
+                    split_tokens.append(token)
+                else:
+                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
+        else:
+            split_tokens = self.wordpiece_tokenizer.tokenize(text)
+        return split_tokens
+
+    def _convert_token_to_id(self, token):
+        """ Converts a token (str) in an id using the vocab. """
+        return self.vocab.get(token, self.vocab.get(self.unk_token))
+
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.ids_to_tokens.get(index, self.unk_token)
+
+    def convert_tokens_to_string(self, tokens):
+        """ Converts a sequence of tokens (string) in a single string. """
+        out_string = " ".join(tokens).replace(" ##", "").strip()
+        return out_string
+
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        index = 0
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+            )
+        else:
+            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.vocab.items(), key=lambda kv: kv[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+
+    def create_attention_mask_from_sequences(self, query_ids: List[int], table_values: List[TableValue]) -> List[int]:
+        """
+        Creates the attention mask according to the query token IDs and a list of table values.
+
+        Args:
+            query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
+            table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
+                token value, the column ID and the row ID of said token.
+
+        Returns:
+            :obj:`List[int]`: List of ints containing the attention mask values.
+        """
+        return [1] * (1 + len(query_ids) + 1 + len(table_values))
+
+    def create_segment_token_type_ids_from_sequences(
+        self, query_ids: List[int], table_values: List[TableValue]
+    ) -> List[int]:
+        """
+        Creates the segment token type IDs according to the query token IDs and a list of table values.
+
+        Args:
+            query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
+            table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
+                token value, the column ID and the row ID of said token.
+
+        Returns:
+            :obj:`List[int]`: List of ints containing the segment token type IDs values.
+        """
+        table_ids = list(zip(*table_values))[0] if table_values else []
+        return [0] * (1 + len(query_ids) + 1) + [1] * len(table_ids)
+
+    def create_column_token_type_ids_from_sequences(
+        self, query_ids: List[int], table_values: List[TableValue]
+    ) -> List[int]:
+        """
+        Creates the column token type IDs according to the query token IDs and a list of table values.
+
+        Args:
+            query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
+            table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
+                token value, the column ID and the row ID of said token.
+
+        Returns:
+            :obj:`List[int]`: List of ints containing the column token type IDs values.
+        """
+        table_column_ids = list(zip(*table_values))[1] if table_values else []
+        return [0] * (1 + len(query_ids) + 1) + list(table_column_ids)
+
+    def create_row_token_type_ids_from_sequences(
+        self, query_ids: List[int], table_values: List[TableValue]
+    ) -> List[int]:
+        """
+        Creates the row token type IDs according to the query token IDs and a list of table values.
+
+        Args:
+            query_ids (:obj:`List[int]`): list of token IDs corresponding to the ID.
+            table_values (:obj:`List[TableValue]`): lift of table values, which are named tuples containing the
+                token value, the column ID and the row ID of said token.
+
+        Returns:
+            :obj:`List[int]`: List of ints containing the row token type IDs values.
+        """
+        table_row_ids = list(zip(*table_values))[2] if table_values else []
+        return [0] * (1 + len(query_ids) + 1) + list(table_row_ids)
+
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        """
+        Build model inputs from a question and flattened table for question answering or sequence classification tasks by concatenating and
+        adding special tokens.
+
+        Args:
+            token_ids_0 (:obj:`List[int]`): The ids of the question.
+            token_ids_1 (:obj:`List[int]`, `optional`): The ids of the flattened table.
+
+        Returns:
+            :obj:`List[int]`: The model input with special tokens.
+        """
+        if token_ids_1 is None:
+            raise ValueError("With TAPAS, you must provide both question IDs and table IDs.")
+
+        return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] + token_ids_1
+
+    def get_special_tokens_mask(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
+    ) -> List[int]:
+        """
+        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
+        special tokens using the tokenizer ``prepare_for_model`` method.
+
+        Args:
+            token_ids_0 (:obj:`List[int]`):
+                List of question IDs.
+            token_ids_1 (:obj:`List[int]`, `optional`):
+                List of flattened table IDs.
+            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
+                Whether or not the token list is already formatted with special tokens for the model.
+
+        Returns:
+            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
+        """
+
+        if already_has_special_tokens:
+            if token_ids_1 is not None:
+                raise ValueError(
+                    "You should not supply a second sequence if the provided sequence of "
+                    "ids is already formatted with special tokens for the model."
+                )
+            return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
+
+        if token_ids_1 is not None:
+            return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1))
+        return [1] + ([0] * len(token_ids_0)) + [1]
+
+    @add_end_docstrings(TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
+    def __call__(
+        self,
+        table: pd.DataFrame,
+        queries: Optional[
+            Union[
+                TextInput,
+                PreTokenizedInput,
+                EncodedInput,
+                List[TextInput],
+                List[PreTokenizedInput],
+                List[EncodedInput],
+            ]
+        ] = None,
+        answer_coordinates: Optional[
+            Union[
+                List[Tuple],
+                List[List[Tuple]]
+            ]
+        ] = None,
+        answer_text: Optional[
+            Union[
+                List[TextInput],
+                List[List[TextInput]]
+            ]
+        ] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Main method to tokenize and prepare for the model one or several sequence(s) related to a table.
+
+        Args:
+            table (:obj:`pd.DataFrame`):
+                Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to
+                convert it to string. 
+            queries (:obj:`str` or :obj:`List[str]`):
+                Question or batch of questions related to a table to be encoded. Note that
+                in case of a batch, all questions must refer to the **same** table. 
+            answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
+                Answer coordinates of each table-question pair in the batch. In case only a single table-question pair
+                is provided, then the answer_coordinates must be a single list of one or more tuples. Each tuple must be 
+                a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column
+                has index 0. In case a batch of table-question pairs is provided, then the answer_coordinates must be a 
+                list of lists of tuples (each list corresponding to a single table-question pair). 
+            answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
+                Answer text of each table-question pair in the batch. In case only a single table-question pair
+                is provided, then the answer_text must be a single list of one or more strings. Each string must be 
+                the answer text of a corresponding answer coordinate. In case a batch of table-question pairs is provided, then 
+                the answer_coordinates must be a list of lists of strings (each list corresponding to a single table-question pair). 
+        """
+        assert isinstance(table, pd.DataFrame), "Table must be of type pd.DataFrame"
+
+        # Input type checking for clearer error
+        assert (
+            queries is None
+            or isinstance(queries, str)
+            or (
+                isinstance(queries, (list, tuple))
+                and (
+                    len(queries) == 0
+                    or (
+                        isinstance(queries[0], str)
+                        or (
+                            isinstance(queries[0], (list, tuple))
+                            and (len(queries[0]) == 0 or isinstance(queries[0][0], str))
+                        )
+                    )
+                )
+            )
+        ), (
+            "queries input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
+            "or `List[List[str]]` (batch of pretokenized examples)."
+        )
+
+        is_batched = isinstance(queries, (list, tuple))
+
+        if is_batched:
+            return self.batch_encode_plus(
+                table=table,
+                queries=queries,
+                answer_coordinates=answer_coordinates,
+                answer_text=answer_text,
+                add_special_tokens=add_special_tokens,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_tensors=return_tensors,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_offsets_mapping=return_offsets_mapping,
+                return_length=return_length,
+                verbose=verbose,
+                **kwargs,
+            )
+        else:
+            return self.encode_plus(
+                table=table,
+                query=queries,
+                answer_coordinates=answer_coordinates,
+                answer_text=answer_text,
+                add_special_tokens=add_special_tokens,
+                padding=padding,
+                truncation=truncation,
+                max_length=max_length,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_tensors=return_tensors,
+                return_token_type_ids=return_token_type_ids,
+                return_attention_mask=return_attention_mask,
+                return_overflowing_tokens=return_overflowing_tokens,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_offsets_mapping=return_offsets_mapping,
+                return_length=return_length,
+                verbose=verbose,
+                **kwargs,
+            )
+
+    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
+    def batch_encode_plus(
+        self,
+        table: pd.DataFrame,
+        queries: Optional[
+            Union[
+                List[TextInput],
+                List[PreTokenizedInput],
+                List[EncodedInput],
+            ]
+        ] = None,
+        answer_coordinates: Optional[List[List[Tuple]]] = None,
+        answer_text: Optional[List[List[TextInput]]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Prepare a table and a list of strings for the model.
+
+        .. warning::
+            This method is deprecated, ``__call__`` should be used instead.
+
+        Args:
+            table (:obj:`pd.DataFrame`):
+                Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to
+                convert it to string.
+            queries (:obj:`List[str]`):
+                Batch of questions related to a table to be encoded. Note that all questions must refer to
+                the **same** table.
+            answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
+                Answer coordinates of each table-question pair in the batch. Each tuple must be
+                a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column
+                has index 0. The answer_coordinates must be a
+                list of lists of tuples (each list corresponding to a single table-question pair).
+            answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
+                Answer text of each table-question pair in the batch. In case a batch of table-question pairs is provided, then
+                the answer_coordinates must be a list of lists of strings (each list corresponding to a single table-question pair). Each string must be
+                the answer text of a corresponding answer coordinate.
+        """
+        if return_token_type_ids is not None and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+
+        if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text):
+            raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided")
+        elif answer_coordinates is None and answer_text is None:
+            answer_coordinates = answer_text = [None] * len(queries)
+
+        if "is_split_into_words" in kwargs:
+            raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.")
+
+        if return_offsets_mapping:
+            raise NotImplementedError(
+                "return_offset_mapping is not available when using Python tokenizers."
+                "To use this feature, change your tokenizer to one deriving from "
+                "transformers.PreTrainedTokenizerFast."
+            )
+
+        return self._batch_encode_plus(
+            table=table,
+            queries=queries,
+            answer_coordinates=answer_coordinates,
+            answer_text=answer_text,
+            add_special_tokens=add_special_tokens,
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def _batch_encode_plus(
+        self,
+        table,
+        queries: Union[
+            List[TextInput],
+            List[PreTokenizedInput],
+            List[EncodedInput],
+        ],
+        answer_coordinates: Optional[List[List[Tuple]]] = None,
+        answer_text: Optional[List[List[TextInput]]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = True,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        table_tokens = self._tokenize_table(table)
+
+        queries_tokens = []
+        for query in queries:
+            query_tokens = self.tokenize(query)
+            queries_tokens.append(query_tokens)
+
+        batch_outputs = self._batch_prepare_for_model(
+            table,
+            queries,
+            tokenized_table=table_tokens,
+            queries_tokens=queries_tokens,
+            answer_coordinates=answer_coordinates,
+            padding=padding,
+            truncation=truncation,
+            answer_text=answer_text,
+            add_special_tokens=add_special_tokens,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            prepend_batch_axis=True,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_length=return_length,
+            verbose=verbose,
+        )
+
+        return BatchEncoding(batch_outputs)
+
+    def _batch_prepare_for_model(
+        self,
+        raw_table: pd.DataFrame,
+        raw_queries: Union[
+            List[TextInput],
+            List[PreTokenizedInput],
+            List[EncodedInput],
+        ],
+        tokenized_table: Optional[TokenizedTable] = None,
+        queries_tokens: Optional[List[List[str]]] = None,
+        answer_coordinates: Optional[List[List[Tuple]]] = None,
+        answer_text: Optional[List[List[TextInput]]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = True,
+        return_attention_mask: Optional[bool] = True,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs
+    ) -> BatchEncoding:
+        batch_outputs = {}
+
+        for index, example in enumerate(zip(raw_queries, queries_tokens, answer_coordinates, answer_text)):
+            raw_query, query_tokens, answer_coords, answer_txt = example
+            outputs = self.prepare_for_model(
+                raw_table,
+                raw_query,
+                tokenized_table=tokenized_table,
+                query_tokens=query_tokens,
+                answer_coordinates=answer_coords,
+                answer_text=answer_txt,
+                add_special_tokens=add_special_tokens,
+                padding=PaddingStrategy.DO_NOT_PAD.value,  # we pad in batch afterwards
+                truncation=truncation,
+                max_length=max_length,
+                pad_to_multiple_of=None,  # we pad in batch afterwards
+                return_attention_mask=False,  # we pad in batch afterwards
+                return_token_type_ids=return_token_type_ids,
+                return_special_tokens_mask=return_special_tokens_mask,
+                return_length=return_length,
+                return_tensors=None,  # We convert the whole batch to tensors at the end
+                prepend_batch_axis=False,
+                verbose=verbose,
+                prev_answer_coordinates=answer_coordinates[index-1] if index != 0 else None,
+                prev_answer_text=answer_text[index-1] if index != 0 else None,
+            )
+
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                batch_outputs[key].append(value)
+
+        batch_outputs = self.pad(
+            batch_outputs,
+            padding=padding,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_attention_mask=return_attention_mask,
+        )
+
+        batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
+
+        return batch_outputs
+
+    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING)
+    def encode(
+        self,
+        table: pd.DataFrame,
+        query: Optional[
+            Union[
+                TextInput,
+                PreTokenizedInput,
+                EncodedInput,
+            ]
+        ] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs
+    ) -> List[int]:
+        """
+        Prepare a table and a string for the model. This method does not return token type IDs, attention masks, etc.
+        which are necessary for the model to work correctly. Use that method if you want to build your processing
+        on your own, otherwise refer to ``__call__``.
+
+        Args:
+            table (:obj:`pd.DataFrame`):
+                Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to
+                convert it to string.
+            query (:obj:`str` or :obj:`List[str]`):
+                Question related to a table to be encoded.
+        """
+        encoded_inputs = self.encode_plus(
+            table,
+            query=query,
+            add_special_tokens=add_special_tokens,
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            return_tensors=return_tensors,
+            **kwargs,
+        )
+
+        return encoded_inputs["input_ids"]
+
+    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
+    def encode_plus(
+        self,
+        table: pd.DataFrame,
+        query: Optional[
+            Union[
+                TextInput,
+                PreTokenizedInput,
+                EncodedInput,
+            ]
+        ] = None,
+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_text: Optional[List[TextInput]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Prepare a table and a string for the model.
+
+        Args:
+            table (:obj:`pd.DataFrame`):
+                Table containing tabular data. Note that all cell values must be text. Use `.astype(str)` on a Pandas dataframe to
+                convert it to string.
+            query (:obj:`str` or :obj:`List[str]`):
+                Question related to a table to be encoded.
+            answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
+                Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single
+                list of one or more tuples. Each tuple must be
+                a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column
+                has index 0.
+            answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
+                Answer text of each table-question pair in the batch. The answer_text must be a single list of one
+                or more strings. Each string must be
+                the answer text of a corresponding answer coordinate.
+        """
+        if return_token_type_ids is not None and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+
+        if (answer_coordinates and not answer_text) or (not answer_coordinates and answer_text):
+            raise ValueError("In case you provide answers, both answer_coordinates and answer_text should be provided")
+
+        if "is_split_into_words" in kwargs:
+            raise NotImplementedError("Currently TapasTokenizer only supports questions as strings.")
+
+        if return_offsets_mapping:
+            raise NotImplementedError(
+                "return_offset_mapping is not available when using Python tokenizers."
+                "To use this feature, change your tokenizer to one deriving from "
+                "transformers.PreTrainedTokenizerFast."
+            )
+
+        return self._encode_plus(
+            table=table,
+            query=query,
+            answer_coordinates=answer_coordinates,
+            answer_text=answer_text,
+            add_special_tokens=add_special_tokens,
+            truncation=truncation,
+            padding=padding,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            return_token_type_ids=return_token_type_ids,
+            return_attention_mask=return_attention_mask,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_offsets_mapping=return_offsets_mapping,
+            return_length=return_length,
+            verbose=verbose,
+            **kwargs,
+        )
+
+    def _encode_plus(
+        self,
+        table: pd.DataFrame,
+        query: Union[
+            TextInput,
+            PreTokenizedInput,
+            EncodedInput,
+        ],
+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_text: Optional[List[TextInput]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = True,
+        return_attention_mask: Optional[bool] = True,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs
+    ):
+        if query is None:
+            query = ""
+            logger.warning(
+                "TAPAS is a question answering model but you have not passed a query. Please be aware that the "
+                "model will probably not behave correctly."
+            )
+
+        table_tokens = self._tokenize_table(table)
+        query_tokens = self.tokenize(query)
+
+        return self.prepare_for_model(
+            table,
+            query,
+            tokenized_table=table_tokens,
+            query_tokens=query_tokens,
+            answer_coordinates=answer_coordinates,
+            answer_text=answer_text,
+            add_special_tokens=add_special_tokens,
+            truncation=truncation,
+            padding=padding,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            return_tensors=return_tensors,
+            prepend_batch_axis=True,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_length=return_length,
+            verbose=verbose,
+        )
+
+    @add_end_docstrings(ENCODE_KWARGS_DOCSTRING, TAPAS_ENCODE_PLUS_ADDITIONAL_KWARGS_DOCSTRING)
+    def prepare_for_model(
+        self,
+        raw_table: pd.DataFrame,
+        raw_query: Union[
+            TextInput,
+            PreTokenizedInput,
+            EncodedInput,
+        ],
+        tokenized_table: Optional[TokenizedTable] = None,
+        query_tokens: Optional[TokenizedTable] = None,
+        answer_coordinates: Optional[List[Tuple]] = None,
+        answer_text: Optional[List[TextInput]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TapasTruncationStrategy] = False,
+        max_length: Optional[int] = None,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = True,
+        return_attention_mask: Optional[bool] = True,
+        return_special_tokens_mask: bool = False,
+        return_offsets_mapping: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs
+    ) -> BatchEncoding:
+        """
+        Prepares a sequence of input id so that it can be used by the model. It
+        adds special tokens, truncates sequences if overflowing while taking into account the special tokens.
+
+        Args:
+            raw_table (:obj:`pd.DataFrame`):
+                The original table before any transformation (like tokenization) was applied to it.
+            raw_query (:obj:`TextInput` or :obj:`PreTokenizedInput` or :obj:`EncodedInput`):
+                The original query before any transformation (like tokenization) was applied to it.
+            tokenized_table (:obj:`TokenizedTable`):
+                The table after tokenization.
+            query_tokens (:obj:`List[str]`):
+                The query after tokenization.
+            answer_coordinates (:obj:`List[Tuple]` or :obj:`List[List[Tuple]]`, `optional`):
+                Answer coordinates of each table-question pair in the batch. The answer_coordinates must be a single
+                list of one or more tuples. Each tuple must be
+                a (row_index, column_index) pair. The first data row (not the column header row) has index 0. The first column
+                has index 0.
+            answer_text (:obj:`List[str]` or :obj:`List[List[str]]`, `optional`):
+                Answer text of each table-question pair in the batch. The answer_text must be a single list of one
+                or more strings. Each string must be
+                the answer text of a corresponding answer coordinate.
+        """
+        if isinstance(padding, bool):
+            if padding and (max_length is not None or pad_to_multiple_of is not None):
+                padding = PaddingStrategy.MAX_LENGTH
+            else:
+                padding = PaddingStrategy.DO_NOT_PAD
+        elif not isinstance(padding, PaddingStrategy):
+            padding = PaddingStrategy(padding)
+
+        if isinstance(truncation, bool):
+            if truncation:
+                truncation = TapasTruncationStrategy.DROP_ROWS_TO_FIT
+            else:
+                truncation = TapasTruncationStrategy.DO_NOT_TRUNCATE
+        elif not isinstance(truncation, TapasTruncationStrategy):
+            truncation = TapasTruncationStrategy(truncation)
+
+        encoded_inputs = {}
+
+        is_part_of_batch = False
+        prev_answer_coordinates, prev_answer_text = None, None
+        if "prev_answer_coordinates" in kwargs and "prev_answer_text" in kwargs:
+            is_part_of_batch = True
+            prev_answer_coordinates = kwargs["prev_answer_coordinates"]
+            prev_answer_text = kwargs["prev_answer_text"]
+
+        num_rows = self._get_num_rows(raw_table, self.drop_rows_to_fit)
+        num_columns = self._get_num_columns(raw_table)
+        _, _, num_tokens = self._get_table_boundaries(tokenized_table)
+
+        if truncation != TapasTruncationStrategy.DO_NOT_TRUNCATE and max_length:
+            num_rows, num_tokens = self._get_truncated_table_rows(query_tokens, tokenized_table, num_rows, num_columns,
+                                                                  max_length, truncation_strategy=truncation)
+        table_data = list(self._get_table_values(tokenized_table, num_columns, num_rows, num_tokens))
+
+        query_ids = self.convert_tokens_to_ids(query_tokens)
+        table_ids = list(zip(*table_data))[0] if len(table_data) > 0 else list(zip(*table_data))
+        table_ids = self.convert_tokens_to_ids(list(table_ids))
+
+        if "return_overflowing_tokens" in kwargs and kwargs["return_overflowing_tokens"]:
+            raise ValueError("TAPAS does not return overflowing tokens as it works on tables.")
+
+        if add_special_tokens:
+            input_ids = self.build_inputs_with_special_tokens(query_ids, table_ids)
+        else:
+            input_ids = query_ids + table_ids
+
+        if max_length is not None and len(input_ids) > max_length:
+            raise ValueError(
+                "Could not encode the query and table header given the maximum length. Encoding the query and table"
+                f"header results in a length of {len(input_ids)} which is higher than the max_length of {max_length}"
+            )
+
+        encoded_inputs["input_ids"] = input_ids
+
+        segment_ids = self.create_segment_token_type_ids_from_sequences(query_ids, table_data)
+        column_ids = self.create_column_token_type_ids_from_sequences(query_ids, table_data)
+        row_ids = self.create_row_token_type_ids_from_sequences(query_ids, table_data)
+        if not is_part_of_batch or (prev_answer_coordinates is None and prev_answer_text is None):
+            # simply set the prev_label_ids to zeros
+            prev_label_ids = [0] * len(row_ids)
+        else:
+            prev_label_ids = self.get_answer_ids(
+                column_ids, row_ids, table_data, prev_answer_text, prev_answer_coordinates
+            )
+
+        ### FIRST: parse both the table and question in terms of numeric values
+        
+        raw_table = add_numeric_table_values(raw_table)
+        raw_query = add_numeric_values_to_question(raw_query)
+        
+        ### SECOND: add numeric-related features (and not parse them in these functions):
+        
+        column_ranks, inv_column_ranks = self._get_numeric_column_ranks(
+            column_ids, row_ids, raw_table
+        )
+        numeric_relations = self._get_numeric_relations(
+            raw_query, column_ids, row_ids, raw_table
+        )
+
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+
+        if return_attention_mask:
+            attention_mask = self.create_attention_mask_from_sequences(query_ids, table_data)
+            encoded_inputs["attention_mask"] = attention_mask
+
+        if answer_coordinates is not None and answer_text is not None:
+            label_ids = self.get_answer_ids(
+                column_ids, row_ids, table_data, answer_text, answer_coordinates
+            )
+            numeric_values = self._get_numeric_values(raw_table, column_ids, row_ids)
+            numeric_values_scale = self._get_numeric_values_scale(raw_table, column_ids, row_ids)
+
+            encoded_inputs["label_ids"] = label_ids
+            encoded_inputs["numeric_values"] = numeric_values
+            encoded_inputs["numeric_values_scale"] = numeric_values_scale
+
+        if return_token_type_ids:
+            token_type_ids = [
+                segment_ids,
+                column_ids,
+                row_ids,
+                prev_label_ids,
+                column_ranks,
+                inv_column_ranks,
+                numeric_relations,
+            ]
+
+            token_type_ids = [list(ids) for ids in list(zip(*token_type_ids))]
+            encoded_inputs["token_type_ids"] = token_type_ids
+
+        if return_special_tokens_mask:
+            if add_special_tokens:
+                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(query_ids, table_ids)
+            else:
+                encoded_inputs["special_tokens_mask"] = [0] * len(input_ids)
+
+        # Check lengths
+        if max_length is None and len(encoded_inputs["input_ids"]) > self.model_max_length and verbose:
+            if not self.deprecation_warnings.get("sequence-length-is-longer-than-the-specified-maximum", False):
+                logger.warning(
+                    "Token indices sequence length is longer than the specified maximum sequence length "
+                    "for this model ({} > {}). Running this sequence through the model will result in "
+                    "indexing errors".format(len(encoded_inputs["input_ids"]), self.model_max_length)
+                )
+            self.deprecation_warnings["sequence-length-is-longer-than-the-specified-maximum"] = True
+
+        # Padding
+        if padding != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding.value,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+
+        batch_outputs = BatchEncoding(
+            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+        )
+
+        return batch_outputs
+
+    def _get_truncated_table_rows(
+            self,
+            query_tokens: List[str],
+            tokenized_table: TokenizedTable,
+            num_rows: int,
+            num_columns: int,
+            max_length: int,
+            truncation_strategy: Union[str, TapasTruncationStrategy],
+    ) -> Tuple[int, int]:
+        """
+        Truncates a sequence pair in-place following the strategy.
+
+        Args:
+            query_tokens (:obj:`List[str]`):
+                List of strings corresponding to the tokenized query.
+            tokenized_table (:obj:`TokenizedTable`):
+                Tokenized table
+            num_rows (:obj:`int`):
+                Total number of table rows
+            num_columns (:obj:`int`):
+                Total number of table columns
+            max_length (:obj:`int`):
+                Total maximum length.
+            truncation_strategy (:obj:`str` or :obj:`~transformers.TapasTruncationStrategy`):
+                Truncation strategy to use. Seeing as this method should only be called when truncating, the only
+                available strategy is the "drop_rows_to_fit" strategy.
+
+        Returns:
+            :obj:`Tuple(int, int)`: tuple containing the number of rows after truncation, and the number of tokens
+                available for each table element.
+        """
+        if not isinstance(truncation_strategy, TapasTruncationStrategy):
+            truncation_strategy = TapasTruncationStrategy(truncation_strategy)
+
+        if truncation_strategy == TapasTruncationStrategy.DROP_ROWS_TO_FIT:
+            while True:
+                num_tokens = self._get_max_num_tokens(
+                    query_tokens,
+                    tokenized_table,
+                    num_rows=num_rows,
+                    num_columns=num_columns,
+                    max_length=max_length
+                )
+
+                if num_tokens is not None:
+                    # We could fit the table.
+                    break
+
+                # Try to drop a row to fit the table.
+                num_rows -= 1
+
+                if num_rows < 1:
+                    break
+        elif truncation_strategy != TapasTruncationStrategy.DO_NOT_TRUNCATE:
+            raise ValueError(f"Unknown truncation strategy {truncation_strategy}.")
+
+        return num_rows, num_tokens or 1
+
+    def _tokenize_table(
+        self,
+        table=None,
+    ):
+        """
+        Tokenizes column headers and cell texts of a table.
+
+        Args:
+            table (:obj:`pd.Dataframe`):
+                Table. Returns: :obj:`TokenizedTable`: TokenizedTable object.
+        """
+        tokenized_rows = []
+        tokenized_row = []
+        # tokenize column headers
+        for column in table:
+            if self.strip_column_names:
+                tokenized_row.append(self.tokenize(""))
+            else:
+                tokenized_row.append(self.tokenize(column))
+        tokenized_rows.append(tokenized_row)
+
+        # tokenize cell values
+        for idx, row in table.iterrows():
+            tokenized_row = []
+            for cell in row:
+                tokenized_row.append(self.tokenize(cell))
+            tokenized_rows.append(tokenized_row)
+
+        token_coordinates = []
+        for row_index, row in enumerate(tokenized_rows):
+            for column_index, cell in enumerate(row):
+                for token_index, _ in enumerate(cell):
+                    token_coordinates.append(
+                        TokenCoordinates(
+                            row_index=row_index,
+                            column_index=column_index,
+                            token_index=token_index,
+                        )
+                    )
+
+        return TokenizedTable(
+            rows=tokenized_rows,
+            selected_tokens=token_coordinates,
+        )
+
+    def _question_encoding_cost(self, question_tokens):
+        # Two extra spots of SEP and CLS.
+        return len(question_tokens) + 2
+
+    def _get_token_budget(self, question_tokens, max_length=None):
+        """
+        Computes the number of tokens left for the table after tokenizing a question, taking into account the max
+        sequence length of the model.
+
+        Args:
+            question_tokens (:obj:`List[String]`):
+                List of question tokens. Returns: :obj:`int`: the number of tokens left for the table, given the model
+                max length.
+        """
+        return (max_length if max_length is not None else self.model_max_length) - self._question_encoding_cost(question_tokens)
+
+    def _get_table_values(self, table, num_columns, num_rows, num_tokens) -> Generator[TableValue, None, None]:
+        """Iterates over partial table and returns token, column and row indexes."""
+        for tc in table.selected_tokens:
+            # First row is header row.
+            if tc.row_index >= num_rows + 1:
+                continue
+            if tc.column_index >= num_columns:
+                continue
+            cell = table.rows[tc.row_index][tc.column_index]
+            token = cell[tc.token_index]
+            word_begin_index = tc.token_index
+            # Don't add partial words. Find the starting word piece and check if it
+            # fits in the token budget.
+            while word_begin_index >= 0 and _is_inner_wordpiece(cell[word_begin_index]):
+                word_begin_index -= 1
+            if word_begin_index >= num_tokens:
+                continue
+            yield TableValue(token, tc.column_index + 1, tc.row_index)
+
+    def _get_table_boundaries(self, table):
+        """Return maximal number of rows, columns and tokens."""
+        max_num_tokens = 0
+        max_num_columns = 0
+        max_num_rows = 0
+        for tc in table.selected_tokens:
+            max_num_columns = max(max_num_columns, tc.column_index + 1)
+            max_num_rows = max(max_num_rows, tc.row_index + 1)
+            max_num_tokens = max(max_num_tokens, tc.token_index + 1)
+            max_num_columns = min(self.max_column_id, max_num_columns)
+            max_num_rows = min(self.max_row_id, max_num_rows)
+        return max_num_rows, max_num_columns, max_num_tokens
+
+    def _get_table_cost(self, table, num_columns, num_rows, num_tokens):
+        return sum(1 for _ in self._get_table_values(table, num_columns, num_rows, num_tokens))
+
+    def _get_max_num_tokens(
+        self,
+        question_tokens,
+        tokenized_table,
+        num_columns,
+        num_rows,
+        max_length
+    ):
+        """Computes max number of tokens that can be squeezed into the budget."""
+        token_budget = self._get_token_budget(question_tokens, max_length)
+        _, _, max_num_tokens = self._get_table_boundaries(tokenized_table)
+        if self.cell_trim_length >= 0 and max_num_tokens > self.cell_trim_length:
+            max_num_tokens = self.cell_trim_length
+        num_tokens = 0
+        for num_tokens in range(max_num_tokens + 1):
+            cost = self._get_table_cost(tokenized_table, num_columns, num_rows, num_tokens + 1)
+            if cost > token_budget:
+                break
+        if num_tokens < max_num_tokens:
+            if self.cell_trim_length >= 0:
+                # We don't allow dynamic trimming if a cell_trim_length is set.
+                return None
+            if num_tokens == 0:
+                return None
+        return num_tokens
+
+    def _get_num_columns(self, table):
+        num_columns = table.shape[1]
+        if num_columns >= self.max_column_id:
+            raise ValueError("Too many columns")
+        return num_columns
+
+    def _get_num_rows(self, table, drop_rows_to_fit):
+        num_rows = table.shape[0]
+        if num_rows >= self.max_row_id:
+            if drop_rows_to_fit:
+                num_rows = self.max_row_id - 1
+            else:
+                raise ValueError("Too many rows")
+        return num_rows
+
+    def _serialize_text(self, question_tokens):
+        """Serializes texts in index arrays."""
+        tokens = []
+        segment_ids = []
+        column_ids = []
+        row_ids = []
+
+        # add [CLS] token at the beginning
+        tokens.append(self.cls_token)
+        segment_ids.append(0)
+        column_ids.append(0)
+        row_ids.append(0)
+
+        for token in question_tokens:
+            tokens.append(token)
+            segment_ids.append(0)
+            column_ids.append(0)
+            row_ids.append(0)
+
+        return tokens, segment_ids, column_ids, row_ids
+
+    def _serialize(
+        self,
+        question_tokens,
+        table,
+        num_columns,
+        num_rows,
+        num_tokens,
+    ):
+        """Serializes table and text."""
+        tokens, segment_ids, column_ids, row_ids = self._serialize_text(question_tokens)
+
+        # add [SEP] token between question and table tokens
+        tokens.append(self.sep_token)
+        segment_ids.append(0)
+        column_ids.append(0)
+        row_ids.append(0)
+
+        for token, column_id, row_id in self._get_table_values(table, num_columns, num_rows, num_tokens):
+            tokens.append(token)
+            segment_ids.append(1)
+            column_ids.append(column_id)
+            row_ids.append(row_id)
+
+        return SerializedExample(
+            tokens=tokens,
+            segment_ids=segment_ids,
+            column_ids=column_ids,
+            row_ids=row_ids,
+        )
+
+    def _get_column_values(self, table, col_index):
+        table_numeric_values = {}
+        for row_index, row in table.iterrows():
+            cell = row[col_index]
+            if cell.numeric_value is not None:
+                table_numeric_values[row_index] = cell.numeric_value
+        return table_numeric_values
+
+    def _get_cell_token_indexes(self, column_ids, row_ids, column_id, row_id):
+        for index in range(len(column_ids)):
+            if column_ids[index] - 1 == column_id and row_ids[index] - 1 == row_id:
+                yield index
+
+    def _get_numeric_column_ranks(self, column_ids, row_ids, table):
+        """Returns column ranks for all numeric columns."""
+
+        ranks = [0] * len(column_ids)
+        inv_ranks = [0] * len(column_ids)
+
+        # original code from tf_example_utils.py of the original implementation
+        if table is not None:
+            for col_index in range(len(table.columns)):
+                table_numeric_values = self._get_column_values(table, col_index)
+
+                if not table_numeric_values:
+                    continue
+
+                try:
+                    key_fn = get_numeric_sort_key_fn(table_numeric_values.values())
+                except ValueError:
+                    continue
+
+                table_numeric_values = {row_index: key_fn(value) for row_index, value in table_numeric_values.items()}
+
+                table_numeric_values_inv = collections.defaultdict(list)
+                for row_index, value in table_numeric_values.items():
+                    table_numeric_values_inv[value].append(row_index)
+
+                unique_values = sorted(table_numeric_values_inv.keys())
+
+                for rank, value in enumerate(unique_values):
+                    for row_index in table_numeric_values_inv[value]:
+                        for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index):
+                            ranks[index] = rank + 1
+                            inv_ranks[index] = len(unique_values) - rank
+
+        return ranks, inv_ranks
+
+    def _get_numeric_sort_key_fn(self, table_numeric_values, value):
+        """
+        Returns the sort key function for comparing value to table values. The function returned will be a suitable
+        input for the key param of the sort(). See number_annotation_utils._get_numeric_sort_key_fn for details
+
+        Args:
+            table_numeric_values: Numeric values of a column
+            value: Numeric value in the question
+
+        Returns:
+            A function key function to compare column and question values.
+        """
+        if not table_numeric_values:
+            return None
+        all_values = list(table_numeric_values.values())
+        all_values.append(value)
+        try:
+            return get_numeric_sort_key_fn(all_values)
+        except ValueError:
+            return None
+
+    def _get_numeric_relations(self, question, column_ids, row_ids, table):
+        """
+        Returns numeric relations embeddings
+
+        Args:
+            question: Question object.
+            column_ids: Maps word piece position to column id.
+            row_ids: Maps word piece position to row id.
+            table: The table containing the numeric cell values.
+        """
+
+        numeric_relations = [0] * len(column_ids)
+
+        # first, we add any numeric value spans to the question:
+        # Create a dictionary that maps a table cell to the set of all relations
+        # this cell has with any value in the question.
+        cell_indices_to_relations = collections.defaultdict(set)
+        if question is not None and table is not None:
+            for numeric_value_span in question.numeric_spans:
+                for value in numeric_value_span.values:
+                    for column_index in range(len(table.columns)):
+                        table_numeric_values = self._get_column_values(table, column_index)
+                        sort_key_fn = self._get_numeric_sort_key_fn(table_numeric_values, value)
+                        if sort_key_fn is None:
+                            continue
+                        for row_index, cell_value in table_numeric_values.items():
+                            relation = get_numeric_relation(value, cell_value, sort_key_fn)
+                            if relation is not None:
+                                cell_indices_to_relations[column_index, row_index].add(relation)
+
+        # For each cell add a special feature for all its word pieces.
+        for (column_index, row_index), relations in cell_indices_to_relations.items():
+            relation_set_index = 0
+            for relation in relations:
+                assert relation.value >= Relation.EQ.value
+                relation_set_index += 2 ** (relation.value - Relation.EQ.value)
+            for cell_token_index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index):
+                numeric_relations[cell_token_index] = relation_set_index
+
+        return numeric_relations
+
+    def _get_numeric_values(self, table, column_ids, row_ids):
+        """Returns numeric values for computation of answer loss."""
+
+        numeric_values = [float("nan")] * len(column_ids)
+
+        if table is not None:
+            num_rows = table.shape[0]
+            num_columns = table.shape[1]
+
+            for col_index in range(num_columns):
+                for row_index in range(num_rows):
+                    numeric_value = table.iloc[row_index, col_index].numeric_value
+                    if numeric_value is not None:
+                        if numeric_value.float_value is None:
+                            continue
+                        float_value = numeric_value.float_value
+                        if float_value == float("inf"):
+                            continue
+                        for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index):
+                            numeric_values[index] = float_value
+
+        return numeric_values
+
+    def _get_numeric_values_scale(self, table, column_ids, row_ids):
+        """Returns a scale to each token to down weigh the value of long words."""
+
+        numeric_values_scale = [1.0] * len(column_ids)
+
+        if table is None:
+            return numeric_values_scale
+
+        num_rows = table.shape[0]
+        num_columns = table.shape[1]
+
+        for col_index in range(num_columns):
+            for row_index in range(num_rows):
+                indices = [index for index in self._get_cell_token_indexes(column_ids, row_ids, col_index, row_index)]
+                num_indices = len(indices)
+                if num_indices > 1:
+                    for index in indices:
+                        numeric_values_scale[index] = float(num_indices)
+
+        return numeric_values_scale
+
+    def _pad_to_seq_length(self, inputs):
+        while len(inputs) > self.model_max_length:
+            inputs.pop()
+        while len(inputs) < self.model_max_length:
+            inputs.append(0)
+
+    def _get_all_answer_ids_from_coordinates(
+        self,
+        column_ids,
+        row_ids,
+        answers_list,
+    ):
+        """Maps lists of answer coordinates to token indexes."""
+        answer_ids = [0] * len(column_ids)
+        found_answers = set()
+        all_answers = set()
+        for answers in answers_list:
+            column_index, row_index = answers
+            all_answers.add((column_index, row_index))
+            for index in self._get_cell_token_indexes(column_ids, row_ids, column_index, row_index):
+                found_answers.add((column_index, row_index))
+                answer_ids[index] = 1
+
+        missing_count = len(all_answers) - len(found_answers)
+        return answer_ids, missing_count
+
+    def _get_all_answer_ids(self, column_ids, row_ids, answer_coordinates):
+        """
+        Maps answer coordinates of a question to token indexes. 
+        
+        In the SQA format (TSV), the coordinates are given as (row, column) tuples. Here, we first 
+        swap them to (column, row) format before calling _get_all_answer_ids_from_coordinates.
+        """
+        
+        def _to_coordinates(answer_coordinates_question):
+            return [(coords[1], coords[0]) for coords in answer_coordinates_question]
+
+        return self._get_all_answer_ids_from_coordinates(
+            column_ids, row_ids, answers_list=(_to_coordinates(answer_coordinates))
+        )
+
+    def _find_tokens(self, text, segment):
+        """Return start index of segment in text or None."""
+        logging.info("text: %s %s", text, segment)
+        for index in range(1 + len(text) - len(segment)):
+            for seg_index, seg_token in enumerate(segment):
+                if text[index + seg_index].piece != seg_token.piece:
+                    break
+            else:
+                return index
+        return None
+
+    def _find_answer_coordinates_from_answer_text(
+        self,
+        tokenized_table,
+        answer_text,
+    ):
+        """Returns all occurrences of answer_text in the table."""
+        logging.info("answer text: %s", answer_text)
+        for row_index, row in enumerate(tokenized_table.rows):
+            if row_index == 0:
+                # We don't search for answers in the header.
+                continue
+            for col_index, cell in enumerate(row):
+                token_index = self._find_tokens(cell, answer_text)
+                if token_index is not None:
+                    yield TokenCoordinates(
+                        row_index=row_index,
+                        column_index=col_index,
+                        token_index=token_index,
+                    )
+
+    def _find_answer_ids_from_answer_texts(
+        self,
+        column_ids,
+        row_ids,
+        tokenized_table,
+        answer_texts,
+    ):
+        """Maps question with answer texts to the first matching token indexes."""
+        answer_ids = [0] * len(column_ids)
+        for answer_text in answer_texts:
+            for coordinates in self._find_answer_coordinates_from_answer_text(
+                tokenized_table,
+                answer_text,
+            ):
+                # Maps answer coordinates to indexes this can fail if tokens / rows have
+                # been pruned.
+                indexes = list(
+                    self._get_cell_token_indexes(
+                        column_ids,
+                        row_ids,
+                        column_id=coordinates.column_index,
+                        row_id=coordinates.row_index - 1,
+                    )
+                )
+                indexes.sort()
+                coordinate_answer_ids = []
+                if indexes:
+                    begin_index = coordinates.token_index + indexes[0]
+                    end_index = begin_index + len(answer_text)
+                    for index in indexes:
+                        if index >= begin_index and index < end_index:
+                            coordinate_answer_ids.append(index)
+                if len(coordinate_answer_ids) == len(answer_text):
+                    for index in coordinate_answer_ids:
+                        answer_ids[index] = 1
+                    break
+        return answer_ids
+
+    def _get_answer_ids(self, column_ids, row_ids, answer_coordinates):
+        """Maps answer coordinates of a question to token indexes."""
+        answer_ids, missing_count = self._get_all_answer_ids(column_ids, row_ids, answer_coordinates)
+
+        if missing_count:
+            raise ValueError("Couldn't find all answers")
+        return answer_ids
+
+    def get_answer_ids(
+        self, column_ids, row_ids, tokenized_table, answer_texts_question, answer_coordinates_question
+    ):
+        if self.update_answer_coordinates:
+            return self._find_answer_ids_from_answer_texts(
+                column_ids,
+                row_ids,
+                tokenized_table,
+                answer_texts=[
+                    self.tokenize(at) 
+                    for at in answer_texts_question
+                ],
+            )
+        return self._get_answer_ids(column_ids, row_ids, answer_coordinates_question)
+
+    def _pad(
+        self,
+        encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
+        max_length: Optional[int] = None,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        pad_to_multiple_of: Optional[int] = None,
+        return_attention_mask: Optional[bool] = None,
+    ) -> dict:
+        """
+        Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
+
+        Args:
+            encoded_inputs: Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
+            max_length: maximum length of the returned list and optionally padding length (see below).
+                Will truncate by taking into account the special tokens.
+            padding_strategy: PaddingStrategy to use for padding.
+
+                - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
+                - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
+                - PaddingStrategy.DO_NOT_PAD: Do not pad
+                The tokenizer padding sides are defined in self.padding_side:
+
+                    - 'left': pads on the left of the sequences
+                    - 'right': pads on the right of the sequences
+            pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
+                >= 7.5 (Volta).
+            return_attention_mask: (optional) Set to False to avoid returning attention mask (default: set to model specifics)
+        """
+        # Load from model defaults
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+
+        if padding_strategy == PaddingStrategy.LONGEST:
+            max_length = len(encoded_inputs["input_ids"])
+
+        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
+            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
+
+        needs_to_be_padded = (
+            padding_strategy != PaddingStrategy.DO_NOT_PAD and len(encoded_inputs["input_ids"]) != max_length
+        )
+
+        if needs_to_be_padded:
+            difference = max_length - len(encoded_inputs["input_ids"])
+            if self.padding_side == "right":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = (
+                        encoded_inputs["token_type_ids"] + [[self.pad_token_type_id] * 7] * difference
+                    )
+                if "label_ids" in encoded_inputs:
+                    encoded_inputs["label_ids"] = encoded_inputs["label_ids"] + [0] * difference
+                if "numeric_values" in encoded_inputs:
+                    encoded_inputs["numeric_values"] = encoded_inputs["numeric_values"] + [float("nan")] * difference
+                if "numeric_values_scale" in encoded_inputs:
+                    encoded_inputs["numeric_values_scale"] = encoded_inputs["numeric_values_scale"] + [1.0] * difference
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference
+                encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.pad_token_id] * difference
+            elif self.padding_side == "left":
+                if return_attention_mask:
+                    encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"])
+                if "token_type_ids" in encoded_inputs:
+                    encoded_inputs["token_type_ids"] = [[self.pad_token_type_id] * 7] * difference + encoded_inputs[
+                        "token_type_ids"
+                    ]
+                if "label_ids" in encoded_inputs:
+                    encoded_inputs["label_ids"] = [0] * difference + encoded_inputs["label_ids"] 
+                if "numeric_values" in encoded_inputs:
+                    encoded_inputs["numeric_values"] = [float("nan")] * difference + encoded_inputs["numeric_values"] 
+                if "numeric_values_scale" in encoded_inputs:
+                    encoded_inputs["numeric_values_scale"] = [1.0] * difference + encoded_inputs["numeric_values_scale"] 
+                if "special_tokens_mask" in encoded_inputs:
+                    encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
+                encoded_inputs["input_ids"] = [self.pad_token_id] * difference + encoded_inputs["input_ids"]
+            else:
+                raise ValueError("Invalid padding strategy:" + str(self.padding_side))
+        else:
+            if return_attention_mask:
+                encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"])
+
+        return encoded_inputs
+
+    #### Everything related to converting logits to predictions ####
+
+    def _get_cell_token_probs(self, probabilities, segment_ids, row_ids, column_ids):
+        for i, p in enumerate(probabilities):
+            segment_id = segment_ids[i]
+            col = column_ids[i] - 1
+            row = row_ids[i] - 1
+            if col >= 0 and row >= 0 and segment_id == 1:
+                yield i, p
+
+    def _get_mean_cell_probs(self, probabilities, segment_ids, row_ids, column_ids):
+        """Computes average probability per cell, aggregating over tokens."""
+        coords_to_probs = collections.defaultdict(list)
+        for i, prob in self._get_cell_token_probs(probabilities, segment_ids, row_ids, column_ids):
+            col = column_ids[i] - 1
+            row = row_ids[i] - 1
+            coords_to_probs[(col, row)].append(prob)
+        return {coords: torch.as_tensor(cell_probs).mean() for coords, cell_probs in coords_to_probs.items()}
+
+    def convert_logits_to_predictions(
+        self, data, logits, logits_agg=None, cell_classification_threshold=0.5
+    ):
+        """
+        Converts logits of :class:`~transformers.TapasForQuestionAnswering` to actual predicted answer coordinates
+        and optional aggregation indices.
+
+        Args:
+            data (:obj:`dict`):
+                Dictionary mapping features to actual values. Should be created using
+                :class:`~transformers.TapasTokenizer`.
+            logits (:obj:`torch.FloatTensor` of shape ``(batch_size, sequence_length)``):
+                Tensor containing the logits at the token level.
+            logits_agg (:obj:`torch.FloatTensor` of shape ``(batch_size, num_aggregation_labels)``, `optional`):
+                Tensor containing the aggregation logits.
+            cell_classification_threshold (:obj:`float`, `optional`, defaults to 0.5):
+                Threshold to be used for cell selection. All table cells for which their probability is larger than
+                this threshold will be selected.
+
+        Returns:
+            :obj:`tuple` comprising various elements depending on the inputs: 
+            predicted_answer_coordinates (``List[List[[tuple]]`` of length ``batch_size``): 
+                Predicted answer coordinates as a list of lists of tuples. Each element in the list contains the predicted answer coordinates 
+                of a single example in the batch, as a list of tuples. Each tuple is a cell, i.e. (row index, column index). 
+            predicted_aggregation_indices (`optional`, returned when ``logits_aggregation`` is provided) ``List[int]`` of length ``batch_size``: 
+                Predicted aggregation operator indices of the aggregation head. 
+        """
+        # compute probabilities from token logits
+        dist_per_token = torch.distributions.Bernoulli(logits=logits)
+        probabilities = dist_per_token.probs * data["attention_mask"].type(torch.float32).to(
+            dist_per_token.probs.device
+        )
+
+        token_types = [
+            "segment_ids",
+            "column_ids",
+            "row_ids",
+            "prev_label_ids",
+            "column_ranks",
+            "inv_column_ranks",
+            "numeric_relations",
+        ]
+
+        # collect input_ids, segment ids, row ids and column ids of batch. Shape (batch_size, seq_len)
+        input_ids = data["input_ids"]
+        segment_ids = data["token_type_ids"][:, :, token_types.index("segment_ids")]
+        row_ids = data["token_type_ids"][:, :, token_types.index("row_ids")]
+        column_ids = data["token_type_ids"][:, :, token_types.index("column_ids")]
+
+        # next, get answer coordinates for every example in the batch
+        num_batch = input_ids.shape[0]
+        predicted_answer_coordinates = []
+        for i in range(num_batch):
+            probabilities_example = probabilities[i].tolist()
+            segment_ids_example = segment_ids[i]
+            row_ids_example = row_ids[i]
+            column_ids_example = column_ids[i]
+
+            max_width = column_ids_example.max()
+            max_height = row_ids_example.max()
+
+            if max_width == 0 and max_height == 0:
+                continue
+
+            cell_coords_to_prob = self._get_mean_cell_probs(
+                probabilities_example,
+                segment_ids_example.tolist(),
+                row_ids_example.tolist(),
+                column_ids_example.tolist(),
+            )
+
+            # Select the answers above the classification threshold.
+            answer_coordinates = []
+            for col in range(max_width):
+                for row in range(max_height):
+                    cell_prob = cell_coords_to_prob.get((col, row), None)
+                    if cell_prob is not None:
+                        if cell_prob > cell_classification_threshold:
+                            answer_coordinates.append((row, col))
+            answer_coordinates = sorted(answer_coordinates)
+            predicted_answer_coordinates.append(answer_coordinates)
+
+        output = predicted_answer_coordinates
+
+        if logits_agg is not None:
+            predicted_aggregation_indices = logits_agg.argmax(dim=-1)
+            output = (output, predicted_aggregation_indices.tolist())
+
+        return output
+
+    #### End of everything related to converting logits to predictions ####
+
+
+# Copied from transformers.models.bert.tokenization_bert.BasicTokenizer
+class BasicTokenizer(object):
+    """
+    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.)
+
+    Args:
+        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to lowercase the input when tokenizing.
+        never_split (:obj:`Iterable`, `optional`):
+            Collection of tokens which will never be split during tokenization. Only has an effect when
+            :obj:`do_basic_tokenize=True`
+        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
+            Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this
+            `issue <https://github.com/huggingface/transformers/issues/328>`__).
+        strip_accents: (:obj:`bool`, `optional`):
+            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
+            value for :obj:`lowercase` (as in the original BERT).
+    """
+
+    def __init__(self, do_lower_case=True, never_split=None, tokenize_chinese_chars=True, strip_accents=None):
+        if never_split is None:
+            never_split = []
+        self.do_lower_case = do_lower_case
+        self.never_split = set(never_split)
+        self.tokenize_chinese_chars = tokenize_chinese_chars
+        self.strip_accents = strip_accents
+
+    def tokenize(self, text, never_split=None):
+        """
+        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
+        WordPieceTokenizer
+
+        Args:
+            **never_split**: (`optional`) list of str
+                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
+                :func:`PreTrainedTokenizer.tokenize`) List of token not to split.
+        """
+        # union() returns a new set by concatenating the two sets.
+        never_split = self.never_split.union(set(never_split)) if never_split else self.never_split
+        text = self._clean_text(text)
+
+        # This was added on November 1st, 2018 for the multilingual and Chinese
+        # models. This is also applied to the English models now, but it doesn't
+        # matter since the English models were not trained on any Chinese data
+        # and generally don't have any Chinese data in them (there are Chinese
+        # characters in the vocabulary because Wikipedia does have some Chinese
+        # words in the English Wikipedia.).
+        if self.tokenize_chinese_chars:
+            text = self._tokenize_chinese_chars(text)
+        orig_tokens = whitespace_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if token not in never_split:
+                if self.do_lower_case:
+                    token = token.lower()
+                    if self.strip_accents is not False:
+                        token = self._run_strip_accents(token)
+                elif self.strip_accents:
+                    token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split))
+
+        output_tokens = whitespace_tokenize(" ".join(split_tokens))
+        return output_tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if _is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self._is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _is_chinese_char(self, cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if (
+            (cp >= 0x4E00 and cp <= 0x9FFF)
+            or (cp >= 0x3400 and cp <= 0x4DBF)  #
+            or (cp >= 0x20000 and cp <= 0x2A6DF)  #
+            or (cp >= 0x2A700 and cp <= 0x2B73F)  #
+            or (cp >= 0x2B740 and cp <= 0x2B81F)  #
+            or (cp >= 0x2B820 and cp <= 0x2CEAF)  #
+            or (cp >= 0xF900 and cp <= 0xFAFF)
+            or (cp >= 0x2F800 and cp <= 0x2FA1F)  #
+        ):  #
+            return True
+
+        return False
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xFFFD or _is_control(char):
+                continue
+            if _is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+
+# Copied from transformers.models.bert.tokenization_bert.WordpieceTokenizer
+class WordpieceTokenizer(object):
+    """Runs WordPiece tokenization."""
+
+    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
+        self.vocab = vocab
+        self.unk_token = unk_token
+        self.max_input_chars_per_word = max_input_chars_per_word
+
+    def tokenize(self, text):
+        """
+        Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
+        tokenization using the given vocabulary. For example, :obj:`input = "unaffable"` wil return as output
+        :obj:`["un", "##aff", "##able"]`
+
+        Args:
+          text: A single token or whitespace separated tokens. This should have
+            already been passed through `BasicTokenizer`
+
+        Returns:
+          A list of wordpiece tokens.
+        """
+
+        output_tokens = []
+        for token in whitespace_tokenize(text):
+            chars = list(token)
+            if len(chars) > self.max_input_chars_per_word:
+                output_tokens.append(self.unk_token)
+                continue
+
+            is_bad = False
+            start = 0
+            sub_tokens = []
+            while start < len(chars):
+                end = len(chars)
+                cur_substr = None
+                while start < end:
+                    substr = "".join(chars[start:end])
+                    if start > 0:
+                        substr = "##" + substr
+                    if substr in self.vocab:
+                        cur_substr = substr
+                        break
+                    end -= 1
+                if cur_substr is None:
+                    is_bad = True
+                    break
+                sub_tokens.append(cur_substr)
+                start = end
+
+            if is_bad:
+                output_tokens.append(self.unk_token)
+            else:
+                output_tokens.extend(sub_tokens)
+        return output_tokens
+
+
+
+# Below: utilities for TAPAS tokenizer (independent from PyTorch/Tensorflow).
+# This includes functions to parse numeric values (dates and numbers) from both the table and questions in order
+# to create the column_ranks, inv_column_ranks, numeric_values, numeric values_scale and numeric_relations in
+# prepare_for_model of TapasTokenizer. 
+# These are meant to be used in an academic setup, for production use cases Gold mine or Aqua should be used.
+
+
+# taken from constants.py of the original implementation
+# URL: https://github.com/google-research/tapas/blob/master/tapas/utils/constants.py
+class Relation(enum.Enum):
+    HEADER_TO_CELL = 1  # Connects header to cell.
+    CELL_TO_HEADER = 2  # Connects cell to header.
+    QUERY_TO_HEADER = 3  # Connects query to headers.
+    QUERY_TO_CELL = 4  # Connects query to cells.
+    ROW_TO_CELL = 5  # Connects row to cells.
+    CELL_TO_ROW = 6  # Connects cells to row.
+    EQ = 7  # Annotation value is same as cell value
+    LT = 8  # Annotation value is less than cell value
+    GT = 9  # Annotation value is greater than cell value
+
+
+@dataclass
+class Date:
+    year: Optional[int] = None
+    month: Optional[int] = None
+    day: Optional[int] = None
+
+
+@dataclass
+class NumericValue:
+    float_value: Optional[float] = None
+    date: Optional[Date] = None
+
+
+@dataclass
+class NumericValueSpan:
+    begin_index: int = None
+    end_index: int = None
+    values: List[NumericValue] = None
+
+
+@dataclass
+class Cell:
+    text: Text
+    numeric_value: Optional[NumericValue] = None
+
+
+@dataclass
+class Question:
+    original_text: Text # The original raw question string.
+    text: Text # The question string after normalization.
+    numeric_spans: Optional[List[NumericValueSpan]] = None
+
+    
+# Below: all functions from number_utils.py as well as 2 functions (namely get_all_spans and normalize_for_match)
+# from text_utils.py of the original implementation. URL's: 
+# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_utils.py
+# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py
+    
+
+# Constants for parsing date expressions.
+# Masks that specify (by a bool) which of (year, month, day) will be populated.
+_DateMask = collections.namedtuple("_DateMask", ["year", "month", "day"])
+
+_YEAR = _DateMask(True, False, False)
+_YEAR_MONTH = _DateMask(True, True, False)
+_YEAR_MONTH_DAY = _DateMask(True, True, True)
+_MONTH = _DateMask(False, True, False)
+_MONTH_DAY = _DateMask(False, True, True)
+
+# Pairs of patterns to pass to 'datetime.strptime' and masks specifying which
+# fields will be set by the corresponding pattern.
+_DATE_PATTERNS = (
+    ("%B", _MONTH),
+    ("%Y", _YEAR),
+    ("%Ys", _YEAR),
+    ("%b %Y", _YEAR_MONTH),
+    ("%B %Y", _YEAR_MONTH),
+    ("%B %d", _MONTH_DAY),
+    ("%b %d", _MONTH_DAY),
+    ("%d %b", _MONTH_DAY),
+    ("%d %B", _MONTH_DAY),
+    ("%B %d, %Y", _YEAR_MONTH_DAY),
+    ("%d %B %Y", _YEAR_MONTH_DAY),
+    ("%m-%d-%Y", _YEAR_MONTH_DAY),
+    ("%Y-%m-%d", _YEAR_MONTH_DAY),
+    ("%Y-%m", _YEAR_MONTH),
+    ("%B %Y", _YEAR_MONTH),
+    ("%d %b %Y", _YEAR_MONTH_DAY),
+    ("%Y-%m-%d", _YEAR_MONTH_DAY),
+    ("%b %d, %Y", _YEAR_MONTH_DAY),
+    ("%d.%m.%Y", _YEAR_MONTH_DAY),
+    ("%A, %b %d", _MONTH_DAY),
+    ("%A, %B %d", _MONTH_DAY),
+)
+
+# This mapping is used to convert date patterns to regex patterns.
+_FIELD_TO_REGEX = (
+    ("%A", r"\w+"),  # Weekday as locale’s full name.
+    ("%B", r"\w+"),  # Month as locale’s full name.
+    ("%Y", r"\d{4}"),  #  Year with century as a decimal number.
+    ("%b", r"\w{3}"),  # Month as locale’s abbreviated name.
+    ("%d", r"\d{1,2}"),  # Day of the month as a zero-padded decimal number.
+    ("%m", r"\d{1,2}"),  # Month as a zero-padded decimal number.
+)
+
+
+def _process_date_pattern(dp):
+    """Compute a regex for each date pattern to use as a prefilter."""
+    pattern, mask = dp
+    regex = pattern
+    regex = regex.replace(".", re.escape("."))
+    regex = regex.replace("-", re.escape("-"))
+    regex = regex.replace(" ", r"\s+")
+    for field, field_regex in _FIELD_TO_REGEX:
+        regex = regex.replace(field, field_regex)
+    # Make sure we didn't miss any of the fields.
+    assert "%" not in regex, regex
+    return pattern, mask, re.compile("^" + regex + "$")
+
+
+def _process_date_patterns():
+    return tuple(_process_date_pattern(dp) for dp in _DATE_PATTERNS)
+
+
+_PROCESSED_DATE_PATTERNS = _process_date_patterns()
+
+_MAX_DATE_NGRAM_SIZE = 5
+
+# Following DynSp:
+# https://github.com/Microsoft/DynSP/blob/master/util.py#L414.
+_NUMBER_WORDS = [
+    "zero",
+    "one",
+    "two",
+    "three",
+    "four",
+    "five",
+    "six",
+    "seven",
+    "eight",
+    "nine",
+    "ten",
+    "eleven",
+    "twelve",
+]
+
+_ORDINAL_WORDS = [
+    "zeroth",
+    "first",
+    "second",
+    "third",
+    "fourth",
+    "fith",
+    "sixth",
+    "seventh",
+    "eighth",
+    "ninth",
+    "tenth",
+    "eleventh",
+    "twelfth",
+]
+
+_ORDINAL_SUFFIXES = ["st", "nd", "rd", "th"]
+
+_NUMBER_PATTERN = re.compile(r"((^|\s)[+-])?((\.\d+)|(\d+(,\d\d\d)*(\.\d*)?))")
+
+# Following DynSp:
+# https://github.com/Microsoft/DynSP/blob/master/util.py#L293.
+_MIN_YEAR = 1700
+_MAX_YEAR = 2016
+
+_INF = float("INF")
+
+
+def _get_numeric_value_from_date(date, mask):
+    """Converts date (datetime Python object) to a NumericValue object with a Date object value."""
+    if date.year < _MIN_YEAR or date.year > _MAX_YEAR:
+        raise ValueError("Invalid year: %d" % date.year)
+
+    new_date = Date()
+    if mask.year:
+        new_date.year = date.year
+    if mask.month:
+        new_date.month = date.month
+    if mask.day:
+        new_date.day = date.day
+    return NumericValue(date=new_date)
+
+
+def _get_span_length_key(span):
+    """Sorts span by decreasing length first and incresing first index second."""
+    return span[1] - span[0], -span[0]
+
+
+def _get_numeric_value_from_float(value):
+    """Converts float (Python) to a NumericValue object with a float value."""
+    return NumericValue(float_value=value)
+
+
+# Doesn't parse ordinal expressions such as '18th of february 1655'.
+def _parse_date(text):
+    """Attempts to format a text as a standard date string (yyyy-mm-dd)."""
+    text = re.sub(r"Sept\b", "Sep", text)
+    for in_pattern, mask, regex in _PROCESSED_DATE_PATTERNS:
+        if not regex.match(text):
+            continue
+        try:
+            date = datetime.datetime.strptime(text, in_pattern).date()
+        except ValueError:
+            continue
+        try:
+            return _get_numeric_value_from_date(date, mask)
+        except ValueError:
+            continue
+    return None
+
+
+def _parse_number(text):
+    """Parses simple cardinal and ordinals numbers."""
+    for suffix in _ORDINAL_SUFFIXES:
+        if text.endswith(suffix):
+            text = text[: -len(suffix)]
+            break
+    text = text.replace(",", "")
+    try:
+        value = float(text)
+    except ValueError:
+        return None
+    if math.isnan(value):
+        return None
+    if value == _INF:
+        return None
+    return value
+
+
+def get_all_spans(text, max_ngram_length):
+    """
+    Split a text into all possible ngrams up to 'max_ngram_length'. Split points are white space and punctuation.
+
+    Args:
+      text: Text to split.
+      max_ngram_length: maximal ngram length.
+    Yields:
+      Spans, tuples of begin-end index.
+    """
+    start_indexes = []
+    for index, char in enumerate(text):
+        if not char.isalnum():
+            continue
+        if index == 0 or not text[index - 1].isalnum():
+            start_indexes.append(index)
+        if index + 1 == len(text) or not text[index + 1].isalnum():
+            for start_index in start_indexes[-max_ngram_length:]:
+                yield start_index, index + 1
+
+
+def normalize_for_match(text):
+    return " ".join(text.lower().split())
+
+
+def format_text(text):
+    """Lowercases and strips punctuation."""
+    text = text.lower().strip()
+    if text == "n/a" or text == "?" or text == "nan":
+        text = EMPTY_TEXT
+
+    text = re.sub(r"[^\w\d]+", " ", text).replace("_", " ")
+    text = " ".join(text.split())
+    text = text.strip()
+    if text:
+        return text
+    return EMPTY_TEXT
+
+
+def parse_text(text):
+    """
+    Extracts longest number and date spans.
+
+    Args:
+      text: text to annotate
+
+    Returns:
+      List of longest numeric value spans.
+    """
+    span_dict = collections.defaultdict(list)
+    for match in _NUMBER_PATTERN.finditer(text):
+        span_text = text[match.start() : match.end()]
+        number = _parse_number(span_text)
+        if number is not None:
+            span_dict[match.span()].append(_get_numeric_value_from_float(number))
+
+    for begin_index, end_index in get_all_spans(text, max_ngram_length=1):
+        if (begin_index, end_index) in span_dict:
+            continue
+        span_text = text[begin_index:end_index]
+
+        number = _parse_number(span_text)
+        if number is not None:
+            span_dict[begin_index, end_index].append(_get_numeric_value_from_float(number))
+        for number, word in enumerate(_NUMBER_WORDS):
+            if span_text == word:
+                span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number)))
+                break
+        for number, word in enumerate(_ORDINAL_WORDS):
+            if span_text == word:
+                span_dict[begin_index, end_index].append(_get_numeric_value_from_float(float(number)))
+                break
+
+    for begin_index, end_index in get_all_spans(text, max_ngram_length=_MAX_DATE_NGRAM_SIZE):
+        span_text = text[begin_index:end_index]
+        date = _parse_date(span_text)
+        if date is not None:
+            span_dict[begin_index, end_index].append(date)
+
+    spans = sorted(span_dict.items(), key=lambda span_value: _get_span_length_key(span_value[0]), reverse=True)
+    selected_spans = []
+    for span, value in spans:
+        for selected_span, _ in selected_spans:
+            if selected_span[0] <= span[0] and span[1] <= selected_span[1]:
+                break
+        else:
+            selected_spans.append((span, value))
+
+    selected_spans.sort(key=lambda span_value: span_value[0][0])
+
+    numeric_value_spans = []
+    for span, values in selected_spans:
+        numeric_value_spans.append(NumericValueSpan(begin_index=span[0], end_index=span[1], values=values))
+    return numeric_value_spans
+
+
+# Below: all functions from number_annotation_utils.py and 2 functions (namely filter_invalid_unicode
+# and filter_invalid_unicode_from_table) from text_utils.py of the original implementation. URL's: 
+# - https://github.com/google-research/tapas/blob/master/tapas/utils/number_annotation_utils.py
+# - https://github.com/google-research/tapas/blob/master/tapas/utils/text_utils.py 
+
+
+_PrimitiveNumericValue = Union[float, Tuple[Optional[float], Optional[float], Optional[float]]]
+_SortKeyFn = Callable[[NumericValue], Tuple[float, Ellipsis]]
+
+_DATE_TUPLE_SIZE = 3
+
+EMPTY_TEXT = 'EMPTY'
+
+NUMBER_TYPE = "number"
+DATE_TYPE = "date"
+
+
+def _get_value_type(numeric_value):
+    if numeric_value.float_value is not None:
+        return NUMBER_TYPE
+    elif numeric_value.date is not None:
+        return DATE_TYPE
+    raise ValueError("Unknown type: %s" % numeric_value)
+
+
+def _get_value_as_primitive_value(numeric_value):
+    """Maps a NumericValue proto to a float or tuple of float."""
+    if numeric_value.float_value is not None:
+        return numeric_value.float_value
+    if numeric_value.date is not None:
+        date = numeric_value.date
+        value_tuple = [None, None, None]
+        # All dates fields are cased to float to produce a simple primitive value.
+        if date.year is not None:
+            value_tuple[0] = float(date.year)
+        if date.month is not None:
+            value_tuple[1] = float(date.month)
+        if date.day is not None:
+            value_tuple[2] = float(date.day)
+        return tuple(value_tuple)
+    raise ValueError("Unknown type: %s" % numeric_value)
+
+
+def _get_all_types(numeric_values):
+    return {_get_value_type(value) for value in numeric_values}
+
+
+def get_numeric_sort_key_fn(numeric_values):
+    """
+    Creates a function that can be used as a sort key or to compare the values. Maps to primitive types and finds the
+    biggest common subset. Consider the values "05/05/2010" and "August 2007". With the corresponding primitive values
+    (2010.,5.,5.) and (2007.,8., None). These values can be compared by year and date so we map to the sequence (2010.,
+    5.), (2007., 8.). If we added a third value "2006" with primitive value (2006., None, None), we could only compare
+    by the year so we would map to (2010.,), (2007.,) and (2006.,).
+
+    Args:
+     numeric_values: Values to compare
+
+    Returns:
+     A function that can be used as a sort key function (mapping numeric values to a comparable tuple)
+
+    Raises:
+      ValueError if values don't have a common type or are not comparable.
+    """
+    value_types = _get_all_types(numeric_values)
+    if len(value_types) != 1:
+        raise ValueError("No common value type in %s" % numeric_values)
+
+    value_type = next(iter(value_types))
+    if value_type == NUMBER_TYPE:
+        # Primitive values are simple floats, nothing to do here.
+        return _get_value_as_primitive_value
+
+    # The type can only be Date at this point which means the primitive type
+    # is a float triple.
+    valid_indexes = set(range(_DATE_TUPLE_SIZE))
+
+    for numeric_value in numeric_values:
+        value = _get_value_as_primitive_value(numeric_value)
+        assert isinstance(value, tuple)
+        for tuple_index, inner_value in enumerate(value):
+            if inner_value is None:
+                valid_indexes.discard(tuple_index)
+
+    if not valid_indexes:
+        raise ValueError("No common value in %s" % numeric_values)
+
+    def _sort_key_fn(numeric_value):
+        value = _get_value_as_primitive_value(numeric_value)
+        return tuple(value[index] for index in valid_indexes)
+
+    return _sort_key_fn
+
+
+def _consolidate_numeric_values(
+        row_index_to_values,
+        min_consolidation_fraction,
+        debug_info):
+    """Finds the most common numeric values in a column and returns them.
+    Args:
+        row_index_to_values: 
+            For each row index all the values in that cell.
+        min_consolidation_fraction: 
+            Fraction of cells that need to have consolidated value.
+        debug_info:    
+            Additional information only used for logging.
+    Returns:
+        For each row index the first value that matches the most common value.
+        Rows that don't have a matching value are dropped. Empty list if values can't
+        be consolidated.
+    """
+    type_counts = collections.Counter()
+    for numeric_values in row_index_to_values.values():
+        type_counts.update(_get_all_types(numeric_values))
+    if not type_counts:
+        return {}
+    max_count = max(type_counts.values())
+    if max_count < len(row_index_to_values) * min_consolidation_fraction:
+        # logging.log_every_n(logging.INFO, 'Can\'t consolidate types: %s %s %d', 100,
+        #                     debug_info, row_index_to_values, max_count)
+        return {}
+
+    valid_types = set()
+    for value_type, count in type_counts.items():
+        if count == max_count:
+            valid_types.add(value_type)
+    if len(valid_types) > 1:
+        assert DATE_TYPE in valid_types
+        max_type = DATE_TYPE
+    else:
+        max_type = next(iter(valid_types))
+    
+    new_row_index_to_value = {}
+    for index, values in row_index_to_values.items():
+        # Extract the first matching value.
+        for value in values:
+            if _get_value_type(value) == max_type:
+                new_row_index_to_value[index] = value
+                break
+
+    return new_row_index_to_value
+
+
+def _get_numeric_values(text):
+    """Parses text and returns numeric values."""
+    numeric_spans = parse_text(text)
+    return itertools.chain(*(span.values for span in numeric_spans))
+
+
+def _get_column_values(table, col_index):
+    """Parses text in column and returns a dict mapping row_index to values.
+    This is the _get_column_values function from number_annotation_utils.py of the
+    original implementation.
+    Args:
+      table: Pandas dataframe
+      col_index: integer, indicating the index of the column to get the numeric values of
+    """
+    index_to_values = {}
+    for row_index, row in table.iterrows():
+        text = normalize_for_match(row[col_index].text)
+        index_to_values[row_index] = list(_get_numeric_values(text))
+    return index_to_values
+
+
+def get_numeric_relation(value, other_value, sort_key_fn):
+    """Compares two values and returns their relation or None."""
+    value = sort_key_fn(value)
+    other_value = sort_key_fn(other_value)
+    if value == other_value:
+        return Relation.EQ
+    if value < other_value:
+        return Relation.LT
+    if value > other_value:
+        return Relation.GT
+    return None
+
+
+def add_numeric_values_to_question(question):
+    """Adds numeric value spans to a question."""
+    original_text = question
+    question = normalize_for_match(question)
+    numeric_spans = parse_text(question)
+    return Question(original_text=original_text, 
+                    text=question, 
+                    numeric_spans=numeric_spans)
+
+
+def filter_invalid_unicode(text):
+    """Return an empty string and True if 'text' is in invalid unicode."""
+    return ("", True) if isinstance(text, bytes) else (text, False)
+
+
+def filter_invalid_unicode_from_table(table):
+    """Removes invalid unicode from table.
+    Checks whether a table cell text contains an invalid unicode encoding. If yes,
+    reset the table cell text to an empty str and log a warning for each invalid
+    cell.
+    Args:
+        table: table to clean.
+    """
+    # to do: add table id support
+    if not hasattr(table, "table_id"):
+        table.table_id = 0
+
+    for row_index, row in table.iterrows():
+        for col_index, cell in enumerate(row):
+            cell, is_invalid = filter_invalid_unicode(cell)
+            if is_invalid:
+                logging.warning(
+                    "Scrub an invalid table body @ table_id: %s, row_index: %d, "
+                    "col_index: %d", table.table_id, row_index, col_index)
+    for col_index, column in enumerate(table.columns):
+        column, is_invalid = filter_invalid_unicode(column)
+        if is_invalid:
+            logging.warning(
+                "Scrub an invalid table header @ table_id: %s, col_index: %d",
+                table.table_id, col_index)
+
+
+def add_numeric_table_values(table,
+                             min_consolidation_fraction=0.7,
+                             debug_info = None):
+    """Parses text in table column-wise and adds the consolidated values.
+    Consolidation refers to finding values with a common types (date or number).
+    Args:
+        table: 
+            Table to annotate.
+        min_consolidation_fraction: 
+            Fraction of cells in a column that need to have consolidated value.
+        debug_info: 
+            Additional information used for logging.
+    """
+    table = table.copy()
+    # First, filter table on invalid unicode
+    filter_invalid_unicode_from_table(table)
+    
+    # Second, replace cell values by Cell objects
+    for row_index, row in table.iterrows():
+        for col_index, cell in enumerate(row):
+            table.iloc[row_index, col_index] = Cell(text=cell)
+    
+    # Third, add numeric_value attributes to these Cell objects
+    for col_index, column in enumerate(table.columns):
+        column_values = _consolidate_numeric_values(
+            _get_column_values(table, col_index),
+            min_consolidation_fraction=min_consolidation_fraction,
+            debug_info=(debug_info, column))
+
+        for row_index, numeric_value in column_values.items():
+            table.iloc[row_index, col_index].numeric_value = numeric_value
+
+    return table
\ No newline at end of file
diff --git a/tests/test_modeling_tapas.py b/tests/test_modeling_tapas.py
new file mode 100644
index 000000000000..95bf1614dff0
--- /dev/null
+++ b/tests/test_modeling_tapas.py
@@ -0,0 +1,863 @@
+# coding=utf-8
+# Copyright 2020 Google Research and The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import unittest
+
+import copy
+
+import numpy as np
+import pandas as pd
+
+from transformers import is_torch_available
+from transformers.file_utils import cached_property
+from transformers.testing_utils import require_torch, slow, torch_device
+
+from .test_configuration_common import ConfigTester
+from .test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor, random_attention_mask
+
+
+if is_torch_available():
+    import torch
+
+    from transformers import (
+        TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST,
+        TapasConfig,
+        #TapasForMaskedLM,
+        TapasForQuestionAnswering,
+        TapasForSequenceClassification,
+        TapasModel,
+    )
+
+    from transformers.modeling_tapas import (
+        IndexMap,
+        ProductIndexMap,
+        gather,
+        flatten,
+        range_index_map,
+        reduce_sum, 
+        reduce_mean,
+        reduce_max,
+        reduce_min,
+    )
+
+
+class TapasModelTester:
+    """You can also import this e.g from .test_modeling_tapas import TapasModelTester """
+
+    def __init__(
+        self,
+        parent,
+        batch_size=13,
+        seq_length=7,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=True,
+        use_labels=True,
+        vocab_size=99,
+        hidden_size=32,
+        num_hidden_layers=5,
+        num_attention_heads=4,
+        intermediate_size=37,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        initializer_range=0.02,
+        max_position_embeddings=512,
+        type_vocab_sizes=[3, 256, 256, 2, 256, 256, 10],
+        type_sequence_label_size=2,
+        positive_weight=10.0,
+        num_aggregation_labels=4,
+        num_labels=2,
+        aggregation_loss_importance=0.8,
+        use_answer_as_supervision=True,
+        answer_loss_importance=0.001,
+        use_normalized_answer_loss=False,
+        huber_loss_delta=25.0,
+        temperature=1.0,
+        agg_temperature=1.0,
+        use_gumbel_for_cells=False,
+        use_gumbel_for_agg=False,
+        average_approximation_function="ratio",
+        cell_selection_preference=0.5,
+        answer_loss_cutoff=100,
+        max_num_rows=64,
+        max_num_columns=32,
+        average_logits_per_cell=True,
+        select_one_column=True,
+        allow_empty_column_selection=False,
+        init_cell_selection_weights_to_zero=False,
+        reset_position_index_per_cell=True,
+        disable_per_token_loss=False,
+        scope=None,
+    ):
+        self.parent = parent
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.initializer_range = initializer_range
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_sizes = type_vocab_sizes
+        self.type_sequence_label_size = type_sequence_label_size
+        self.positive_weight = positive_weight
+        self.num_aggregation_labels = num_aggregation_labels
+        self.num_labels = num_labels
+        self.aggregation_loss_importance = aggregation_loss_importance
+        self.use_answer_as_supervision = use_answer_as_supervision
+        self.answer_loss_importance = answer_loss_importance
+        self.use_normalized_answer_loss = use_normalized_answer_loss
+        self.huber_loss_delta = huber_loss_delta
+        self.temperature = temperature
+        self.agg_temperature = agg_temperature
+        self.use_gumbel_for_cells = use_gumbel_for_cells
+        self.use_gumbel_for_agg = use_gumbel_for_agg
+        self.average_approximation_function = average_approximation_function
+        self.cell_selection_preference = cell_selection_preference
+        self.answer_loss_cutoff = answer_loss_cutoff
+        self.max_num_rows = max_num_rows
+        self.max_num_columns = max_num_columns
+        self.average_logits_per_cell = average_logits_per_cell
+        self.select_one_column = select_one_column
+        self.allow_empty_column_selection = allow_empty_column_selection
+        self.init_cell_selection_weights_to_zero = init_cell_selection_weights_to_zero
+        self.reset_position_index_per_cell = reset_position_index_per_cell
+        self.disable_per_token_loss = disable_per_token_loss
+        self.scope = scope
+
+    def prepare_config_and_inputs(self):
+        input_ids = ids_tensor([self.batch_size, self.seq_length], self.vocab_size)
+
+        input_mask = None
+        if self.use_input_mask:
+            input_mask = random_attention_mask([self.batch_size, self.seq_length])
+
+        token_type_ids = []
+        for type_vocab_size in self.type_vocab_sizes:
+            token_type_ids.append(ids_tensor(shape=[self.batch_size, self.seq_length], vocab_size=type_vocab_size))
+        token_type_ids = torch.stack(token_type_ids, dim=2)
+
+        sequence_labels = None
+        token_labels = None
+        label_ids = None
+        answer = None
+        numeric_values = None
+        numeric_values_scale = None
+        float_answer = None
+        aggregation_labels = None
+        if self.use_labels:
+            sequence_labels = ids_tensor([self.batch_size], self.type_sequence_label_size)
+            token_labels = ids_tensor([self.batch_size, self.seq_length], self.num_labels)
+            label_ids = ids_tensor([self.batch_size, self.seq_length], vocab_size=2)
+            numeric_values = floats_tensor([self.batch_size, self.seq_length])
+            numeric_values_scale = floats_tensor([self.batch_size, self.seq_length])
+            float_answer = floats_tensor([self.batch_size])
+            aggregation_labels = ids_tensor([self.batch_size], self.num_aggregation_labels)
+
+        config = TapasConfig(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_hidden_layers=self.num_hidden_layers,
+            num_attention_heads=self.num_attention_heads,
+            intermediate_size=self.intermediate_size,
+            hidden_act=self.hidden_act,
+            hidden_dropout_prob=self.hidden_dropout_prob,
+            attention_probs_dropout_prob=self.attention_probs_dropout_prob,
+            max_position_embeddings=self.max_position_embeddings,
+            type_vocab_sizes=self.type_vocab_sizes,
+            initializer_range=self.initializer_range,
+            positive_weight=self.positive_weight,
+            num_aggregation_labels=self.num_aggregation_labels,
+            num_labels=self.num_labels,
+            aggregation_loss_importance=self.aggregation_loss_importance,
+            use_answer_as_supervision=self.use_answer_as_supervision,
+            answer_loss_importance=self.answer_loss_importance,
+            use_normalized_answer_loss=self.use_normalized_answer_loss,
+            huber_loss_delta=self.huber_loss_delta,
+            temperature=self.temperature,
+            agg_temperature=self.agg_temperature,
+            use_gumbel_for_cells=self.use_gumbel_for_cells,
+            use_gumbel_for_agg=self.use_gumbel_for_agg,
+            average_approximation_function=self.average_approximation_function,
+            cell_selection_preference=self.cell_selection_preference,
+            answer_loss_cutoff=self.answer_loss_cutoff,
+            max_num_rows=self.max_num_rows,
+            max_num_columns=self.max_num_columns,
+            average_logits_per_cell=self.average_logits_per_cell,
+            select_one_column=self.select_one_column,
+            allow_empty_column_selection=self.allow_empty_column_selection,
+            init_cell_selection_weights_to_zero=self.init_cell_selection_weights_to_zero,
+            reset_position_index_per_cell=self.reset_position_index_per_cell,
+            disable_per_token_loss=self.disable_per_token_loss,
+            return_dict=True,
+        )
+
+        return (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            sequence_labels,
+            token_labels,
+            label_ids,
+            numeric_values,
+            numeric_values_scale,
+            float_answer,
+            aggregation_labels,
+        )
+
+    def create_and_check_model(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        label_ids,
+        numeric_values,
+        numeric_values_scale,
+        float_answer,
+        aggregation_labels,
+    ):
+        model = TapasModel(config=config)
+        model.to(torch_device)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids)
+        result = model(input_ids, token_type_ids=token_type_ids)
+        result = model(input_ids)
+        self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, self.seq_length, self.hidden_size))
+        self.parent.assertEqual(result.pooler_output.shape, (self.batch_size, self.hidden_size))
+
+    # def create_and_check_for_masked_lm(
+    #     self,
+    #     config,
+    #     input_ids,
+    #     input_mask,
+    #     token_type_ids,
+    #     sequence_labels,
+    #     token_labels,
+    #     label_ids,
+    #     numeric_values,
+    #     numeric_values_scale,
+    #     float_answer,
+    #     aggregation_labels,
+    # ):
+    #     model = TapasForMaskedLM(config=config)
+    #     model.to(torch_device)
+    #     model.eval()
+    #     result = model(input_ids, attention_mask=input_mask, token_type_ids=token_type_ids, labels=token_labels)
+    #     self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length, self.vocab_size))
+
+    def create_and_check_for_question_answering(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        label_ids,
+        numeric_values,
+        numeric_values_scale,
+        float_answer,
+        aggregation_labels,
+    ):
+        # inference: without aggregation head (SQA). 
+        sqa_config = copy.copy(config)
+        sqa_config.num_aggregation_labels = 0
+        sqa_config.use_answer_as_supervision = False
+        model = TapasForQuestionAnswering(config=sqa_config)
+        model.to(torch_device)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
+        
+        # inference: with aggregation head (WTQ, WikiSQL-supervised)
+        model = TapasForQuestionAnswering(config=config)
+        model.to(torch_device)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+        )
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
+        self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
+        
+        # training: can happen in 3 main ways
+        # case 1: conversational (SQA)
+        model = TapasForQuestionAnswering(config=sqa_config)
+        model.to(torch_device)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            label_ids=label_ids,
+        )
+        self.parent.assertEqual(result.loss.shape, ())
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
+
+        # case 2: weak supervision for aggregation (WTQ)
+        model = TapasForQuestionAnswering(config=config)
+        model.to(torch_device)
+        model.eval()
+        result = model(
+            input_ids=input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            label_ids=label_ids,
+            numeric_values=numeric_values,
+            numeric_values_scale=numeric_values_scale,
+            float_answer=float_answer,
+        )
+        self.parent.assertEqual(result.loss.shape, ())
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
+        self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
+
+        # case 3: strong supervision for aggregation (WikiSQL-supervised)
+        wikisql_config = copy.copy(config)
+        wikisql_config.use_answer_as_supervision = False
+        model = TapasForQuestionAnswering(config=wikisql_config)
+        model.to(torch_device)
+        model.eval()
+        result = model(
+            input_ids,
+            attention_mask=input_mask,
+            token_type_ids=token_type_ids,
+            label_ids=label_ids,
+            aggregation_labels=aggregation_labels,
+        )
+        self.parent.assertEqual(result.loss.shape, ())
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.seq_length))
+        self.parent.assertEqual(result.logits_aggregation.shape, (self.batch_size, self.num_aggregation_labels))
+
+    def create_and_check_for_sequence_classification(
+        self,
+        config,
+        input_ids,
+        input_mask,
+        token_type_ids,
+        sequence_labels,
+        token_labels,
+        label_ids,
+        numeric_values,
+        numeric_values_scale,
+        float_answer,
+        aggregation_labels,
+    ):
+        config.num_labels = self.num_labels
+        model = TapasForSequenceClassification(config)
+        model.to(torch_device)
+        model.eval()
+        result = model(input_ids, attention_mask=input_mask, labels=sequence_labels)
+        self.parent.assertEqual(result.logits.shape, (self.batch_size, self.num_labels))
+
+    def prepare_config_and_inputs_for_common(self):
+        config_and_inputs = self.prepare_config_and_inputs()
+        (
+            config,
+            input_ids,
+            input_mask,
+            token_type_ids,
+            sequence_labels,
+            token_labels,
+            label_ids,
+            numeric_values,
+            numeric_values_scale,
+            float_answer,
+            aggregation_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "token_type_ids": token_type_ids, "attention_mask": input_mask}
+        return config, inputs_dict
+
+
+@require_torch
+class TapasModelTest(ModelTesterMixin, unittest.TestCase):
+
+    all_model_classes = (
+        (
+            TapasModel,
+            #TapasForMaskedLM,
+            TapasForQuestionAnswering,
+            TapasForSequenceClassification,
+        )
+        if is_torch_available()
+        else None
+    )
+    test_pruning = False
+    test_torchscript = True
+    test_resize_embeddings = True
+    test_head_masking = False
+
+    def setUp(self):
+        self.model_tester = TapasModelTester(self)
+        self.config_tester = ConfigTester(self, config_class=TapasConfig, dim=37)
+
+    def test_config(self):
+        self.config_tester.run_common_tests()
+
+    def test_model(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_model(*config_and_inputs)
+
+    # def test_for_masked_lm(self):
+    #     config_and_inputs = self.model_tester.prepare_config_and_inputs()
+    #     self.model_tester.create_and_check_for_masked_lm(*config_and_inputs)
+
+    def test_for_question_answering(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_question_answering(*config_and_inputs)
+
+    def test_for_sequence_classification(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        self.model_tester.create_and_check_for_sequence_classification(*config_and_inputs)
+
+
+def prepare_tapas_single_inputs_for_inference():
+    # Here we prepare a single table-question pair to test TAPAS inference on:
+    data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], 
+            'Age': ["33", "35"],
+    }
+    queries = "Which footballer is 33 years old?"
+    table = pd.DataFrame.from_dict(data) 
+    
+    return table, queries
+
+
+def prepare_tapas_batch_inputs_for_inference():
+    # Here we prepare a batch of 2 table-question pairs to test TAPAS inference on:
+    data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], 
+        'Age': ["33", "35"],
+        'Number of goals': ["712", "750"]
+    }
+    queries = ["Which footballer is 33 years old?", "How many goals does Ronaldo have?"]
+    table = pd.DataFrame.from_dict(data)
+
+    return table, queries
+
+
+def prepare_tapas_batch_inputs_for_training():
+    # Here we prepare a DIFFERENT batch of 2 table-question pairs to test TAPAS training on:
+    data = {'Footballer': ["Lionel Messi", "Cristiano Ronaldo"], 
+        'Age': ["33", "35"],
+        'Number of goals': ["712", "750"]
+    }
+    queries = ["Which footballer is 33 years old?", "What's the total number of goals?"]
+    table = pd.DataFrame.from_dict(data)
+
+    answer_coordinates = [[(0, 0)], [(0, 2), (1, 2)]]
+    answer_text = [["Lionel Messi"], ["1462"]]
+    float_answer = [float("NaN"), float("1462")]
+
+    return table, queries, answer_coordinates, answer_text, float_answer
+
+
+@require_torch
+class TapasModelIntegrationTest(unittest.TestCase):
+    @cached_property
+    def default_tokenizer(self):
+        return TapasTokenizer.from_pretrained("nielsr/tapas-base-finetuned-wtq")
+    
+    @slow
+    def test_inference_no_head(self):
+        # ideally we want to test this with the weights of tapas_inter_masklm_base_reset,
+        # but since it's not straightforward to do this with the TF 1 implementation, we test it with 
+        # the weights of the WTQ base model (i.e. tapas_wtq_wikisql_sqa_inter_masklm_base_reset)
+        model = TapasModel.from_pretrained("nielsr/tapas-base-finetuned-wtq")
+
+        tokenizer = default_tokenizer()
+        table, queries = prepare_tapas_single_inputs_for_inference()
+        inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
+        outputs = model(**inputs)
+        # test the sequence output
+        expected_slice = torch.tensor(
+            [[[-0.141581565, -0.599805772, 0.747186482], 
+            [-0.143664181, -0.602008104, 0.749218345],
+            [-0.15169853, -0.603363097, 0.741370678]]]
+        )
+
+        self.assertTrue(torch.allclose(outputs.sequence_output[:, :3, :3], expected_slice, atol=1e-4))
+        
+        # test the pooled output
+        expected_slice = torch.tensor(
+            [[0.987518311, -0.970520139, -0.994303405]]
+        )
+
+        self.assertTrue(torch.allclose(outputs.pooled_output[:, :3], expected_slice, atol=1e-4))
+  
+    
+    @unittest.skip(reason="Model not available yet")
+    def test_inference_masked_lm(self):
+        pass
+
+    # TapasForQuestionAnswering has 3 possible ways of being fine-tuned:
+    # - conversational set-up (SQA)
+    # - weak supervision for aggregation (WTQ, WikiSQL)
+    # - strong supervision for aggregation (WikiSQL-supervised)
+    # We test all of them:
+    @slow
+    def test_inference_question_answering_head_conversational(self):
+        # note that nielsr/tapas-base-finetuned-sqa should correspond to tapas_sqa_inter_masklm_base_reset
+        model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-sqa")
+
+        tokenizer = default_tokenizer()
+        table, queries = prepare_tapas_single_inputs_for_inference()
+        inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
+        outputs = model(**inputs)
+        # test the logits
+        logits = outputs.logits
+        expected_shape = torch.Size((1, 21))
+        self.assertEqual(logits.shape, expected_shape)
+        expected_tensor = torch.tensor([[-9997.22461, -9997.22461, -9997.22461, -9997.22461, -9997.22461,
+                            -9997.22461, -9997.22461, -9997.22461, -9997.22461, -16.2628059, 
+                            -10004.082, 15.4330549, 15.4330549, 15.4330549, -9990.42,
+                            -16.3270779, -16.3270779, -16.3270779, -16.3270779, -16.3270779, -10004.8506]]) # ok
+
+        self.assertTrue(torch.allclose(logits, expected_tensor, atol=1e-4))
+
+    @slow
+    def test_inference_question_answering_head_weak_supervision(self):
+        # note that nielsr/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset
+        model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wtq")
+
+        tokenizer = default_tokenizer()
+        # let's test on a batch 
+        table, queries = prepare_tapas_batch_inputs_for_inference()
+        inputs = tokenizer(table=table, queries=queries, padding="longest", return_tensors="pt")
+        outputs = model(**inputs)
+        # test the logits
+        logits = outputs.logits
+        expected_shape = torch.Size((2, 28))
+        self.assertEqual(logits.shape, expected_shape)
+        expected_slice = torch.tensor([[-160.375504, -160.375504, -160.375504, -10072.3965, -10070.9414, -10094.9736],
+                                       [-9861.6123, -9861.6123, -9861.6123, -9861.6123, -9891.01172, 146.600677]]) # ok (batch size = 2)
+
+        self.assertTrue(torch.allclose(logits[:,-6:], expected_slice, atol=1e-4))
+
+        # test the aggregation logits
+        logits_aggregation = outputs.logits_aggregation
+        expected_shape = torch.Size((2, 4))
+        self.assertEqual(logits_aggregation.shape, expected_shape)
+        expected_tensor = torch.tensor([[18.8545208, -9.76614857, -6.3128891, -2.93525243],
+                                        [-4.05782509, 40.0351, -5.35329962, 23.3978653]]) # ok (batch size = 2)
+
+        self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=1e-4))
+
+        tokenizer = default_tokenizer()
+
+    @slow
+    def test_training_question_answering_head_weak_supervision(self):
+        # note that nielsr/tapas-base-finetuned-wtq should correspond to tapas_wtq_wikisql_sqa_inter_masklm_base_reset
+        model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wtq")
+        model.to(torch_device)
+
+        tokenizer = default_tokenizer()
+        # let's test on a batch 
+        table, queries, answer_coordinates, answer_text, float_answer = prepare_tapas_batch_inputs_for_training()
+        inputs = tokenizer(table=table, queries=queries, answer_coordinates=answer_coordinates,
+                                answer_text=answer_text, padding="longest", return_tensors="pt")
+        
+        # prepare data (created by the tokenizer) and move to torch_device
+        input_ids = inputs["input_ids"].to(torch_device)
+        attention_mask = inputs["attention_mask"].to(torch_device)
+        token_type_ids = inputs["token_type_ids"].to(torch_device)
+        label_ids = inputs["label_ids"].to(torch_device)
+        numeric_values = inputs["numeric_values"].to(torch_device)
+        numeric_values_scale = inputs["numeric_values_scale"].to(torch_device)
+
+        # the answer should be prepared by the user
+        float_answer = torch.FloatTensor(float_answer).to(torch_device)
+
+        # forward pass to get loss + logits:
+        outputs = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label_ids=label_ids,
+                        numeric_values=numeric_values, numeric_values_scale=numeric_values_scale, 
+                        float_answer=float_answer)
+
+        # test the loss
+        loss = outputs.loss
+        expected_loss = 3.3527612686157227e-08 # ok
+        self.assertEqual(loss.item(), expected_loss, atol=1e-4)
+
+        # test the logits on the first example
+        logits = outputs.logits
+        expected_shape = torch.Size((2, 28))
+        self.assertEqual(logits.shape, expected_shape)
+        expected_slice = torch.tensor([-160.0156, -160.0156, -160.0156, -160.0156, -160.0156,
+                                        -10072.2266, -10070.8896, -10092.6006, -10092.6006]) # ok 
+
+        self.assertTrue(torch.allclose(logits[:,-9:], expected_slice, atol=1e-4))
+        
+
+        # test the aggregation logits on the second example
+        logits_aggregation = outputs.logits_aggregation
+        expected_shape = torch.Size((2, 4))
+        self.assertEqual(logits_aggregation.shape, expected_shape)
+        expected_slice = torch.tensor([-4.0538, 40.0304, -5.3554, 23.3965]) # ok
+
+        self.assertTrue(torch.allclose(logits_aggregation[1,-4:], expected_slice, atol=1e-4))
+
+    @slow
+    def test_inference_question_answering_head_strong_supervision(self):
+        # note that nielsr/tapas-base-finetuned-wikisql-supervised should correspond to tapas_wikisql_sqa_inter_masklm_base_reset
+        model = TapasForQuestionAnswering.from_pretrained("nielsr/tapas-base-finetuned-wikisql-supervised")
+
+        tokenizer = default_tokenizer()
+        table, queries = prepare_tapas_single_inputs_for_inference()
+        inputs = tokenizer(table=table, queries=queries, return_tensors="pt")
+        outputs = model(**inputs)
+        # test the logits
+        logits = outputs.logits
+        expected_shape = torch.Size((1, 21))
+        self.assertEqual(logits.shape, expected_shape)
+        expected_tensor = torch.tensor([[-10011.1084, -10011.1084, -10011.1084, -10011.1084, -10011.1084, 
+                            -10011.1084, -10011.1084, -10011.1084, -10011.1084, -18.6185989, 
+                            -10008.7969, 17.6355762, 17.6355762, 17.6355762, -10002.4404, 
+                            -18.7111301, -18.7111301, -18.7111301, -18.7111301, -18.7111301, -10007.0977]]) # ok
+
+        self.assertTrue(torch.allclose(logits, expected_tensor, atol=1e-4))
+
+        # test the aggregation logits
+        logits_aggregation = outputs.logits_aggregation
+        expected_shape = torch.Size((1, 4))
+        self.assertEqual(logits_aggregation.shape, expected_shape)
+        expected_tensor = torch.tensor([[16.5659733, -3.06624889, -2.34152961, -0.970244825]]) # ok, PyTorch model outputs [[16.5679, -3.0668, -2.3442, -0.9674]]
+
+        self.assertTrue(torch.allclose(logits_aggregation, expected_tensor, atol=1e-4))
+    
+    @slow
+    def test_inference_classification_head(self):
+        # note that nielsr/tapas-base-finetuned-tabfact should correspond to tapas_tabfact_inter_masklm_base_reset
+        model = TapasForSequenceClassification.from_pretrained("nielsr/tapas-base-finetuned-tabfact")
+
+        inputs = prepare_tapas_inputs_for_inference()
+        outputs = model(**inputs)
+
+        # test the classification logits
+        logits = outputs.logits
+        expected_shape = torch.Size((1, 2))
+        self.assertEqual(logits.shape, expected_shape)
+        expected_tensor = torch.tensor([[0.795137286, 9.5572]]) # ok. Note that the PyTorch model outputs [[0.8057, 9.5281]]
+
+        self.assertTrue(torch.allclose(outputs.logits, expected_tensor, atol=1e-4))
+
+# Below: tests for Tapas utilities which are defined in modeling_tapas.py.
+# These are based on segmented_tensor_test.py of the original implementation.
+# URL: https://github.com/google-research/tapas/blob/master/tapas/models/segmented_tensor_test.py
+class TapasUtilitiesTest(unittest.TestCase):
+    def _prepare_tables(self):
+        """Prepares two tables, both with three distinct rows.
+        The first table has two columns:
+        1.0, 2.0 | 3.0
+        2.0, 0.0 | 1.0
+        1.0, 3.0 | 4.0
+        The second table has three columns:
+        1.0 | 2.0 | 3.0
+        2.0 | 0.0 | 1.0
+        1.0 | 3.0 | 4.0
+        Returns:
+        SegmentedTensors with the tables.
+        """
+        values = torch.tensor(
+            [
+                [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]],
+                [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]],
+            ]
+        )
+        row_index = IndexMap(
+            indices=torch.tensor(
+                [
+                    [[0, 0, 0], [1, 1, 1], [2, 2, 2]],
+                    [[0, 0, 0], [1, 1, 1], [2, 2, 2]],
+                ]
+            ),
+            num_segments=3,
+            batch_dims=1,
+        )
+        col_index = IndexMap(
+            indices=torch.tensor(
+                [
+                    [[0, 0, 1], [0, 0, 1], [0, 0, 1]],
+                    [[0, 1, 2], [0, 1, 2], [0, 1, 2]],
+                ]
+            ),
+            num_segments=3,
+            batch_dims=1,
+        )
+        return values, row_index, col_index
+
+    def test_product_index(self):
+        _, row_index, col_index = self._prepare_tables()
+        cell_index = ProductIndexMap(row_index, col_index)
+        row_index_proj = cell_index.project_outer(cell_index)
+        col_index_proj = cell_index.project_inner(cell_index)
+
+        ind = cell_index.indices
+        self.assertEqual(cell_index.num_segments, 9)
+
+        # Projections should give back the original indices.
+        # we use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(row_index.indices.numpy(), row_index_proj.indices.numpy())
+        self.assertEqual(row_index.num_segments, row_index_proj.num_segments)
+        self.assertEqual(row_index.batch_dims, row_index_proj.batch_dims)
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(col_index.indices.numpy(), col_index_proj.indices.numpy())
+        self.assertEqual(col_index.batch_dims, col_index_proj.batch_dims)
+
+        # The first and second "column" are identified in the first table.
+        for i in range(3):
+            self.assertEqual(ind[0, i, 0], ind[0, i, 1])
+            self.assertNotEqual(ind[0, i, 0], ind[0, i, 2])
+
+        # All rows are distinct in the first table.
+        for i, i_2 in zip(range(3), range(3)):
+            for j, j_2 in zip(range(3), range(3)):
+                if i != i_2 and j != j_2:
+                    self.assertNotEqual(ind[0, i, j], ind[0, i_2, j_2])
+
+        # All cells are distinct in the second table.
+        for i, i_2 in zip(range(3), range(3)):
+            for j, j_2 in zip(range(3), range(3)):
+                if i != i_2 or j != j_2:
+                    self.assertNotEqual(ind[1, i, j], ind[1, i_2, j_2])
+
+    def test_flatten(self):
+        _, row_index, col_index = self._prepare_tables()
+        row_index_flat = flatten(row_index)
+        col_index_flat = flatten(col_index)
+
+        shape = [3, 4, 5]
+        batched_index = IndexMap(indices=torch.zeros(shape).type(torch.LongTensor), num_segments=1, batch_dims=3)
+        batched_index_flat = flatten(batched_index)
+
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(
+            row_index_flat.indices.numpy(), [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5]
+        )
+        np.testing.assert_array_equal(
+            col_index_flat.indices.numpy(), [0, 0, 1, 0, 0, 1, 0, 0, 1, 3, 4, 5, 3, 4, 5, 3, 4, 5]
+        )
+        self.assertEqual(batched_index_flat.num_segments.numpy(), np.prod(shape))
+        np.testing.assert_array_equal(batched_index_flat.indices.numpy(), range(np.prod(shape)))
+
+    def test_range_index_map(self):
+        batch_shape = [3, 4]
+        num_segments = 5
+        index = range_index_map(batch_shape, num_segments)
+
+        self.assertEqual(num_segments, index.num_segments)
+        self.assertEqual(2, index.batch_dims)
+        indices = index.indices
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(list(indices.size()), [3, 4, 5])
+        for i in range(batch_shape[0]):
+            for j in range(batch_shape[1]):
+                # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+                np.testing.assert_array_equal(indices[i, j, :].numpy(), range(num_segments))
+
+    def test_reduce_sum(self):
+        values, row_index, col_index = self._prepare_tables()
+        cell_index = ProductIndexMap(row_index, col_index)
+        row_sum, _ = reduce_sum(values, row_index)
+        col_sum, _ = reduce_sum(values, col_index)
+        cell_sum, _ = reduce_sum(values, cell_index)
+
+        # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
+        np.testing.assert_allclose(row_sum.numpy(), [[6.0, 3.0, 8.0], [6.0, 3.0, 8.0]])
+        np.testing.assert_allclose(col_sum.numpy(), [[9.0, 8.0, 0.0], [4.0, 5.0, 8.0]])
+        np.testing.assert_allclose(
+            cell_sum.numpy(),
+            [[3.0, 3.0, 0.0, 2.0, 1.0, 0.0, 4.0, 4.0, 0.0], [1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0]],
+        )
+
+    def test_reduce_mean(self):
+        values, row_index, col_index = self._prepare_tables()
+        cell_index = ProductIndexMap(row_index, col_index)
+        row_mean, _ = reduce_mean(values, row_index)
+        col_mean, _ = reduce_mean(values, col_index)
+        cell_mean, _ = reduce_mean(values, cell_index)
+
+        # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
+        np.testing.assert_allclose(
+            row_mean.numpy(), [[6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0], [6.0 / 3.0, 3.0 / 3.0, 8.0 / 3.0]]
+        )
+        np.testing.assert_allclose(col_mean.numpy(), [[9.0 / 6.0, 8.0 / 3.0, 0.0], [4.0 / 3.0, 5.0 / 3.0, 8.0 / 3.0]])
+        np.testing.assert_allclose(
+            cell_mean.numpy(),
+            [
+                [3.0 / 2.0, 3.0, 0.0, 2.0 / 2.0, 1.0, 0.0, 4.0 / 2.0, 4.0, 0.0],
+                [1.0, 2.0, 3.0, 2.0, 0.0, 1.0, 1.0, 3.0, 4.0],
+            ],
+        )
+
+    def test_reduce_max(self):
+        values = torch.as_tensor([2.0, 1.0, 0.0, 3.0])
+        index = IndexMap(indices=torch.as_tensor([0, 1, 0, 1]), num_segments=2)
+        maximum, _ = reduce_max(values, index)
+
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(maximum.numpy(), [2, 3])
+
+    def test_reduce_sum_vectorized(self):
+        values = torch.as_tensor([[1.0, 2.0, 3.0], [2.0, 3.0, 4.0], [3.0, 4.0, 5.0]])
+        index = IndexMap(indices=torch.as_tensor([0, 0, 1]), num_segments=2, batch_dims=0)
+        sums, new_index = reduce_sum(values, index)
+
+        # We use np.testing.assert_allclose rather than Tensorflow's assertAllClose
+        np.testing.assert_allclose(sums.numpy(), [[3.0, 5.0, 7.0], [3.0, 4.0, 5.0]])
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(new_index.indices.numpy(), [0, 1])
+        np.testing.assert_array_equal(new_index.num_segments.numpy(), 2)
+        np.testing.assert_array_equal(new_index.batch_dims, 0)
+
+    def test_gather(self):
+        values, row_index, col_index = self._prepare_tables()
+        cell_index = ProductIndexMap(row_index, col_index)
+
+        # Compute sums and then gather. The result should have the same shape as
+        # the original table and each element should contain the sum the values in
+        # its cell.
+        sums, _ = reduce_sum(values, cell_index)
+        cell_sum = gather(sums, cell_index)
+        assert cell_sum.size() == values.size()
+
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_allclose(
+            cell_sum.numpy(),
+            [[[3.0, 3.0, 3.0], [2.0, 2.0, 1.0], [4.0, 4.0, 4.0]], [[1.0, 2.0, 3.0], [2.0, 0.0, 1.0], [1.0, 3.0, 4.0]]],
+        )
+
+    def test_gather_vectorized(self):
+        values = torch.as_tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
+        index = IndexMap(indices=torch.as_tensor([[0, 1], [1, 0]]), num_segments=2, batch_dims=1)
+        result = gather(values, index)
+
+        # We use np.testing.assert_array_equal rather than Tensorflow's assertAllEqual
+        np.testing.assert_array_equal(result.numpy(), [[[1, 2], [3, 4]], [[7, 8], [5, 6]]])
\ No newline at end of file
diff --git a/tests/test_tokenization_tapas.py b/tests/test_tokenization_tapas.py
new file mode 100644
index 000000000000..5d8cba376515
--- /dev/null
+++ b/tests/test_tokenization_tapas.py
@@ -0,0 +1,3287 @@
+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import os
+import shutil
+import tempfile
+import unittest
+from typing import List, Tuple
+import numpy as np
+
+import pandas as pd
+
+from transformers import AddedToken
+from transformers.testing_utils import require_tokenizers, slow
+from transformers.tokenization_tapas import (
+    VOCAB_FILES_NAMES,
+    BasicTokenizer,
+    TapasTokenizer,
+    WordpieceTokenizer,
+    _is_control,
+    _is_punctuation,
+    _is_whitespace,
+)
+
+from .test_tokenization_common import TokenizerTesterMixin, filter_non_english
+
+
+@require_tokenizers
+class TapasTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
+    tokenizer_class = TapasTokenizer
+    test_rust_tokenizer = False
+    space_between_special_tokens = True
+    from_pretrained_filter = filter_non_english
+
+    def get_table(
+            self,
+            tokenizer: TapasTokenizer,
+            length=5,
+    ):
+        toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))]
+
+        if length == 0:
+            data = {}
+        else:
+            data = {toks[0]: [toks[tok] for tok in range(1, length)]}
+
+        table = pd.DataFrame.from_dict(data)
+
+        return table
+
+    def get_table_and_query(
+            self,
+            tokenizer: TapasTokenizer,
+            length=5,
+    ):
+        toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))]
+        table = self.get_table(tokenizer, length=length - 3)
+        query = " ".join(toks[:3])
+
+        return table, query
+
+    def get_clean_sequence(
+            self,
+            tokenizer: TapasTokenizer,
+            with_prefix_space=False,
+            max_length=20,
+            min_length=5,
+            empty_table: bool = False,
+            add_special_tokens: bool = True,
+            return_table_and_query: bool = False,
+    ):
+
+        toks = [tokenizer.decode([i], clean_up_tokenization_spaces=False) for i in range(len(tokenizer))]
+
+        if empty_table:
+            table = pd.DataFrame.from_dict({})
+            query = " ".join(toks[:min_length])
+        else:
+            data = {toks[0]: [toks[tok] for tok in range(1, min_length - 3)]}
+            table = pd.DataFrame.from_dict(data)
+            query = " ".join(toks[:3])
+
+        output_ids = tokenizer.encode(table, query, add_special_tokens=add_special_tokens)
+        output_txt = tokenizer.decode(output_ids)
+
+        assert len(output_ids) >= min_length, "Update the code to generate the sequences so that they are larger"
+        assert len(output_ids) <= max_length, "Update the code to generate the sequences so that they are smaller"
+
+        if return_table_and_query:
+            return output_txt, output_ids, table, query
+
+        return output_txt, output_ids
+
+    def setUp(self):
+        super().setUp()
+
+        vocab_tokens = [
+            "[UNK]",
+            "[CLS]",
+            "[SEP]",
+            "[PAD]",
+            "[MASK]",
+            "want",
+            "##want",
+            "##ed",
+            "wa",
+            "un",
+            "runn",
+            "##ing",
+            ",",
+            "low",
+            "lowest",
+        ]
+        self.vocab_file = os.path.join(self.tmpdirname, VOCAB_FILES_NAMES["vocab_file"])
+        with open(self.vocab_file, "w", encoding="utf-8") as vocab_writer:
+            vocab_writer.write("".join([x + "\n" for x in vocab_tokens]))
+
+    def get_input_output_texts(self, tokenizer):
+        input_text = "UNwant\u00E9d,running"
+        output_text = "unwanted, running"
+        return input_text, output_text
+
+    def test_full_tokenizer(self):
+        tokenizer = self.tokenizer_class(self.vocab_file)
+
+        tokens = tokenizer.tokenize("UNwant\u00E9d,running")
+        self.assertListEqual(tokens, ["un", "##want", "##ed", ",", "runn", "##ing"])
+        self.assertListEqual(tokenizer.convert_tokens_to_ids(tokens), [9, 6, 7, 12, 10, 11])
+
+    def test_rust_and_python_full_tokenizers(self):
+        if not self.test_rust_tokenizer:
+            return
+
+        tokenizer = self.get_tokenizer()
+        rust_tokenizer = self.get_rust_tokenizer()
+
+        sequence = "UNwant\u00E9d,running"
+
+        tokens = tokenizer.tokenize(sequence)
+        rust_tokens = rust_tokenizer.tokenize(sequence)
+        self.assertListEqual(tokens, rust_tokens)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)
+        rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
+        self.assertListEqual(ids, rust_ids)
+
+        rust_tokenizer = self.get_rust_tokenizer()
+        ids = tokenizer.encode(sequence)
+        rust_ids = rust_tokenizer.encode(sequence)
+        self.assertListEqual(ids, rust_ids)
+
+        # With lower casing
+        tokenizer = self.get_tokenizer(do_lower_case=True)
+        rust_tokenizer = self.get_rust_tokenizer(do_lower_case=True)
+
+        sequence = "UNwant\u00E9d,running"
+
+        tokens = tokenizer.tokenize(sequence)
+        rust_tokens = rust_tokenizer.tokenize(sequence)
+        self.assertListEqual(tokens, rust_tokens)
+
+        ids = tokenizer.encode(sequence, add_special_tokens=False)
+        rust_ids = rust_tokenizer.encode(sequence, add_special_tokens=False)
+        self.assertListEqual(ids, rust_ids)
+
+        rust_tokenizer = self.get_rust_tokenizer()
+        ids = tokenizer.encode(sequence)
+        rust_ids = rust_tokenizer.encode(sequence)
+        self.assertListEqual(ids, rust_ids)
+
+    def test_chinese(self):
+        tokenizer = BasicTokenizer()
+
+        self.assertListEqual(tokenizer.tokenize("ah\u535A\u63A8zz"), ["ah", "\u535A", "\u63A8", "zz"])
+
+    def test_basic_tokenizer_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["hello", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hällo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["h\u00E9llo"])
+
+    def test_basic_tokenizer_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_lower_strip_accents_default(self):
+        tokenizer = BasicTokenizer(do_lower_case=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["hallo", "!", "how", "are", "you", "?"]
+        )
+        self.assertListEqual(tokenizer.tokenize("H\u00E9llo"), ["hello"])
+
+    def test_basic_tokenizer_no_lower(self):
+        tokenizer = BasicTokenizer(do_lower_case=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU?  "), ["HeLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_false(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=False)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HäLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_no_lower_strip_accents_true(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHäLLo!how  \n Are yoU?  "), ["HaLLo", "!", "how", "Are", "yoU", "?"]
+        )
+
+    def test_basic_tokenizer_respects_never_split_tokens(self):
+        tokenizer = BasicTokenizer(do_lower_case=False, never_split=["[UNK]"])
+
+        self.assertListEqual(
+            tokenizer.tokenize(" \tHeLLo!how  \n Are yoU? [UNK]"), ["HeLLo", "!", "how", "Are", "yoU", "?", "[UNK]"]
+        )
+
+    def test_wordpiece_tokenizer(self):
+        vocab_tokens = ["[UNK]", "[CLS]", "[SEP]", "want", "##want", "##ed", "wa", "un", "runn", "##ing"]
+
+        vocab = {}
+        for (i, token) in enumerate(vocab_tokens):
+            vocab[token] = i
+        tokenizer = WordpieceTokenizer(vocab=vocab, unk_token="[UNK]")
+
+        self.assertListEqual(tokenizer.tokenize(""), [])
+
+        self.assertListEqual(tokenizer.tokenize("unwanted running"), ["un", "##want", "##ed", "runn", "##ing"])
+
+        self.assertListEqual(tokenizer.tokenize("unwantedX running"), ["[UNK]", "runn", "##ing"])
+
+    def test_is_whitespace(self):
+        self.assertTrue(_is_whitespace(" "))
+        self.assertTrue(_is_whitespace("\t"))
+        self.assertTrue(_is_whitespace("\r"))
+        self.assertTrue(_is_whitespace("\n"))
+        self.assertTrue(_is_whitespace("\u00A0"))
+
+        self.assertFalse(_is_whitespace("A"))
+        self.assertFalse(_is_whitespace("-"))
+
+    def test_is_control(self):
+        self.assertTrue(_is_control("\u0005"))
+
+        self.assertFalse(_is_control("A"))
+        self.assertFalse(_is_control(" "))
+        self.assertFalse(_is_control("\t"))
+        self.assertFalse(_is_control("\r"))
+
+    def test_is_punctuation(self):
+        self.assertTrue(_is_punctuation("-"))
+        self.assertTrue(_is_punctuation("$"))
+        self.assertTrue(_is_punctuation("`"))
+        self.assertTrue(_is_punctuation("."))
+
+        self.assertFalse(_is_punctuation("A"))
+        self.assertFalse(_is_punctuation(" "))
+
+    def test_clean_text(self):
+        tokenizer = self.get_tokenizer()
+
+        # Example taken from the issue https://github.com/huggingface/tokenizers/issues/340
+        self.assertListEqual([tokenizer.tokenize(t) for t in ["Test", "\xad", "test"]], [["[UNK]"], ["[EMPTY]"], ["[UNK]"]])
+
+    @slow
+    def test_sequence_builders(self):
+        tokenizer = self.tokenizer_class.from_pretrained("tapas-base-uncased")
+
+        text = tokenizer.encode("sequence builders", add_special_tokens=False)
+        text_2 = tokenizer.encode("multi-sequence build", add_special_tokens=False)
+
+        encoded_sentence = tokenizer.build_inputs_with_special_tokens(text)
+        encoded_pair = tokenizer.build_inputs_with_special_tokens(text, text_2)
+
+        assert encoded_sentence == [101] + text + [102]
+        assert encoded_pair == [101] + text + [102] + text_2 + [102]
+
+    def test_offsets_with_special_characters(self):
+        for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
+            with self.subTest("{} ({})".format(tokenizer.__class__.__name__, pretrained_name)):
+                tokenizer_r = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs)
+
+                sentence = f"A, naïve {tokenizer_r.mask_token} AllenNLP sentence."
+                tokens = tokenizer_r.encode_plus(
+                    sentence,
+                    return_attention_mask=False,
+                    return_token_type_ids=False,
+                    return_offsets_mapping=True,
+                    add_special_tokens=True,
+                )
+
+                do_lower_case = tokenizer_r.do_lower_case if hasattr(tokenizer_r, "do_lower_case") else False
+                expected_results = (
+                    [
+                        ((0, 0), tokenizer_r.cls_token),
+                        ((0, 1), "A"),
+                        ((1, 2), ","),
+                        ((3, 5), "na"),
+                        ((5, 6), "##ï"),
+                        ((6, 8), "##ve"),
+                        ((9, 15), tokenizer_r.mask_token),
+                        ((16, 21), "Allen"),
+                        ((21, 23), "##NL"),
+                        ((23, 24), "##P"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer_r.sep_token),
+                    ]
+                    if not do_lower_case
+                    else [
+                        ((0, 0), tokenizer_r.cls_token),
+                        ((0, 1), "a"),
+                        ((1, 2), ","),
+                        ((3, 8), "naive"),
+                        ((9, 15), tokenizer_r.mask_token),
+                        ((16, 21), "allen"),
+                        ((21, 23), "##nl"),
+                        ((23, 24), "##p"),
+                        ((25, 33), "sentence"),
+                        ((33, 34), "."),
+                        ((0, 0), tokenizer_r.sep_token),
+                    ]
+                )
+
+                self.assertEqual(
+                    [e[1] for e in expected_results], tokenizer_r.convert_ids_to_tokens(tokens["input_ids"])
+                )
+                self.assertEqual([e[0] for e in expected_results], tokens["offset_mapping"])
+
+    def test_add_special_tokens(self):
+        tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                input_table = self.get_table(tokenizer, length=0)
+
+                special_token = "[SPECIAL_TOKEN]"
+
+                tokenizer.add_special_tokens({"cls_token": special_token})
+                encoded_special_token = tokenizer.encode(input_table, special_token, add_special_tokens=False)
+                self.assertEqual(len(encoded_special_token), 1)
+
+                decoded = tokenizer.decode(encoded_special_token, skip_special_tokens=True)
+                self.assertTrue(special_token not in decoded)
+
+    def test_add_tokens_tokenizer(self):
+        tokenizers: List[TapasTokenizer] = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                vocab_size = tokenizer.vocab_size
+                all_size = len(tokenizer)
+
+                self.assertNotEqual(vocab_size, 0)
+
+                # We usually have added tokens from the start in tests because our vocab fixtures are
+                # smaller than the original vocabs - let's not assert this
+                # self.assertEqual(vocab_size, all_size)
+
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd"]
+                added_toks = tokenizer.add_tokens(new_toks)
+                vocab_size_2 = tokenizer.vocab_size
+                all_size_2 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_2, 0)
+                self.assertEqual(vocab_size, vocab_size_2)
+                self.assertEqual(added_toks, len(new_toks))
+                self.assertEqual(all_size_2, all_size + len(new_toks))
+
+                tokens = tokenizer.encode(table, "aaaaa bbbbbb low cccccccccdddddddd l", add_special_tokens=False)
+
+                self.assertGreaterEqual(len(tokens), 4)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+
+                new_toks_2 = {"eos_token": ">>>>|||<||<<|<<", "pad_token": "<<<<<|||>|>>>>|>"}
+                added_toks_2 = tokenizer.add_special_tokens(new_toks_2)
+                vocab_size_3 = tokenizer.vocab_size
+                all_size_3 = len(tokenizer)
+
+                self.assertNotEqual(vocab_size_3, 0)
+                self.assertEqual(vocab_size, vocab_size_3)
+                self.assertEqual(added_toks_2, len(new_toks_2))
+                self.assertEqual(all_size_3, all_size_2 + len(new_toks_2))
+
+                tokens = tokenizer.encode(
+                    table,
+                    ">>>>|||<||<<|<< aaaaabbbbbb low cccccccccdddddddd <<<<<|||>|>>>>|> l",
+                    add_special_tokens=False,
+                )
+
+                self.assertGreaterEqual(len(tokens), 6)
+                self.assertGreater(tokens[0], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[0], tokens[1])
+                self.assertGreater(tokens[-2], tokenizer.vocab_size - 1)
+                self.assertGreater(tokens[-2], tokens[-3])
+                self.assertEqual(tokens[0], tokenizer.eos_token_id)
+                self.assertEqual(tokens[-2], tokenizer.pad_token_id)
+
+    @require_tokenizers
+    def test_encode_decode_with_spaces(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+
+                # new_toks = ["[ABC]", "[DEF]"]  # TODO(thom) add this one back when Rust toks are ready: , "GHI IHG"]
+                new_toks = [AddedToken("[ABC]", normalized=False), AddedToken("[DEF]", normalized=False)]
+                tokenizer.add_tokens(new_toks)
+                input = "[ABC][DEF][ABC][DEF]"  # TODO(thom) add back cf above: "[ABC] [DEF] [ABC] GHI IHG [DEF]"
+                if self.space_between_special_tokens:
+                    output = "[ABC] [DEF] [ABC] [DEF]"
+                else:
+                    output = input
+                encoded = tokenizer.encode(table, input, add_special_tokens=False)
+                decoded = tokenizer.decode(encoded, spaces_between_special_tokens=self.space_between_special_tokens)
+                self.assertIn(decoded, [output, output.lower()])
+
+    def test_encode_plus_with_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                sequence = "Sequence"
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_size = 10
+                padding_idx = tokenizer.pad_token_id
+                token_type_padding_idx = tokenizer.pad_token_type_id
+
+                encoded_sequence = tokenizer.encode_plus(table, sequence, return_special_tokens_mask=True)
+                input_ids = encoded_sequence["input_ids"]
+                special_tokens_mask = encoded_sequence["special_tokens_mask"]
+                sequence_length = len(input_ids)
+
+                # Test 'longest' and 'no_padding' don't do anything
+                tokenizer.padding_side = "right"
+
+                not_padded_sequence = tokenizer.encode_plus(
+                    table,
+                    sequence,
+                    padding=False,
+                    return_special_tokens_mask=True,
+                )
+                not_padded_input_ids = not_padded_sequence["input_ids"]
+
+                not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+                not_padded_sequence_length = len(not_padded_input_ids)
+
+                assert sequence_length == not_padded_sequence_length
+                assert input_ids == not_padded_input_ids
+                assert special_tokens_mask == not_padded_special_tokens_mask
+
+                not_padded_sequence = tokenizer.encode_plus(
+                    table,
+                    sequence,
+                    padding=False,
+                    return_special_tokens_mask=True,
+                )
+                not_padded_input_ids = not_padded_sequence["input_ids"]
+
+                not_padded_special_tokens_mask = not_padded_sequence["special_tokens_mask"]
+                not_padded_sequence_length = len(not_padded_input_ids)
+
+                assert sequence_length == not_padded_sequence_length
+                assert input_ids == not_padded_input_ids
+                assert special_tokens_mask == not_padded_special_tokens_mask
+
+                # Test right padding
+                tokenizer.padding_side = "right"
+
+                right_padded_sequence = tokenizer.encode_plus(
+                    table,
+                    sequence,
+                    max_length=sequence_length + padding_size,
+                    padding="max_length",
+                    return_special_tokens_mask=True,
+                )
+                right_padded_input_ids = right_padded_sequence["input_ids"]
+
+                right_padded_special_tokens_mask = right_padded_sequence["special_tokens_mask"]
+                right_padded_sequence_length = len(right_padded_input_ids)
+
+                assert sequence_length + padding_size == right_padded_sequence_length
+                assert input_ids + [padding_idx] * padding_size == right_padded_input_ids
+                assert special_tokens_mask + [1] * padding_size == right_padded_special_tokens_mask
+
+                # Test left padding
+                tokenizer.padding_side = "left"
+                left_padded_sequence = tokenizer.encode_plus(
+                    table,
+                    sequence,
+                    max_length=sequence_length + padding_size,
+                    padding="max_length",
+                    return_special_tokens_mask=True,
+                )
+                left_padded_input_ids = left_padded_sequence["input_ids"]
+                left_padded_special_tokens_mask = left_padded_sequence["special_tokens_mask"]
+                left_padded_sequence_length = len(left_padded_input_ids)
+
+                assert sequence_length + padding_size == left_padded_sequence_length
+                assert [padding_idx] * padding_size + input_ids == left_padded_input_ids
+                assert [1] * padding_size + special_tokens_mask == left_padded_special_tokens_mask
+
+                if "token_type_ids" in tokenizer.model_input_names:
+                    token_type_ids = encoded_sequence["token_type_ids"]
+                    left_padded_token_type_ids = left_padded_sequence["token_type_ids"]
+                    right_padded_token_type_ids = right_padded_sequence["token_type_ids"]
+
+                    assert (
+                            token_type_ids + [
+                        [token_type_padding_idx] * 7] * padding_size == right_padded_token_type_ids
+                    )
+                    assert [[token_type_padding_idx] * 7] * padding_size + token_type_ids == left_padded_token_type_ids
+
+                if "attention_mask" in tokenizer.model_input_names:
+                    attention_mask = encoded_sequence["attention_mask"]
+                    right_padded_attention_mask = right_padded_sequence["attention_mask"]
+                    left_padded_attention_mask = left_padded_sequence["attention_mask"]
+
+                    assert attention_mask + [0] * padding_size == right_padded_attention_mask
+                    assert [0] * padding_size + attention_mask == left_padded_attention_mask
+
+    def test_internal_consistency(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                input_text, output_text = self.get_input_output_texts(tokenizer)
+
+                tokens = tokenizer.tokenize(input_text)
+                ids = tokenizer.convert_tokens_to_ids(tokens)
+                ids_2 = tokenizer.encode(table, input_text, add_special_tokens=False)
+                self.assertListEqual(ids, ids_2)
+
+                tokens_2 = tokenizer.convert_ids_to_tokens(ids)
+                self.assertNotEqual(len(tokens_2), 0)
+                text_2 = tokenizer.decode(ids)
+                self.assertIsInstance(text_2, str)
+
+                self.assertEqual(text_2, output_text)
+
+    def test_mask_output(self):
+        tokenizers = self.get_tokenizers(fast=False, do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table, query = self.get_table_and_query(tokenizer)
+
+                if (
+                        tokenizer.build_inputs_with_special_tokens.__qualname__.split(".")[0] != "PreTrainedTokenizer"
+                        and "token_type_ids" in tokenizer.model_input_names
+                ):
+                    information = tokenizer.encode_plus(table, query, add_special_tokens=True)
+                    sequences, mask = information["input_ids"], information["token_type_ids"]
+                    self.assertEqual(len(sequences), len(mask))
+
+    @unittest.skip("TAPAS tokenizer only handles two sequences.")
+    def test_maximum_encoding_length_pair_input(self):
+        pass
+
+    @unittest.skip("TAPAS tokenizer only handles two sequences.")
+    def test_maximum_encoding_length_single_input(self):
+        pass
+
+    def test_number_of_added_tokens(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+
+                table, query = self.get_table_and_query(tokenizer)
+
+                sequences = tokenizer.encode(table, query, add_special_tokens=False)
+                attached_sequences = tokenizer.encode(table, query, add_special_tokens=True)
+
+                # Method is implemented (e.g. not GPT-2)
+                if len(attached_sequences) != 2:
+                    self.assertEqual(
+                        tokenizer.num_special_tokens_to_add(pair=True), len(attached_sequences) - len(sequences)
+                    )
+
+    def test_padding_to_max_length(self):
+        """We keep this test for backward compatibility but it should be removed when `pad_to_max_length` will be deprecated"""
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer)
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(table, sequence)
+                sequence_length = len(encoded_sequence)
+                # FIXME: the next line should be padding(max_length) to avoid warning
+                padded_sequence = tokenizer.encode(
+                    table, sequence, max_length=sequence_length + padding_size, padding=True
+                )
+                padded_sequence_length = len(padded_sequence)
+                assert sequence_length + padding_size == padded_sequence_length
+                assert encoded_sequence + [padding_idx] * padding_size == padded_sequence
+
+                # Check that nothing is done when a maximum length is not specified
+                encoded_sequence = tokenizer.encode(table, sequence)
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(table, sequence, pad_to_max_length=True)
+                padded_sequence_right_length = len(padded_sequence_right)
+                assert sequence_length == padded_sequence_right_length
+                assert encoded_sequence == padded_sequence_right
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer("", padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer("This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+                    normal_tokens = tokenizer("This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer("This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+                    # truncation to something which is not a multiple of pad_to_multiple_of raises an error
+                    self.assertRaises(
+                        ValueError,
+                        tokenizer.__call__,
+                        "This",
+                        padding=True,
+                        truncation=True,
+                        max_length=12,
+                        pad_to_multiple_of=8,
+                    )
+
+    def test_call(self):
+        # Tests that all call wrap to encode_plus and batch_encode_plus
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                # Test not batched
+                table = self.get_table(tokenizer, length=0)
+                encoded_sequences_1 = tokenizer.encode_plus(table, sequences[0])
+                encoded_sequences_2 = tokenizer(table, sequences[0])
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test not batched pairs
+                table = self.get_table(tokenizer, length=10)
+                encoded_sequences_1 = tokenizer.encode_plus(table, sequences[1])
+                encoded_sequences_2 = tokenizer(table, sequences[1])
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+                # Test batched
+                table = self.get_table(tokenizer, length=0)
+                encoded_sequences_1 = tokenizer.batch_encode_plus(table, sequences)
+                encoded_sequences_2 = tokenizer(table, sequences)
+                self.assertEqual(encoded_sequences_1, encoded_sequences_2)
+
+    def test_batch_encode_plus_batch_sequence_length(self):
+        # Tests that all encoded values have the correct size
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                encoded_sequences = [tokenizer.encode_plus(table, sequence) for sequence in sequences]
+                encoded_sequences_batch = tokenizer.batch_encode_plus(table, sequences, padding=False)
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+                maximum_length = len(
+                    max([encoded_sequence["input_ids"] for encoded_sequence in encoded_sequences], key=len)
+                )
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences_padded = [
+                    tokenizer.encode_plus(table, sequence, max_length=maximum_length, padding="max_length")
+                    for sequence in sequences
+                ]
+
+                encoded_sequences_batch_padded = tokenizer.batch_encode_plus(table, sequences, padding=True)
+                self.assertListEqual(
+                    encoded_sequences_padded,
+                    self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch_padded),
+                )
+
+                # check 'longest' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=True)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus(
+                    table, sequences, max_length=maximum_length + 10, padding="longest"
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+                # check 'no_padding' is unsensitive to a max length
+                encoded_sequences_batch_padded_1 = tokenizer.batch_encode_plus(table, sequences, padding=False)
+                encoded_sequences_batch_padded_2 = tokenizer.batch_encode_plus(
+                    table, sequences, max_length=maximum_length + 10, padding=False
+                )
+                for key in encoded_sequences_batch_padded_1.keys():
+                    self.assertListEqual(
+                        encoded_sequences_batch_padded_1[key],
+                        encoded_sequences_batch_padded_2[key],
+                    )
+
+    @unittest.skip("batch_encode_plus does not handle overflowing tokens.")
+    def test_batch_encode_plus_overflowing_tokens(self):
+        pass
+
+    def test_batch_encode_plus_padding(self):
+        # Test that padded sequences are equivalent between batch_encode_plus and encode_plus
+
+        # Right padding tests
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                max_length = 100
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences = [
+                    tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length")
+                    for sequence in sequences
+                ]
+                encoded_sequences_batch = tokenizer.batch_encode_plus(
+                    table, sequences, max_length=max_length, padding="max_length"
+                )
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+        # Left padding tests
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                tokenizer.padding_side = "left"
+                sequences = [
+                    "Testing batch encode plus",
+                    "Testing batch encode plus with different sequence lengths",
+                    "Testing batch encode plus with different sequence lengths correctly pads",
+                ]
+
+                max_length = 100
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequences)
+
+                encoded_sequences = [
+                    tokenizer.encode_plus(table, sequence, max_length=max_length, padding="max_length")
+                    for sequence in sequences
+                ]
+                encoded_sequences_batch = tokenizer.batch_encode_plus(
+                    table, sequences, max_length=max_length, padding="max_length"
+                )
+                self.assertListEqual(
+                    encoded_sequences, self.convert_batch_encode_plus_format_to_encode_plus(encoded_sequences_batch)
+                )
+
+    def test_padding_to_multiple_of(self):
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                if tokenizer.pad_token is None:
+                    self.skipTest("No padding token.")
+                else:
+                    empty_tokens = tokenizer(table, padding=True, pad_to_multiple_of=8)
+                    normal_tokens = tokenizer(table, "This is a sample input", padding=True, pad_to_multiple_of=8)
+                    for key, value in empty_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+                    normal_tokens = tokenizer(table, "This", pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertNotEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+                    # Should also work with truncation
+                    normal_tokens = tokenizer(table, "This", padding=True, truncation=True, pad_to_multiple_of=8)
+                    for key, value in normal_tokens.items():
+                        self.assertEqual(len(value) % 8, 0, "BatchEncoding.{} is not multiple of 8".format(key))
+
+    @unittest.skip("TAPAS cannot handle `prepare_for_model` without passing by `encode_plus` or `batch_encode_plus`")
+    def test_prepare_for_model(self):
+        pass
+
+    def test_tokenizer_slow_store_full_signature(self):
+        signature = inspect.signature(self.tokenizer_class.__init__)
+        tokenizer = self.get_tokenizer()
+
+        for parameter_name, parameter in signature.parameters.items():
+            if parameter.default != inspect.Parameter.empty:
+                self.assertIn(parameter_name, tokenizer.init_kwargs)
+
+    def test_special_tokens_mask_input_pairs(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                sequence_0 = "Encode this."
+                empty_table = self.get_table(tokenizer, length=0)
+                table = self.get_table(tokenizer, length=10)
+                encoded_sequence = tokenizer.encode(empty_table, sequence_0, add_special_tokens=False)
+                encoded_sequence += tokenizer.encode(table, "", add_special_tokens=False)
+                encoded_sequence_dict = tokenizer.encode_plus(
+                    table,
+                    sequence_0,
+                    add_special_tokens=True,
+                    return_special_tokens_mask=True,
+                    # add_prefix_space=False,
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+                filtered_sequence = [
+                    (x if not special_tokens_mask[i] else None) for i, x in enumerate(encoded_sequence_w_special)
+                ]
+                filtered_sequence = [x for x in filtered_sequence if x is not None]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+    def test_special_tokens_mask(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                sequence_0 = "Encode this."
+                # Testing single inputs
+                encoded_sequence = tokenizer.encode(table, sequence_0, add_special_tokens=False)
+                encoded_sequence_dict = tokenizer.encode_plus(
+                    table, sequence_0, add_special_tokens=True, return_special_tokens_mask=True
+                )
+                encoded_sequence_w_special = encoded_sequence_dict["input_ids"]
+                special_tokens_mask = encoded_sequence_dict["special_tokens_mask"]
+                self.assertEqual(len(special_tokens_mask), len(encoded_sequence_w_special))
+
+                filtered_sequence = [x for i, x in enumerate(encoded_sequence_w_special) if not special_tokens_mask[i]]
+                self.assertEqual(encoded_sequence, filtered_sequence)
+
+    def test_save_and_load_tokenizer(self):
+        # safety check on max_len default value so we are sure the test works
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                self.assertNotEqual(tokenizer.model_max_length, 42)
+
+        # Now let's start the test
+        tokenizers = self.get_tokenizers()
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                # Isolate this from the other tests because we save additional tokens/etc
+                table = self.get_table(tokenizer, length=0)
+                tmpdirname = tempfile.mkdtemp()
+
+                sample_text = " He is very happy, UNwant\u00E9d,running"
+                before_tokens = tokenizer.encode(table, sample_text, add_special_tokens=False)
+                before_vocab = tokenizer.get_vocab()
+                tokenizer.save_pretrained(tmpdirname)
+
+                after_tokenizer = tokenizer.__class__.from_pretrained(tmpdirname)
+                after_tokens = after_tokenizer.encode(table, sample_text, add_special_tokens=False)
+                after_vocab = after_tokenizer.get_vocab()
+                self.assertListEqual(before_tokens, after_tokens)
+                self.assertDictEqual(before_vocab, after_vocab)
+
+                shutil.rmtree(tmpdirname)
+
+    def test_right_and_left_padding(self):
+        tokenizers = self.get_tokenizers(do_lower_case=False)
+        for tokenizer in tokenizers:
+            with self.subTest(f"{tokenizer.__class__.__name__}"):
+                table = self.get_table(tokenizer, length=0)
+                sequence = "Sequence"
+                padding_size = 10
+
+                # check correct behaviour if no pad_token_id exists and add it eventually
+                self._check_no_pad_token_padding(tokenizer, sequence)
+
+                padding_idx = tokenizer.pad_token_id
+
+                # RIGHT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "right"
+                encoded_sequence = tokenizer.encode(table, sequence)
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    table, sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )
+                padded_sequence_length = len(padded_sequence)
+                assert sequence_length + padding_size == padded_sequence_length
+                assert encoded_sequence + [padding_idx] * padding_size == padded_sequence
+
+                # LEFT PADDING - Check that it correctly pads when a maximum length is specified along with the padding flag set to True
+                tokenizer.padding_side = "left"
+                encoded_sequence = tokenizer.encode(table, sequence)
+                sequence_length = len(encoded_sequence)
+                padded_sequence = tokenizer.encode(
+                    table, sequence, max_length=sequence_length + padding_size, padding="max_length"
+                )
+                padded_sequence_length = len(padded_sequence)
+                assert sequence_length + padding_size == padded_sequence_length
+                assert [padding_idx] * padding_size + encoded_sequence == padded_sequence
+
+                # RIGHT & LEFT PADDING - Check that nothing is done for 'longest' and 'no_padding'
+                encoded_sequence = tokenizer.encode(table, sequence)
+                sequence_length = len(encoded_sequence)
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(table, sequence, padding=True)
+                padded_sequence_right_length = len(padded_sequence_right)
+                assert sequence_length == padded_sequence_right_length
+                assert encoded_sequence == padded_sequence_right
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(table, sequence, padding="longest")
+                padded_sequence_left_length = len(padded_sequence_left)
+                assert sequence_length == padded_sequence_left_length
+                assert encoded_sequence == padded_sequence_left
+
+                tokenizer.padding_side = "right"
+                padded_sequence_right = tokenizer.encode(table, sequence)
+                padded_sequence_right_length = len(padded_sequence_right)
+                assert sequence_length == padded_sequence_right_length
+                assert encoded_sequence == padded_sequence_right
+
+                tokenizer.padding_side = "left"
+                padded_sequence_left = tokenizer.encode(table, sequence, padding=False)
+                padded_sequence_left_length = len(padded_sequence_left)
+                assert sequence_length == padded_sequence_left_length
+                assert encoded_sequence == padded_sequence_left
+
+    @unittest.skip("TAPAS doesn't handle pre-tokenized inputs.")
+    def test_pretokenized_inputs(self):
+        pass
+
+    # TODO SET TO SLOW
+    def test_tapas_truncation_integration_test(self):
+        data = {
+            "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
+            "Age": ["56", "45", "59"],
+            "Number of movies": ["87", "53", "69"],
+            "Date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
+        }
+        queries = [
+            "When was Brad Pitt born?",
+            "Which actor appeared in the least number of movies?",
+            "What is the average number of movies?",
+        ]
+        table = pd.DataFrame.from_dict(data)
+
+        # TODO: Should update this in the future
+        tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512)
+
+        for i in range(12):
+            # The table cannot even encode the headers, so raise an error
+            with self.assertRaises(ValueError):
+                tokenizer.encode(table=table, query=queries[0], max_length=i, truncation="drop_rows_to_fit")
+
+        for i in range(12, 512):
+            new_encoded_inputs = tokenizer.encode(table=table, query=queries[0], max_length=i, truncation="drop_rows_to_fit")
+
+            # Ensure that the input IDs are less than the max length defined.
+            self.assertLessEqual(len(new_encoded_inputs), i)
+
+    # TODO SET TO SLOW
+    def test_tapas_integration_test(self):
+        data = {
+            "Actors": ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
+            "Age": ["56", "45", "59"],
+            "Number of movies": ["87", "53", "69"],
+            "Date of birth": ["18 december 1963", "11 november 1974", "6 may 1961"],
+        }
+        queries = [
+            "When was Brad Pitt born?",
+            "Which actor appeared in the least number of movies?",
+            "What is the average number of movies?",
+        ]
+        table = pd.DataFrame.from_dict(data)
+
+        # TODO: Should update this in the future
+        tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512)
+
+        expected_results = {
+            "input_ids": [
+                101,
+                2043,
+                2001,
+                8226,
+                15091,
+                2141,
+                1029,
+                102,
+                5889,
+                2287,
+                2193,
+                1997,
+                5691,
+                3058,
+                1997,
+                4182,
+                8226,
+                15091,
+                5179,
+                6584,
+                2324,
+                2285,
+                3699,
+                14720,
+                4487,
+                6178,
+                9488,
+                3429,
+                5187,
+                2340,
+                2281,
+                3326,
+                2577,
+                18856,
+                7828,
+                3240,
+                5354,
+                6353,
+                1020,
+                2089,
+                3777,
+            ],
+            "attention_mask": [
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+            ],
+            "token_type_ids": [
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [0, 0, 0, 0, 0, 0, 0],
+                [1, 1, 0, 0, 0, 0, 0],
+                [1, 2, 0, 0, 0, 0, 0],
+                [1, 3, 0, 0, 0, 0, 0],
+                [1, 3, 0, 0, 0, 0, 0],
+                [1, 3, 0, 0, 0, 0, 0],
+                [1, 4, 0, 0, 0, 0, 0],
+                [1, 4, 0, 0, 0, 0, 0],
+                [1, 4, 0, 0, 0, 0, 0],
+                [1, 1, 1, 0, 0, 0, 0],
+                [1, 1, 1, 0, 0, 0, 0],
+                [1, 2, 1, 0, 2, 2, 0],
+                [1, 3, 1, 0, 3, 1, 0],
+                [1, 4, 1, 0, 2, 2, 0],
+                [1, 4, 1, 0, 2, 2, 0],
+                [1, 4, 1, 0, 2, 2, 0],
+                [1, 1, 2, 0, 0, 0, 0],
+                [1, 1, 2, 0, 0, 0, 0],
+                [1, 1, 2, 0, 0, 0, 0],
+                [1, 1, 2, 0, 0, 0, 0],
+                [1, 2, 2, 0, 1, 3, 0],
+                [1, 3, 2, 0, 1, 3, 0],
+                [1, 4, 2, 0, 3, 1, 0],
+                [1, 4, 2, 0, 3, 1, 0],
+                [1, 4, 2, 0, 3, 1, 0],
+                [1, 1, 3, 0, 0, 0, 0],
+                [1, 1, 3, 0, 0, 0, 0],
+                [1, 1, 3, 0, 0, 0, 0],
+                [1, 1, 3, 0, 0, 0, 0],
+                [1, 2, 3, 0, 3, 1, 0],
+                [1, 3, 3, 0, 2, 2, 0],
+                [1, 4, 3, 0, 1, 3, 0],
+                [1, 4, 3, 0, 1, 3, 0],
+                [1, 4, 3, 0, 1, 3, 0],
+            ],
+        }
+
+        new_encoded_inputs = tokenizer.encode_plus(table=table, query=queries[0])
+
+        self.assertDictEqual(dict(new_encoded_inputs), expected_results)
+
+    # TODO SET TO SLOW
+    def test_full_tokenizer(self):
+        data = [
+            ["Pos", "No", "Driver", "Team", "Laps", "Time/Retired", "Grid", "Points"],
+            ["1", "32", "Patrick Carpentier", "Team Player's", "87", "1:48:11.023", "1", "22"],
+            ["2", "1", "Bruno Junqueira", "Newman/Haas Racing", "87", "+0.8 secs", "2", "17"],
+            ["3", "3", "Paul Tracy", "Team Player's", "87", "+28.6 secs", "3", "14"],
+            ["4", "9", "Michel Jourdain, Jr.", "Team Rahal", "87", "+40.8 secs", "13", "12"],
+            ["5", "34", "Mario Haberfeld", "Mi-Jack Conquest Racing", "87", "+42.1 secs", "6", "10"],
+            ["6", "20", "Oriol Servia", "Patrick Racing", "87", "+1:00.2", "10", "8"],
+            ["7", "51", "Adrian Fernandez", "Fernandez Racing", "87", "+1:01.4", "5", "6"],
+            ["8", "12", "Jimmy Vasser", "American Spirit Team Johansson", "87", "+1:01.8", "8", "5"],
+            ["9", "7", "Tiago Monteiro", "Fittipaldi-Dingman Racing", "86", "+ 1 Lap", "15", "4"],
+            ["10", "55", "Mario Dominguez", "Herdez Competition", "86", "+ 1 Lap", "11", "3"],
+            ["11", "27", "Bryan Herta", "PK Racing", "86", "+ 1 Lap", "12", "2"],
+            ["12", "31", "Ryan Hunter-Reay", "American Spirit Team Johansson", "86", "+ 1 Lap", "17", "1"],
+            ["13", "19", "Joel Camathias", "Dale Coyne Racing", "85", "+ 2 Laps", "18", "0"],
+            ["14", "33", "Alex Tagliani", "Rocketsports Racing", "85", "+ 2 Laps", "14", "0"],
+            ["15", "4", "Roberto Moreno", "Herdez Competition", "85", "+ 2 Laps", "9", "0"],
+            ["16", "11", "Geoff Boss", "Dale Coyne Racing", "83", "Mechanical", "19", "0"],
+            ["17", "2", "Sebastien Bourdais", "Newman/Haas Racing", "77", "Mechanical", "4", "0"],
+            ["18", "15", "Darren Manning", "Walker Racing", "12", "Mechanical", "7", "0"],
+            ["19", "5", "Rodolfo Lavin", "Walker Racing", "10", "Mechanical", "16", "0"],
+        ]
+        query = "what were the drivers names?"
+        table = pd.DataFrame.from_records(data[1:], columns=data[0])
+
+        # TODO: Should update this in the future
+        tokenizer = TapasTokenizer.from_pretrained("lysandre/tapas-temporary-repo", model_max_length=512)
+        model_inputs = tokenizer(table, query, padding="max_length")
+
+        input_ids = model_inputs["input_ids"]
+        token_type_ids = np.array(model_inputs["token_type_ids"])
+        segment_ids = token_type_ids[:, 0]
+        column_ids = token_type_ids[:, 1]
+        row_ids = token_type_ids[:, 2]
+
+        expected_results = {
+            "input_ids": [
+                101,
+                2054,
+                2020,
+                1996,
+                6853,
+                3415,
+                1029,
+                102,
+                13433,
+                2015,
+                2053,
+                4062,
+                2136,
+                10876,
+                2051,
+                1013,
+                3394,
+                8370,
+                2685,
+                1015,
+                3590,
+                4754,
+                29267,
+                4765,
+                3771,
+                2136,
+                2447,
+                1005,
+                1055,
+                6584,
+                1015,
+                1024,
+                4466,
+                1024,
+                2340,
+                1012,
+                6185,
+                2509,
+                1015,
+                2570,
+                1016,
+                1015,
+                10391,
+                12022,
+                4226,
+                7895,
+                10625,
+                1013,
+                22996,
+                3868,
+                6584,
+                1009,
+                1014,
+                1012,
+                1022,
+                10819,
+                2015,
+                1016,
+                2459,
+                1017,
+                1017,
+                2703,
+                10555,
+                2136,
+                2447,
+                1005,
+                1055,
+                6584,
+                1009,
+                2654,
+                1012,
+                1020,
+                10819,
+                2015,
+                1017,
+                2403,
+                1018,
+                1023,
+                8709,
+                8183,
+                3126,
+                21351,
+                2078,
+                1010,
+                3781,
+                1012,
+                2136,
+                10958,
+                8865,
+                6584,
+                1009,
+                2871,
+                1012,
+                1022,
+                10819,
+                2015,
+                2410,
+                2260,
+                1019,
+                4090,
+                7986,
+                5292,
+                5677,
+                8151,
+                2771,
+                1011,
+                2990,
+                9187,
+                3868,
+                6584,
+                1009,
+                4413,
+                1012,
+                1015,
+                10819,
+                2015,
+                1020,
+                2184,
+                1020,
+                2322,
+                2030,
+                20282,
+                14262,
+                9035,
+                4754,
+                3868,
+                6584,
+                1009,
+                1015,
+                1024,
+                4002,
+                1012,
+                1016,
+                2184,
+                1022,
+                1021,
+                4868,
+                7918,
+                12023,
+                12023,
+                3868,
+                6584,
+                1009,
+                1015,
+                1024,
+                5890,
+                1012,
+                1018,
+                1019,
+                1020,
+                1022,
+                2260,
+                5261,
+                12436,
+                18116,
+                2137,
+                4382,
+                2136,
+                26447,
+                6584,
+                1009,
+                1015,
+                1024,
+                5890,
+                1012,
+                1022,
+                1022,
+                1019,
+                1023,
+                1021,
+                27339,
+                3995,
+                10125,
+                9711,
+                4906,
+                25101,
+                24657,
+                1011,
+                22033,
+                2386,
+                3868,
+                6564,
+                1009,
+                1015,
+                5001,
+                2321,
+                1018,
+                2184,
+                4583,
+                7986,
+                14383,
+                2075,
+                29488,
+                14906,
+                9351,
+                2971,
+                6564,
+                1009,
+                1015,
+                5001,
+                2340,
+                1017,
+                2340,
+                2676,
+                8527,
+                2014,
+                2696,
+                1052,
+                2243,
+                3868,
+                6564,
+                1009,
+                1015,
+                5001,
+                2260,
+                1016,
+                2260,
+                2861,
+                4575,
+                4477,
+                1011,
+                2128,
+                4710,
+                2137,
+                4382,
+                2136,
+                26447,
+                6564,
+                1009,
+                1015,
+                5001,
+                2459,
+                1015,
+                2410,
+                2539,
+                8963,
+                11503,
+                25457,
+                3022,
+                8512,
+                2522,
+                9654,
+                3868,
+                5594,
+                1009,
+                1016,
+                10876,
+                2324,
+                1014,
+                2403,
+                3943,
+                4074,
+                6415,
+                15204,
+                2072,
+                12496,
+                25378,
+                3868,
+                5594,
+                1009,
+                1016,
+                10876,
+                2403,
+                1014,
+                2321,
+                1018,
+                10704,
+                17921,
+                14906,
+                9351,
+                2971,
+                5594,
+                1009,
+                1016,
+                10876,
+                1023,
+                1014,
+                2385,
+                2340,
+                14915,
+                5795,
+                8512,
+                2522,
+                9654,
+                3868,
+                6640,
+                6228,
+                2539,
+                1014,
+                2459,
+                1016,
+                28328,
+                8945,
+                3126,
+                21351,
+                2015,
+                10625,
+                1013,
+                22996,
+                3868,
+                6255,
+                6228,
+                1018,
+                1014,
+                2324,
+                2321,
+                12270,
+                11956,
+                5232,
+                3868,
+                2260,
+                6228,
+                1021,
+                1014,
+                2539,
+                1019,
+                8473,
+                28027,
+                2080,
+                2474,
+                6371,
+                5232,
+                3868,
+                2184,
+                6228,
+                2385,
+                1014,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+            ],
+            "column_ids": [
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                1,
+                1,
+                2,
+                3,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                4,
+                4,
+                4,
+                5,
+                6,
+                6,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                5,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                4,
+                4,
+                5,
+                6,
+                7,
+                8,
+                1,
+                2,
+                3,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                5,
+                6,
+                7,
+                8,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+            ],
+            "row_ids": [
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                2,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                3,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                4,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                5,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                6,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                7,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                8,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                9,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                10,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                11,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                12,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                13,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                14,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                15,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                16,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                17,
+                18,
+                18,
+                18,
+                18,
+                18,
+                18,
+                18,
+                18,
+                18,
+                18,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                19,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+            ],
+            "segment_ids": [
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+                0,
+            ],
+        }
+
+        self.assertListEqual(input_ids, expected_results["input_ids"])
+        self.assertListEqual(segment_ids.tolist(), expected_results["segment_ids"])
+        self.assertListEqual(column_ids.tolist(), expected_results["column_ids"])
+        self.assertListEqual(row_ids.tolist(), expected_results["row_ids"])