NielsRogge · NielsRogge · Nov 16, 2020 · Nov 12, 2020 · Nov 16, 2020
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -145,8 +145,8 @@ conversion utilities for the following models:
 27. :doc:`T5 <model_doc/t5>` (from Google AI) released with the paper `Exploring the Limits of Transfer Learning with a
     Unified Text-to-Text Transformer <https://arxiv.org/abs/1910.10683>`__ by Colin Raffel and Noam Shazeer and Adam
     Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
-28. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via 
-    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, 
+28. :doc:`TAPAS <model_doc/tapas>` (from Google AI) released with the paper `TAPAS: Weakly Supervised Table Parsing via
+    Pre-training <https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller,
     Francesco Piccinno and Julian Martin Eisenschlos.
 29. :doc:`Transformer-XL <model_doc/transformerxl>` (from Google/CMU) released with the paper `Transformer-XL:
     Attentive Language Models Beyond a Fixed-Length Context <https://arxiv.org/abs/1901.02860>`__ by Zihang Dai*,

diff --git a/docs/source/model_doc/tapas.rst b/docs/source/model_doc/tapas.rst
@@ -5,85 +5,84 @@ Overview
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The TAPAS model was proposed in `TAPAS: Weakly Supervised Table Parsing via Pre-training
-<https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and 
-Julian Martin Eisenschlos.
-It's a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. Compared to 
-BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular structure. TAPAS is pre-trained 
-on the masked language modeling (MLM) objective on a large dataset comprising millions of tables from English Wikipedia and 
-corresponding texts. For question answering, TAPAS has 2 heads on top: a cell selection head and an aggregation head, for 
-(optionally) performing aggregations (such as counting or summing) among selected cells. TAPAS has been fine-tuned on several 
-datasets: SQA (Sequential Question Answering by Microsoft), WTQ (Wiki Table Questions by Stanford University) and WikiSQL 
-(by Salesforce). It achieves state-of-the-art on both SQA and WTQ, while having comparable performance to SOTA on WikiSQL, 
-with a much simpler architecture. 
+<https://arxiv.org/abs/2004.02349>`__ by Jonathan Herzig, Paweł Krzysztof Nowak, Thomas Müller, Francesco Piccinno and
+Julian Martin Eisenschlos. It's a BERT-based model specifically designed (and pre-trained) for answering questions
+about tabular data. Compared to BERT, TAPAS uses relative position embeddings and has 7 token types that encode tabular
+structure. TAPAS is pre-trained on the masked language modeling (MLM) objective on a large dataset comprising millions
+of tables from English Wikipedia and corresponding texts. For question answering, TAPAS has 2 heads on top: a cell
+selection head and an aggregation head, for (optionally) performing aggregations (such as counting or summing) among
+selected cells. TAPAS has been fine-tuned on several datasets: SQA (Sequential Question Answering by Microsoft), WTQ
+(Wiki Table Questions by Stanford University) and WikiSQL (by Salesforce). It achieves state-of-the-art on both SQA and
+WTQ, while having comparable performance to SOTA on WikiSQL, with a much simpler architecture.
 
 The abstract from the paper is the following:
 
-*Answering natural language questions over tables is usually seen as a semantic parsing task. 
-To alleviate the collection cost of full logical forms, one popular approach focuses on weak 
-supervision consisting of denotations instead of logical forms. However, training semantic parsers 
-from weak supervision poses difficulties, and in addition, the generated logical forms are only used
-as an intermediate step prior to retrieving the denotation. In this paper, we present TAPAS, an 
-approach to question answering over tables without generating logical forms. TAPAS trains from weak 
-supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding 
-aggregation operator to such selection. TAPAS extends BERT's architecture to encode tables as input, 
-initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, 
-and is trained end-to-end. We experiment with three different semantic parsing datasets, and find 
-that TAPAS outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on 
-SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL and WIKITQ, but 
-with a simpler model architecture. We additionally find that transfer learning, which is trivial 
-in our setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
-
-In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset of millions 
-of automatically created training examples which are learned in an intermediate step prior to fine-tuning. The authors of TAPAS 
-call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM, and then on another dataset). 
-They found that intermediate pre-training further improves performance on SQA, achieving a new state-of-the-art as well as 
-state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for table entailment (a binary classification task).
-For more details, see their new paper: `Understanding tables with intermediate pre-training <https://arxiv.org/abs/2010.00571>`__ 
-by Julian Martin Eisenschlos, Syrine Krichene and Thomas Müller.
+*Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the
+collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations
+instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition,
+the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we
+present TAPAS, an approach to question answering over tables without generating logical forms. TAPAS trains from weak
+supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation
+operator to such selection. TAPAS extends BERT's architecture to encode tables as input, initializes from an effective
+joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with
+three different semantic parsing datasets, and find that TAPAS outperforms or rivals semantic parsing models by
+improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WIKISQL
+and WIKITQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our
+setting, from WIKISQL to WIKITQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.*
+
+In addition, the authors have further pre-trained TAPAS to recognize table entailment, by creating a balanced dataset
+of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning.
+The authors of TAPAS call this further pre-training intermediate pre-training (since TAPAS is first pre-trained on MLM,
+and then on another dataset). They found that intermediate pre-training further improves performance on SQA, achieving
+a new state-of-the-art as well as state-of-the-art on TabFact, a large-scale dataset with 16k Wikipedia tables for
+table entailment (a binary classification task). For more details, see their new paper: `Understanding tables with
+intermediate pre-training <https://arxiv.org/abs/2010.00571>`__ by Julian Martin Eisenschlos, Syrine Krichene and
+Thomas Müller.
 
 The original code can be found `here <https://github.com/google-research/tapas>`__.
 
 Tips:
 
-- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell of the table). According to
-  the authors, this usually results in a slightly better performance, and allows you to encode longer sequences without running out 
-  of embeddings.
-  If you don't want this, you can set the `reset_position_index_per_cell` parameter of :class:`~transformers.TapasConfig` to False.
-- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a conversational set-up. This 
-  means that you can ask follow-up questions such as "what is his age?" related to the previous question. Note that the forward pass of 
-  TAPAS is a bit different in case of a conversational set-up: in that case, you have to feed every training example one by one to the 
-  model, such that the `prev_label_ids` token type ids can be overwritten by the predicted `label_ids` of the model to the previous 
-  question.
-- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective.
-  It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for
-  text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
+- TAPAS is a model that uses relative position embeddings by default (restarting the position embeddings at every cell
+  of the table). According to the authors, this usually results in a slightly better performance, and allows you to
+  encode longer sequences without running out of embeddings. If you don't want this, you can set the
+  `reset_position_index_per_cell` parameter of :class:`~transformers.TapasConfig` to False.
+- TAPAS has checkpoints fine-tuned on SQA, which are capable of answering questions related to a table in a
+  conversational set-up. This means that you can ask follow-up questions such as "what is his age?" related to the
+  previous question. Note that the forward pass of TAPAS is a bit different in case of a conversational set-up: in that
+  case, you have to feed every training example one by one to the model, such that the `prev_label_ids` token type ids
+  can be overwritten by the predicted `label_ids` of the model to the previous question.
+- TAPAS is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore
+  efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained
+  with a causal language modeling (CLM) objective are better in that regard.
 
 
 Usage
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If you just want to perform inference (i.e. making predictions) in a non-conversational setup, you can do the following:
+If you just want to perform inference (i.e. making predictions) in a non-conversational setup, you can do the
+following:
 
 .. code-block::
 
         >>> from transformers import TapasTokenizer, TapasForQuestionAnswering
         >>> import pandas as pd 
-        
+
         >>> model_name = 'tapas-base-finetuned-wtq'
         >>> model = TapasForQuestionAnswering.from_pretrained(model_name)
         >>> tokenizer = TapasTokenizer.from_pretrained(model_name)
-        
+
         >>> data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 'Number of movies': ["87", "53", "69"]}
         >>> queries = ["What is the name of the first actor?", "How many movies has George Clooney played in?", "What is the total number of movies?"]
         >>> table = pd.Dataframe(data)
         >>> inputs = tokenizer(table, queries, return_tensors='pt')
         >>> logits, logits_agg = model(**inputs)
         >>> answer_coordinates_batch, aggregation_predictions = tokenizer.convert_logits_to_predictions(inputs, logits, logits_agg)
-        
+
         >>> # let's print out the results:
         >>> id2aggregation = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"}
         >>> aggregation_predictions_string = [id2aggregation[x] for x in aggregation_predictions]
-        
+
         >>> answers = []
         >>> for coordinates in answer_coordinates_batch:
         ...   if len(coordinates) == 1:
@@ -95,7 +94,7 @@ If you just want to perform inference (i.e. making predictions) in a non-convers
         ...     for coordinate in coordinates:
         ...        cell_values.append(df.iat[coordinate])
         ...     answers.append(", ".join(cell_values))
-        
+
         >>> display(df)
         >>> print("")
         >>> for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string):

diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
@@ -562,10 +562,10 @@
     )
     from .modeling_tapas import (
         TAPAS_PRETRAINED_MODEL_ARCHIVE_LIST,
-        TapasModel,
         TapasForMaskedLM,
         TapasForQuestionAnswering,
         TapasForSequenceClassification,
+        TapasModel,
         load_tf_weights_in_tapas,
     )
     from .modeling_transfo_xl import (

diff --git a/src/transformers/configuration_tapas.py b/src/transformers/configuration_tapas.py
@@ -17,20 +17,20 @@
 
 from .configuration_utils import PretrainedConfig
 
+
 TAPAS_PRETRAINED_CONFIG_ARCHIVE_MAP = {"tapas-base": "", "tapas-large": ""}  # to be added  # to be added
 
 
 class TapasConfig(PretrainedConfig):
     r"""
-    This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`.
-    It is used to instantiate a TAPAS model according to the specified arguments, defining the model
-    architecture. Instantiating a configuration with the defaults will yield a similar configuration
-    to that of the TAPAS `tapas-base-finetuned-sqa` architecture. Configuration objects
-    inherit from :class:`~transformers.PreTrainedConfig` and can be used to control the model outputs.
-    Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
+    This is the configuration class to store the configuration of a :class:`~transformers.TapasModel`. It is used to
+    instantiate a TAPAS model according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the TAPAS `tapas-base-finetuned-sqa`
+    architecture. Configuration objects inherit from :class:`~transformers.PreTrainedConfig` and can be used to control
+    the model outputs. Read the documentation from :class:`~transformers.PretrainedConfig` for more information.
 
-    Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original implementation.
-    Original implementation available at https://github.com/google-research/tapas/tree/master.
+    Hyperparameters additional to BERT are taken from run_task_main.py and hparam_utils.py of the original
+    implementation. Original implementation available at https://github.com/google-research/tapas/tree/master.
 
     Args:
         vocab_size (:obj:`int`, `optional`, defaults to 30522):
@@ -87,9 +87,9 @@ class TapasConfig(PretrainedConfig):
         average_approximation_function: (:obj:`string`, `optional`, defaults to :obj:`"ratio"`):
             Method to calculate expected average of cells in the relaxed case.
         cell_selection_preference: (:obj:`float`, `optional`, defaults to None):
-            Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for aggregation (WTQ, WikiSQL).
-            If the total mass of the aggregation probabilities (excluding the "NONE" operator) is higher than this hyperparameter, 
-            then aggregation is predicted for an example.
+            Preference for cell selection in ambiguous cases. Only applicable in case of weak supervision for
+            aggregation (WTQ, WikiSQL). If the total mass of the aggregation probabilities (excluding the "NONE"
+            operator) is higher than this hyperparameter, then aggregation is predicted for an example.
         answer_loss_cutoff: (:obj:`float`, `optional`, defaults to None):
             Ignore examples with answer loss larger than cutoff.
         max_num_rows: (:obj:`int`, `optional`, defaults to 64):
@@ -109,7 +109,7 @@ class TapasConfig(PretrainedConfig):
         disable_per_token_loss: (:obj:`bool`, `optional`, defaults to :obj:`False`):
             Disable any (strong or weak) supervision on cells.
         span_prediction: (:obj:`string`, `optional`, defaults to :obj:`"none"`):
-            Span selection mode to use. Currently only "none" is supported. 
+            Span selection mode to use. Currently only "none" is supported.
 
     Example::
 

diff --git a/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py b/src/transformers/convert_tapas_original_tf_checkpoint_to_pytorch.py
@@ -19,7 +19,12 @@
 
 import torch
 
-from transformers import TapasConfig, TapasForQuestionAnswering, TapasForSequenceClassification, load_tf_weights_in_tapas
+from transformers import (
+    TapasConfig,
+    TapasForQuestionAnswering,
+    TapasForSequenceClassification,
+    load_tf_weights_in_tapas,
+)
 from transformers.utils import logging
 
 
@@ -41,14 +46,14 @@ def convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, tapas_config_file, pyto
     #         select_one_column = True,
     #         allow_empty_column_selection = False,
     #         temperature = 0.0352513)
-    
+
     # SQA config
     config = TapasConfig()
-            
+
     print("Building PyTorch model from configuration: {}".format(str(config)))
     # model = TapasForMaskedLM(config)
     model = TapasForQuestionAnswering(config)
-    #model = TapasForSequenceClassification(config)
+    # model = TapasForSequenceClassification(config)
 
     # Load weights from tf checkpoint
     load_tf_weights_in_tapas(model, config, tf_checkpoint_path)

diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py
@@ -191,10 +191,10 @@
 
 except ImportError:
     _tokenizers_available = False
-  
-    
-try:    
-    import torch_scatter  
+
+
+try:
+    import torch_scatter
 
     # Check we're not importing a "torch_scatter" directory somewhere
     _scatter_available = hasattr(torch_scatter, "__version__") and hasattr(torch_scatter, "scatter")