[WIP] Add Tapas model by NielsRogge · Pull Request #8113 · huggingface/transformers

NielsRogge · 2020-10-28T08:38:21Z

What does this PR do?

Since the beginning of August, I'm working in my free time on incorporating the Tapas algorithm by Google AI in the Transformers library (because this library is awesome and I want to contribute to it!). Tapas is basically a BERT model with some clever modifications for natural language understanding related to tabular data (structured data like tables, or even HTML). Adding this model could foster research in this area 😄

Demo's of my current implementation:

colab notebook to showcase TapasForQuestionAnswering on WTQ (WikiTable Questions by Stanford University)
colab notebook to showcase TapasForQuestionAnsweringon SQA (Sequential Question Answering by Microsoft Research)
colab notebook to showcase TapasForSequenceClassification on TabFact (Table Fact checking, introduced at ICLR this year)

The model weights are available on the original Github repository, and I wrote a conversion script (similar to other models in the Transformers library) to load them into their PyTorch counterpart.
I suggest reading the paper as well as my notes to gain a full understanding of how the model works and how I implemented it. There's also a blog post by Google AI as well as a video by Yannic Kilcher explaining how the algorithm works.

The main classes are TapasConfig, TapasModel, TapasForQuestionAnswering and TapasForSequenceClassification which can all be found in modeling_tapas.py. I'm quite sure the models are OK, the output is the same as the Tensorflow implementation. I added a very extensive documentation (docstrings) to all classes, which you can view by running the make html command from the docs directory. Feedback appreciated!

However, there are 2 things for which I need some help/opinions to finish this work:

1. Making TapasTokenizer fully Transformers-compliant

To implement TapasTokenizer, I need some help/opinions. I suggest using Pandas dataframes as the central object for tabular data (as shown in the Colab notebooks above). and let the API be as follows:

from transformers import TapasTokenizer
import pandas as pd

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"]}
table = pd.DataFrame.from_dict(data)
queries = ["When was Brad Pitt born?", 
           "Which actor appeared in the least number of movies?",
           "What is the average number of movies?"]

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-wtq")
inputs = tokenizer(table=table, queries=queries)

Currently I've only implemented the batch_encode_plus method of TapasTokenizer, because it's not really clear to me how to make it fully compatible with the Transformers library, since the way that data is prepared for the model is a bit different compared to BERT/RoBERTa/etc (see also my notes above). It's also not straightforward to make it compatible with the different padding/truncation strategies of Transformers. Currently, the way it works is as follows: there’s a function _get_token_budget in tokenization_tapas.py that calculates the number of tokens left for the flattened table after tokenizing a question. This is currently set to self.model_max_length - (len(question_tokens) + 2) (+ 2 for the CLS and SEP tokens), as was done in the original implementation. There is a hyperparameter when initializing TapasTokenizer called drop_rows_to_fit which drops rows of the table to fit into the token budget if set to True. If it’s set to False and a table is too big, it throws a ValueError indicating 'too many rows'.

2. Testing

Currently I've written test_modeling_tapas.py (23 tests passed, 5 failed) and test_modeling_tapas_utilities.py (9 tests passed). However, there are 4 different settings to use TapasForQuestionAnswering (see my notes) and these all need to be tested (currently only 1 setting is tested) - some help here would be great. Besides this, tests should be added to see whether the model can be properly trained, as well as adding test_tokenization_tapas.py (which depends on how TapasTokenizer will be implemented).

Fixes the following issues (people requesting to add the model):

Who can review?

I suggest @sgugger @LysandreJik since we already discussed this on the forum here.
tokenizers: @mfuntowicz

DISCLAIMER: this is my first PR of my life, never done this before, hopefully I don't mess up anything (just got the Pro Git book 😛). I assume I should not use git rebase anymore now since this branch submitted as PR and should only use git add, git commit and git push -u origin tapas_v3? And git pull origin tapas_v3 in case others make commits to my branch?
Is there a Slack channel where people can help me out in case I have git issues?

sgugger · 2020-10-28T13:39:11Z

I think you might need a rebase on the latest master as your PR seems to have taken master from a while ago (all the modifications in the README should not be there for instance). If it messes the diff, we can always close this PR and open a new one, the branch will be safe :-)

NielsRogge · 2020-10-28T13:43:53Z

EDIT: I've always used git fetch upstream and git rebase upstream/master, before pushing my local tapas_v3 branch to my fork on Github. I didn't know that all models starts with 1. now in the README (therefore I manually changed the numbering), should be fixed now, branch is up-to-date.

NielsRogge · 2020-10-29T17:41:55Z

@sgugger should I keep rebasing this branch everyday, to keep up with master (as long as the code is not being reviewed)?

Also, is it normal that I have to do a force push everytime I perform a rebase and want to push to Github? Because when I want to do simply git push -u origin tapas_v3, I always get

(env) PS C:\Users\niels.rogge\Documents\Python projecten\transformers> git push -u origin tapas_v3
To https://github.com/NielsRogge/transformers.git
 ! [rejected]          tapas_v3 -> tapas_v3 (non-fast-forward)
error: failed to push some refs to 'https://github.com/NielsRogge/transformers.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

after a local rebase.

sgugger · 2020-10-29T17:57:37Z

I'm far from being an expert on git and I don't use the command line anyway, so can't really help you with that.

LysandreJik

Hi! Thanks a lot @NielsRogge for implementing this model! It's a great model that definitely deserves its place in the huggingface library, and you did a great job implementing it! I'm reviewing the code below.

Regarding the tokenizer, we do not have a dependency (hard or soft) on pandas, but we will soon have one on datasets. I think it is feasible to have datasets as the central object for tabular data instead of pandas dataframes. It's very easy to use it as a replacement, your example code would simply use a datasets.Dataset instead of a pd.Dataframe:

from transformers import TapasTokenizer
from datasets import Dataset

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"]}

table = Dataset.from_dict(data)
queries = ["When was Brad Pitt born?", 
           "Which actor appeared in the least number of movies?",
           "What is the average number of movies?"]

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-wtq")
inputs = tokenizer(table=table, queries=queries)

Putting @lhoestq in cc.

Regarding the configuration and tokenizer, it would be great to uncouple them from their BERT counterparts.

I haven't played around with the model yet, but it seems to be in a great shape already. If you can do the modifications in the review here that would be great, and then I'll spend time playing with it and will come back to you regarding the tokenizer API currently setup.

Concerning git, it seems you have just rebased & force-pushed, so that's very cool. Regarding your question, this does mess up the history of users that are currently using your branch, but seeing as we're early in development, there shouldn't be any.

docs/source/index.rst

docs/source/model_doc/tapas.rst

src/transformers/modeling_tapas.py

LysandreJik · 2020-10-29T19:07:43Z

src/transformers/modeling_tapas.py

Thses checkpoints will need their namespace as well (something like google/tapas-base)

Yes, some notes about this:

I would like to discuss which of the models to upload to the model hub, because the authors did release a large amount of model variants: https://github.com/google-research/tapas. I assume not all of them are that useful, especially the TINY models, and maybe we can only include the ones with relative embeddings (because performance is almost always slightly better with relative position embeddings) and intermediate pre-training (for the same reason).

The naming of the models is quite extensive: tapas_wtq_wikisql_sqa_inter_masklm_large_reset for example is the large model with relative position embeddings (reset) that was pretrained on masked language modeling, then intermediate pretraining, then fine-tuned on SQA, then on WikiSQL and finally on WTQ 😅 Is there a naming standard for models in the model hub? Maybe this can be shortened to tapas-large-finetuned-wtq, and provide the details in the model cards.

Regarding the models to be added, I don't think it hurts to put all the models in the hub. This way it would be easy for future researchers to replicate the results of this paper, only using the implementation here, without needing to switch from one implementation to another. In any case, the conversion script should handle all of these. We can take care of uploading the models then, as they'll need to be under the google organization namespace.

I think your proposal regarding model naming makes total sense! As long as there exists a document in which we can find the old name, renaming to something easier to understand like this is fine.

src/transformers/configuration_tapas.py

src/transformers/modeling_tapas.py

src/transformers/modeling_tapas_utilities.py

src/transformers/tokenization_tapas.py

src/transformers/configuration_tapas.py

patrickvonplaten

This looks like an amazing contribution! I let @LysandreJik handle the next steps and am happy to do a final review at a later stage.

NielsRogge · 2020-11-02T09:59:51Z

Hi! Thanks a lot @NielsRogge for implementing this model! It's a great model that definitely deserves its place in the huggingface library, and you did a great job implementing it! I'm reviewing the code below.

Thank you! Great to hear :)

@LysandreJik I addressed all of the comments. To summarize:

TapasConfig and TapasTokenizer now inherit from PretrainedConfig and PreTrainedTokenizer respectively. For TapasTokenizer, a lot of the code of tokenization_bert.py was copied (such as the BasicTokenizer and WordPieceTokenizer classes), since the tokenization logic of text itself is the same. However, some things are different (see tokenization_tapas.py).
Modeling_tapas_utilities.py and tokenization_tapas_utilities.py are also gone now, they are added to the bottom of modeling_tapas.py and tokenization_tapas.py respectively.
pandas is not a dependency in the code (pandas is not imported in any of the files), but I assume you want to switch to datasets so that people can use Tapas using only the Transformers library? However, currently, in tokenization_tapas.py some Pandas logic is used, for example in the _tokenize_table function .iterrows() is used, so this will require some changes.
concerning git, I assume I should stop rebasing at some point? I can do it as long as I'm the only one committing?

LysandreJik

I've played around with your implementation, and it works well! It's very cool. Now comes a complicated aspect: ensuring we stay consistent with the existing API, especially for the tokenizers.

This is a complicated part, so please let us know if you would like some help along the way/if you want us to take over from here, in which case we would open PRs against your branch with proposals.

Some of this echoes what you've put in your PR description, I'm putting it here as well so that we don't lose track of what's left to do.

Tokenizers

As you've said, the tokenizer might be a bit hard to use with our existing API, but I think it's possible. Let's list the issues with the current implementation:

Right now, the tokenizer is not behaving as we would like as it doesn't respect much of the flags passed to it. For example, running your example:

from transformers import TapasTokenizer

dir_name = "./tapas_base_finetuned_wtq_with_reset_and_intermediate_pretraining"

tokenizer = TapasTokenizer(vocab_file=dir_name + "/vocab.txt", model_max_length=512)
encoded_inputs = tokenizer.batch_encode_plus(table=table, queries=queries)

This immediately pads to the maximum length. This is the case whether I specify the padding strategy or not. It does not respect the truncation parameter either, nor does it respect the return_attention_mask, but it does respect the return_token_type_ids.

I haven't tried the pad_to_multiple_of, is_split_into_words, return_overflowing_tokens, return_special_tokens_mask, return_length parameters, but having them work similarly to other tokenizers should be a priority.

`call`

The __call__ method should be implemented for the tokenizer. The following methods should also be implemented:

⚠️ some of these are already implemented in the superclass, and may not need to be reimplemented. This may be the case for prepare_for_model, which takes care of the truncation/padding on its own.

build_inputs_with_special_tokens
get_special_tokens_mask
create_token_type_ids_from_sequences

It seems you've only implemented the batching methods, but there are methods to be used for single sequences as well:

encode_plus
prepare_for_model

Testing

Modeling Tapas Utilities

Same as the other utilities, it would be nice to have this in the main file. Moving test_modeling_tapas_utilities.py to be in the same file as test_modeling_tapas.py would be great.

Tokenization tests

The implementation of the tokenization tests would be a good way to monitor what is currently working/failing.

src/transformers/tokenization_tapas.py

tests/test_modeling_tapas_utilities.py

tests/test_modeling_tapas.py

src/transformers/configuration_tapas.py

src/transformers/modeling_tapas.py

…rSequenceClassification

src/transformers/tokenization_tapas.py

tests/test_modeling_tapas.py

patrickvonplaten

PR looks very clean already! Great job! I left a couple of comments regarding naming, etc..
I'd really like a slow integration test and I think we should delete thee encoder-decoder related logic from modeling_tapas.py for now

sgugger

Wow that's a lot of work! Thanks a lot for putting this together with @LysandreJik help. This seems on track to be merged, there is still a bit of cleanup work to do before though.

docs/source/model_doc/tapas.rst

src/transformers/tokenization_tapas.py

tests/test_modeling_tapas.py

tests/test_tokenization_tapas.py

NielsRogge · 2020-12-01T17:43:08Z

Thanks for the reviews, I've updated the requested changes and marked the ones I did as resolved.

@LysandreJik, could you maybe fix the remaining comments? In short:

remove the encoder-decoder logic of TapasModel (only remove this from the API of TapasModel, but leave them in the code that's copied from BERT (that the user won't see and won't use). I'll let you do this since I don't want to mess up anything.
remove some tests and add a slow test as requested above

... then I'll mark these as resolved. Besides these, there's also the truncation of TapasTokenizer which should still be implemented. I copied what was left here:

Add support for the drop_rows_to_fit and cell_trim_length attributes of TapasTokenizer, which should reflect the original API (see also my suggestion on how this could be done for cell_trim_length). The original implementation can be found here.
Add support for the special [EMPTY] token for empty cells in a table (see the _tokenize method of TapasTokenizer, which now uses the format_text method as in the original implementation). Does this have implications for the add_special_tokens method? I assume not? What about get_special_tokens_mask? To be verified.
There was also a small discrepancy between the tokenization of TAPAS and the original implementation, see this Github issue. I don't expect this too big of an issue, but maybe you know more about this.

And then I assume we're done 👍 (finally)

LysandreJik · 2020-12-01T18:44:36Z

Sure, will do so. Probably tomorrow morning/afternoon!

Tapas truncation

Add URLs to configuration_tapas.py and remove span_prediction

NielsRogge · 2020-12-15T08:11:22Z

Closing this one as the most up-to-date is now #9117 .

LysandreJik self-requested a review October 28, 2020 13:41

NielsRogge force-pushed the tapas_v3 branch from e751327 to b500b60 Compare October 28, 2020 15:30

NielsRogge force-pushed the tapas_v3 branch from b500b60 to 57db64b Compare October 29, 2020 20:09

LysandreJik reviewed Oct 29, 2020

View reviewed changes

NielsRogge force-pushed the tapas_v3 branch from 57db64b to d839fd8 Compare October 30, 2020 15:18

patrickvonplaten reviewed Oct 30, 2020

View reviewed changes

src/transformers/configuration_tapas.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Oct 30, 2020

View reviewed changes

src/transformers/configuration_tapas.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Oct 30, 2020

View reviewed changes

src/transformers/configuration_tapas.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Oct 30, 2020

View reviewed changes

NielsRogge force-pushed the tapas_v3 branch 2 times, most recently from 9afa6ca to ab2b40d Compare November 2, 2020 08:42

LysandreJik reviewed Nov 3, 2020

View reviewed changes

Rogge Niels added 13 commits November 4, 2020 10:35

First commit

1142960

Added TapasForSequenceClassification

9bb0060

Improved documentation

ac313f3

Improved loss calculation and documentation of TapasForQuestionAnswering

e6eeed9

Improved documentation

3383e79

Some more improved documentation

13cadf7

Added fix for conversion of TapasForSequenceClassification

13c7cda

First attempt at using nn.linear rather than nn.parameter for TapasFo…

8f80607

…rSequenceClassification

Bug fix with TapasForSequenceClassification

3114c8c

Simplified aggregation head using nn.Linear

da1d0e0

Improved testing

78fc0f4

Minor fix

58ff811

Fix numbering of models

86aa165

patrickvonplaten reviewed Nov 30, 2020

View reviewed changes

src/transformers/tokenization_tapas.py Show resolved Hide resolved

patrickvonplaten reviewed Nov 30, 2020

View reviewed changes

tests/test_modeling_tapas.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Nov 30, 2020

View reviewed changes

Rogge Niels added 3 commits November 30, 2020 17:51

Address some comments by @patrickvonplaten (mainly docs)

0363264

Do not pass entire config to calculate aggregation loss

6b05d83

Bug fix

9400277

sgugger approved these changes Nov 30, 2020

View reviewed changes

Rogge Niels added 2 commits December 1, 2020 17:10

Address most docs related comments by @sgugger

4b4a1b4

Some more docs improvements

8a936f0

NielsRogge mentioned this pull request Dec 2, 2020

Questions about using this codes NielsRogge/tapas_utils#3

Closed

LysandreJik and others added 15 commits December 2, 2020 17:01

Truncation

40847a3

Tokenization integration test

cb0395b

TAPAS modeling integration test template

4b93499

Fix tests

ec7761b

Various fixes

50c8b8c

Minor docs improvements

5cb3708

Merge branch 'tapas_v3' into tapas-truncation

bc42e80

Merge pull request #3 from huggingface/tapas-truncation

4200d40

Tapas truncation

Add model integration test first draft

57b6205

Added URLs of models in model hub

545ed40

Fix bug, currently 28 tests passed, 6 failed, 7 skipped

cd8f5a1

Improve testing

f38c4c1

Add URLs to configuration_tapas.py and remove span_prediction

Fix URLs in configuration_tapas.py

e4fe878

Improve integration tests

38c567f

Add integration test for loss calculation

2b71201

NielsRogge mentioned this pull request Dec 8, 2020

[WIP] Add Tapas (bis) #8988

Closed

4 tasks

NielsRogge closed this Dec 15, 2020

Conversation

NielsRogge commented Oct 28, 2020

What does this PR do?

1. Making TapasTokenizer fully Transformers-compliant

2. Testing

Who can review?

Uh oh!

sgugger commented Oct 28, 2020

Uh oh!

NielsRogge commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NielsRogge commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Oct 29, 2020

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LysandreJik Oct 29, 2020

Choose a reason for hiding this comment

Uh oh!

NielsRogge Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

LysandreJik Oct 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

NielsRogge commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Tokenizers

__call__

Testing

Modeling Tapas Utilities

Tokenization tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NielsRogge commented Oct 28, 2020 •

edited

Loading

NielsRogge commented Oct 29, 2020 •

edited

Loading

NielsRogge commented Nov 2, 2020 •

edited

Loading

LysandreJik left a comment •

edited

Loading

`call`

NielsRogge commented Dec 1, 2020 •

edited

Loading