Skip to content

[WIP] Add Tapas model#8113

Closed
NielsRogge wants to merge 58 commits intohuggingface:masterfrom
NielsRogge:tapas_v3
Closed

[WIP] Add Tapas model#8113
NielsRogge wants to merge 58 commits intohuggingface:masterfrom
NielsRogge:tapas_v3

Conversation

@NielsRogge
Copy link
Contributor

What does this PR do?

Since the beginning of August, I'm working in my free time on incorporating the Tapas algorithm by Google AI in the Transformers library (because this library is awesome and I want to contribute to it!). Tapas is basically a BERT model with some clever modifications for natural language understanding related to tabular data (structured data like tables, or even HTML). Adding this model could foster research in this area 😄

Demo's of my current implementation:

  • colab notebook to showcase TapasForQuestionAnswering on WTQ (WikiTable Questions by Stanford University)
  • colab notebook to showcase TapasForQuestionAnsweringon SQA (Sequential Question Answering by Microsoft Research)
  • colab notebook to showcase TapasForSequenceClassification on TabFact (Table Fact checking, introduced at ICLR this year)

The model weights are available on the original Github repository, and I wrote a conversion script (similar to other models in the Transformers library) to load them into their PyTorch counterpart.
I suggest reading the paper as well as my notes to gain a full understanding of how the model works and how I implemented it. There's also a blog post by Google AI as well as a video by Yannic Kilcher explaining how the algorithm works.

The main classes are TapasConfig, TapasModel, TapasForQuestionAnswering and TapasForSequenceClassification which can all be found in modeling_tapas.py. I'm quite sure the models are OK, the output is the same as the Tensorflow implementation. I added a very extensive documentation (docstrings) to all classes, which you can view by running the make html command from the docs directory. Feedback appreciated!

However, there are 2 things for which I need some help/opinions to finish this work:

1. Making TapasTokenizer fully Transformers-compliant

To implement TapasTokenizer, I need some help/opinions. I suggest using Pandas dataframes as the central object for tabular data (as shown in the Colab notebooks above). and let the API be as follows:

from transformers import TapasTokenizer
import pandas as pd

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"]}
table = pd.DataFrame.from_dict(data)
queries = ["When was Brad Pitt born?", 
           "Which actor appeared in the least number of movies?",
           "What is the average number of movies?"]

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-wtq")
inputs = tokenizer(table=table, queries=queries)

Currently I've only implemented the batch_encode_plus method of TapasTokenizer, because it's not really clear to me how to make it fully compatible with the Transformers library, since the way that data is prepared for the model is a bit different compared to BERT/RoBERTa/etc (see also my notes above). It's also not straightforward to make it compatible with the different padding/truncation strategies of Transformers. Currently, the way it works is as follows: there’s a function _get_token_budget in tokenization_tapas.py that calculates the number of tokens left for the flattened table after tokenizing a question. This is currently set to self.model_max_length - (len(question_tokens) + 2) (+ 2 for the CLS and SEP tokens), as was done in the original implementation. There is a hyperparameter when initializing TapasTokenizer called drop_rows_to_fit which drops rows of the table to fit into the token budget if set to True. If it’s set to False and a table is too big, it throws a ValueError indicating 'too many rows'.

2. Testing

Currently I've written test_modeling_tapas.py (23 tests passed, 5 failed) and test_modeling_tapas_utilities.py (9 tests passed). However, there are 4 different settings to use TapasForQuestionAnswering (see my notes) and these all need to be tested (currently only 1 setting is tested) - some help here would be great. Besides this, tests should be added to see whether the model can be properly trained, as well as adding test_tokenization_tapas.py (which depends on how TapasTokenizer will be implemented).

Fixes the following issues (people requesting to add the model):

Who can review?

I suggest @sgugger @LysandreJik since we already discussed this on the forum here.
tokenizers: @mfuntowicz

DISCLAIMER: this is my first PR of my life, never done this before, hopefully I don't mess up anything (just got the Pro Git book 😛). I assume I should not use git rebase anymore now since this branch submitted as PR and should only use git add, git commit and git push -u origin tapas_v3? And git pull origin tapas_v3 in case others make commits to my branch?
Is there a Slack channel where people can help me out in case I have git issues?

@sgugger
Copy link
Collaborator

sgugger commented Oct 28, 2020

I think you might need a rebase on the latest master as your PR seems to have taken master from a while ago (all the modifications in the README should not be there for instance). If it messes the diff, we can always close this PR and open a new one, the branch will be safe :-)

@LysandreJik LysandreJik self-requested a review October 28, 2020 13:41
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Oct 28, 2020

EDIT: I've always used git fetch upstream and git rebase upstream/master, before pushing my local tapas_v3 branch to my fork on Github. I didn't know that all models starts with 1. now in the README (therefore I manually changed the numbering), should be fixed now, branch is up-to-date.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Oct 29, 2020

@sgugger should I keep rebasing this branch everyday, to keep up with master (as long as the code is not being reviewed)?

Also, is it normal that I have to do a force push everytime I perform a rebase and want to push to Github? Because when I want to do simply git push -u origin tapas_v3, I always get

(env) PS C:\Users\niels.rogge\Documents\Python projecten\transformers> git push -u origin tapas_v3
To https://github.com/NielsRogge/transformers.git
 ! [rejected]          tapas_v3 -> tapas_v3 (non-fast-forward)
error: failed to push some refs to 'https://github.com/NielsRogge/transformers.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

after a local rebase.

@sgugger
Copy link
Collaborator

sgugger commented Oct 29, 2020

I'm far from being an expert on git and I don't use the command line anyway, so can't really help you with that.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Thanks a lot @NielsRogge for implementing this model! It's a great model that definitely deserves its place in the huggingface library, and you did a great job implementing it! I'm reviewing the code below.

Regarding the tokenizer, we do not have a dependency (hard or soft) on pandas, but we will soon have one on datasets. I think it is feasible to have datasets as the central object for tabular data instead of pandas dataframes. It's very easy to use it as a replacement, your example code would simply use a datasets.Dataset instead of a pd.Dataframe:

from transformers import TapasTokenizer
from datasets import Dataset

data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"], 
        'Age': ["56", "45", "59"],
        'Number of movies': ["87", "53", "69"],
        'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"]}

table = Dataset.from_dict(data)
queries = ["When was Brad Pitt born?", 
           "Which actor appeared in the least number of movies?",
           "What is the average number of movies?"]

tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-wtq")
inputs = tokenizer(table=table, queries=queries)

Putting @lhoestq in cc.

Regarding the configuration and tokenizer, it would be great to uncouple them from their BERT counterparts.

I haven't played around with the model yet, but it seems to be in a great shape already. If you can do the modifications in the review here that would be great, and then I'll spend time playing with it and will come back to you regarding the tokenizer API currently setup.

Concerning git, it seems you have just rebased & force-pushed, so that's very cool. Regarding your question, this does mess up the history of users that are currently using your branch, but seeing as we're early in development, there shouldn't be any.

Comment on lines 60 to 65
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thses checkpoints will need their namespace as well (something like google/tapas-base)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some notes about this:

  • I would like to discuss which of the models to upload to the model hub, because the authors did release a large amount of model variants: https://github.com/google-research/tapas. I assume not all of them are that useful, especially the TINY models, and maybe we can only include the ones with relative embeddings (because performance is almost always slightly better with relative position embeddings) and intermediate pre-training (for the same reason).

  • The naming of the models is quite extensive: tapas_wtq_wikisql_sqa_inter_masklm_large_reset for example is the large model with relative position embeddings (reset) that was pretrained on masked language modeling, then intermediate pretraining, then fine-tuned on SQA, then on WikiSQL and finally on WTQ 😅 Is there a naming standard for models in the model hub? Maybe this can be shortened to tapas-large-finetuned-wtq, and provide the details in the model cards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Regarding the models to be added, I don't think it hurts to put all the models in the hub. This way it would be easy for future researchers to replicate the results of this paper, only using the implementation here, without needing to switch from one implementation to another. In any case, the conversion script should handle all of these. We can take care of uploading the models then, as they'll need to be under the google organization namespace.

  • I think your proposal regarding model naming makes total sense! As long as there exists a document in which we can find the old name, renaming to something easier to understand like this is fine.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an amazing contribution! I let @LysandreJik handle the next steps and am happy to do a final review at a later stage.

@NielsRogge NielsRogge force-pushed the tapas_v3 branch 2 times, most recently from 9afa6ca to ab2b40d Compare November 2, 2020 08:42
@NielsRogge
Copy link
Contributor Author

NielsRogge commented Nov 2, 2020

Hi! Thanks a lot @NielsRogge for implementing this model! It's a great model that definitely deserves its place in the huggingface library, and you did a great job implementing it! I'm reviewing the code below.

Thank you! Great to hear :)

@LysandreJik I addressed all of the comments. To summarize:

  • TapasConfig and TapasTokenizer now inherit from PretrainedConfig and PreTrainedTokenizer respectively. For TapasTokenizer, a lot of the code of tokenization_bert.py was copied (such as the BasicTokenizer and WordPieceTokenizer classes), since the tokenization logic of text itself is the same. However, some things are different (see tokenization_tapas.py).
  • Modeling_tapas_utilities.py and tokenization_tapas_utilities.py are also gone now, they are added to the bottom of modeling_tapas.py and tokenization_tapas.py respectively.
  • pandas is not a dependency in the code (pandas is not imported in any of the files), but I assume you want to switch to datasets so that people can use Tapas using only the Transformers library? However, currently, in tokenization_tapas.py some Pandas logic is used, for example in the _tokenize_table function .iterrows() is used, so this will require some changes.
  • concerning git, I assume I should stop rebasing at some point? I can do it as long as I'm the only one committing?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with your implementation, and it works well! It's very cool. Now comes a complicated aspect: ensuring we stay consistent with the existing API, especially for the tokenizers.

This is a complicated part, so please let us know if you would like some help along the way/if you want us to take over from here, in which case we would open PRs against your branch with proposals.

Some of this echoes what you've put in your PR description, I'm putting it here as well so that we don't lose track of what's left to do.

Tokenizers

As you've said, the tokenizer might be a bit hard to use with our existing API, but I think it's possible. Let's list the issues with the current implementation:

Right now, the tokenizer is not behaving as we would like as it doesn't respect much of the flags passed to it. For example, running your example:

from transformers import TapasTokenizer

dir_name = "./tapas_base_finetuned_wtq_with_reset_and_intermediate_pretraining"

tokenizer = TapasTokenizer(vocab_file=dir_name + "/vocab.txt", model_max_length=512)
encoded_inputs = tokenizer.batch_encode_plus(table=table, queries=queries)

This immediately pads to the maximum length. This is the case whether I specify the padding strategy or not. It does not respect the truncation parameter either, nor does it respect the return_attention_mask, but it does respect the return_token_type_ids.

I haven't tried the pad_to_multiple_of, is_split_into_words, return_overflowing_tokens, return_special_tokens_mask, return_length parameters, but having them work similarly to other tokenizers should be a priority.

__call__

The __call__ method should be implemented for the tokenizer. The following methods should also be implemented:

⚠️ some of these are already implemented in the superclass, and may not need to be reimplemented. This may be the case for prepare_for_model, which takes care of the truncation/padding on its own.

  • build_inputs_with_special_tokens
  • get_special_tokens_mask
  • create_token_type_ids_from_sequences

It seems you've only implemented the batching methods, but there are methods to be used for single sequences as well:

  • encode_plus
  • prepare_for_model

Testing

Modeling Tapas Utilities

Same as the other utilities, it would be nice to have this in the main file. Moving test_modeling_tapas_utilities.py to be in the same file as test_modeling_tapas.py would be great.

Tokenization tests

The implementation of the tokenization tests would be a good way to monitor what is currently working/failing.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks very clean already! Great job! I left a couple of comments regarding naming, etc..
I'd really like a slow integration test and I think we should delete thee encoder-decoder related logic from modeling_tapas.py for now

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's a lot of work! Thanks a lot for putting this together with @LysandreJik help. This seems on track to be merged, there is still a bit of cleanup work to do before though.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Dec 1, 2020

Thanks for the reviews, I've updated the requested changes and marked the ones I did as resolved.

@LysandreJik, could you maybe fix the remaining comments? In short:

  • remove the encoder-decoder logic of TapasModel (only remove this from the API of TapasModel, but leave them in the code that's copied from BERT (that the user won't see and won't use). I'll let you do this since I don't want to mess up anything.
  • remove some tests and add a slow test as requested above

... then I'll mark these as resolved. Besides these, there's also the truncation of TapasTokenizer which should still be implemented. I copied what was left here:

  • Add support for the drop_rows_to_fit and cell_trim_length attributes of TapasTokenizer, which should reflect the original API (see also my suggestion on how this could be done for cell_trim_length). The original implementation can be found here.
  • Add support for the special [EMPTY] token for empty cells in a table (see the _tokenize method of TapasTokenizer, which now uses the format_text method as in the original implementation). Does this have implications for the add_special_tokens method? I assume not? What about get_special_tokens_mask? To be verified.
  • There was also a small discrepancy between the tokenization of TAPAS and the original implementation, see this Github issue. I don't expect this too big of an issue, but maybe you know more about this.

And then I assume we're done 👍 (finally)

@LysandreJik
Copy link
Member

Sure, will do so. Probably tomorrow morning/afternoon!

@NielsRogge NielsRogge mentioned this pull request Dec 8, 2020
4 tasks
@NielsRogge
Copy link
Contributor Author

Closing this one as the most up-to-date is now #9117 .

@NielsRogge NielsRogge closed this Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants