Conversation
|
I think you might need a rebase on the latest master as your PR seems to have taken master from a while ago (all the modifications in the README should not be there for instance). If it messes the diff, we can always close this PR and open a new one, the branch will be safe :-) |
|
EDIT: I've always used |
|
@sgugger should I keep rebasing this branch everyday, to keep up with master (as long as the code is not being reviewed)? Also, is it normal that I have to do a force push everytime I perform a rebase and want to push to Github? Because when I want to do simply after a local rebase. |
|
I'm far from being an expert on git and I don't use the command line anyway, so can't really help you with that. |
LysandreJik
left a comment
There was a problem hiding this comment.
Hi! Thanks a lot @NielsRogge for implementing this model! It's a great model that definitely deserves its place in the huggingface library, and you did a great job implementing it! I'm reviewing the code below.
Regarding the tokenizer, we do not have a dependency (hard or soft) on pandas, but we will soon have one on datasets. I think it is feasible to have datasets as the central object for tabular data instead of pandas dataframes. It's very easy to use it as a replacement, your example code would simply use a datasets.Dataset instead of a pd.Dataframe:
from transformers import TapasTokenizer
from datasets import Dataset
data = {'Actors': ["Brad Pitt", "Leonardo Di Caprio", "George Clooney"],
'Age': ["56", "45", "59"],
'Number of movies': ["87", "53", "69"],
'Date of birth': ["18 december 1963", "11 november 1974", "6 may 1961"]}
table = Dataset.from_dict(data)
queries = ["When was Brad Pitt born?",
"Which actor appeared in the least number of movies?",
"What is the average number of movies?"]
tokenizer = TapasTokenizer.from_pretrained("tapas-base-finetuned-wtq")
inputs = tokenizer(table=table, queries=queries)Putting @lhoestq in cc.
Regarding the configuration and tokenizer, it would be great to uncouple them from their BERT counterparts.
I haven't played around with the model yet, but it seems to be in a great shape already. If you can do the modifications in the review here that would be great, and then I'll spend time playing with it and will come back to you regarding the tokenizer API currently setup.
Concerning git, it seems you have just rebased & force-pushed, so that's very cool. Regarding your question, this does mess up the history of users that are currently using your branch, but seeing as we're early in development, there shouldn't be any.
src/transformers/modeling_tapas.py
Outdated
There was a problem hiding this comment.
Thses checkpoints will need their namespace as well (something like google/tapas-base)
There was a problem hiding this comment.
Yes, some notes about this:
-
I would like to discuss which of the models to upload to the model hub, because the authors did release a large amount of model variants: https://github.com/google-research/tapas. I assume not all of them are that useful, especially the TINY models, and maybe we can only include the ones with relative embeddings (because performance is almost always slightly better with relative position embeddings) and intermediate pre-training (for the same reason).
-
The naming of the models is quite extensive: tapas_wtq_wikisql_sqa_inter_masklm_large_reset for example is the large model with relative position embeddings (reset) that was pretrained on masked language modeling, then intermediate pretraining, then fine-tuned on SQA, then on WikiSQL and finally on WTQ 😅 Is there a naming standard for models in the model hub? Maybe this can be shortened to tapas-large-finetuned-wtq, and provide the details in the model cards.
There was a problem hiding this comment.
-
Regarding the models to be added, I don't think it hurts to put all the models in the hub. This way it would be easy for future researchers to replicate the results of this paper, only using the implementation here, without needing to switch from one implementation to another. In any case, the conversion script should handle all of these. We can take care of uploading the models then, as they'll need to be under the google organization namespace.
-
I think your proposal regarding model naming makes total sense! As long as there exists a document in which we can find the old name, renaming to something easier to understand like this is fine.
patrickvonplaten
left a comment
There was a problem hiding this comment.
This looks like an amazing contribution! I let @LysandreJik handle the next steps and am happy to do a final review at a later stage.
9afa6ca to
ab2b40d
Compare
Thank you! Great to hear :) @LysandreJik I addressed all of the comments. To summarize:
|
There was a problem hiding this comment.
I've played around with your implementation, and it works well! It's very cool. Now comes a complicated aspect: ensuring we stay consistent with the existing API, especially for the tokenizers.
This is a complicated part, so please let us know if you would like some help along the way/if you want us to take over from here, in which case we would open PRs against your branch with proposals.
Some of this echoes what you've put in your PR description, I'm putting it here as well so that we don't lose track of what's left to do.
Tokenizers
As you've said, the tokenizer might be a bit hard to use with our existing API, but I think it's possible. Let's list the issues with the current implementation:
Right now, the tokenizer is not behaving as we would like as it doesn't respect much of the flags passed to it. For example, running your example:
from transformers import TapasTokenizer
dir_name = "./tapas_base_finetuned_wtq_with_reset_and_intermediate_pretraining"
tokenizer = TapasTokenizer(vocab_file=dir_name + "/vocab.txt", model_max_length=512)
encoded_inputs = tokenizer.batch_encode_plus(table=table, queries=queries)This immediately pads to the maximum length. This is the case whether I specify the padding strategy or not. It does not respect the truncation parameter either, nor does it respect the return_attention_mask, but it does respect the return_token_type_ids.
I haven't tried the pad_to_multiple_of, is_split_into_words, return_overflowing_tokens, return_special_tokens_mask, return_length parameters, but having them work similarly to other tokenizers should be a priority.
__call__
The __call__ method should be implemented for the tokenizer. The following methods should also be implemented:
prepare_for_model, which takes care of the truncation/padding on its own.
build_inputs_with_special_tokensget_special_tokens_maskcreate_token_type_ids_from_sequences
It seems you've only implemented the batching methods, but there are methods to be used for single sequences as well:
encode_plusprepare_for_model
Testing
Modeling Tapas Utilities
Same as the other utilities, it would be nice to have this in the main file. Moving test_modeling_tapas_utilities.py to be in the same file as test_modeling_tapas.py would be great.
Tokenization tests
The implementation of the tokenization tests would be a good way to monitor what is currently working/failing.
…rSequenceClassification
patrickvonplaten
left a comment
There was a problem hiding this comment.
PR looks very clean already! Great job! I left a couple of comments regarding naming, etc..
I'd really like a slow integration test and I think we should delete thee encoder-decoder related logic from modeling_tapas.py for now
sgugger
left a comment
There was a problem hiding this comment.
Wow that's a lot of work! Thanks a lot for putting this together with @LysandreJik help. This seems on track to be merged, there is still a bit of cleanup work to do before though.
|
Thanks for the reviews, I've updated the requested changes and marked the ones I did as resolved. @LysandreJik, could you maybe fix the remaining comments? In short:
... then I'll mark these as resolved. Besides these, there's also the truncation of
And then I assume we're done 👍 (finally) |
|
Sure, will do so. Probably tomorrow morning/afternoon! |
Tapas truncation
Add URLs to configuration_tapas.py and remove span_prediction
|
Closing this one as the most up-to-date is now #9117 . |
What does this PR do?
Since the beginning of August, I'm working in my free time on incorporating the Tapas algorithm by Google AI in the Transformers library (because this library is awesome and I want to contribute to it!). Tapas is basically a BERT model with some clever modifications for natural language understanding related to tabular data (structured data like tables, or even HTML). Adding this model could foster research in this area 😄
Demo's of my current implementation:
TapasForQuestionAnsweringon WTQ (WikiTable Questions by Stanford University)TapasForQuestionAnsweringon SQA (Sequential Question Answering by Microsoft Research)TapasForSequenceClassificationon TabFact (Table Fact checking, introduced at ICLR this year)The model weights are available on the original Github repository, and I wrote a conversion script (similar to other models in the Transformers library) to load them into their PyTorch counterpart.
I suggest reading the paper as well as my notes to gain a full understanding of how the model works and how I implemented it. There's also a blog post by Google AI as well as a video by Yannic Kilcher explaining how the algorithm works.
The main classes are
TapasConfig,TapasModel,TapasForQuestionAnsweringandTapasForSequenceClassificationwhich can all be found inmodeling_tapas.py. I'm quite sure the models are OK, the output is the same as the Tensorflow implementation. I added a very extensive documentation (docstrings) to all classes, which you can view by running themake htmlcommand from the docs directory. Feedback appreciated!However, there are 2 things for which I need some help/opinions to finish this work:
1. Making TapasTokenizer fully Transformers-compliant
To implement
TapasTokenizer, I need some help/opinions. I suggest using Pandas dataframes as the central object for tabular data (as shown in the Colab notebooks above). and let the API be as follows:Currently I've only implemented the
batch_encode_plusmethod of TapasTokenizer, because it's not really clear to me how to make it fully compatible with the Transformers library, since the way that data is prepared for the model is a bit different compared to BERT/RoBERTa/etc (see also my notes above). It's also not straightforward to make it compatible with the different padding/truncation strategies of Transformers. Currently, the way it works is as follows: there’s a function_get_token_budgetintokenization_tapas.pythat calculates the number of tokens left for the flattened table after tokenizing a question. This is currently set toself.model_max_length - (len(question_tokens) + 2)(+ 2 for the CLS and SEP tokens), as was done in the original implementation. There is a hyperparameter when initializingTapasTokenizercalleddrop_rows_to_fitwhich drops rows of the table to fit into the token budget if set toTrue. If it’s set toFalseand a table is too big, it throws aValueErrorindicating 'too many rows'.2. Testing
Currently I've written
test_modeling_tapas.py(23 tests passed, 5 failed) andtest_modeling_tapas_utilities.py(9 tests passed). However, there are 4 different settings to useTapasForQuestionAnswering(see my notes) and these all need to be tested (currently only 1 setting is tested) - some help here would be great. Besides this, tests should be added to see whether the model can be properly trained, as well as addingtest_tokenization_tapas.py(which depends on how TapasTokenizer will be implemented).Fixes the following issues (people requesting to add the model):
Who can review?
I suggest @sgugger @LysandreJik since we already discussed this on the forum here.
tokenizers: @mfuntowicz
DISCLAIMER: this is my first PR of my life, never done this before, hopefully I don't mess up anything (just got the Pro Git book 😛). I assume I should not use
git rebaseanymore now since this branch submitted as PR and should only usegit add,git commitandgit push -u origin tapas_v3? Andgit pull origin tapas_v3in case others make commits to my branch?Is there a Slack channel where people can help me out in case I have git issues?