Skip to content

Conversation

@kanpuriyanawab
Copy link
Collaborator

@kanpuriyanawab kanpuriyanawab commented Feb 5, 2023

Closes #718
Fixes #733

Docstrings are yet to be added. Will add once doubts about implementation are clear.

@kanpuriyanawab kanpuriyanawab marked this pull request as draft February 5, 2023 17:05
@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Feb 5, 2023

@kanpuriyanawab kanpuriyanawab changed the title Adding an AlbertMaskedLM task model Adding an AlbertMaskedLM task model and preprocessor Feb 5, 2023
@kanpuriyanawab kanpuriyanawab marked this pull request as ready for review February 7, 2023 18:09
@kanpuriyanawab
Copy link
Collaborator Author

Ready for review

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks great!

Please go ahead and add docs, also there are some test failures it looks like (this model using different tokenization and padding than roberta, so the preprocessing tests will need some updates).

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Feb 8, 2023

CI is green, ready for review @mattdangerw @abheesht17 @jbischof

@kanpuriyanawab kanpuriyanawab requested review from mattdangerw and removed request for abheesht17 February 8, 2023 19:07
Copy link
Collaborator

@abheesht17 abheesht17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review of the doc-strings.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a few quick comments

@kanpuriyanawab
Copy link
Collaborator Author

@mattdangerw CI is green in terms of tests.
Not sure why cloud build fails in accelerator test

@mattdangerw
Copy link
Member

@shivance thanks! will take another pass soon. looks like the accelerator test is just a timeout, so probably unrelated to the code changes here.

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Feb 10, 2023

Good to go @mattdangerw

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple minor comments. I can fix these up as I merge.

sep_token = "[SEP]"
pad_token = "<pad>"
mask_token = "<mask>"
for token in [cls_token, sep_token, pad_token]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry missed one thing! We should update our tokenizer to check for the masked token here. Which will then mean we have to update a lot of testing for albert_tokenizer and albert_preprocessor so they actually include the mask token.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR shows how to add the mask in when training a sentencepiece model. #732

)

def test_preprocess_strings(self):
input_data = " airplane at airport"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other follow up as you are adding the check for the mask token--we should also make sure the data you are using to train the sentencpiece model above matches the data you are passing here.

When all is working, the token id output should look something like [bos, mask, mask mask, eos, pad, pad pad] (replacing those symbols with the proper indices from the vocab).

@kanpuriyanawab
Copy link
Collaborator Author

Still WIP

@mattdangerw
Copy link
Member

Still WIP

No rush! Ping when this is ready for 👀 again!

@kanpuriyanawab
Copy link
Collaborator Author

The PR is ready for review, waiting for the CI, although tests pass in local.

@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Feb 15, 2023

CI is green @mattdangerw @jbischof Ready for review !

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you!

Going to try out a quick training run with it, and if all looks good, will merge this in.

# See the License for the specific language governing permissions and
# limitations under the License.
"""Tests for BERT classification model."""
"""Tests for ALBERT classification model."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch :)

@kanpuriyanawab
Copy link
Collaborator Author

This looks great! Thank you!

Going to try out a quick training run with it, and if all looks good, will merge this in.

@mattdangerw Would you mind sharing the training script once done?

@mattdangerw
Copy link
Member

@shivance sure! https://colab.research.google.com/gist/mattdangerw/c73a58e20132fd1117161a0f00b23b4b/albert-mlm.ipynb

Things look in the right ballpark to me. 43% guess the word accuracy after a single epoch on IMDb. So going to pull this in. Thanks again!

@mattdangerw mattdangerw merged commit 30cb703 into keras-team:master Feb 17, 2023
@kanpuriyanawab
Copy link
Collaborator Author

kanpuriyanawab commented Feb 24, 2023

@mattdangerw @jbischof @chenmoneygithub I went through the GSoC ideas list
image

I found this one specially intriguing. I think this will add a lot of value to KerasNLP and Keras ecosystem in general. I referred huggingface tasks. Their piepline syntax is really useful.
Which of these task is KerasNLP planning to support?

image

Apart from supporting this task, in GSoC project I think accompanying tutorial will add a lot of value. I'm already working on #754 . I also opened #1253 in Keras-io, these are tutorials for Token Classification and Sentence Similarity respectively.

Which which of these task are in priority list and how many of them you suggest me to include in my proposal, (keeping timeline of GSoC in mind).

@kanpuriyanawab kanpuriyanawab deleted the alberta_lm branch March 10, 2023 06:49
@kanpuriyanawab kanpuriyanawab changed the title Adding an AlbertMaskedLM task model and preprocessor Adding an AlbertMaskedLM task + Fix Projection layer dimension in MaskedLMHead Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mistake in projection layer dimension of MaskedLMHead Add an AlbertMaskedLM task model

3 participants