Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training T5 from scratch on a new language? #269

Closed
ritvik1512 opened this issue Jun 17, 2020 · 25 comments · Fixed by #274
Closed

Training T5 from scratch on a new language? #269

ritvik1512 opened this issue Jun 17, 2020 · 25 comments · Fixed by #274
Assignees

Comments

@ritvik1512
Copy link

Hi I was wondering if there are any guidelines or documentation as to pre-training T5 from scratch (not just to any particular downstream task) in a new language?

Also is it possible to do the same with PyTorch under the current framework?

Please let me know if this is not the right place to discuss this, thank you!

@ritvik1512 ritvik1512 changed the title Training T5 from scratch on a new language? Pre-training T5 from scratch on a new language? Jun 17, 2020
@ritvik1512 ritvik1512 changed the title Pre-training T5 from scratch on a new language? Training T5 from scratch on a new language? Jun 17, 2020
@ritvik1512
Copy link
Author

I did refer to issue #172 for this, but that just seems to be initializing it for fine-tuning on a specific task?

@agemagician
Copy link

Check this:
https://github.com/google-research/google-research/tree/master/t5_closed_book_qa

@ritvik1512
Copy link
Author

I'm sorry if I am missing something, but isn't this training specifically for the QA task?

@agemagician
Copy link

agemagician commented Jun 17, 2020

Check also this:
#253

@ritvik1512
Copy link
Author

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

@huseinzol05
Copy link

huseinzol05 commented Jun 20, 2020

I pretrained t5 base and small on Malay language (Malaysia), all steps in here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5

Step to generate sentencepiece for this T5 model, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess, no 4.

I increased input and output length to 1024 because our use case to summarize long texts (https://malaya.readthedocs.io/en/latest/Abstractive.html#load-t5) and generate long texts given important contexts (https://malaya.readthedocs.io/en/latest/Generator.html#load-t5).

You can found our t5 model in huggingface, https://huggingface.co/huseinzol05/t5-base-bahasa-cased

I never seen a powerful seq2seq as T5.

@ritvik1512
Copy link
Author

ritvik1512 commented Jun 20, 2020

@huseinzol05 I understand it is in tensorflow, but still is extremely helpful and very close to what I am looking for, thanks for taking the time to share the details!

@traumasv
Copy link

@ritvik1512 I was able to get the pytorch model going and here's my team's notebook:

https://colab.research.google.com/github/jameschartouni/arabic_translation/blob/google-t5/Model_1.ipynb#scrollTo=UL5yLXs4YJw7

but I'm still trying to figure out how the operative_config.gin should be adjusted when pre-training the pytorch model.

I know that you guys have only put out the API for fine-tuning purposes only but is there a way to correctly set up operative_config.gin for pretraining the HfPyTorch model?

@adarob
Copy link
Collaborator

adarob commented Jun 22, 2020

Hey folks. I'm going to work on setting up the unsupervised task to not require the use of gin.

@adarob
Copy link
Collaborator

adarob commented Jun 22, 2020

PTAL at #274 and see if it helps.

@ritvik1512
Copy link
Author

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

@ritvik1512
Copy link
Author

Thanks @adarob! If it works well with @traumasv's task, I will be trying to implement the same

copybara-service bot pushed a commit that referenced this issue Jun 22, 2020
copybara-service bot pushed a commit that referenced this issue Jun 22, 2020
@traumasv
Copy link

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

Yes that's right

@traumasv
Copy link

PTAL at #274 and see if it helps.

@adarob Thank you for such a quick reply and solution!

I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..?

Could this be specified in the task.addTaskRegistry() or in model.train()?

@adarob
Copy link
Collaborator

adarob commented Jun 23, 2020 via email

@ritvik1512
Copy link
Author

@traumasv ah right, I was aiming for a slightly different approach of first pre-training the model on one particular language and then later fine-tuning for downstream tasks, but thanks nonetheless!

@traumasv
Copy link

traumasv commented Jun 24, 2020 via email

@ashispapu
Copy link

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

@ritvik1512 Hey did you try to do the pre training on one language and then fine tune for down stream tasks. I'm also exploring the same, didn't come across any useful resources. Please let me know if you have done any progress.

@ashispapu
Copy link

Here's the link to the cell with the error: https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1 On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts [email protected] wrote:

Which binding is missing? Can you share the error message? On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park @.***> wrote: > PTAL at #274 > < #274 > > and see if it helps. > > @adarob https://github.com/adarob Thank you for such a quick reply and > solution! > > I tried adding the token_preprocessor functions to my tasks and ran > training with the API without the gin file and it looks like there's a > binding that's missing for 'denoise'..? > > Could this be specified in the task.addTaskRegistry() or in model.train()? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #269 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ .

@adarob @traumasv Is there any fix available for this binding issue ?
RuntimeError: Required bindings fordenoisenot provided in config: ['noise_mask_fn']

@craffel
Copy link
Contributor

craffel commented Jul 25, 2020

denoise has no argument noise_function, it should use the argument noise_mask_fn (see https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1749). The functools.partial call for denoise, should be changed to

      functools.partial(
          preprocessors.denoise,
          inputs_fn=preprocessors.noise_span_to_unique_sentinel,
          targets_fn=preprocessors.nonnoise_span_to_unique_sentinel,
          noise_density=0.15,
          noise_mask_fn=preprocessors.iid_noise_mask
      )

@RTKno1
Copy link

RTKno1 commented Aug 29, 2020

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error.
I can't seem to figure this out, could you please take a look?
Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

@Stellakats
Copy link

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error.
I can't seem to figure this out, could you please take a look?
Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

Hi! Did you manage to find a solution for this?

@RTKno1
Copy link

RTKno1 commented Oct 16, 2020

Hi @Stellakats ! Yes I did actually, though it may not be what you are looking for. I actually just followed @huseinzol05 task registry setup here: https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/prepare/finetune-summarization.ipynb for the arguments. So I removed the token preprocessor argument. You can view the colab link I posted as well and navigate to the "Arabic to English Task" to see how we add the Task Registry.

@hiiamsid
Copy link

hiiamsid commented Oct 1, 2021

@ritvik1512 were you able to implement t5 on non-english language?

@PiotrNawrot
Copy link

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet