Training T5 from scratch on a new language? #269

ritvik1512 · 2020-06-17T12:41:56Z

Hi I was wondering if there are any guidelines or documentation as to pre-training T5 from scratch (not just to any particular downstream task) in a new language?

Also is it possible to do the same with PyTorch under the current framework?

Please let me know if this is not the right place to discuss this, thank you!

ritvik1512 · 2020-06-17T12:46:01Z

I did refer to issue #172 for this, but that just seems to be initializing it for fine-tuning on a specific task?

agemagician · 2020-06-17T14:12:06Z

Check this:
https://github.com/google-research/google-research/tree/master/t5_closed_book_qa

ritvik1512 · 2020-06-17T14:19:42Z

I'm sorry if I am missing something, but isn't this training specifically for the QA task?

agemagician · 2020-06-17T17:40:10Z

Check also this:
#253

ritvik1512 · 2020-06-20T04:46:40Z

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

huseinzol05 · 2020-06-20T16:47:02Z

I pretrained t5 base and small on Malay language (Malaysia), all steps in here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5

Step to generate sentencepiece for this T5 model, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess, no 4.

I increased input and output length to 1024 because our use case to summarize long texts (https://malaya.readthedocs.io/en/latest/Abstractive.html#load-t5) and generate long texts given important contexts (https://malaya.readthedocs.io/en/latest/Generator.html#load-t5).

You can found our t5 model in huggingface, https://huggingface.co/huseinzol05/t5-base-bahasa-cased

I never seen a powerful seq2seq as T5.

ritvik1512 · 2020-06-20T16:57:34Z

@huseinzol05 I understand it is in tensorflow, but still is extremely helpful and very close to what I am looking for, thanks for taking the time to share the details!

traumasv · 2020-06-22T08:32:49Z

@ritvik1512 I was able to get the pytorch model going and here's my team's notebook:

https://colab.research.google.com/github/jameschartouni/arabic_translation/blob/google-t5/Model_1.ipynb#scrollTo=UL5yLXs4YJw7

but I'm still trying to figure out how the operative_config.gin should be adjusted when pre-training the pytorch model.

I know that you guys have only put out the API for fine-tuning purposes only but is there a way to correctly set up operative_config.gin for pretraining the HfPyTorch model?

adarob · 2020-06-22T13:33:36Z

Hey folks. I'm going to work on setting up the unsupervised task to not require the use of gin.

…configs. Fixes #269. PiperOrigin-RevId: 317647026

adarob · 2020-06-22T14:10:40Z

PTAL at #274 and see if it helps.

ritvik1512 · 2020-06-22T15:04:06Z

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

ritvik1512 · 2020-06-22T15:08:00Z

Thanks @adarob! If it works well with @traumasv's task, I will be trying to implement the same

…configs. Fixes #269. PiperOrigin-RevId: 317647026

traumasv · 2020-06-23T03:32:30Z

@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch?

Yes that's right

traumasv · 2020-06-23T04:20:50Z

PTAL at #274 and see if it helps.

@adarob Thank you for such a quick reply and solution!

I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..?

Could this be specified in the task.addTaskRegistry() or in model.train()?

adarob · 2020-06-23T10:08:50Z

Which binding is missing? Can you share the error message?

…

On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park ***@***.***> wrote: PTAL at #274 <#274> and see if it helps. @adarob <https://github.com/adarob> Thank you for such a quick reply and solution! I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..? Could this be specified in the task.addTaskRegistry() or in model.train()? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ> .

ritvik1512 · 2020-06-23T11:51:18Z

@traumasv ah right, I was aiming for a slightly different approach of first pre-training the model on one particular language and then later fine-tuning for downstream tasks, but thanks nonetheless!

traumasv · 2020-06-24T01:39:47Z

Here's the link to the cell with the error: https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1 On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts <[email protected]> wrote:

…

Which binding is missing? Can you share the error message? On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park ***@***.***> wrote: > PTAL at #274 > < #274 > > and see if it helps. > > @adarob <https://github.com/adarob> Thank you for such a quick reply and > solution! > > I tried adding the token_preprocessor functions to my tasks and ran > training with the API without the gin file and it looks like there's a > binding that's missing for 'denoise'..? > > Could this be specified in the task.addTaskRegistry() or in model.train()? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #269 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ> .

ashispapu · 2020-07-23T16:20:04Z

Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset?

Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch)

Sorry about the series of questions, but thanks for the help!

@ritvik1512 Hey did you try to do the pre training on one language and then fine tune for down stream tasks. I'm also exploring the same, didn't come across any useful resources. Please let me know if you have done any progress.

ashispapu · 2020-07-25T06:03:15Z

Here's the link to the cell with the error: https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1 On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts [email protected] wrote:
…
Which binding is missing? Can you share the error message? On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park @.***> wrote: > PTAL at #274 > < #274 > > and see if it helps. > > @adarob https://github.com/adarob Thank you for such a quick reply and > solution! > > I tried adding the token_preprocessor functions to my tasks and ran > training with the API without the gin file and it looks like there's a > binding that's missing for 'denoise'..? > > Could this be specified in the task.addTaskRegistry() or in model.train()? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #269 (comment) >, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ > > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#269 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ .

@adarob @traumasv Is there any fix available for this binding issue ?
RuntimeError: Required bindings fordenoisenot provided in config: ['noise_mask_fn']

craffel · 2020-07-25T14:23:05Z

denoise has no argument noise_function, it should use the argument noise_mask_fn (see https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py#L1749). The functools.partial call for denoise, should be changed to

      functools.partial(
          preprocessors.denoise,
          inputs_fn=preprocessors.noise_span_to_unique_sentinel,
          targets_fn=preprocessors.nonnoise_span_to_unique_sentinel,
          noise_density=0.15,
          noise_mask_fn=preprocessors.iid_noise_mask
      )

RTKno1 · 2020-08-29T04:18:22Z

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error.
I can't seem to figure this out, could you please take a look?
Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

Stellakats · 2020-10-08T10:52:03Z

Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a keyError: input_plaintext in running eval similar to #173 . I also tried that solution to remove the postprocessor and metrics from the tasks, but I then get keyError: translation_en_msa ,which is a task name, for line 438 in hf_model.py. I saw that batches in there is only coded if the task has a metric_fns, so i uncommented that line, but still get the same error.
I can't seem to figure this out, could you please take a look?
Here is the colab link to the current code, relevant sections are under English to Arabic Task, Levantine to MSA Task, and Maghrib to MSA Task, and train, eval.

Thank you!

Hi! Did you manage to find a solution for this?

RTKno1 · 2020-10-16T22:50:03Z

Hi @Stellakats ! Yes I did actually, though it may not be what you are looking for. I actually just followed @huseinzol05 task registry setup here: https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/prepare/finetune-summarization.ipynb for the arguments. So I removed the token preprocessor argument. You can view the colab link I posted as well and navigate to the "Arabic to English Task" to see how we add the Task Registry.

hiiamsid · 2021-10-01T08:48:56Z

@ritvik1512 were you able to implement t5 on non-english language?

PiotrNawrot · 2023-03-16T16:29:58Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

ritvik1512 changed the title ~~Training T5 from scratch on a new language?~~ Pre-training T5 from scratch on a new language? Jun 17, 2020

ritvik1512 changed the title ~~Pre-training T5 from scratch on a new language?~~ Training T5 from scratch on a new language? Jun 17, 2020

adarob self-assigned this Jun 22, 2020

copybara-service bot pushed a commit that referenced this issue Jun 22, 2020

Add hard-coded version of final pretraining task for use without gin …

061df51

…configs. Fixes #269. PiperOrigin-RevId: 317647026

copybara-service bot mentioned this issue Jun 22, 2020

Add hard-coded version of final pretraining task for use without gin configs. Fixes #269. #274

Merged

copybara-service bot pushed a commit that referenced this issue Jun 22, 2020

Add hard-coded version of final pretraining task for use without gin …

bfb4a3f

…configs. Fixes #269. PiperOrigin-RevId: 317647026

copybara-service bot pushed a commit that referenced this issue Jun 22, 2020

Add hard-coded version of final pretraining task for use without gin …

8eb5213

…configs. Fixes #269. PiperOrigin-RevId: 317647026

copybara-service bot closed this as completed in 897ea73 Jun 22, 2020

copybara-service bot closed this as completed in #274 Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training T5 from scratch on a new language? #269

Training T5 from scratch on a new language? #269

ritvik1512 commented Jun 17, 2020

ritvik1512 commented Jun 17, 2020

agemagician commented Jun 17, 2020

ritvik1512 commented Jun 17, 2020

agemagician commented Jun 17, 2020 •

edited

Loading

ritvik1512 commented Jun 20, 2020

huseinzol05 commented Jun 20, 2020 •

edited

Loading

ritvik1512 commented Jun 20, 2020 •

edited

Loading

traumasv commented Jun 22, 2020

adarob commented Jun 22, 2020

adarob commented Jun 22, 2020

ritvik1512 commented Jun 22, 2020

ritvik1512 commented Jun 22, 2020

traumasv commented Jun 23, 2020

traumasv commented Jun 23, 2020

adarob commented Jun 23, 2020 via email

ritvik1512 commented Jun 23, 2020

traumasv commented Jun 24, 2020 via email

ashispapu commented Jul 23, 2020

ashispapu commented Jul 25, 2020

craffel commented Jul 25, 2020

RTKno1 commented Aug 29, 2020

Stellakats commented Oct 8, 2020

RTKno1 commented Oct 16, 2020

hiiamsid commented Oct 1, 2021

PiotrNawrot commented Mar 16, 2023

Training T5 from scratch on a new language? #269

Training T5 from scratch on a new language? #269

Comments

ritvik1512 commented Jun 17, 2020

ritvik1512 commented Jun 17, 2020

agemagician commented Jun 17, 2020

ritvik1512 commented Jun 17, 2020

agemagician commented Jun 17, 2020 • edited Loading

ritvik1512 commented Jun 20, 2020

huseinzol05 commented Jun 20, 2020 • edited Loading

ritvik1512 commented Jun 20, 2020 • edited Loading

traumasv commented Jun 22, 2020

adarob commented Jun 22, 2020

adarob commented Jun 22, 2020

ritvik1512 commented Jun 22, 2020

ritvik1512 commented Jun 22, 2020

traumasv commented Jun 23, 2020

traumasv commented Jun 23, 2020

adarob commented Jun 23, 2020 via email

ritvik1512 commented Jun 23, 2020

traumasv commented Jun 24, 2020 via email

ashispapu commented Jul 23, 2020

ashispapu commented Jul 25, 2020

craffel commented Jul 25, 2020

RTKno1 commented Aug 29, 2020

Stellakats commented Oct 8, 2020

RTKno1 commented Oct 16, 2020

hiiamsid commented Oct 1, 2021

PiotrNawrot commented Mar 16, 2023

agemagician commented Jun 17, 2020 •

edited

Loading

huseinzol05 commented Jun 20, 2020 •

edited

Loading

ritvik1512 commented Jun 20, 2020 •

edited

Loading