-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training T5 from scratch on a new language? #269
Comments
I did refer to issue #172 for this, but that just seems to be initializing it for fine-tuning on a specific task? |
I'm sorry if I am missing something, but isn't this training specifically for the QA task? |
Check also this: |
Hi, thanks for the link. Did you guys perfork unsupervised pre-training these models from scratch on this dataset? Any ideas on how this would shift to a new language? (Also if there is any chance I could make it work with Pytorch) Sorry about the series of questions, but thanks for the help! |
I pretrained t5 base and small on Malay language (Malaysia), all steps in here, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/t5 Step to generate sentencepiece for this T5 model, https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/preprocess, no 4. I increased input and output length to 1024 because our use case to summarize long texts (https://malaya.readthedocs.io/en/latest/Abstractive.html#load-t5) and generate long texts given important contexts (https://malaya.readthedocs.io/en/latest/Generator.html#load-t5). You can found our t5 model in huggingface, https://huggingface.co/huseinzol05/t5-base-bahasa-cased I never seen a powerful seq2seq as T5. |
@huseinzol05 I understand it is in tensorflow, but still is extremely helpful and very close to what I am looking for, thanks for taking the time to share the details! |
@ritvik1512 I was able to get the pytorch model going and here's my team's notebook: but I'm still trying to figure out how the operative_config.gin should be adjusted when pre-training the pytorch model. I know that you guys have only put out the API for fine-tuning purposes only but is there a way to correctly set up operative_config.gin for pretraining the HfPyTorch model? |
Hey folks. I'm going to work on setting up the unsupervised task to not require the use of gin. |
…configs. Fixes #269. PiperOrigin-RevId: 317647026
PTAL at #274 and see if it helps. |
@traumasv thanks for sharing the notebook! If I get this correctly, you guys are trying to implement translation for Arabic from scratch? |
…configs. Fixes #269. PiperOrigin-RevId: 317647026
…configs. Fixes #269. PiperOrigin-RevId: 317647026
Yes that's right |
@adarob Thank you for such a quick reply and solution! I tried adding the token_preprocessor functions to my tasks and ran training with the API without the gin file and it looks like there's a binding that's missing for 'denoise'..? Could this be specified in the task.addTaskRegistry() or in model.train()? |
Which binding is missing? Can you share the error message?
…On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park ***@***.***> wrote:
PTAL at #274
<#274>
and see if it helps.
@adarob <https://github.com/adarob> Thank you for such a quick reply and
solution!
I tried adding the token_preprocessor functions to my tasks and ran
training with the API without the gin file and it looks like there's a
binding that's missing for 'denoise'..?
Could this be specified in the task.addTaskRegistry() or in model.train()?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ>
.
|
@traumasv ah right, I was aiming for a slightly different approach of first pre-training the model on one particular language and then later fine-tuning for downstream tasks, but thanks nonetheless! |
Here's the link to the cell with the error:
https://colab.research.google.com/drive/1eOjdqErmzxOED4tbyNddzyCwuqonSvqd#scrollTo=f6f5uUWXWUKw&line=4&uniqifier=1
On Tue, Jun 23, 2020 at 6:09 AM Adam Roberts <[email protected]>
wrote:
… Which binding is missing? Can you share the error message?
On Tue, Jun 23, 2020, 12:21 AM Hyung Bin Park ***@***.***>
wrote:
> PTAL at #274
> <
#274
>
> and see if it helps.
>
> @adarob <https://github.com/adarob> Thank you for such a quick reply and
> solution!
>
> I tried adding the token_preprocessor functions to my tasks and ran
> training with the API without the gin file and it looks like there's a
> binding that's missing for 'denoise'..?
>
> Could this be specified in the task.addTaskRegistry() or in
model.train()?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <
#269 (comment)
>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAIJV2EQXBCWRALYKBKAMUTRYAUS5ANCNFSM4OAR7LAQ
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#269 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH7LVBGOD3A2EYTOSOSIKODRYB5MBANCNFSM4OAR7LAQ>
.
|
@ritvik1512 Hey did you try to do the pre training on one language and then fine tune for down stream tasks. I'm also exploring the same, didn't come across any useful resources. Please let me know if you have done any progress. |
@adarob @traumasv Is there any fix available for this binding issue ? |
functools.partial(
preprocessors.denoise,
inputs_fn=preprocessors.noise_span_to_unique_sentinel,
targets_fn=preprocessors.nonnoise_span_to_unique_sentinel,
noise_density=0.15,
noise_mask_fn=preprocessors.iid_noise_mask
) |
Hi @craffel I am continuing the work that @traumasv was doing last month, I tried implementing that function into our code, and got training to work. However I then encountered a Thank you! |
Hi! Did you manage to find a solution for this? |
Hi @Stellakats ! Yes I did actually, though it may not be what you are looking for. I actually just followed @huseinzol05 task registry setup here: https://github.com/huseinzol05/Malaya/blob/master/pretrained-model/t5/prepare/finetune-summarization.ipynb for the arguments. So I removed the token preprocessor argument. You can view the colab link I posted as well and navigate to the "Arabic to English Task" to see how we add the Task Registry. |
@ritvik1512 were you able to implement t5 on non-english language? |
We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax). You can take a look! Any suggestions are more than welcome. |
Hi I was wondering if there are any guidelines or documentation as to pre-training T5 from scratch (not just to any particular downstream task) in a new language?
Also is it possible to do the same with PyTorch under the current framework?
Please let me know if this is not the right place to discuss this, thank you!
The text was updated successfully, but these errors were encountered: