Tips for training base model from scratch on smaller amount of datasets #11

VictorAtPL · 2022-08-04T19:54:40Z

I am very excited about this model and an e2e approach it implements.

For my master thesis, I'd like to make an experiment to compare your method of generating synthetic documents with mine. I am only interested in evaluating the model on the Document Information Extraction downstream task with the CORD dataset and my proprietary one (let's call it PolCORD).

I'd like to train the Donut model on the (Psuedo) Text Reading Task with:
1/ naver-clova-ix/synthdog-en; synthdog-id; synthdog-pl (total 1.5M examples)
2/ my-method-en, my-method-id, my-method-pl (total 1.2M examples)

Could you give me a hand and share you experience:

how can I generated/prepare corpus for Indonesian and Polish language in the same way how you prepared here: https://github.com/clovaai/donut/tree/master/synthdog/resources/corpus
if I am going to train the model on 1.2-1.5M examples instead of 13M, do you have any gut feeling if I need, and to what values I should, downsize the model defined here: https://huggingface.co/naver-clova-ix/donut-base/blob/main/config.json?
How many examples were you able to fit into single A100 GPU card? I've a 40Gb version and I'm going to use 16 of them.

gwkrsrch · 2022-08-11T09:10:34Z

Hi @VictorAtPL

For (1), as explained in Section 2.3 and Appendix A.2 (https://arxiv.org/abs/2111.15664), we sampled words and phrases from Wikipedia. The following links would be helpful to you.

To process the dump files, you may consider using WikiExtractor or other relevant tools/scripts.

For (2) and (3), you may need to control input_size of model architecture to meet the environmental condition.
Appendix A.6 (https://arxiv.org/abs/2111.15664) will also be helpful to you.

I hope this is useful to you :)

gwkrsrch · 2022-08-12T09:43:02Z

One more general tip. To train a model for a new language, you may need to change some codes regarding token vocabulary/tokenizer. For example, see this block. This depends on the letters of the target language.
+) Or, just adding some new tokens for the target language would be enough.

Vadkoz · 2023-02-15T11:37:47Z

Hello!
I faced a problem with letters that do not exist in the model's tokenizer. It's just greek\arabian\cyrillic\math symbols etc which sometimes appear in different wiki articles. Also, there are some useful letters like Lithuanian umlauts.
So, what do I need to do? What is the best way to work with missing letters? Do I need to add to the tokenizer only useful and just skip all other letters and bear in mind that I will have some misrecognizing with such letters? Or do I need to add to the tokenizer all these letters? Or maybe I need to skip all these letters?

Is it a problem, that Asian MBert was not trained while "only text" phase (with text encoder) with letters that I want to add?

If I need to add tokens, is it enough just to do something like this?
tokenizer.add_tokens(new_tokens)
Do I need to do some additional steps, maybe train an MBart (with text encoder) on these letters first, or something?

VictorAtPL · 2023-02-15T11:41:39Z

I faced a problem with letters that do not exist in the model's tokenizer.

+1. If anyone knows how to train model and tokenizer like Asian MBart for other languages in a way that it can easily replace the current Donut decoder, please share this knowledge with us.

Vadkoz · 2023-02-15T11:53:56Z

@VictorAtPL Did you try to add tokens like tokenizer.add_tokens(new_tokens)? If yes, is it works properly?

VictorAtPL · 2023-02-15T11:58:14Z

@Vadkoz Haven't tried yet. I think the proper approach is to use Wikipedia corpus of the languages you care the most and retrain the whole Decoder on this corpus.

Not sure what kind of tokens should I add. Just letters, or sub-words or most common words? I'd rather leave it up to the Tokenizer to determine how tokens should be derived from e.g. Polish Wikipedia Corpus.

balabis · 2023-02-20T14:05:51Z

Hello! I faced a problem with letters that do not exist in the model's tokenizer. It's just greek\arabian\cyrillic\math symbols etc which sometimes appear in different wiki articles. Also, there are some useful letters like Lithuanian umlauts. So, what do I need to do? What is the best way to work with missing letters? Do I need to add to the tokenizer only useful and just skip all other letters and bear in mind that I will have some misrecognizing with such letters? Or do I need to add to the tokenizer all these letters? Or maybe I need to skip all these letters?

Is it a problem, that Asian MBert was not trained while "only text" phase (with text encoder) with letters that I want to add?

If I need to add tokens, is it enough just to do something like this? tokenizer.add_tokens(new_tokens) Do I need to do some additional steps, maybe train an MBart (with text encoder) on these letters first, or something?

Hey @Vadkoz, have you figured it out yet? I am looking how can I use Donut with different languages too

VictorAtPL · 2023-02-20T14:50:24Z

@balabis It looks like we need to retrain the tokenizer: https://huggingface.co/course/chapter6/2

And then train (M)Bart from scratch on (multiple) language(s) corpora using e.g.:

https://github.com/prajdabre/yanmtt - MBart - unofficial but PyTorch
https://github.com/ayaka14732/bart-base-jax - Bart - unofficial, JAX
fairseq library - BART Pretraining Script facebookresearch/fairseq#1899 (comment) - Bart - official by facebook
How to pre-train BART model huggingface/transformers#4151 (comment) - Bart - huggingface, custom training loop
https://github.com/duongna21/transformers/blob/a5914fa94fd6172f3336e4d05270b138d288e47b/examples/flax/language-modeling/README.md#bart-denoising-language-modeling - Bart - huggingface (JAX)

Some of the links are for BART so just for one language. I guess that the MBart tokenizer must be used then to prepare training and inference examples with language tokens and in appropriate format.

Another v. interesting thing is this quote:

We made asian-bart using mBart by embedding layer pruning.

https://github.com/hyunwoongko/asian-bart

Maybe the asian-bart which is used as decoder in Donut isn't trained from scratch, but it's fine-tuned MBart25 or MBart50 model with reduced vocab size?

I think the first thing I will try is yanmtt but I am not ready yet to do this experiment.

PiotrNawrot · 2023-03-16T16:28:26Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

gwkrsrch closed this as completed Aug 11, 2022

gwkrsrch mentioned this issue Aug 11, 2022

Is this available with other language? #19

Closed

gwkrsrch mentioned this issue Aug 17, 2022

Erroneous Text output for IE task #21

Closed

gwkrsrch mentioned this issue Aug 25, 2022

Fine Tuning with Arabic #33

Closed

akashlp27 mentioned this issue Jun 26, 2023

Issue while training with custom decoder and tokenizer #215

Open

lauraminkova mentioned this issue Aug 8, 2023

Difficulties finetuning for another language #236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for training base model from scratch on smaller amount of datasets #11

Tips for training base model from scratch on smaller amount of datasets #11

VictorAtPL commented Aug 4, 2022 •

edited

Loading

gwkrsrch commented Aug 11, 2022

gwkrsrch commented Aug 12, 2022 •

edited

Loading

Vadkoz commented Feb 15, 2023 •

edited

Loading

VictorAtPL commented Feb 15, 2023

Vadkoz commented Feb 15, 2023

VictorAtPL commented Feb 15, 2023

balabis commented Feb 20, 2023 •

edited

Loading

VictorAtPL commented Feb 20, 2023

PiotrNawrot commented Mar 16, 2023

Tips for training base model from scratch on smaller amount of datasets #11

Tips for training base model from scratch on smaller amount of datasets #11

Comments

VictorAtPL commented Aug 4, 2022 • edited Loading

gwkrsrch commented Aug 11, 2022

gwkrsrch commented Aug 12, 2022 • edited Loading

Vadkoz commented Feb 15, 2023 • edited Loading

VictorAtPL commented Feb 15, 2023

Vadkoz commented Feb 15, 2023

VictorAtPL commented Feb 15, 2023

balabis commented Feb 20, 2023 • edited Loading

VictorAtPL commented Feb 20, 2023

PiotrNawrot commented Mar 16, 2023

VictorAtPL commented Aug 4, 2022 •

edited

Loading

gwkrsrch commented Aug 12, 2022 •

edited

Loading

Vadkoz commented Feb 15, 2023 •

edited

Loading

balabis commented Feb 20, 2023 •

edited

Loading