-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tips for training base model from scratch on smaller amount of datasets #11
Comments
Hi @VictorAtPL For (1), as explained in Section 2.3 and Appendix A.2 (https://arxiv.org/abs/2111.15664), we sampled words and phrases from Wikipedia. The following links would be helpful to you. To process the dump files, you may consider using WikiExtractor or other relevant tools/scripts. For (2) and (3), you may need to control I hope this is useful to you :) |
One more general tip. To train a model for a new language, you may need to change some codes regarding token vocabulary/tokenizer. For example, see this block. This depends on the letters of the target language. |
Hello! Is it a problem, that Asian MBert was not trained while "only text" phase (with text encoder) with letters that I want to add? If I need to add tokens, is it enough just to do something like this? |
+1. If anyone knows how to train model and tokenizer like Asian MBart for other languages in a way that it can easily replace the current Donut decoder, please share this knowledge with us. |
@VictorAtPL Did you try to add tokens like |
@Vadkoz Haven't tried yet. I think the proper approach is to use Wikipedia corpus of the languages you care the most and retrain the whole Decoder on this corpus. Not sure what kind of tokens should I add. Just letters, or sub-words or most common words? I'd rather leave it up to the Tokenizer to determine how tokens should be derived from e.g. Polish Wikipedia Corpus. |
Hey @Vadkoz, have you figured it out yet? I am looking how can I use Donut with different languages too |
@balabis It looks like we need to retrain the tokenizer: https://huggingface.co/course/chapter6/2 And then train (M)Bart from scratch on (multiple) language(s) corpora using e.g.:
Some of the links are for BART so just for one language. I guess that the MBart tokenizer must be used then to prepare training and inference examples with language tokens and in appropriate format. Another v. interesting thing is this quote:
https://github.com/hyunwoongko/asian-bart Maybe the asian-bart which is used as decoder in Donut isn't trained from scratch, but it's fine-tuned MBart25 or MBart50 model with reduced vocab size? I think the first thing I will try is |
We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax). You can take a look! Any suggestions are more than welcome. |
Hello @gwkrsrch ,
I am very excited about this model and an e2e approach it implements.
For my master thesis, I'd like to make an experiment to compare your method of generating synthetic documents with mine. I am only interested in evaluating the model on the Document Information Extraction downstream task with the CORD dataset and my proprietary one (let's call it PolCORD).
I'd like to train the Donut model on the (Psuedo) Text Reading Task with:
1/ naver-clova-ix/synthdog-en; synthdog-id; synthdog-pl (total 1.5M examples)
2/ my-method-en, my-method-id, my-method-pl (total 1.2M examples)
Could you give me a hand and share you experience:
The text was updated successfully, but these errors were encountered: