You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/flax/language-modeling/README.md
+92
Original file line number
Diff line number
Diff line change
@@ -338,6 +338,98 @@ of 2.36 and 57.0 respectively after 3 epochs on a single TPUv3-8.
338
338
This should take around 4.5 hours.
339
339
Training statistics can be accessed on directly on the 🤗 [hub](https://huggingface.co/patrickvonplaten/t5-base-norwegian/tensorboard)
340
340
341
+
## BART: Denoising language modeling
342
+
343
+
In the following, we demonstrate how to train a BART model
344
+
using denoising language modeling objective as introduced in [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461).
345
+
More specifically, we demonstrate how JAX/Flax can be leveraged
346
+
to pre-train [**`bart-base`**](https://huggingface.co/facebook/bart-base)
347
+
in Norwegian on a single TPUv3-8 pod.
348
+
349
+
The example script uses the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.
350
+
351
+
To setup all relevant files for training, let's create a directory.
352
+
353
+
```bash
354
+
mkdir ./norwegian-roberta-base
355
+
```
356
+
357
+
### Train tokenizer
358
+
In the first step, we train a tokenizer to efficiently process the text input for the model. Similar to how it is shown in [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train), we use a **`ByteLevelBPETokenizer`**.
359
+
The tokenizer is trained on the complete Norwegian dataset of OSCAR
360
+
and consequently saved in the cloned model directory.
361
+
This can take up to 10 minutes depending on your hardware ☕.
362
+
363
+
```python
364
+
from datasets import load_dataset
365
+
from tokenizers import trainers, Tokenizer, normalizers, ByteLevelBPETokenizer
0 commit comments