How to pre-train BART model #4151

omerarshad · 2020-05-05T10:00:43Z

How to pre-train BART model in an unsupervised manner. any example?

patrickvonplaten · 2020-05-18T12:26:59Z

We still need to provide a good docstring/notebook for this. It's on our ToDo-List. :-)

Or @sshleifer - is there already something for Bart?

sshleifer · 2020-05-18T14:10:25Z

Nothing yet, would be good to add!

stale · 2020-07-17T14:42:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shamanez · 2020-07-23T10:20:38Z

I have seen the same issue in fairseq BART!.

cahya-wirawan · 2020-07-25T19:27:45Z

Hi, any news about bart pre-training?

zy329jy · 2020-07-29T12:20:34Z

who can tell me how to pre-train the bart on my own dataset? I am so confused ....
thank you so much

patrickvonplaten · 2020-08-03T15:04:07Z

Maybe this comment can help: #5096 (comment)

stale · 2020-10-03T04:50:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dhruvramani · 2020-12-06T11:17:44Z

Any news on this please?

cahya-wirawan · 2020-12-06T11:38:06Z

not so far, would be great to have it. Thanks.

myechona · 2021-03-12T08:05:31Z

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

banditelol · 2021-04-05T07:16:03Z

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for the code example, I am also planning on implementing pretrained from scratch, and I've got several questions for the code

I noticed that you use pretrained bart tokenizer, how can I pretrain it for different language?
How much compute did you use for your implementation?

myechona · 2021-04-06T02:05:39Z

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for the code example, I am also planning on implementing pretrained from scratch, and I've got several questions for the code

I noticed that you use pretrained bart tokenizer, how can I pretrain it for different language?
How much compute did you use for your implementation?

For the first question, just like this:

from tokenizers import (ByteLevelBPETokenizer,SentencePieceBPETokenizer,BertWordPieceTokenizer)

tokenizer = ByteLevelBPETokenizer()
paths = ['./data/corpus.txt']
tokenizer.train(files=paths, vocab_size = 15000, min_frequency=6, special_tokens = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

tokenizer.save_model("./data/dic/")

For the other question, i trained it with 12G gpu memory，but it may be completed with samller gpu memory. And also you could adjust you parameters to your server environment.

Martine307 · 2021-04-29T06:52:48Z

@myechona Thanks for your code. I have a question about it. There are some tasks like text-filling and sentence-permutation during pretrain stage, i want to know whether the "input_ids" is for masked sentence and the "labels" is for origin sentence?

prajdabre · 2021-06-15T15:14:04Z

If anyone wants to train their MBART model then feel free to use this.
https://github.com/prajdabre/yanmtt

Contributions are welcome!

thomas-li-sjtu · 2022-01-15T05:35:35Z

I and my co-worker wrote a demo according to roberta pretraining demo.

#encoding=utf-8

from transformers import (
    BartForConditionalGeneration, BartTokenizer, BartForCausalLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer
  )

import torch
from torch.utils.data import random_split


# ## Initiating model and trainer for training
from transformers import BartModel, BartConfig
from transformers import BartTokenizerFast


configuration = BartConfig(
    vocab_size=52000,
    max_position_embeddings=258,
    d_model=256,
    encoder_layers=3,
    decoder_layers=3,
    encoder_attention_heads=4,
    decoder_attention_heads=4,
    decoder_ffn_dim=1024,
    encoder_ffn_dim=1024,
)
model = BartForCausalLM(configuration)
tokenizer = BartTokenizerFast.from_pretrained("./dic", max_len=256, additional_special_tokens=['[CH]', '[OTHER]', '[VAR]', '[NUM]'])


# ### HTTP Request DataPreparing & Modeling
data = []
with open("../data/sample.txt") as f1:
    for src in f1:
      data.append(
          {
              "seq2seq": {
                  "input": src.strip()
              }
          }
      )
print(f'total size of data is {len(data)}')


# splitting dataset into train, validation
split = 0.2
train_dataset, eval_dataset = random_split(data, lengths=[int((1-split)*len(data))+1, int(split*len(data))])


# defining collator functioon for preparing batches on the fly ..
def data_collator(features:list):
   inputs = [f["seq2seq"]["input"] for f in features]
   batch = tokenizer.prepare_seq2seq_batch(src_texts=inputs, max_length=256, padding='max_length')
   batch["labels"] = batch["input_ids"].copy()
   for k in batch:
        batch[k] = torch.tensor(batch[k])
   return batch


batch_out = data_collator(eval_dataset)
print(batch_out)
print(batch_out['input_ids'].shape,batch_out['labels'].shape,batch_out['attention_mask'].shape)


# defining training related arguments
args = Seq2SeqTrainingArguments(output_dir="clm-checkpoints",
                        do_train=True,
                        do_eval=True,
                        evaluation_strategy="epoch",
                        per_device_train_batch_size=8,
                        per_device_eval_batch_size=8,
                        learning_rate=5e-5,
                        num_train_epochs=1,
                        logging_dir="./logs")


# defining trainer using 🤗
trainer = Seq2SeqTrainer(model=model, 
                args=args, 
                data_collator=data_collator, 
                train_dataset=train_dataset, 
                eval_dataset=eval_dataset)


# ## Training time
trainer.train()
# It will take hours to train this model on this dataset


# lets save model
trainer.evaluate(eval_dataset=eval_dataset)
trainer.save_model("clm-checkpoints")

Thanks for your code, it really helps.

jbmaxwell · 2022-02-17T03:24:22Z

I'm most interested in sentence infilling, which this script doesn't really seem to address (though my understanding was that BART training generally involves masking and permutation). Is there an additional step I need to add for the infilling functionality?

sajastu · 2022-04-06T18:24:35Z

We still need to provide a good docstring/notebook for this. It's on our ToDo-List. :-)

Or @sshleifer - is there already something for Bart?

Hi, any update on this? @vanpelt

jbmaxwell · 2022-04-06T18:30:33Z

I actually decided to jump over to T5 and use the run_t5_mlm_flax.py script. Seems to be working so far, though it's very new, so missing some conveniences.... it sounds like that stuff is underway!

sajastu · 2022-04-06T22:23:23Z

I actually decided to jump over to T5 and use the run_t5_mlm_flax.py script. Seems to be working so far, though it's very new, so missing some conveniences.... it sounds like that stuff is underway!

Great, I was initially looking at those scripts to get some ideas about the pre-training script, but since then thought the Huggingface guys might have come up with a resource to do this. Apparently, it's still underway! :)

PiotrNawrot · 2023-03-16T16:26:17Z

We've released nanoT5 that reproduces T5-model (similar to BART) pre-training in PyTorch (not Flax).

You can take a look!

Any suggestions are more than welcome.

patrickvonplaten self-assigned this May 5, 2020

patrickvonplaten added Ex: LM (Pretraining) Related to language modeling pre-training Ex: LM (Finetuning) Related to language modeling fine-tuning labels May 5, 2020

stale bot added the wontfix label Jul 17, 2020

stale bot removed the wontfix label Jul 23, 2020

stale bot added the wontfix label Oct 3, 2020

stale bot closed this as completed Oct 10, 2020

sajastu mentioned this issue Jul 5, 2022

Pretraining BART language model #18030

Closed

duongna21 mentioned this issue Jul 26, 2022

Add Flax BART pretraining script #18297

Merged

3 tasks

VictorAtPL mentioned this issue Feb 20, 2023

Tips for training base model from scratch on smaller amount of datasets clovaai/donut#11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pre-train BART model #4151

How to pre-train BART model #4151

omerarshad commented May 5, 2020

patrickvonplaten commented May 18, 2020

sshleifer commented May 18, 2020

stale bot commented Jul 17, 2020

shamanez commented Jul 23, 2020

cahya-wirawan commented Jul 25, 2020

zy329jy commented Jul 29, 2020

patrickvonplaten commented Aug 3, 2020

stale bot commented Oct 3, 2020

dhruvramani commented Dec 6, 2020

cahya-wirawan commented Dec 6, 2020

myechona commented Mar 12, 2021 •

edited

Loading

banditelol commented Apr 5, 2021

myechona commented Apr 6, 2021

Martine307 commented Apr 29, 2021

prajdabre commented Jun 15, 2021

thomas-li-sjtu commented Jan 15, 2022

jbmaxwell commented Feb 17, 2022

sajastu commented Apr 6, 2022 •

edited

Loading

jbmaxwell commented Apr 6, 2022

sajastu commented Apr 6, 2022

PiotrNawrot commented Mar 16, 2023

How to pre-train BART model #4151

How to pre-train BART model #4151

Comments

omerarshad commented May 5, 2020

patrickvonplaten commented May 18, 2020

sshleifer commented May 18, 2020

stale bot commented Jul 17, 2020

shamanez commented Jul 23, 2020

cahya-wirawan commented Jul 25, 2020

zy329jy commented Jul 29, 2020

patrickvonplaten commented Aug 3, 2020

stale bot commented Oct 3, 2020

dhruvramani commented Dec 6, 2020

cahya-wirawan commented Dec 6, 2020

myechona commented Mar 12, 2021 • edited Loading

banditelol commented Apr 5, 2021

myechona commented Apr 6, 2021

Martine307 commented Apr 29, 2021

prajdabre commented Jun 15, 2021

thomas-li-sjtu commented Jan 15, 2022

jbmaxwell commented Feb 17, 2022

sajastu commented Apr 6, 2022 • edited Loading

jbmaxwell commented Apr 6, 2022

sajastu commented Apr 6, 2022

PiotrNawrot commented Mar 16, 2023

myechona commented Mar 12, 2021 •

edited

Loading

sajastu commented Apr 6, 2022 •

edited

Loading