We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
描述这个 bug 我在使用BART模型和wmt16-en-de的时候出现src和tgt长度不一致的情况,但是我检查了数据集之后发现文件长度相等。
如何复现 (cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE)
gpu_id: 0 use_gpu: True device: cuda seed: 2020 reproducibility: True cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE filename: BART-wmt16-en-de-2023-May-10_01-27-13 saved_dir: saved/ state: INFO wandb: offline
do_train: True do_valid: True optimizer: adamw adafactor_kwargs: {'lr': 0.001, 'scale_parameter': False, 'relative_step': False, 'warmup_init': False} optimizer_kwargs: {} valid_steps: 1 valid_strategy: epoch stopping_steps: 2 epochs: 50 learning_rate: 3e-05 train_batch_size: 4 grad_clip: 0.1 accumulation_steps: 48 disable_tqdm: False resume_training: True
do_test: True lower_evaluation: True multiref_strategy: max bleu_max_ngrams: 4 bleu_type: sacrebleu smoothing_function: 0 corpus_bleu: False rouge_max_ngrams: 2 rouge_type: files2rouge meteor_type: pycocoevalcap chrf_type: m-popovic distinct_max_ngrams: 4 inter_distinct: True unique_max_ngrams: 4 self_bleu_max_ngrams: 4 tgt_lang: de_DE metrics: ['bleu'] eval_batch_size: 8 corpus_meteor: True
model: BART model_name: bart model_path: facebook/bart-base config_kwargs: {} tokenizer_kwargs: {'src_lang': 'en_XX', 'tgt_lang': 'de_DE'} generation_kwargs: {'num_beams': 5, 'no_repeat_ngram_size': 3, 'early_stopping': True} efficient_kwargs: {} efficient_methods: [] efficient_unfreeze_model: False label_smoothing: 0.1
dataset: wmt16-en-de data_path: dataset/wmt16-en-de src_lang: en_XX tgt_lang: de_DE src_len: 1024 tgt_len: 1024 truncate: tail prefix_prompt: translate English to Germany: metrics_for_best_model: ['bleu']
tokenizer_add_tokens: [] load_type: from_pretrained find_unused_parameters: False
================================================================================ 10 May 01:27 INFO Pretrain type: pretrain disabled Traceback (most recent call last): File "run_textbox.py", line 12, in run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={}) File "/hy-tmp/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox experiment = Experiment(model, dataset, config_file_list, config_dict) File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 56, in init self._init_data(self.get_config(), self.accelerator) File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 82, in _init_data train_data, valid_data, test_data = data_preparation(config, tokenizer) File "/hy-tmp/TextBox/textbox/data/utils.py", line 23, in data_preparation train_dataset = AbstractDataset(config, 'train') File "/hy-tmp/TextBox/textbox/data/abstract_dataset.py", line 36, in init assert len(self.source_text) == len(self.target_text) AssertionError
The text was updated successfully, but these errors were encountered:
请参考这个方法 #346 (comment)
Sorry, something went wrong.
No branches or pull requests
描述这个 bug
我在使用BART模型和wmt16-en-de的时候出现src和tgt长度不一致的情况,但是我检查了数据集之后发现文件长度相等。
如何复现
(cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE)
日志
General Hyper Parameters:
gpu_id: 0
use_gpu: True
device: cuda
seed: 2020
reproducibility: True
cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE
filename: BART-wmt16-en-de-2023-May-10_01-27-13
saved_dir: saved/
state: INFO
wandb: offline
Training Hyper Parameters:
do_train: True
do_valid: True
optimizer: adamw
adafactor_kwargs: {'lr': 0.001, 'scale_parameter': False, 'relative_step': False, 'warmup_init': False}
optimizer_kwargs: {}
valid_steps: 1
valid_strategy: epoch
stopping_steps: 2
epochs: 50
learning_rate: 3e-05
train_batch_size: 4
grad_clip: 0.1
accumulation_steps: 48
disable_tqdm: False
resume_training: True
Evaluation Hyper Parameters:
do_test: True
lower_evaluation: True
multiref_strategy: max
bleu_max_ngrams: 4
bleu_type: sacrebleu
smoothing_function: 0
corpus_bleu: False
rouge_max_ngrams: 2
rouge_type: files2rouge
meteor_type: pycocoevalcap
chrf_type: m-popovic
distinct_max_ngrams: 4
inter_distinct: True
unique_max_ngrams: 4
self_bleu_max_ngrams: 4
tgt_lang: de_DE
metrics: ['bleu']
eval_batch_size: 8
corpus_meteor: True
Model Hyper Parameters:
model: BART
model_name: bart
model_path: facebook/bart-base
config_kwargs: {}
tokenizer_kwargs: {'src_lang': 'en_XX', 'tgt_lang': 'de_DE'}
generation_kwargs: {'num_beams': 5, 'no_repeat_ngram_size': 3, 'early_stopping': True}
efficient_kwargs: {}
efficient_methods: []
efficient_unfreeze_model: False
label_smoothing: 0.1
Dataset Hyper Parameters:
dataset: wmt16-en-de
data_path: dataset/wmt16-en-de
src_lang: en_XX
tgt_lang: de_DE
src_len: 1024
tgt_len: 1024
truncate: tail
prefix_prompt: translate English to Germany:
metrics_for_best_model: ['bleu']
Unrecognized Hyper Parameters:
tokenizer_add_tokens: []
load_type: from_pretrained
find_unused_parameters: False
================================================================================
10 May 01:27 INFO Pretrain type: pretrain disabled
Traceback (most recent call last):
File "run_textbox.py", line 12, in
run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={})
File "/hy-tmp/TextBox/textbox/quick_start/quick_start.py", line 20, in run_textbox
experiment = Experiment(model, dataset, config_file_list, config_dict)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 56, in init
self._init_data(self.get_config(), self.accelerator)
File "/hy-tmp/TextBox/textbox/quick_start/experiment.py", line 82, in _init_data
train_data, valid_data, test_data = data_preparation(config, tokenizer)
File "/hy-tmp/TextBox/textbox/data/utils.py", line 23, in data_preparation
train_dataset = AbstractDataset(config, 'train')
File "/hy-tmp/TextBox/textbox/data/abstract_dataset.py", line 36, in init
assert len(self.source_text) == len(self.target_text)
AssertionError
The text was updated successfully, but these errors were encountered: