"OOM during optimization" when fine-tuning NLLB #4930

zgerrard · 2022-12-29T07:45:31Z

❓ Questions and Help

What is your question?

Hi, I am getting "OOM during optimization, irrecoverable" when trying to fine-tune the 3.3B parameter NLLB model.

Stack trace:

Traceback (most recent call last):
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1147, in train_step
    raise e
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/trainer.py", line 1099, in train_step
    self.task.optimizer_step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/tasks/fairseq_task.py", line 550, in optimizer_step
    optimizer.step()
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fp16_optimizer.py", line 440, in step
    self.wrapped_optimizer.step(
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fairseq_optimizer.py", line 120, in step
    self.optimizer.step(closure, scale=scale)
  File "/home/x/.local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/x/projects/nllb/fairseq/slurm_snapshot_code/2022-12-28T22_01_31.150636/fairseq/optim/fused_adam.py", line 209, in step
    exp_avg = exp_avg.float() * state["exp_avg_scale"]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.11 GiB (GPU 0; 23.70 GiB total capacity; 20.43 GiB already allocated; 2.13 GiB free; 20.44 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any ideas? Any help will be greatly appreciated.

What have you tried?

Tried fine-tuning smaller models and only the 600M param. (smallest) model didn't cause the error above.

What's your environment?

GPU models and configuration: 24Gb GPU (RTX 3090)

The text was updated successfully, but these errors were encountered:

FayZ676 · 2022-12-29T17:22:47Z

What were your hyperparameter settings?

zgerrard · 2022-12-29T23:18:09Z

@FayZ676 I used default parameters from nllb200_dense3.3B_finetune_on_fbseed.yaml, just changed the dataset path. Also, tried changing the max_tokens to a smaller number, but it didn’t fix the error.

zgerrard · 2022-12-30T06:26:09Z

All hyperparameters:

{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': 'out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'wandb_project': None, 'azureml_logging': False, 'seed': 2, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False, 'moe_generation': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_num_procs': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'ddp_comm_hook': 'none', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'gradient_as_bucket_view': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_base_algorithm': 'localsgd', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': False, 'not_fsdp_flatten_parameters': False, 'freeze_up_to_layer': None}, 'dataset': {'_name': None, 'num_workers': 1, 'num_workers_valid': 0, 'skip_invalid_size_inputs_valid_test': True, 'max_tokens': 100, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': False, 'validate_interval': 1000, 'validate_interval_updates': 10, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': 100, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0, 'grouped_shuffling': False, 'update_epoch_batch_itr': False, 'update_ordered_indices_seed': False}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 50, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [5e-05], 'stop_min_lr': 1e-09, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': 'out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', 'restore_file': 'checkpoint_last.pt', 'continue_once': None, 'finetune_from_model': '/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', 'ignore_suffix': False, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1000, 'save_interval_updates': 50, 'keep_interval_updates': 1, 'keep_interval_updates_pattern': -1, 'keep_last_epochs': 1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_best_checkpoints': False, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'synchronize_checkpoints_before_copy': False, 'symlink_best_and_last_checkpoints': False, 'best_checkpoint_metric': 'nll_loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': False, 's3_upload_path': None, 'replication_count': 1, 'model_parallel_size': 1}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'stats_path': None, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='transformer'), 'task': Namespace(no_progress_bar=False, log_interval=100, log_format='json', log_file=None, tensorboard_logdir='out/tb/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', wandb_project=None, azureml_logging=False, seed=2, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=True, memory_efficient_fp16=True, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', log_nvidia_smi=False, use_tutel_moe=False, tokenizer=None, bpe=None, optimizer='adam', lr_scheduler='inverse_sqrt', scoring='bleu', criterion='label_smoothed_cross_entropy', task='translation_multi_simple_epoch', num_workers=1, num_workers_valid=0, skip_invalid_size_inputs_valid_test=True, max_tokens=100, batch_size=None, required_batch_size_multiple=8, required_seq_len_multiple=1, dataset_impl=None, data_buffer_size=10, train_subset='train', valid_subset='valid', combine_valid_subsets=None, ignore_unused_valid_subsets=False, validate_interval=1000, validate_interval_updates=10, validate_after_updates=0, fixed_validation_seed=None, disable_validation=False, max_tokens_valid=100, batch_size_valid=None, max_valid_steps=None, curriculum=0, gen_subset='test', num_shards=1, shard_id=0, grouped_shuffling=False, update_epoch_batch_itr=False, update_ordered_indices_seed=False, distributed_world_size=1, distributed_num_procs=1, distributed_rank=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, device_id=0, distributed_no_spawn=False, ddp_backend='fully_sharded', ddp_comm_hook='none', bucket_cap_mb=25, fix_batches_to_gpus=False, find_unused_parameters=False, gradient_as_bucket_view=False, fast_stat_sync=False, heartbeat_timeout=-1, broadcast_buffers=False, slowmo_momentum=None, slowmo_base_algorithm='localsgd', localsgd_frequency=3, nprocs_per_node=1, pipeline_model_parallel=False, pipeline_balance=None, pipeline_devices=None, pipeline_chunks=0, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_checkpoint='never', zero_sharding='none', no_reshard_after_forward=False, fp32_reduce_scatter=False, cpu_offload=False, use_sharded_state=False, not_fsdp_flatten_parameters=False, freeze_up_to_layer=None, arch='transformer', max_epoch=0, max_update=50, stop_time_hours=0, clip_norm=0.0, sentence_avg=False, update_freq=[1], lr=[5e-05], stop_min_lr=1e-09, use_bmuf=False, train_with_epoch_remainder_batch=False, save_dir='out/dense.mfp16.mu50.uf1.lss.tmp1.lr5e-05.drop0.1.maxtok100.seed2.max_pos512.shem.NBF.adam16bit.fully_sharded.entsrc.det.transformer.ELS24.DLS24.E2048.H16.ATTDRP0.1.RELDRP0.0.ngpu1', restore_file='checkpoint_last.pt', continue_once=None, finetune_from_model='/media/x/data/nllb_checkpoint/3.3/checkpoint.pt', ignore_suffix=False, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, optimizer_overrides='{}', save_interval=1000, save_interval_updates=50, keep_interval_updates=1, keep_interval_updates_pattern=-1, keep_last_epochs=1, keep_best_checkpoints=-1, no_save=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_best_checkpoints=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, synchronize_checkpoints_before_copy=False, symlink_best_and_last_checkpoints=False, best_checkpoint_metric='nll_loss', maximize_best_checkpoint_metric=False, patience=-1, checkpoint_suffix='', checkpoint_shard_count=1, load_checkpoint_on_all_dp_ranks=False, write_checkpoints_asynchronously=False, s3_upload_path=None, replication_count=1, store_ema=False, ema_decay=0.9999, ema_start_update=0, ema_seed_model=None, ema_update_freq=1, ema_fp32=False, source_lang=None, target_lang=None, lang_pairs='eng_Latn-fra_Latn', keep_inference_langtok=False, one_dataset_per_batch=False, sampling_method='temperature', sampling_temperature=1.0, data='/home/x/projects/stopes/stopes/pipelines/prepare_data/outputs/2022-12-27/22-02-43/prepped_data_new_valid_skip_blank/data_bin/shard000', langs=['ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Olck', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn'], lang_dict=None, source_dict=None, target_dict=None, lang_tok_style='multilingual', load_alignments=False, left_pad_source='True', left_pad_target='False', upsample_primary=1, truncate_source=False, encoder_langtok='src', decoder_langtok=True, lang_tok_replacing_bos_eos=False, enable_lang_ids=False, enable_reservsed_directions_shared_datasets=False, extra_data=None, extra_lang_pairs=None, fixed_dictionary=None, langtoks_specs=['main'], langtoks=None, sampling_weights_from_file=None, sampling_weights=None, virtual_epoch_size=None, virtual_data_size=None, pad_to_fixed_length=False, use_local_shard_size=True, enable_m2m_validation=True, add_data_source_prefix_tags=True, add_ssl_task_tokens=False, tokens_per_sample=512, sample_break_mode='eos', mask=0.1, mask_random=0.0, insert=0.0, permute=0.0, rotate=0.0, poisson_lambda=3.0, permute_sentences=0.0, mask_length='subword', replace_length=1, ignore_mmt_main_data=False, mixed_multitask_denoising_prob=0.5, eval_lang_pairs=None, finetune_dict_specs=None, adam_betas='(0.9, 0.98)', adam_eps=1e-06, weight_decay=0.0, use_old_adam=False, fp16_adam_stats=True, block_wise=False, warmup_updates=10, warmup_init_lr=1e-07, pad=1, eos=2, unk=3, label_smoothing=0.1, report_accuracy=False, ignore_prefix_size=0, dropout=0.1, max_source_positions=512, max_target_positions=512, share_all_embeddings=True, decoder_normalize_before=True, encoder_normalize_before=True, min_params_to_wrap=100000000, encoder_layers=24, decoder_layers=24, encoder_ffn_embed_dim=8192, decoder_ffn_embed_dim=8192, encoder_embed_dim=2048, decoder_embed_dim=2048, encoder_attention_heads=16, decoder_attention_heads=16, attention_dropout=0.1, relu_dropout=0.0, no_seed_provided=False, encoder_embed_path=None, encoder_learned_pos=False, decoder_embed_path=None, decoder_learned_pos=False, activation_dropout=0.0, activation_fn='relu', adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, share_decoder_input_output_embed=False, no_token_positional_embeddings=False, adaptive_input=False, no_cross_attention=False, cross_self_attention=False, decoder_output_dim=2048, decoder_input_dim=2048, no_scale_embedding=False, layernorm_embedding=False, tie_adaptive_weights=False, checkpoint_activations=False, offload_activations=False, encoder_layers_to_keep=None, decoder_layers_to_keep=None, encoder_layerdrop=0, decoder_layerdrop=0, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, _name='translation_multi_simple_epoch'), 'criterion': {'_name': 'label_smoothed_cross_entropy', 'label_smoothing': 0.1, 'report_accuracy': False, 'ignore_prefix_size': 0, 'sentence_avg': False}, 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.98)', 'adam_eps': 1e-06, 'weight_decay': 0.0, 'use_old_adam': False, 'fp16_adam_stats': True, 'tpu': False, 'lr': [5e-05], 'block_wise': False}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10, 'warmup_init_lr': 1e-07, 'lr': [5e-05]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None, 'ema': {'_name': None, 'store_ema': False, 'ema_decay': 0.9999, 'ema_start_update': 0, 'ema_seed_model': None, 'ema_update_freq': 1, 'ema_fp32': False}}

FayZ676 · 2022-12-30T16:34:10Z

For reference, I tried finetuning GPT-NeoX-20B on my setup (4x 3090's) and was told by the devs that I needed at least 13 Bytes of memory per parameter. The largest model I could successfully fine tune was up to the 2B parameter model.

It looks like youre using the config for a 3.3B param model on one 3090 so you may just not have enough memory to fine tune model's larger than 600M????

I don't know for sure, so if someone can confirm the memory requirements for Fairseq that would be great actually.

edvardasast · 2023-01-08T09:40:07Z

@zgerrard Hi, maybe you have step by step tutorial how to finetune 600M data model it will be really helpful for me? Could you share your finetune project via your git repository?

yugaljain1999 · 2023-02-01T06:31:18Z

@edvardasast Did you find any git repository for finetuning?

edvardasast · 2023-02-03T17:16:26Z

@edvardasast Did you find any git repository for finetuning?

unfortunately not :(
I have successfuly preprocessed data by using this command:
python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train
But when I try to finetune with command:
python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096
I am getting this error:
Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

robotsp · 2023-02-20T08:13:05Z

@edvardasast

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

@edvardasast Would you please share your whole steps on finetuning nllb? Thanks!

martinbombin · 2023-03-07T10:44:59Z

@edvardasast Did you find any git repository for finetuning?

unfortunately not :( I have successfuly preprocessed data by using this command: python preprocess.py -s eng_Latn -t deu_Latn --task multilingual_translation --trainpref my_dataset/train --destdir processed_data --validpref my_dataset/train --testpref my_dataset/train But when I try to finetune with command: python train.py processed_data --task multilingual_translation --arch multilingual_transformer --save-dir fine_tuned_model --finetune-from-model model_checkpoints/checkpoint.pt --lang-pairs eng_Latn-deu_Latn --max-tokens 4096 I am getting this error: Exception: Cannot load model parameters from checkpoint model_checkpoints/checkpoint.pt; please ensure that the architectures match.

I am getting the same error, it seems that it is using the vocab of my data instead of the vocab of the NLLB trained model. That makes model have a different number of parameters.

zhanbaohang · 2024-04-26T03:02:59Z

Where is the code for fine-tuning the nllb model? ，thanks

zgerrard added needs triage question labels Dec 29, 2022

zgerrard changed the title ~~OOM during optimization when fine-tuning NLLB~~ "OOM during optimization" when fine-tuning NLLB Dec 29, 2022

BUCKFAE mentioned this issue Apr 13, 2023

Denoising Task crashes OOM #5076

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"OOM during optimization" when fine-tuning NLLB #4930

"OOM during optimization" when fine-tuning NLLB #4930

zgerrard commented Dec 29, 2022 •

edited

Loading

FayZ676 commented Dec 29, 2022

zgerrard commented Dec 29, 2022 •

edited

Loading

zgerrard commented Dec 30, 2022 •

edited

Loading

FayZ676 commented Dec 30, 2022 •

edited

Loading

edvardasast commented Jan 8, 2023

yugaljain1999 commented Feb 1, 2023

edvardasast commented Feb 3, 2023

robotsp commented Feb 20, 2023

martinbombin commented Mar 7, 2023

zhanbaohang commented Apr 26, 2024

"OOM during optimization" when fine-tuning NLLB #4930

"OOM during optimization" when fine-tuning NLLB #4930

Comments

zgerrard commented Dec 29, 2022 • edited Loading

❓ Questions and Help

What is your question?

Stack trace:

What have you tried?

What's your environment?

FayZ676 commented Dec 29, 2022

zgerrard commented Dec 29, 2022 • edited Loading

zgerrard commented Dec 30, 2022 • edited Loading

FayZ676 commented Dec 30, 2022 • edited Loading

edvardasast commented Jan 8, 2023

yugaljain1999 commented Feb 1, 2023

edvardasast commented Feb 3, 2023

robotsp commented Feb 20, 2023

martinbombin commented Mar 7, 2023

zhanbaohang commented Apr 26, 2024

zgerrard commented Dec 29, 2022 •

edited

Loading

zgerrard commented Dec 29, 2022 •

edited

Loading

zgerrard commented Dec 30, 2022 •

edited

Loading

FayZ676 commented Dec 30, 2022 •

edited

Loading