-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi GPU training issue #1798
Comments
How large is your.GPU RAM? |
24 gb each for 2 gpus |
can you reproduce it? |
2024-11-07 13:12:14,793 INFO [train.py:1120] (1/2) Device: cuda:1 |
Are you able to reproduce it with librispeech? |
yes. this is with librispeech only |
then why the data manifest dir is data/8k/fbank in your log? could you tell us what changes you have made? |
I am using different data but codebase is same as librispeech. no changes wise especially in training. |
what is the duration distribution of your data? are you able to reproduce it with the librispeech dataset? |
It is a small experimental dataset for testing codebases under librispeech. The training is running on single GPU. |
You might want to increase bucket size and buffer size in the |
2024-11-05 12:55:26,724 INFO [train.py:1231] (0/2) Training will start from epoch : 1
2024-11-05 12:55:26,725 INFO [train.py:1243] (0/2) Training started
2024-11-05 12:55:26,726 INFO [train.py:1253] (0/2) Device: cuda:0
2024-11-05 12:55:26,728 INFO [train.py:1265] (0/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-05 12:55:26,728 INFO [train.py:1267] (0/2) About to create model
2024-11-05 12:55:26,733 INFO [train.py:1231] (1/2) Training will start from epoch : 1
2024-11-05 12:55:26,734 INFO [train.py:1243] (1/2) Training started
2024-11-05 12:55:26,734 INFO [train.py:1253] (1/2) Device: cuda:1
2024-11-05 12:55:26,736 INFO [train.py:1265] (1/2) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'ff1d435a8d3c4eaa15828a84a7240678a70539a7', 'k2-git-date': 'Fri Feb 23 01:48:38 2024', 'lhotse-version': '1.24.0.dev+git.866e4a80.clean', 'torch-version': '1.13.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'HEAD', 'icefall-git-sha1': '144163c-clean', 'icefall-git-date': 'Fri Oct 18 14:09:24 2024', 'icefall-path': '/builds/mihup/asr/zipformer/icefall', 'k2-path': '/usr/local/lib/python3.9/dist-packages/k2/init.py', 'lhotse-path': '/workspace/lhotse/lhotse/init.py', 'hostname': 'runner-t2iavcpo-project-47789012-concurrent-0', 'IP address': '172.17.0.3'}, 'world_size': 2, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 40, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp-Hindi/2024-11-05T10:55:25Z'), 'bpe_model': 'data/2024-11-05T10:55:25Z/lang_bpe_500/bpe.model', 'base_lr': 0.04, 'lr_batches': 7500, 'lr_epochs': 3.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'ctc_loss_scale': 0.2, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'use_transducer': True, 'use_ctc': False, 'aws_access_key_id': None, 'aws_secret_access_key': None, 'finetune': None, 'av': 9, 'full_libri': True, 'mini_libri': False, 'manifest_dir': 'data/2024-11-05T10:55:25Z/fbank', 'max_duration': 200.0, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': False, 'input_strategy': 'PrecomputedFeatures', 'blank_id': 0, 'vocab_size': 500}
2024-11-05 12:55:26,736 INFO [train.py:1267] (1/2) About to create model
2024-11-05 12:55:26,998 INFO [train.py:1271] (0/2) Number of model parameters: 23627887
2024-11-05 12:55:27,047 INFO [train.py:1271] (1/2) Number of model parameters: 23627887
2024-11-05 12:55:27,934 INFO [train.py:1286] (0/2) Using DDP
2024-11-05 12:55:27,986 INFO [train.py:1286] (1/2) Using DDP
Traceback (most recent call last):
File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1530, in
main()
File "/builds/mihup/asr/zipformer/icefall/egs/librispeech/ASR/./zipformer/train.py", line 1521, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/usr/local/lib/python3.9/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGTERM
WARNING: script canceled externally (UI, API)
The text was updated successfully, but these errors were encountered: