有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

liangshuangI · 2024-11-21T08:55:36Z

nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
[�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2
[�[34m2024-11-21 08:39:44�[0m] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 5,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_buiald_workers': 16,
'num_workers': 8,
'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
[�[34m2024-11-21 08:39:44�[0m] Building dataset...
[�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples.
[�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[�[34m2024-11-21 08:39:44�[0m] Building buckets...
[�[34m2024-11-21 08:39:46�[0m] Bucket Info:
[�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio:
{'0.52': [131, 9],
'0.53': [468, 39],
'0.54': [1, 0],
'0.56': [3733, 386],
'0.57': [1291, 123],
'0.60': [1, 0],
'0.67': [21, 0],
'0.68': [9, 0],
'0.75': [7, 0],
'0.78': [2, 0]}
[�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT:
{}
[�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 408): [36, 36],
('360p', 204): [73, 36],
('360p', 102): [226, 55],
('360p', 51): [521, 64],
('240p', 408): [101, 50],
('240p', 204): [153, 30],
('240p', 102): [509, 49],
('240p', 51): [1183, 58],
('256', 408): [65, 32],
('256', 204): [110, 21],
('256', 102): [376, 36],
('256', 51): [883, 43],
('144p', 408): [59, 9],
('144p', 204): [117, 8],
('144p', 102): [409, 14],
('144p', 51): [843, 16]}
[�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48
[�[34m2024-11-21 08:39:46�[0m] Building models...

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it]
Missing keys: []
Unexpected keys: []
[�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/
[�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds
[�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training...
[�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training
[�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch
[�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...

Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])

Epoch 0: 0%| | 2/557 [01:13<5:22:05, 34.82s/it]tensor([408])
[2024-11-21 08:42:00,328] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053

liangshuangI · 2024-11-21T09:06:25Z

I met this problem when I tried to train it. I also tried to do an inference, and it went well. I just transformed GPU from A100 to H20,

Vincentyua · 2024-11-23T07:42:03Z

I met this problem too

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

liangshuangI commented Nov 21, 2024

liangshuangI commented Nov 21, 2024

Vincentyua commented Nov 23, 2024

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

Comments

liangshuangI commented Nov 21, 2024

scripts/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-21_08:42:00 host : localhost rank : 0 (local_rank: 0) exitcode : -8 (pid: 69053) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 69053

liangshuangI commented Nov 21, 2024

Vincentyua commented Nov 23, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053