We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nohup: ignoring input /root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten) /root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration. warnings.warn( [�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2 [�[34m2024-11-21 08:39:44�[0m] Training configuration: {'adam_eps': 1e-15, 'bucket_config': {'1024': {1: (0.05, 36)}, '1080p': {1: (0.1, 5)}, '144p': {1: (1.0, 475), 51: (1.0, 51), 102: ((1.0, 0.33), 27), 204: ((1.0, 0.1), 13), 408: ((1.0, 0.1), 6)}, '2048': {1: (0.1, 5)}, '240p': {1: (0.3, 297), 51: (0.4, 20), 102: ((0.4, 0.33), 10), 204: ((0.4, 0.1), 5), 408: ((0.4, 0.1), 2)}, '256': {1: (0.4, 297), 51: (0.5, 20), 102: ((0.5, 0.33), 10), 204: ((0.5, 0.1), 5), 408: ((0.5, 0.1), 2)}, '360p': {1: (0.2, 141), 51: (0.15, 8), 102: ((0.15, 0.33), 4), 204: ((0.15, 0.1), 2), 408: ((0.15, 0.1), 1)}, '480p': {1: (0.1, 89)}, '512': {1: (0.1, 141)}, '720p': {1: (0.05, 36)}}, 'ckpt_every': 200, 'config': 'configs/opensora-v1-2/train/stage1.py', 'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv', 'transform_name': 'resize_crop', 'type': 'VariableVideoTextDataset'}, 'dtype': 'bf16', 'ema_decay': 0.99, 'epochs': 5, 'grad_checkpoint': True, 'grad_clip': 1.0, 'load': None, 'log_every': 10, 'lr': 0.0001, 'mask_ratios': {'image_head': 0.05, 'image_head_tail': 0.025, 'image_random': 0.025, 'image_tail': 0.025, 'intepolate': 0.005, 'quarter_head': 0.005, 'quarter_head_tail': 0.005, 'quarter_random': 0.005, 'quarter_tail': 0.005, 'random': 0.05}, 'model': {'enable_flash_attn': True, 'enable_layernorm_kernel': True, 'freeze_y_embedder': True, 'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/', 'qk_norm': True, 'type': 'STDiT3-XL/2'}, 'num_bucket_buiald_workers': 16, 'num_workers': 8, 'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1', 'plugin': 'zero2', 'record_time': False, 'scheduler': {'sample_method': 'logit-normal', 'type': 'rflow', 'use_timestep_transform': True}, 'seed': 42, 'start_from_scratch': False, 'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl', 'model_max_length': 300, 'shardformer': True, 'type': 't5'}, 'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors', 'micro_batch_size': 4, 'micro_frame_size': 17, 'type': 'OpenSoraVAE_V1_2'}, 'wandb': False, 'warmup_steps': 1000} [�[34m2024-11-21 08:39:44�[0m] Building dataset... [�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples. [�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626 INFO: Pandarallel will run on 1 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [�[34m2024-11-21 08:39:44�[0m] Building buckets... [�[34m2024-11-21 08:39:46�[0m] Bucket Info: [�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio: {'0.52': [131, 9], '0.53': [468, 39], '0.54': [1, 0], '0.56': [3733, 386], '0.57': [1291, 123], '0.60': [1, 0], '0.67': [21, 0], '0.68': [9, 0], '0.75': [7, 0], '0.78': [2, 0]} [�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT: {} [�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT: {('360p', 408): [36, 36], ('360p', 204): [73, 36], ('360p', 102): [226, 55], ('360p', 51): [521, 64], ('240p', 408): [101, 50], ('240p', 204): [153, 30], ('240p', 102): [509, 49], ('240p', 51): [1183, 58], ('256', 408): [65, 32], ('256', 204): [110, 21], ('256', 102): [376, 36], ('256', 51): [883, 43], ('144p', 408): [59, 9], ('144p', 204): [117, 8], ('144p', 102): [409, 14], ('144p', 51): [843, 16]} [�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48 [�[34m2024-11-21 08:39:46�[0m] Building models...
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s] Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it] Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it] Missing keys: [] Unexpected keys: [] [�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/ [�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now [extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds [extension] Compiling the JIT fused_optim_cuda kernel during runtime now [extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds [�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8} [�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training... [�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training [�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch [�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...
Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51]) Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51])
The text was updated successfully, but these errors were encountered:
I met this problem when I tried to train it. I also tried to do an inference, and it went well. I just transformed GPU from A100 to H20,
Sorry, something went wrong.
I met this problem too
No branches or pull requests
nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
[�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2
[�[34m2024-11-21 08:39:44�[0m] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 5,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_buiald_workers': 16,
'num_workers': 8,
'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
[�[34m2024-11-21 08:39:44�[0m] Building dataset...
[�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples.
[�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[�[34m2024-11-21 08:39:44�[0m] Building buckets...
[�[34m2024-11-21 08:39:46�[0m] Bucket Info:
[�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio:
{'0.52': [131, 9],
'0.53': [468, 39],
'0.54': [1, 0],
'0.56': [3733, 386],
'0.57': [1291, 123],
'0.60': [1, 0],
'0.67': [21, 0],
'0.68': [9, 0],
'0.75': [7, 0],
'0.78': [2, 0]}
[�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT:
{}
[�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 408): [36, 36],
('360p', 204): [73, 36],
('360p', 102): [226, 55],
('360p', 51): [521, 64],
('240p', 408): [101, 50],
('240p', 204): [153, 30],
('240p', 102): [509, 49],
('240p', 51): [1183, 58],
('256', 408): [65, 32],
('256', 204): [110, 21],
('256', 102): [376, 36],
('256', 51): [883, 43],
('144p', 408): [59, 9],
('144p', 204): [117, 8],
('144p', 102): [409, 14],
('144p', 51): [843, 16]}
[�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48
[�[34m2024-11-21 08:39:46�[0m] Building models...
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it]
Missing keys: []
Unexpected keys: []
[�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/
[�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds
[�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training...
[�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training
[�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch
[�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...
Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 2/557 [01:13<5:22:05, 34.82s/it]tensor([408])
[2024-11-21 08:42:00,328] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053
The text was updated successfully, but these errors were encountered: