Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

rob-hen · 2024-11-05T14:58:55Z

Hi all,

the provided script train_pyramid_flow.sh does not set the flag use_sequence_parallel. In that case, what is the purpose of using VIDEO_SYNC_GROUP=8? Why we want all workers to use the same video?

The text was updated successfully, but these errors were encountered:

jy0205 · 2024-11-05T15:36:18Z

Hi, we do not use the sequence parallel during training. The VIDEO_SYNC_GROUP controls the number of processes that accept the same video batch as input. We find such a trick will make the gradient direction more stable (optimize the performance of the whole latent sequence of a video, not just a latent from different videos).

rob-hen · 2024-11-05T15:42:54Z

Hi @jy0205,

thank you for the answer.
So with VIDEO_SYNC_GROUP =8 and GPUS=8, all GPUs get exactly the same videos. However, I don't see the difference between the processes, all will use exactly the same latent (the same clip from the videos):

Pyramid-Flow/dataset/dataset_cls.py

Line 192 in e4b02ef

'video': video_latent,

.

yjhong89 · 2024-11-06T02:28:43Z

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

This part is different to sequece parallel, which split latent according to time axis.
Is that right ??

yjhong89 · 2024-11-06T03:58:07Z

Why only the number of high resolution units are uniformly sampled??
Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py

Line 360 in e4b02ef

cur_highres_unit = max(int((cur_rank % self.video_sync_group + 1) + update_turn * self.video_sync_group), 1)

jy0205 · 2024-11-06T14:46:37Z

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

This part is different to sequece parallel, which split latent according to time axis.

Is that right ??

Yes, you are right. The video_sync_group does not split the video. It works since different video ranks load different video lengths. You can find in the sample_length method.

jy0205 · 2024-11-06T14:47:53Z

Why only the number of high resolution units are uniformly sampled??

Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py

Line 360 in e4b02ef

cur_highres_unit = max(int((cur_rank % self.video_sync_group + 1) + update_turn * self.video_sync_group), 1)

All the stages employ the uniform sampling. We will make the video token sequence length-balanced (let the token length sum to be fixed)

rob-hen closed this as completed Nov 5, 2024

rob-hen reopened this Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

rob-hen commented Nov 5, 2024

jy0205 commented Nov 5, 2024

rob-hen commented Nov 5, 2024 •

edited

Loading

yjhong89 commented Nov 6, 2024

yjhong89 commented Nov 6, 2024

jy0205 commented Nov 6, 2024

jy0205 commented Nov 6, 2024 •

edited by feifeiobama

Loading

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

Comments

rob-hen commented Nov 5, 2024

jy0205 commented Nov 5, 2024

rob-hen commented Nov 5, 2024 • edited Loading

yjhong89 commented Nov 6, 2024

yjhong89 commented Nov 6, 2024

jy0205 commented Nov 6, 2024

jy0205 commented Nov 6, 2024 • edited by feifeiobama Loading

rob-hen commented Nov 5, 2024 •

edited

Loading

jy0205 commented Nov 6, 2024 •

edited by feifeiobama

Loading