Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training speed is not improved by using a better GPU #1698

Open
SongLi89 opened this issue Jul 20, 2024 · 14 comments
Open

Training speed is not improved by using a better GPU #1698

SongLi89 opened this issue Jul 20, 2024 · 14 comments

Comments

@SongLi89
Copy link

Hi,

just have a question regarding the training speed using different GPUs.
We have tested the training speed with A100 and H100 (single GPU for test) using the same training setup.

Settings for A100 environment:
Driver Version: 525.147.05
cUDA Wersion: 12.0
pytorch 2.0.1
cuda: 11.8

======
Settings for H100 environment:
Driver Wersion: 525.54.15
cuDA Wersion: 12.4
pytorch 2.2.2
cuda: 12.1

To fairly compare the two GPUs, we used the same training parameters as follows (Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training ):

wenetspeech receip:
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-p16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

However, we found that training speeds were not significantly improved with more expensive one (H100).
For better comparison, we have plotted the processing time for each key step for each batch, including time for backward propagation, forward(zipformer), loss calculation, parameter update, and for load data.
The results are plotted in the following two figures.

A100ProcessingTime

H100ProcessingTime

So we can see that the time for IO, loss calculation and param. update is relatively low. The main time cost for the training is the backward propagation and forward (zipformer). It is unclear to me why there is no time reduction with H100 over A100. Has anyone else had a similar experience? Or is there something we haven't noticed?

Best,
Li

@XhrLeokk
Copy link

XhrLeokk commented Jul 20, 2024

Nice plot, surprisingly to know that the gap is that close.
Seems weird. 🤔

@Ziyi6
Copy link

Ziyi6 commented Jul 20, 2024

Met same problem. We're also using A100 and H100 servers, unsurprisingly the speed of H100 aren't as fast as we expected which is absolutely unnormal. At least the price we paid didn't bring us significant speed improvement. I think likely something must be set in the training code to be able to use H100 more efficiently?

@rambowu11
Copy link

Mark, we have a plan to buy H100 GPUs

@yuekaizhang
Copy link
Collaborator

Hi @SongLi89, thank you for raising this issue. I will help check if there are any performance bottlenecks. Will reply here with any updates.

@yuekaizhang
Copy link
Collaborator

yuekaizhang commented Jul 23, 2024

However, we found that training speeds were not significantly improved with more expensive one (H100).

@SongLi89
Could you tell me the specific comparison results of the training speed in your tests?

I am trying to reproduce your issue.

On the A100, it takes me about 0.6 seconds per step, and on the H100, it takes about 0.36 seconds per step. (by checking log file)

I am not sure if this speed ratio is similar to yours? (I used the aishell1 dataset, where the sentence lengths are slightly shorter, but the max_audio_duration setting is the same as yours.

@yuekaizhang
Copy link
Collaborator

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

@SongLi89
Copy link
Author

(Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training )

Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size.

Hi yuekai, thanks for the rapid replay.
so the A100 has mem of 40G where the H100 has 80G. Belows are the two screen shots.
image
image

Which torch/CUDA version you used for test?
so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

@yuekaizhang
Copy link
Collaborator

yuekaizhang commented Jul 23, 2024

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

@SongLi89
Copy link
Author

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

thanks a lot I will try it

@SongLi89
Copy link
Author

Which torch/CUDA version you used for test? so for the training settings above (wenetspeech L), one step around 0.5s for both. H100 is slightly faster, but 0.36 is never reached.

I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice

You could use the pre-built image here:

docker pull soar97/triton-sensevoice:24.05
pip install k2==1.24.4.dev20240606+cuda12.1.torch2.3.1 -f https://k2-fsa.github.io/k2/cuda.html
pip install -r icefall/requirements.txt
pip install lhotse

huggingface-cli download  --repo-type dataset --local-dir /your_icefall/egs/aishell/ASR/data yuekai/aishell_icefall_fbank
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-fp16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16

If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training.

Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it.

However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated.

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

@yuekaizhang
Copy link
Collaborator

Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks.

Hi @SongLi89, I have performed a profiling of the whole pipeline and did not find any significant bottlenecks.

It is worth noting that if you are willing to make some modifications to the attention mechanism of Zipformer, changing it to standard Transformer attention, you could leverage FlashAttention to accelerate both inference and training. However, it is uncertain whether this change would result in any loss of model accuracy.

Alternatively, for the H100, the fastest approach would be to use FP8 for training. However, considering that Zipformer has some gradient rescaling operations, FP8 training recipe might require addressing related issues first. This may not be a task that can be completed quickly. If you or anyone else is interested, we can discuss or collaborate to achieve this.

@danpovey
Copy link
Collaborator

Perhaps you are limited by the latency of individual operations, e.g. loading the kernels? The nsys profile output may give more detailed info. Unfortunately it won't look that pretty or be that easy to understand unless you annotate the code with NVTX ranges. Guys, do we have a branch anywhere that can demonstrate how to add nvtx ranges for profiling? We should make available some code somewhere so that people can easily do this.

I profile using commands like the following, the important part is just prepending "nsys profile", then you have to transfer the .qdrep file to your desktop and view it using Nvidia NSight systems.

 nsys profile  python3 ./pruned_transducer_stateless7/train.py --master-port 71840 --world-size 2 --num-epochs 30 --full-libri 0 --exp-dir pruned_transducer_stateless7/scaled_adam_exp90_2job --max-duration 300 --use-fp16 True --decoder-dim 512 --joiner-dim 512 --start-epoch 5 --num-workers 2 --exit-after-batch 15 &>> nohup2/scaled_adam_exp90_nvtx_2job.out

Something else I notice is that the time for loading data is quite a lot. You should also check on 'top' whether the data-loader workers are always busy decompressing data (would be 100% CPU) or whether they are waiting for disk access ("D" process state). You do seem to be using a lot of data-loader workers (16) so I'd hope that it wouldn't be waiting on that.

But definitely the time for model forward and backward is still quite a lot.
I do notice that your --max-duration is really quite small: 450. Unless your model is extremely large, I'd be surprised if that was the largest duration you could use even for the smaller GPU. We normally use over 1000, I think; and that's on GPUs with 32GB of memory.

@galv
Copy link

galv commented Jul 31, 2024

A quick way to get a sense of what part is slow is to use the "nvtx" pip's package ability to automatically create an nvtx range for every single python function call. You can read more here: https://nvtx.readthedocs.io/en/latest/auto.html

There is an example here: NVIDIA/NeMo#9100 (it also includes how to use cudaProfilerStart and cudaProfilerStop properly, as well emit_nvtx() from pytorch)

Basically, you can run with and without that enabled to get a sense of what might be slow, without manually putting in nvtx ranges, if you wanted to. Note that enabling automatic nvtx ranges can cause a huge slowdown, thus why it is good to run with and without it,a nd comapre the two .nsys-rep files side-by-side. I do this all the time for NeMo.

@pzelasko
Copy link
Collaborator

pzelasko commented Jul 31, 2024

But definitely the time for model forward and backward is still quite a lot. I do notice that your --max-duration is really quite small: 450. Unless your model is extremely large, I'd be surprised if that was the largest duration you could use even for the smaller GPU. We normally use over 1000, I think; and that's on GPUs with 32GB of memory.

BTW since you mentioned max_duration: you might be interested in our latest efforts in improved batch size calibration for bucketing. We found we're able to use practically 100% of available compute, improving the mean batch sizes for some of our models by as much as 5x. NVIDIA/NeMo#9763

This could be easily ported to Icefall with DynamicBucketingSampler, and minor changes in oomptimizer.py to accomodate Icefall models training step API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants