-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training speed is not improved by using a better GPU #1698
Comments
Nice plot, surprisingly to know that the gap is that close. |
Met same problem. We're also using A100 and H100 servers, unsurprisingly the speed of H100 aren't as fast as we expected which is absolutely unnormal. At least the price we paid didn't bring us significant speed improvement. I think likely something must be set in the training code to be able to use H100 more efficiently? |
Mark, we have a plan to buy H100 GPUs |
Hi @SongLi89, thank you for raising this issue. I will help check if there are any performance bottlenecks. Will reply here with any updates. |
@SongLi89 I am trying to reproduce your issue. On the A100, it takes me about 0.6 seconds per step, and on the H100, it takes about 0.36 seconds per step. (by checking log file) I am not sure if this speed ratio is similar to yours? (I used the aishell1 dataset, where the sentence lengths are slightly shorter, but the max_audio_duration setting is the same as yours. |
Also, could you tell me the specific specifications of your GPUs? The A100 80GB and H100 80GB have the same memory size. |
I am using torch 2.3.1, (Host Driver Version: 550.54.15 CUDA Version: 12.4) the dockerfile: https://github.com/modelscope/FunASR/blob/main/runtime/triton_gpu/Dockerfile/Dockerfile.sensevoice You could use the pre-built image here:
If you are willing to follow the steps above to try it on aishell 1, it would be very helpful. This way, we can use almost identical environments and datasets. For aishell, you just need to follow the command to download the pre-extracted features I prepared, and you can start training. Since the wenetspeech dataset is relatively large, reproducing it directly would be time-consuming for me. If you can obtain similar conclusions to mine on aishell 1 and then find that the H100 is slower on wenetspeech, I can try using wenetspeech to test it. However, don't worry. Even if you achieve the same acceleration ratio as I did, I will still check the performance to see if there are any areas in the overall pipeline that can be further accelerated. |
thanks a lot I will try it |
Hi yuekai, I tried with your environment and we have got similar acceleration ratio. Thanks a lot. But still it is great that the performance can further be improved. If you have ideas to speed up the training, please tell me. thanks. |
Hi @SongLi89, I have performed a profiling of the whole pipeline and did not find any significant bottlenecks. It is worth noting that if you are willing to make some modifications to the attention mechanism of Zipformer, changing it to standard Transformer attention, you could leverage FlashAttention to accelerate both inference and training. However, it is uncertain whether this change would result in any loss of model accuracy. Alternatively, for the H100, the fastest approach would be to use FP8 for training. However, considering that Zipformer has some gradient rescaling operations, FP8 training recipe might require addressing related issues first. This may not be a task that can be completed quickly. If you or anyone else is interested, we can discuss or collaborate to achieve this. |
Perhaps you are limited by the latency of individual operations, e.g. loading the kernels? The nsys profile output may give more detailed info. Unfortunately it won't look that pretty or be that easy to understand unless you annotate the code with NVTX ranges. Guys, do we have a branch anywhere that can demonstrate how to add nvtx ranges for profiling? We should make available some code somewhere so that people can easily do this. I profile using commands like the following, the important part is just prepending "nsys profile", then you have to transfer the .qdrep file to your desktop and view it using Nvidia NSight systems.
Something else I notice is that the time for loading data is quite a lot. You should also check on 'top' whether the data-loader workers are always busy decompressing data (would be 100% CPU) or whether they are waiting for disk access ("D" process state). You do seem to be using a lot of data-loader workers (16) so I'd hope that it wouldn't be waiting on that. But definitely the time for model forward and backward is still quite a lot. |
A quick way to get a sense of what part is slow is to use the "nvtx" pip's package ability to automatically create an nvtx range for every single python function call. You can read more here: https://nvtx.readthedocs.io/en/latest/auto.html There is an example here: NVIDIA/NeMo#9100 (it also includes how to use cudaProfilerStart and cudaProfilerStop properly, as well emit_nvtx() from pytorch) Basically, you can run with and without that enabled to get a sense of what might be slow, without manually putting in nvtx ranges, if you wanted to. Note that enabling automatic nvtx ranges can cause a huge slowdown, thus why it is good to run with and without it,a nd comapre the two .nsys-rep files side-by-side. I do this all the time for NeMo. |
BTW since you mentioned max_duration: you might be interested in our latest efforts in improved batch size calibration for bucketing. We found we're able to use practically 100% of available compute, improving the mean batch sizes for some of our models by as much as 5x. NVIDIA/NeMo#9763 This could be easily ported to Icefall with |
Hi,
just have a question regarding the training speed using different GPUs.
We have tested the training speed with A100 and H100 (single GPU for test) using the same training setup.
Settings for A100 environment:
Driver Version: 525.147.05
cUDA Wersion: 12.0
pytorch 2.0.1
cuda: 11.8
======
Settings for H100 environment:
Driver Wersion: 525.54.15
cuDA Wersion: 12.4
pytorch 2.2.2
cuda: 12.1
To fairly compare the two GPUs, we used the same training parameters as follows (Of course, the H100 has more memory than the A100, and we can use a larger number for “maximum duration”. This test is just to compare the performance of these two GPUs against training ):
wenetspeech receip:
./zipformer/train.py
--world-size 1
--num-epochs 30
--use-p16 1
--max-duration 450
--training-subset L
--exp-dir zipformer/exp causal
--causa1 1
--num-workers 16
However, we found that training speeds were not significantly improved with more expensive one (H100).
For better comparison, we have plotted the processing time for each key step for each batch, including time for backward propagation, forward(zipformer), loss calculation, parameter update, and for load data.
The results are plotted in the following two figures.
So we can see that the time for IO, loss calculation and param. update is relatively low. The main time cost for the training is the backward propagation and forward (zipformer). It is unclear to me why there is no time reduction with H100 over A100. Has anyone else had a similar experience? Or is there something we haven't noticed?
Best,
Li
The text was updated successfully, but these errors were encountered: