Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The speed up of LCAO GPU is bad #5932

Open
pxlxingliang opened this issue Feb 25, 2025 · 5 comments
Open

The speed up of LCAO GPU is bad #5932

pxlxingliang opened this issue Feb 25, 2025 · 5 comments
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues

Comments

@pxlxingliang
Copy link
Collaborator

pxlxingliang commented Feb 25, 2025

I have tested the time of GPU and CPU with 16/32/64 Fe on bohrium c12_m92_1 * NVIDIA V100 (price is 4.5 RMB/hour) and c32_m64_cpu (price is 2.56 RMB/hour).

The results are:

# c32_m64_cpu,OMP=1,  MPI=16
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.568929e-07        True         12     -3409.354577     187.321   15.610083
32      True  9.979589e-07        True         13     -3409.353706     568.348   43.719077
64      True  6.391382e-07        True         13     -3409.354057    1862.490  143.268462

#GPU c12_m92_1 * NVIDIA V100, OMP=12 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.991711e-07        True         12     -3409.354577     748.215   62.351250
32      True  9.980243e-07        True         13     -3409.353706    2006.740  154.364615
64      True  6.521218e-07        True         13     -3409.354057    6859.480  527.652308

# GPU c12_m92_1 * NVIDIA V100, OMP=6 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.993026e-07        True         12     -3409.354577      573.12   47.760000
32      True  9.980718e-07        True         13     -3409.353706     1576.05  121.234615

# GPU c12_m92_1 * NVIDIA V100, OMP=1 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.978967e-07        True         12     -3409.354577     432.224   36.018667
32      True  9.980485e-07        True         13     -3409.353706    1086.320   83.563077
64      True  6.496210e-07        True         13     -3409.354057    3471.810  267.062308

The time cost on GPU is much longer than that on CPU, and the time cost longer when using more OpenMP threads.

gpu.zip

@mohanchen mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label Feb 25, 2025
@dzzz2001
Copy link
Collaborator

If we only consider the runtime of cal_gint, it seems that the GPU offers better cost-effectiveness. Taking the Fe16 example, running cal_gint with OMP=6 on a V100 GPU took only 17.8 seconds, while it took 60.2 seconds on a CPU. The price of the GPU is less than twice that of the CPU, yet its speed is more than three times faster. However, overall, I believe that the GPU does not offer better cost-effectiveness compared to the CPU in these calculations because the grid integration part occupies a relatively small proportion of the total runtime, whereas the non-grid integration part takes up a larger portion. Moreover, most of the non-grid integration part has not been optimized for GPU usage.

@dzzz2001
Copy link
Collaborator

By the way, I would like to ask about the settings for OpenMP and MPI when running on the CPU.

The time cost on GPU is much longer than that on CPU, and the time cost longer when using more OpenMP threads.

The reason that performance degrades with an increasing number of OMP threads is partly because the physical core count on Bohrium machines is half of what is labeled. Therefore, using only half of the cores typically yields the highest efficiency. For example, on a c12 machine, setting OMP=6 will be faster than setting OMP=12.

@dzzz2001
Copy link
Collaborator

I've found that the reason why the GPU version is slower than the CPU version mainly lies in the significant time consumption difference of the pdgemm function. Also, a large part of the reason why the GPU version gets slower with more OpenMP threads is due to this pdgemm function— the more threads are opened, the slower the function becomes. However, when running this example on my own machine, the pdgemm function takes much less time and does not exhibit the issue of becoming slower with more threads. I suspect that the problem is related to the ScaLAPACK library installed on Bohrium. I wonder if you have reproduced this efficiency issue in other testing environments."

@pxlxingliang
Copy link
Collaborator Author

The OpenMP and MPI used in CPU is 1 and 16.

I used the image "registry.dp.tech/deepmodeling/abacus-cuda:latest" in bohrium, which is made based on the "Dockerfile.cuda" (https://github.com/deepmodeling/abacus-develop/blob/develop/Dockerfile.cuda).

@dzzz2001
Copy link
Collaborator

dzzz2001 commented Mar 4, 2025

@pxlxingliang The issue of abnormal time consumption by the pdgemm function and the phenomenon of slower computation with an increased number of OpenMP threads are related to OpenMPI's scheduling strategy. When setting the number of processes to 1, the default scheduling strategy of OpenMPI is 'bind-to core', which means no matter how many threads OpenMP starts, these threads can only run on one core. Therefore, the more OpenMP threads you start, the more thread scheduling overhead increases, making the program run slower. OpenMPI's scheduling strategies can be referred to at this link. Hence, when running just one process, it is necessary to set '--bind-to none' after mpirun, which significantly reduces the program runtime. Below are my test results of the pdgemm function's execution time with and without enabling '--bind-to none'.

run "OMP_NUM_THREADS=6 mpirun -n 1 --bind-to none abacus":
Image

run "OMP_NUM_THREADS=6 mpirun -n 1 abacus":
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
Development

No branches or pull requests

3 participants