The speed up of LCAO GPU is bad #5932

pxlxingliang · 2025-02-25T08:12:38Z

I have tested the time of GPU and CPU with 16/32/64 Fe on bohrium c12_m92_1 * NVIDIA V100 (price is 4.5 RMB/hour) and c32_m64_cpu (price is 2.56 RMB/hour).

The results are:

# c32_m64_cpu,OMP=1,  MPI=16
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.568929e-07        True         12     -3409.354577     187.321   15.610083
32      True  9.979589e-07        True         13     -3409.353706     568.348   43.719077
64      True  6.391382e-07        True         13     -3409.354057    1862.490  143.268462

#GPU c12_m92_1 * NVIDIA V100, OMP=12 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.991711e-07        True         12     -3409.354577     748.215   62.351250
32      True  9.980243e-07        True         13     -3409.353706    2006.740  154.364615
64      True  6.521218e-07        True         13     -3409.354057    6859.480  527.652308

# GPU c12_m92_1 * NVIDIA V100, OMP=6 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.993026e-07        True         12     -3409.354577      573.12   47.760000
32      True  9.980718e-07        True         13     -3409.353706     1576.05  121.234615

# GPU c12_m92_1 * NVIDIA V100, OMP=1 MPI=1
    converge     drho_last  normal_end  scf_steps  energy_per_atom  total_time    ave_time
16      True  9.978967e-07        True         12     -3409.354577     432.224   36.018667
32      True  9.980485e-07        True         13     -3409.353706    1086.320   83.563077
64      True  6.496210e-07        True         13     -3409.354057    3471.810  267.062308

The time cost on GPU is much longer than that on CPU, and the time cost longer when using more OpenMP threads.

gpu.zip

The text was updated successfully, but these errors were encountered:

dzzz2001 · 2025-02-27T09:12:25Z

If we only consider the runtime of cal_gint, it seems that the GPU offers better cost-effectiveness. Taking the Fe16 example, running cal_gint with OMP=6 on a V100 GPU took only 17.8 seconds, while it took 60.2 seconds on a CPU. The price of the GPU is less than twice that of the CPU, yet its speed is more than three times faster. However, overall, I believe that the GPU does not offer better cost-effectiveness compared to the CPU in these calculations because the grid integration part occupies a relatively small proportion of the total runtime, whereas the non-grid integration part takes up a larger portion. Moreover, most of the non-grid integration part has not been optimized for GPU usage.

dzzz2001 · 2025-02-27T09:19:33Z

By the way, I would like to ask about the settings for OpenMP and MPI when running on the CPU.

The time cost on GPU is much longer than that on CPU, and the time cost longer when using more OpenMP threads.

The reason that performance degrades with an increasing number of OMP threads is partly because the physical core count on Bohrium machines is half of what is labeled. Therefore, using only half of the cores typically yields the highest efficiency. For example, on a c12 machine, setting OMP=6 will be faster than setting OMP=12.

dzzz2001 · 2025-02-28T09:56:49Z

I've found that the reason why the GPU version is slower than the CPU version mainly lies in the significant time consumption difference of the pdgemm function. Also, a large part of the reason why the GPU version gets slower with more OpenMP threads is due to this pdgemm function— the more threads are opened, the slower the function becomes. However, when running this example on my own machine, the pdgemm function takes much less time and does not exhibit the issue of becoming slower with more threads. I suspect that the problem is related to the ScaLAPACK library installed on Bohrium. I wonder if you have reproduced this efficiency issue in other testing environments."

pxlxingliang · 2025-02-28T10:15:23Z

The OpenMP and MPI used in CPU is 1 and 16.

I used the image "registry.dp.tech/deepmodeling/abacus-cuda:latest" in bohrium, which is made based on the "Dockerfile.cuda" (https://github.com/deepmodeling/abacus-develop/blob/develop/Dockerfile.cuda).

dzzz2001 · 2025-03-04T09:19:46Z

@pxlxingliang The issue of abnormal time consumption by the pdgemm function and the phenomenon of slower computation with an increased number of OpenMP threads are related to OpenMPI's scheduling strategy. When setting the number of processes to 1, the default scheduling strategy of OpenMPI is 'bind-to core', which means no matter how many threads OpenMP starts, these threads can only run on one core. Therefore, the more OpenMP threads you start, the more thread scheduling overhead increases, making the program run slower. OpenMPI's scheduling strategies can be referred to at this link. Hence, when running just one process, it is necessary to set '--bind-to none' after mpirun, which significantly reduces the program runtime. Below are my test results of the pdgemm function's execution time with and without enabling '--bind-to none'.

run "OMP_NUM_THREADS=6 mpirun -n 1 --bind-to none abacus":

run "OMP_NUM_THREADS=6 mpirun -n 1 abacus":

mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The speed up of LCAO GPU is bad #5932

The speed up of LCAO GPU is bad #5932

pxlxingliang commented Feb 25, 2025 •

edited

Loading

dzzz2001 commented Feb 27, 2025

dzzz2001 commented Feb 27, 2025

dzzz2001 commented Feb 28, 2025

pxlxingliang commented Feb 28, 2025

dzzz2001 commented Mar 4, 2025

The speed up of LCAO GPU is bad #5932

The speed up of LCAO GPU is bad #5932

Comments

pxlxingliang commented Feb 25, 2025 • edited Loading

dzzz2001 commented Feb 27, 2025

dzzz2001 commented Feb 27, 2025

dzzz2001 commented Feb 28, 2025

pxlxingliang commented Feb 28, 2025

dzzz2001 commented Mar 4, 2025

pxlxingliang commented Feb 25, 2025 •

edited

Loading