Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation #5773

Merged
merged 45 commits into from
Jan 22, 2025

Conversation

AsTonyshment
Copy link
Collaborator

@AsTonyshment AsTonyshment commented Dec 26, 2024

Phase 1: Rewriting existing code using Tensor (complete)

This is merely a draft and does not represent the final code. Since Tensor can effectively support heterogeneous computing, the goal of the first phase is to rewrite the existing algorithms using Tensor. Currently, all memory is still explicitly allocated on the CPU (the parameter of the Tensor constructor is container::DeviceType::CpuDevice).

Phase 2: Adding needed BLAS and LAPACK support for Tensor on CPU and refactoring linear algebra operations in TDDFT (complete)

Key Changes:

  • Added template structs lapack_getrf and lapack_getri in module_base/module_container/ATen/kernels/lapack.h to support matrix LU factorization (getrf) and matrix inversion (getri) operations for Tensor objects.
  • Fixed original LAPACK function (zgetrf_ and zgetri_) declarations in module_base/lapack_connector.h to comply with standard conventions.
  • Fully implemented CPU-based BLAS and LAPACK support for Tensor operations in TDDFT. These linear algebra operations in container::kernels module from module_base/module_container/ATen include a Device parameter, enabling seamless support for heterogeneous computing (GPU acceleration in future phases).

Phase 3: RT-TDDFT GPU acceleration core algorithm (complete)

  1. Added linear solver interfaces:

    • Implemented CPU-based linear solver (getrs) using LAPACK.
    • Added GPU-based LU decomposition (getrf) and linear solver (getrs) using cuSOLVER.
  2. Refactored RT-TDDFT I/O and parameters:

    • Removed static member variables (td_force_dt, td_vext, td_vext_dire_case, out_dipole, out_efield) from the Evolve_elec class.
    • Unified parameter access through the PARAM.inp input interface to simplify template class usage with Device parameter.
  3. Heterogeneous computing support:

    • Added Device template parameter to RT-TDDFT core algorithm classes and functions.
    • Introduced memory synchronization (e.g., base_device::memory::synchronize_memory_op) to ensure proper data handling across devices.
    • Replaced BlasConnector::copy operations with memory synchronization functions.
  4. GPU acceleration for RT-TDDFT:

    • Implemented core algorithm optimizations for GPU acceleration in RT-TDDFT.

Phase 4: MPI multi-process compatibility (complete)

  1. Removed all ctx parameters in memory synchronization operations.
  2. Added support for outputting intermediate debugging information (e.g., Hamiltonian matrix H, overlap matrix S, and wave function psi_k) when device=gpu.
  3. Added MPI multi-process gather and distribute operations, enabling compatibility of the GPU version of the RT-TDDFT algorithm with multi-process and multi-threaded MPI+OpenMP.

@AsTonyshment AsTonyshment marked this pull request as draft December 27, 2024 02:30
@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Features Needed The features are indeed needed, and developers should have sophisticated knowledge labels Dec 31, 2024
@AsTonyshment
Copy link
Collaborator Author

The current program has some bugs that cause the data in psi to become all zeros after evolution. Through debugging, we found that this issue arises because the original implementation of deep copying psi_k to tmp1 in source/module_hamilt_lcao/module_tddft/norm_psi.cpp was inadvertently replaced with Tensor's CopyFrom.

Useful information:

  1. CopyFrom Method:
    The CopyFrom method currently performs a shallow copy, meaning it shares the underlying data buffer between the source and destination tensors. This can lead to unintended side effects if modifications are made to one tensor, as they will reflect in the other.

  2. Assignment Operator (=):
    The assignment operator (=) is correctly implemented to perform a deep copy, ensuring that the destination tensor gets its own independent copy of the data buffer. This behavior is consistent with expectations for a deep copy operation.

@AsTonyshment
Copy link
Collaborator Author

After testing several parallel parameter combinations for Si-2 (small system) and Si-64 (large system), we conclude that the Tensor implementation on the CPU incurs almost no performance loss. In fact, it appears to be slightly faster than the previous implementation especially for large systems. The test results are as follows:
image

@AsTonyshment AsTonyshment changed the title RT-TDDFT GPU Acceleration (Phase 1): Rewriting existing code using Tensor RT-TDDFT GPU Acceleration (Phase 2): Adding needed BLAS and LAPACK support for Tensor on CPU and refactoring linear algebra operations in TDDFT Jan 3, 2025
@Critsium-xy
Copy link
Collaborator

LGTM👍, a good example showing the possibility of using tensor.

@AsTonyshment AsTonyshment changed the title RT-TDDFT GPU Acceleration (Phase 1, 2, 3): RT-TD now has preliminary support for GPU computation RT-TDDFT GPU Acceleration: RT-TD now has preliminary support for GPU computation Jan 17, 2025
@AsTonyshment AsTonyshment marked this pull request as ready for review January 17, 2025 12:00
@AsTonyshment AsTonyshment marked this pull request as draft January 18, 2025 06:40
@AsTonyshment AsTonyshment changed the title RT-TDDFT GPU Acceleration: RT-TD now has preliminary support for GPU computation RT-TDDFT GPU Acceleration: RT-TD now fully support GPU computation Jan 19, 2025
@AsTonyshment AsTonyshment marked this pull request as ready for review January 19, 2025 07:02
source/module_hamilt_lcao/module_tddft/upsi.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/bandenergy.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/bandenergy.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/bandenergy.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/bandenergy.h Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/evolve_psi.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/propagator.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/upsi.cpp Outdated Show resolved Hide resolved
source/module_hamilt_lcao/module_tddft/upsi.cpp Outdated Show resolved Hide resolved
@mohanchen mohanchen merged commit 3f8fe4f into deepmodeling:develop Jan 22, 2025
14 checks passed
@AsTonyshment AsTonyshment deleted the TDDFT_GPU_phase_1 branch January 22, 2025 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants