-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: neighbor search #61
Conversation
You've structured the kernel so that every thread computes only a single interaction: const int32_t index = blockIdx.x * blockDim.x + threadIdx.x;
if (index >= num_all_pairs) return;
int32_t row = floor((sqrtf(8 * index + 1) + 1) / 2);
if (row * (row - 1) > 2 * index) row--;
const int32_t column = index - row * (row - 1) / 2;
const scalar_t delta_x = positions[row][0] - positions[column][0];
const scalar_t delta_y = positions[row][1] - positions[column][1];
const scalar_t delta_z = positions[row][2] - positions[column][2]; Usually it's better to use a smaller number of thread blocks and have each thread loop over interactions. For one thing, there's overhead to each thread block. For another, it allows lots of optimization. In the above code, if you can arrange that each thread will compute multiple pairs all in the same row, then you can skip the row and column computations, and also you only need to load Of course, it all depends what size you're optimizing for. With 50 atoms, the number of pairs is much too small to fill a large GPU even with only one pair per thread. For larger systems with thousands of atoms and millions of pairs, it will make more of a difference. |
This PR is discontinued. The code is being move to NNPOps (openmm/NNPOps#58) |
This is obsolted |
This is a proof-of-concept. DO NOT MERGER!