Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further optimisations for gpu_hist. #4283

Merged
merged 3 commits into from
Mar 24, 2019
Merged

Further optimisations for gpu_hist. #4283

merged 3 commits into from
Mar 24, 2019

Conversation

RAMitchell
Copy link
Member

  • Fuse final update position functions into a single more efficient kernel

  • Refactor gpu_hist with a more explicit ellpack matrix representation

- Fuse final update position functions into a single more efficient kernel

- Refactor gpu_hist with a more explicit ellpack  matrix representation
@RAMitchell
Copy link
Member Author

A quick review of my optimisation work over the last month. My improvements have been:

  • Combine or remove small host/device transfers
  • Use streams to overlap kernels in split evaluation
  • Use streams to overlap memory transfers in position updating
  • Fuse final position updating kernels into a single kernel
  • Find and remove unnecessary host/device transfers between boosting iterations

On a 10M*100 dense input matrix, boosting for 500 iterations, the performance improvement is approximately:
1 gpu: 28.2%
2 gpus: 33.5%
8 gpus: 40.9%

In particular multi-GPU scalability seems to have improved considerably.

@@ -485,10 +485,10 @@ class LearnerImpl : public Learner {
this->PerformTreeMethodHeuristic(train);

monitor_.Start("PredictRaw");
this->PredictRaw(train, &preds_);
this->PredictRaw(train, &preds_[train]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that DMatrix pointers are increasingly used as cache indices, the parameter should probably be changed from DMatrix* to shared_ptr<DMatrix> in all those places. We can then use weak_ptr<DMatrix> as the index into the cache.

This can be done in another pull request, however.

GradientSumT* d_node_hist,
const GradientPair* d_gpair,
size_t segment_begin, size_t n_elements) {
extern __shared__ char smem[];
GradientSumT* smem_arr = reinterpret_cast<GradientSumT*>(smem); // NOLINT
for (auto i : dh::BlockStrideRange(0, null_gidx_value)) {
for (auto i : dh::BlockStrideRange(0, matrix.null_gidx_value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a function like matrix.BinCount(), just to make the code more readable? null_gidx_value can then be used in cases where it means 'no value'

@RAMitchell RAMitchell merged commit 6d5b34d into dmlc:master Mar 24, 2019
@hcho3 hcho3 mentioned this pull request Apr 21, 2019
18 tasks
@lock lock bot locked as resolved and limited conversation to collaborators Jun 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants