Further optimisations for gpu_hist. #4283

RAMitchell · 2019-03-20T21:42:57Z

Fuse final update position functions into a single more efficient kernel
Refactor gpu_hist with a more explicit ellpack matrix representation

- Fuse final update position functions into a single more efficient kernel - Refactor gpu_hist with a more explicit ellpack matrix representation

RAMitchell · 2019-03-21T00:23:08Z

A quick review of my optimisation work over the last month. My improvements have been:

Combine or remove small host/device transfers
Use streams to overlap kernels in split evaluation
Use streams to overlap memory transfers in position updating
Fuse final position updating kernels into a single kernel
Find and remove unnecessary host/device transfers between boosting iterations

On a 10M*100 dense input matrix, boosting for 500 iterations, the performance improvement is approximately:
1 gpu: 28.2%
2 gpus: 33.5%
8 gpus: 40.9%

In particular multi-GPU scalability seems to have improved considerably.

canonizer · 2019-03-21T15:05:21Z

src/learner.cc

@@ -485,10 +485,10 @@ class LearnerImpl : public Learner {
    this->PerformTreeMethodHeuristic(train);

    monitor_.Start("PredictRaw");
-    this->PredictRaw(train, &preds_);
+    this->PredictRaw(train, &preds_[train]);


Given that DMatrix pointers are increasingly used as cache indices, the parameter should probably be changed from DMatrix* to shared_ptr<DMatrix> in all those places. We can then use weak_ptr<DMatrix> as the index into the cache.

This can be done in another pull request, however.

canonizer · 2019-03-21T16:09:59Z

src/tree/updater_gpu_hist.cu

                                    GradientSumT* d_node_hist,
                                    const GradientPair* d_gpair,
                                    size_t segment_begin, size_t n_elements) {
  extern __shared__ char smem[];
  GradientSumT* smem_arr = reinterpret_cast<GradientSumT*>(smem); // NOLINT
-  for (auto i : dh::BlockStrideRange(0, null_gidx_value)) {
+  for (auto i : dh::BlockStrideRange(0, matrix.null_gidx_value)) {


Could you add a function like matrix.BinCount(), just to make the code more readable? null_gidx_value can then be used in cases where it means 'no value'

Further optimisations for gpu_hist.

6c0e137

- Fuse final update position functions into a single more efficient kernel - Refactor gpu_hist with a more explicit ellpack matrix representation

canonizer reviewed Mar 21, 2019

View reviewed changes

RAMitchell force-pushed the perf branch 2 times, most recently from 51269ae to 8751221 Compare March 21, 2019 23:53

Fix CI

c3d7a86

RAMitchell force-pushed the perf branch from 8751221 to c3d7a86 Compare March 22, 2019 01:00

Address review comment

d3a2f1f

RAMitchell force-pushed the perf branch from 7f808b2 to d3a2f1f Compare March 23, 2019 23:11

RAMitchell merged commit 6d5b34d into dmlc:master Mar 24, 2019

hcho3 mentioned this pull request Apr 21, 2019

XGBoost 0.90 Roadmap #4389

Closed

18 tasks

hcho3 mentioned this pull request May 17, 2019

[RFC] Version 0.90 release candidate #4475

Merged

lock bot locked as resolved and limited conversation to collaborators Jun 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimisations for gpu_hist. #4283

Further optimisations for gpu_hist. #4283

RAMitchell commented Mar 20, 2019

RAMitchell commented Mar 21, 2019

canonizer Mar 21, 2019

canonizer Mar 21, 2019

Further optimisations for gpu_hist. #4283

Further optimisations for gpu_hist. #4283

Conversation

RAMitchell commented Mar 20, 2019

RAMitchell commented Mar 21, 2019

canonizer Mar 21, 2019

Choose a reason for hiding this comment

canonizer Mar 21, 2019

Choose a reason for hiding this comment