Support both row-wise and col-wise multi-threading #2699

guolinke · 2020-01-20T03:53:04Z

To continue #216

Before this PR, LightGBM only supports col-wise multi-threading, which may is inefficient for sparse data.
The row-wise multi-threading is efficient for sparse data, but the overhead (more histograms and the merge cost) will increase with num_threads.

Therefore, as both of them are not perfect, we implement them both and automatically choose the faster one in run-time.

Two new parameters are added, force_col_wise and force_row_wise.

Some other changes in this PR:

remove cnt in Histogram, to reduce histogram merge cost
Better Timer to profile the run time cost

Benchmark:

Data	v2.3.1	Master	This PR	Speed-up to v2.3.1
Higgs	251.237950 s	192.107919 s	132.967057 s	1.88x
Yahoo LTR	163.160723 s	130.946914 s	108.800116 s	1.50x
MS LTR	356.769460 s	249.381964 s	186.277296 s	1.91x
Expo	122.580457 s	115.084357 s	68.163928 s	1.79x
Allstate	294.604991 s	335.108286 s	149.482392 s	1.97x

Run by 16 threads, on Azure ND24s VM

Todo:

fix tests
fix gpu code

updated:
A new PR for lint fix.

guolinke · 2020-01-20T05:53:39Z

@Laurae2 for more benchmarks

guolinke · 2020-01-22T02:45:01Z

src/treelearner/ocl/histogram16.cl

@@ -437,6 +404,8 @@ R""()
            // thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 0, 1, 2, 3, 4, 5, 6, 7's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
            #if CONST_HESSIAN == 0
            atomic_local_add_f(gh_hist + addr2, stat2);
+            #else
+            atom_inc((uint*)(gh_hist + addr2));


@huanzhang12 is this added okay?
or simply atomic_local_add_f(gh_hist + addr2, 1.0f) ?

If the content at address (gh_hist + addr2) is an integer, we can use atom_inc(). If it is a floating point number, we must use atomic_local_add_f(gh_hist + addr2, 1.0f). I am not sure if you want to store a float or int here?

atomic_local_add_f() can be hundreds times slower than atom_inc(), since there is no native floating point atomics support on GPUs. So using atom_inc() if possible (but we must make sure the content is an integer).

could we store it as int firstly and use inc(), then multiply it with hessian[0] and change it to float in-place?

guolinke · 2020-01-22T02:46:30Z

src/treelearner/ocl/histogram16.cl


    // now thread 0 - 7  holds feature 0 - 7's gradient for bin 0 and counter bin 0
    // now thread 8 - 15 holds feature 0 - 7's hessian  for bin 0 and counter bin 1
    // now thread 16- 23 holds feature 0 - 7's gradient for bin 1 and counter bin 2
    // now thread 24- 31 holds feature 0 - 7's hessian  for bin 1 and counter bin 3
    // etc,

+    // FIXME: correct way to fix hessians
 #if CONST_HESSIAN == 1


@huanzhang12 could you help to fix the hessian fixing here?

Can you elaborate a little bit on what you want to do here? Do we need to add anything new to the histogram?

Thanks! As the count is removed, the naive solution is to add gh_hist as before and remove the cnt_hist. However, for const_hessian cases, we can use +1 in h_hist firstly and then multiply all h_hist with hessians[0], If +1 is faster than +hessians[i]. Therefore, here we need to multiply all h_hist with hessians[0].

guolinke · 2020-01-22T02:47:37Z

@huanzhang12 I made some changes on histogram16.cl, could you help to review it?

guolinke · 2020-01-24T13:24:35Z

@jameslamb could you help for the R's tests?

StrikerRUS · 2020-01-24T18:01:09Z

Reminder for myself: run tests against swapped compilers before merging.

jameslamb · 2020-01-25T14:48:29Z

@jameslamb could you help for the R's tests?

sure! Want me to push to this PR or make a separate one?

guolinke · 2020-01-29T08:58:28Z

@jameslamb you can directly push to this PR.
ping @huanzhang12 for the GPU part.

Co-Authored-By: James Lamb <[email protected]>

src/treelearner/ocl/histogram16.cl

StrikerRUS · 2020-01-31T23:40:11Z

src/io/multi_val_dense_bin.hpp

+
+#include <cstdint>
+#include <cstring>
+#include <omp.h>


@guolinke Shouldn't this include (here and in all other files) be handled via openmp_wrapper to not break threadless version building?

guolinke · 2020-02-01T02:36:20Z

@StrikerRUS do you know why test_sklearn_integration randomly failed?

updated: it seems the mavx2 causes the fail: 1c24236 , this commit is passed

guolinke · 2020-02-01T03:17:04Z

also refer to scikit-learn/scikit-learn#14106

This reverts commit 256e6d9.

StrikerRUS · 2020-02-01T17:23:13Z

@guolinke

do you know why test_sklearn_integration randomly failed?

Nope, I have never seen that this test failing before. Is AVX2 critical for this PR?

guolinke · 2020-02-02T02:47:08Z

@StrikerRUS
I benchmarked for that, interestingly, disable avx2 is slightly faster.

guolinke · 2020-02-02T03:17:07Z

@StrikerRUS I am gonging to merge this PR, do we need to swap back the compilers in CI?

StrikerRUS · 2020-02-03T02:51:04Z

@guolinke

I benchmarked for that, interestingly, disable avx2 is slightly faster.

Hmm, really interesting...

I am gonging to merge this PR, do we need to swap back the compilers in CI?

Yes. As it's already merged, I'll do this in a separate PR.

commit

c8883fc

guolinke requested review from chivee, jameslamb, Laurae2 and StrikerRUS as code owners January 20, 2020 03:53

guolinke changed the title ~~Support both row-wise and col-wise multi-threading~~ [WIP] Support both row-wise and col-wise multi-threading Jan 20, 2020

fix a bug

281dd32

guolinke mentioned this pull request Jan 20, 2020

[feature request] faster lambdarank #2701

Closed

fix bug

ea718c2

guolinke requested a review from huanzhang12 as a code owner January 21, 2020 05:55

guolinke commented Jan 22, 2020

View reviewed changes

guolinke mentioned this pull request Jan 30, 2020

Refactoring monotone constraints (linked to #2305) #2717

Merged

reset to track changes

2ad4af5

guolinke force-pushed the sparse_bin_clean branch from 32f3f6f to 2ad4af5 Compare January 30, 2020 07:16

guolinke added 10 commits January 30, 2020 15:41

refine the auto choose logic

748c95a

sort the time stats output

0340ffd

fix include

d3434c7

change multi_val_bin_sparse_threshold

8c4ea1a

add cmake

6cac288

add _mm_malloc and _mm_free for cross platform

afdbf3c

fix cmake bug

210ac4b

timer for split

ad2865d

try to fix cmake

4c4a33b

fix tests

2a33dcb

Update include/LightGBM/bin.h

7fda05a

Co-Authored-By: James Lamb <[email protected]>

guolinke commented Jan 31, 2020

View reviewed changes

src/treelearner/ocl/histogram16.cl Outdated Show resolved Hide resolved

Update src/treelearner/ocl/histogram16.cl

27a7209

StrikerRUS reviewed Jan 31, 2020

View reviewed changes

StrikerRUS and others added 2 commits February 1, 2020 02:53

test: swap compilers for CI

4623cd4

fix omp

38d1e57

guolinke added 4 commits February 1, 2020 12:06

not avx2

8e27631

no aligned for feature histogram

c86a479

Revert "refactor DataPartition::Split"

737e9c9

This reverts commit 256e6d9.

slightly refactor data partition

ce5f66b

guolinke changed the title ~~[WIP] Support both row-wise and col-wise multi-threading~~ Support both row-wise and col-wise multi-threading Feb 1, 2020

reduce the memory cost

a123c47

guolinke merged commit 509c2e5 into master Feb 2, 2020

guolinke deleted the sparse_bin_clean branch February 2, 2020 04:47

StrikerRUS mentioned this pull request Feb 13, 2020

[docs] removed mentions of sparse_threshold #2758

Merged

guolinke added the efficiency label Mar 1, 2020

StrikerRUS added breaking and removed efficiency labels Mar 1, 2020

lock bot locked as resolved and limited conversation to collaborators Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support both row-wise and col-wise multi-threading #2699

Support both row-wise and col-wise multi-threading #2699

guolinke commented Jan 20, 2020 •

edited

Loading

guolinke commented Jan 20, 2020

guolinke Jan 22, 2020

huanzhang12 Jan 23, 2020

guolinke Jan 24, 2020

guolinke Jan 22, 2020

huanzhang12 Jan 23, 2020

guolinke Jan 24, 2020

guolinke commented Jan 22, 2020

guolinke commented Jan 24, 2020

StrikerRUS commented Jan 24, 2020

jameslamb commented Jan 25, 2020

guolinke commented Jan 29, 2020

StrikerRUS Jan 31, 2020 •

edited

Loading

guolinke commented Feb 1, 2020 •

edited

Loading

guolinke commented Feb 1, 2020

StrikerRUS commented Feb 1, 2020

guolinke commented Feb 2, 2020

guolinke commented Feb 2, 2020

StrikerRUS commented Feb 3, 2020

Support both row-wise and col-wise multi-threading #2699

Support both row-wise and col-wise multi-threading #2699

Conversation

guolinke commented Jan 20, 2020 • edited Loading

guolinke commented Jan 20, 2020

guolinke Jan 22, 2020

Choose a reason for hiding this comment

huanzhang12 Jan 23, 2020

Choose a reason for hiding this comment

guolinke Jan 24, 2020

Choose a reason for hiding this comment

guolinke Jan 22, 2020

Choose a reason for hiding this comment

huanzhang12 Jan 23, 2020

Choose a reason for hiding this comment

guolinke Jan 24, 2020

Choose a reason for hiding this comment

guolinke commented Jan 22, 2020

guolinke commented Jan 24, 2020

StrikerRUS commented Jan 24, 2020

jameslamb commented Jan 25, 2020

guolinke commented Jan 29, 2020

StrikerRUS Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

guolinke commented Feb 1, 2020 • edited Loading

guolinke commented Feb 1, 2020

StrikerRUS commented Feb 1, 2020

guolinke commented Feb 2, 2020

guolinke commented Feb 2, 2020

StrikerRUS commented Feb 3, 2020

guolinke commented Jan 20, 2020 •

edited

Loading

StrikerRUS Jan 31, 2020 •

edited

Loading

guolinke commented Feb 1, 2020 •

edited

Loading