Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support both row-wise and col-wise multi-threading #2699

Merged
merged 33 commits into from
Feb 2, 2020
Merged

Conversation

guolinke
Copy link
Collaborator

@guolinke guolinke commented Jan 20, 2020

To continue #216

Before this PR, LightGBM only supports col-wise multi-threading, which may is inefficient for sparse data.
The row-wise multi-threading is efficient for sparse data, but the overhead (more histograms and the merge cost) will increase with num_threads.

Therefore, as both of them are not perfect, we implement them both and automatically choose the faster one in run-time.

Two new parameters are added, force_col_wise and force_row_wise.

Some other changes in this PR:

  1. remove cnt in Histogram, to reduce histogram merge cost
  2. Better Timer to profile the run time cost

Benchmark:

Data v2.3.1 Master This PR Speed-up to v2.3.1
Higgs 251.237950 s 192.107919 s 132.967057 s 1.88x
Yahoo LTR 163.160723 s 130.946914 s 108.800116 s 1.50x
MS LTR 356.769460 s 249.381964 s 186.277296 s 1.91x
Expo 122.580457 s 115.084357 s 68.163928 s 1.79x
Allstate 294.604991 s 335.108286 s 149.482392 s 1.97x

Run by 16 threads, on Azure ND24s VM

Todo:

  • fix tests
  • fix gpu code

updated:
A new PR for lint fix.

@guolinke
Copy link
Collaborator Author

@Laurae2 for more benchmarks

@guolinke guolinke changed the title Support both row-wise and col-wise multi-threading [WIP] Support both row-wise and col-wise multi-threading Jan 20, 2020
@@ -437,6 +404,8 @@ R""()
// thread 8, 9, 10, 11, 12, 13, 14, 15 now process feature 0, 1, 2, 3, 4, 5, 6, 7's gradients for example 8, 9, 10, 11, 12, 13, 14, 15
#if CONST_HESSIAN == 0
atomic_local_add_f(gh_hist + addr2, stat2);
#else
atom_inc((uint*)(gh_hist + addr2));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huanzhang12 is this added okay?
or simply atomic_local_add_f(gh_hist + addr2, 1.0f) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the content at address (gh_hist + addr2) is an integer, we can use atom_inc(). If it is a floating point number, we must use atomic_local_add_f(gh_hist + addr2, 1.0f). I am not sure if you want to store a float or int here?

atomic_local_add_f() can be hundreds times slower than atom_inc(), since there is no native floating point atomics support on GPUs. So using atom_inc() if possible (but we must make sure the content is an integer).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we store it as int firstly and use inc(), then multiply it with hessian[0] and change it to float in-place?


// now thread 0 - 7 holds feature 0 - 7's gradient for bin 0 and counter bin 0
// now thread 8 - 15 holds feature 0 - 7's hessian for bin 0 and counter bin 1
// now thread 16- 23 holds feature 0 - 7's gradient for bin 1 and counter bin 2
// now thread 24- 31 holds feature 0 - 7's hessian for bin 1 and counter bin 3
// etc,

// FIXME: correct way to fix hessians
#if CONST_HESSIAN == 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huanzhang12 could you help to fix the hessian fixing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a little bit on what you want to do here? Do we need to add anything new to the histogram?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! As the count is removed, the naive solution is to add gh_hist as before and remove the cnt_hist. However, for const_hessian cases, we can use +1 in h_hist firstly and then multiply all h_hist with hessians[0], If +1 is faster than +hessians[i]. Therefore, here we need to multiply all h_hist with hessians[0].

@guolinke
Copy link
Collaborator Author

@huanzhang12 I made some changes on histogram16.cl, could you help to review it?

@guolinke
Copy link
Collaborator Author

@jameslamb could you help for the R's tests?

@StrikerRUS
Copy link
Collaborator

Reminder for myself: run tests against swapped compilers before merging.

@jameslamb
Copy link
Collaborator

@jameslamb could you help for the R's tests?

sure! Want me to push to this PR or make a separate one?

@guolinke
Copy link
Collaborator Author

@jameslamb you can directly push to this PR.
ping @huanzhang12 for the GPU part.


#include <cstdint>
#include <cstring>
#include <omp.h>
Copy link
Collaborator

@StrikerRUS StrikerRUS Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke Shouldn't this include (here and in all other files) be handled via openmp_wrapper to not break threadless version building?

@guolinke
Copy link
Collaborator Author

guolinke commented Feb 1, 2020

@StrikerRUS do you know why test_sklearn_integration randomly failed?

updated: it seems the mavx2 causes the fail: 1c24236 , this commit is passed

@guolinke
Copy link
Collaborator Author

guolinke commented Feb 1, 2020

also refer to scikit-learn/scikit-learn#14106

@guolinke guolinke changed the title [WIP] Support both row-wise and col-wise multi-threading Support both row-wise and col-wise multi-threading Feb 1, 2020
@StrikerRUS
Copy link
Collaborator

@guolinke

do you know why test_sklearn_integration randomly failed?

Nope, I have never seen that this test failing before. Is AVX2 critical for this PR?

@guolinke
Copy link
Collaborator Author

guolinke commented Feb 2, 2020

@StrikerRUS
I benchmarked for that, interestingly, disable avx2 is slightly faster.

@guolinke
Copy link
Collaborator Author

guolinke commented Feb 2, 2020

@StrikerRUS I am gonging to merge this PR, do we need to swap back the compilers in CI?

@guolinke guolinke merged commit 509c2e5 into master Feb 2, 2020
@guolinke guolinke deleted the sparse_bin_clean branch February 2, 2020 04:47
@StrikerRUS
Copy link
Collaborator

@guolinke

I benchmarked for that, interestingly, disable avx2 is slightly faster.

Hmm, really interesting...

I am gonging to merge this PR, do we need to swap back the compilers in CI?

Yes. As it's already merged, I'll do this in a separate PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants