Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] support dataset rows more then max(int32_t) #5454

Closed
wants to merge 6 commits into from

Conversation

junpeng0715
Copy link

@junpeng0715 junpeng0715 commented Aug 31, 2022

Description

This is a WIP PR to allow more than MAX(int32_t) of dataset rows.

  • change data_size_t to int64_t
  • change int32_t to int64_t for related places
  • modify test code to allow int64 dataset.
  • change unmatch type in c_api.h(int) and c_api.cpp(int32_t) for ncol
    LIGHTGBM_C_EXPORT int LGBM_BoosterPredictForMatSingleRow(BoosterHandle handle,
    const void* data,
    int data_type,
    int ncol,
    int is_row_major,
    int predict_type,
    int start_iteration,
    int num_iteration,
    const char* parameter,
    int64_t* out_len,
    double* out_result);
  • change the thread run logic in Threading.h, when nblock = 0, set left_cnt = 0
    left_write_pos_[0] = 0;
    right_write_pos_[0] = 0;
    for (int i = 1; i < nblock; ++i) {
    left_write_pos_[i] = left_write_pos_[i - 1] + left_cnts_[i - 1];
    right_write_pos_[i] = right_write_pos_[i - 1] + right_cnts_[i - 1];
    }
    data_size_t left_cnt = left_write_pos_[nblock - 1] + left_cnts_[nblock - 1];

Additional Comments

  • Need some experienced developers to review this PR, and make some suggestions.
  • When I build in java wrapper(swig), I got two errors,
  ubuntu: 20.04
  SWIG:4.0.2
  gcc: GNU 9.4.0
  G++: GNU 9.4.0
  JVM: 1.8.0_342

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

 [ 96%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetCreateFromSampledColumn(JNIEnv*, jclass, jlong, jlong, jint, jlong, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1285:68: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1285 |   result = (int)LGBM_DatasetCreateFromSampledColumn(arg1,arg2,arg3,(long long const *)arg4,arg5,arg6,arg7,(char const *)arg8,arg9);
      |                                                                    ^~~~~~~~~~~~~~~~~~~~~~~
      |                                                                    |
      |                                                                    const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:130:74: note:   initializing argument 4 of ‘int LGBM_DatasetCreateFromSampledColumn(double**, int**, int32_t, const int64_t*, int64_t, int64_t, int64_t, const char*, void**)’
  130 |                                                           const int64_t* num_per_col,
      |                                                           ~~~~~~~~~~~~~~~^~~~~~~~~~~
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘jint Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_LGBM_1DatasetGetSubset(JNIEnv*, jclass, jlong, jlong, jlong, jstring, jlong)’:
/home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:1561:44: error: invalid conversion from ‘const long long int*’ to ‘const int64_t*’ {aka ‘const long int*’} [-fpermissive]
 1561 |   result = (int)LGBM_DatasetGetSubset(arg1,(long long const *)arg2,arg3,(char const *)arg4,arg5);
      |                                            ^~~~~~~~~~~~~~~~~~~~~~~
      |                                            |
      |                                            const long long int*
In file included from /home/ubuntu/projects/LightGBM/build/java/lightgbmlibJAVA_wrap.cxx:239:
/home/ubuntu/projects/LightGBM/include/../include/LightGBM/c_api.h:316:60: note:   initializing argument 2 of ‘int LGBM_DatasetGetSubset(DatasetHandle, const int64_t*, int64_t, const char*, void**)’
  316 |                                             const int64_t* used_row_indices,

@guolinke
Copy link
Collaborator

guolinke commented Sep 1, 2022

will this affect the training speed? any benchmarks?

@junpeng0715
Copy link
Author

junpeng0715 commented Sep 2, 2022

will this affect the training speed? any benchmarks?

@guolinke

  • I found a benchmarks in you repo, boosting_tree_benchmarks, is it fine to confirm training speed?
  • Can you give me some feedback on the question for SWIG that I raised? Or who is the main developer for SWIG part to ask for advice.

@guolinke
Copy link
Collaborator

guolinke commented Sep 2, 2022

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

@StrikerRUS
Copy link
Collaborator

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

int32_t num_local_row,
const int64_t* num_per_col,
int64_t num_sample_row,
int64_t num_local_row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to support int64_t for sample data and data local to 1 machine? Not opposing, just making sure you want to do that. If so, you prob need to change the sample_indices to int64_t as well?

@svotaw Very happy to receive feedbacks.

Not just using 1 machine. I'm not sure, int/int32_t is enough for Worker machines, is it enough to Driver machine?

@@ -91,10 +91,10 @@ LIGHTGBM_C_EXPORT int LGBM_GetSampleCount(int32_t num_total_row,
* \param[out] out_len Number of indices
* \return 0 when succeed, -1 when failure happens
*/
LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int32_t num_total_row,
LIGHTGBM_C_EXPORT int LGBM_SampleIndices(int64_t num_total_row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these signature changes are breaking changes. Just making sure you consider that.

@junpeng0715
Copy link
Author

junpeng0715 commented Sep 5, 2022

yeah, it is fine to benchmark.

@imatiach-msft , @AlbertoEAF, @svotaw can you help for SWIG ?

@guolinke Tested Higgs Dataset in AWS
Instrance : r5.2xlarge (vCPUs 8/Memory 64GiB)

# Master Branch Modified
Accuracy 310.827840 seconds, iteration 500, AUC:0.845874 427.753441 seconds, iteration 500, AUC:0.845874
Speed 294.647278 seconds, iteration 500 303.744985 seconds, iteration 500

@junpeng0715
Copy link
Author

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

/* Unfortunately, for the time being,
* SWIG has issues generating the overloads to coalesce_to()
* for larger integral types
* so we won't support that for now:
*/
//%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

@guolinke
Copy link
Collaborator

guolinke commented Sep 5, 2022

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

@junpeng0715
Copy link
Author

@junpeng0715 given the performance loss, I think a better solution is to make the long indices an option, e.g. make it a flag (define in CMake) in the compilation.

@guolinke Re-ran the Higgs benchmark, and got a new result below.
It is faster this time.It seems the instance is not stable. Anyway, I also agree with you, it's fine to add an option.

# Master Branch Modified
Accuracy 309.241788 seconds, iteration 500, AUC:0.845874 318.976365 seconds, iteration 500, AUC:0.845874
Speed 399.465431 seconds, iteration 500 304.222913 seconds, iteration 500

e.g. make it a flag (define in CMake) in the compilation.

It feels like changing the modified int64_t to data_size_t , then add a CMAKE option to switch data_size_t .
But how to handle the APIs(C, Python, etc)?

@junpeng0715
Copy link
Author

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

/* Unfortunately, for the time being,
* SWIG has issues generating the overloads to coalesce_to()
* for larger integral types
* so we won't support that for now:
*/
//%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

@guolinke
Copy link
Collaborator

guolinke commented Sep 7, 2022

cc @shiyu1994 to confirm the performance test.

@junpeng0715
Copy link
Author

It seem that SWIG will convert the parameter which defined as const int64_t* to long long const*, could you give some suggestions for this error?

I believe this is related: #4091.

/* Unfortunately, for the time being,
* SWIG has issues generating the overloads to coalesce_to()
* for larger integral types
* so we won't support that for now:
*/
//%template(int64ChunkedArray) ChunkedArray<int64_t>;

Yes @StrikerRUS, I saw this source comment. I'm finding a way to avoid the SWIG issue.

  • Is there any tasks are working on this issue?
  • I use SynapseML to run LightGBM. First build lightgbm JAVA wrapper once, then modify the error in built source from long long const* to long int const*, then rebuild, it will be success. SynapseML can use the built package to run, I am not sure it is a temp solution for test. Could you raise some suggesion for this issue?

Hi @imatiach-msft , @AlbertoEAF, @svotaw , could you help for this?

Hi @imatiach-msft , @AlbertoEAF, @svotaw, @guolinke
Could you explain my doubt for SWIG part?

LightGBM is using the SWIG's stdint.i for int64_t,

%include "stdint.i"

the defination for int64_t in SWIG need to enable SWIGWORDSIZE64 when Word size is 64-bit,
https://github.com/swig/swig/blob/ad1688055dea4f07cd4023d5e0301d97f7759151/Lib/stdint.i#L19-L24

So for default, in LightGBM's swig part, int64_t is being long long(SWIGWORDSIZE64 not defined),
Is that expected?Or unsuitable?

@junpeng0715
Copy link
Author

Hi guys, Is there any update on this ticket?

@ghost
Copy link

ghost commented Sep 22, 2022

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

❌ junpeng0715 sign now
You have signed the CLA already but the status is still pending? Let us recheck it.

@guolinke
Copy link
Collaborator

guolinke commented Oct 9, 2022

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

@junpeng0715
Copy link
Author

@junpeng0715 sorry for the delay. I check with @shiyu1994 , and confirmed the int64_t would affect the speed of the data partition algorithm. So for now, the solution is "make it a flag (define in CMake) in the compilation.", and users need to recompile if they need the int64_t version.

Thank you so mush! Ok, will do that!

junpeng0715 pushed a commit to junpeng0715/LightGBM that referenced this pull request Oct 12, 2022
junpeng0715 pushed a commit to junpeng0715/LightGBM that referenced this pull request Oct 12, 2022
@junpeng0715
Copy link
Author

junpeng0715 commented Oct 13, 2022

I raised a new ticket for "make it a flag (define in CMake) in the compilation." #5540

@jameslamb
Copy link
Collaborator

Based on #5540 (comment), #5540 replaces this PR. I'm going to close this one.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants