Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] accelerate the data loading from file #2788

Closed
guolinke opened this issue Feb 21, 2020 · 10 comments
Closed

[feature request] accelerate the data loading from file #2788

guolinke opened this issue Feb 21, 2020 · 10 comments

Comments

@guolinke
Copy link
Collaborator

refer to the code in include/utils/text_reader.h , src/io/dataset_loader.cpp and src/io/parser.hpp.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@cyfdecyf
Copy link
Contributor

I see fast_double_parser is now used in master branch and is going to test its performance improvements.

But I'm more interested in using binary file as input data format instead of text.

The drawbacks of text files:

  1. Storing double as text may take more than 8 bytes (sometimes much more)
    • use more disk space and cause more I/O
    • file system level compression makes this less problematic, but it maybe not an option to many people
  2. Text parsing overhead
  3. Ignoring some features does not reduce file I/O (because we have read line by line)

I'm familiar with HDF5 and is considering use it. A few interesting features:

  • HDF5 does not limit the way we store data, either row or column oriented can be used
    • I prefer column oriented style because feature selection is possible in this style
  • Supports multi-dimension data
  • Supports compression (gzip, lzo, or install separate LZ4 plugin)

HDF5 Drawbacks:

  • HDF5's C API is awkward to use
    • but the python library h5py is pleasant to use
  • It's like a blackbox and difficult to understand it's internal behavior

I do not have much experience with other binary data formats. This is the only reason that I only mention HDF5.

@StrikerRUS
Copy link
Collaborator

@cyfdecyf Thanks a lot for sharing your opinion! LightGBM can read/save Dataset into a binary format: https://lightgbm.readthedocs.io/en/latest/Parameters.html#save_binary.

In case you meant saving a LightGBM model file (but this issue is about reading training data), there was an attempt to use a protobuf as a binary serializer: #908.

@cyfdecyf
Copy link
Contributor

@StrikerRUS I've beening using the save_binary parameter. But currently only for train data and haven't tried this on validation and testing data.

Suppose I've got a larget training and many testing datasets and I'm trying to try lots of parameters. (Assume fixing parameter that will change the binary file.)

In order to save data loading time, I'm considering the following step:

  1. Preprocess: convert training and testing datasets to binary using save_binary feature
  2. Run train task with different parameters with the saved train data binary, save the model
  3. Run predict task for each model on every dataset with the saved testing data binary
    • As predict uses much less memory compared to train, I'd like to run it as a separate task on our cluster

If this is the right way to do things? If yes, then I think there's not too much need to support reading other binary data format.

@StrikerRUS
Copy link
Collaborator

If this is the right way to do things?

Yep, I think so. Just maybe you can specify test datasets as validation data. I believe it can help to simplify your pipeline.

@cyfdecyf
Copy link
Contributor

cyfdecyf commented Feb 3, 2021

One possible optimization for two_round option:

If input file is seekable, use seek to randomly select bin_construct_sample_cnt lines of text. This can avoid reading the whole input file twice.

Problems of this approach:

  • If we generate random number as seek position and search for the next linebreak as selected line, this will favor longer lines. But I guess this would be acceptable if line width does not vary too much
  • Not easy for partitioning data, the used_data_indices requires line index which can only be obtained by reading input text file line by line

Do you have any suggestion on this idea? I'm planning to implement this.

@StrikerRUS
Copy link
Collaborator

cc @guolinke @shiyu1994 @chivee @btrotta
for the seek idea above.

@shiyu1994
Copy link
Collaborator

One possible optimization for two_round option:

If input file is seekable, use seek to randomly select bin_construct_sample_cnt lines of text. This can avoid reading the whole input file twice.

Problems of this approach:

  • If we generate random number as seek position and search for the next linebreak as selected line, this will favor longer lines. But I guess this would be acceptable if line width does not vary too much
  • Not easy for partitioning data, the used_data_indices requires line index which can only be obtained by reading input text file line by line

Do you have any suggestion on this idea? I'm planning to implement this.

Good suggestion! Maybe we can simply fall back to the old method without seek if used_data_indices is not nullptr in SampleTextDataFromFile. Here used_data_indices is useful only when data distributed training is enabled and pre_partition=false.

@cyfdecyf
Copy link
Contributor

Sorry I didn't give any updates on this issue. I prefer binary file format for input data and have implemented something similar in PR #4089. So I won't continue on this idea for text input file for now.

@shiyu1994
Copy link
Collaborator

Ok. I see now the Sequence in #4089 interface already requires a random access operation being supported. But we can reserve that idea for text file in CLI version. Perhaps I can continue on that. Thanks for your idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants