[feature request] accelerate the data loading from file #2788

guolinke · 2020-02-21T03:37:03Z

refer to the code in include/utils/text_reader.h , src/io/dataset_loader.cpp and src/io/parser.hpp.

The text was updated successfully, but these errors were encountered:

StrikerRUS · 2020-02-29T16:44:58Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

cyfdecyf · 2020-12-16T06:08:19Z

I see fast_double_parser is now used in master branch and is going to test its performance improvements.

But I'm more interested in using binary file as input data format instead of text.

The drawbacks of text files:

Storing double as text may take more than 8 bytes (sometimes much more)
- use more disk space and cause more I/O
- file system level compression makes this less problematic, but it maybe not an option to many people
Text parsing overhead
Ignoring some features does not reduce file I/O (because we have read line by line)

I'm familiar with HDF5 and is considering use it. A few interesting features:

HDF5 does not limit the way we store data, either row or column oriented can be used
- I prefer column oriented style because feature selection is possible in this style
Supports multi-dimension data
Supports compression (gzip, lzo, or install separate LZ4 plugin)

HDF5 Drawbacks:

HDF5's C API is awkward to use
- but the python library h5py is pleasant to use
It's like a blackbox and difficult to understand it's internal behavior

I do not have much experience with other binary data formats. This is the only reason that I only mention HDF5.

StrikerRUS · 2020-12-16T14:48:21Z

@cyfdecyf Thanks a lot for sharing your opinion! LightGBM can read/save Dataset into a binary format: https://lightgbm.readthedocs.io/en/latest/Parameters.html#save_binary.

In case you meant saving a LightGBM model file (but this issue is about reading training data), there was an attempt to use a protobuf as a binary serializer: #908.

cyfdecyf · 2020-12-17T03:15:10Z

@StrikerRUS I've beening using the save_binary parameter. But currently only for train data and haven't tried this on validation and testing data.

Suppose I've got a larget training and many testing datasets and I'm trying to try lots of parameters. (Assume fixing parameter that will change the binary file.)

In order to save data loading time, I'm considering the following step:

Preprocess: convert training and testing datasets to binary using save_binary feature
- I've add this a new task type in PR Add new task type: "save_binary". #3651
Run train task with different parameters with the saved train data binary, save the model
Run predict task for each model on every dataset with the saved testing data binary
- As predict uses much less memory compared to train, I'd like to run it as a separate task on our cluster

If this is the right way to do things? If yes, then I think there's not too much need to support reading other binary data format.

StrikerRUS · 2020-12-17T14:11:34Z

If this is the right way to do things?

Yep, I think so. Just maybe you can specify test datasets as validation data. I believe it can help to simplify your pipeline.

cyfdecyf · 2021-02-03T12:51:41Z

One possible optimization for two_round option:

If input file is seekable, use seek to randomly select bin_construct_sample_cnt lines of text. This can avoid reading the whole input file twice.

Problems of this approach:

If we generate random number as seek position and search for the next linebreak as selected line, this will favor longer lines. But I guess this would be acceptable if line width does not vary too much
Not easy for partitioning data, the used_data_indices requires line index which can only be obtained by reading input text file line by line

Do you have any suggestion on this idea? I'm planning to implement this.

StrikerRUS · 2021-02-03T14:01:41Z

cc @guolinke @shiyu1994 @chivee @btrotta
for the seek idea above.

shiyu1994 · 2021-04-28T07:26:45Z

One possible optimization for two_round option:

If input file is seekable, use seek to randomly select bin_construct_sample_cnt lines of text. This can avoid reading the whole input file twice.

Problems of this approach:

If we generate random number as seek position and search for the next linebreak as selected line, this will favor longer lines. But I guess this would be acceptable if line width does not vary too much

Not easy for partitioning data, the used_data_indices requires line index which can only be obtained by reading input text file line by line

Do you have any suggestion on this idea? I'm planning to implement this.

Good suggestion! Maybe we can simply fall back to the old method without seek if used_data_indices is not nullptr in SampleTextDataFromFile. Here used_data_indices is useful only when data distributed training is enabled and pre_partition=false.

cyfdecyf · 2021-04-28T23:34:43Z

Sorry I didn't give any updates on this issue. I prefer binary file format for input data and have implemented something similar in PR #4089. So I won't continue on this idea for text input file for now.

shiyu1994 · 2021-04-29T02:33:22Z

Ok. I see now the Sequence in #4089 interface already requires a random access operation being supported. But we can reserve that idea for text file in CLI version. Perhaps I can continue on that. Thanks for your idea!

guolinke added help wanted feature request efficiency labels Feb 21, 2020

guolinke mentioned this issue Feb 21, 2020

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Feb 29, 2020

cyfdecyf mentioned this issue Mar 21, 2021

[python-package] Create Dataset from multiple data files #4089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] accelerate the data loading from file #2788

[feature request] accelerate the data loading from file #2788

guolinke commented Feb 21, 2020

StrikerRUS commented Feb 29, 2020

cyfdecyf commented Dec 16, 2020

StrikerRUS commented Dec 16, 2020

cyfdecyf commented Dec 17, 2020

StrikerRUS commented Dec 17, 2020

cyfdecyf commented Feb 3, 2021

StrikerRUS commented Feb 3, 2021

shiyu1994 commented Apr 28, 2021

cyfdecyf commented Apr 28, 2021

shiyu1994 commented Apr 29, 2021

[feature request] accelerate the data loading from file #2788

[feature request] accelerate the data loading from file #2788

Comments

guolinke commented Feb 21, 2020

StrikerRUS commented Feb 29, 2020

cyfdecyf commented Dec 16, 2020

StrikerRUS commented Dec 16, 2020

cyfdecyf commented Dec 17, 2020

StrikerRUS commented Dec 17, 2020

cyfdecyf commented Feb 3, 2021

StrikerRUS commented Feb 3, 2021

shiyu1994 commented Apr 28, 2021

cyfdecyf commented Apr 28, 2021

shiyu1994 commented Apr 29, 2021