-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] accelerate the data loading from file #2788
Comments
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
I see fast_double_parser is now used in master branch and is going to test its performance improvements. But I'm more interested in using binary file as input data format instead of text. The drawbacks of text files:
I'm familiar with HDF5 and is considering use it. A few interesting features:
HDF5 Drawbacks:
I do not have much experience with other binary data formats. This is the only reason that I only mention HDF5. |
@cyfdecyf Thanks a lot for sharing your opinion! LightGBM can read/save In case you meant saving a LightGBM model file (but this issue is about reading training data), there was an attempt to use a |
@StrikerRUS I've beening using the Suppose I've got a larget training and many testing datasets and I'm trying to try lots of parameters. (Assume fixing parameter that will change the binary file.) In order to save data loading time, I'm considering the following step:
If this is the right way to do things? If yes, then I think there's not too much need to support reading other binary data format. |
Yep, I think so. Just maybe you can specify test datasets as validation data. I believe it can help to simplify your pipeline. |
One possible optimization for If input file is seekable, use Problems of this approach:
Do you have any suggestion on this idea? I'm planning to implement this. |
cc @guolinke @shiyu1994 @chivee @btrotta |
Good suggestion! Maybe we can simply fall back to the old method without |
Sorry I didn't give any updates on this issue. I prefer binary file format for input data and have implemented something similar in PR #4089. So I won't continue on this idea for text input file for now. |
Ok. I see now the |
refer to the code in
include/utils/text_reader.h
,src/io/dataset_loader.cpp
andsrc/io/parser.hpp
.The text was updated successfully, but these errors were encountered: