Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Dataset from Arrow format #3369

Closed
jameslamb opened this issue Sep 8, 2020 · 10 comments
Closed

Create Dataset from Arrow format #3369

jameslamb opened this issue Sep 8, 2020 · 10 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

Apache Arrow is an open source columnar in-memory storage format that's well-suited to tabular data. It offers efficient data loading from files or other input streams, and zero-copy data sharing between languages.

Motivation

I think that this feature could allow for faster data loading, esp. from the parquet and CSV file formats. It would also allow directly training on Arrow tables, so we might be able to avoid some data copying in language wrappers (e.g. converting to a pandas data frame or R data.frame).

pyarrow offers a fast, efficient Parquet reader. I believe that reading from Parquet files directly into Arrow, then being able to efficiently create a LightGBM Dataset from that pyarrow table, would allow for faster I/O and better memory efficiency by avoiding the need to ever create a pandas data frame: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html.

Description

I'm admittedly not very experienced with C++, so maybe others can expand this description. But basically, I think it would involve adding a LGBM_DatasetCreateFromArrow similar to LGBM_DatasetCreateFromCSV:

int LGBM_DatasetCreateFromCSC(const void* col_ptr,

Arrow is a fairly heavy dependency (and pyarrow in Python / {arrow} in R, by extension), so an implementation should also explore how to make these optional for users who do not need the Arrow features.

References

There is an in-progress PR to add this feature to XGBoost: dmlc/xgboost#5667

Spark added support for Arrow as a memory representation in pyspark 3 years ago: https://arrow.apache.org/blog/2017/07/26/spark-arrow/.

@StrikerRUS
Copy link
Collaborator

Spark added support for Arrow as a memory representation in pyspark 3 years ago:

Does it mean that right now one can use MMLSpark (https://github.com/Azure/mmlspark) for Arrow + LightGBM (similarly to parquet #1286 (comment))?

@jameslamb
Copy link
Collaborator Author

possibly! But that definitely does not satisfy this feature. Spark is a heavy dependency that many users are unlikely to have access to.

@guolinke
Copy link
Collaborator

guolinke commented Sep 9, 2020

I think @shiyu1994 can help with this. He has some ideas to refine the dataset class recently.

@yalwan-iqvia
Copy link

Spark added support for Arrow as a memory representation in pyspark 3 years ago:

Does it mean that right now one can use MMLSpark (https://github.com/Azure/mmlspark) for Arrow + LightGBM (similarly to parquet #1286 (comment))?

I think it would be other way round. If LightGBM implemented datasetFromArrow, it would probably be useful to speed up / improve efficiency from within MMLSpark

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@StrikerRUS
Copy link
Collaborator

For the reference: Parquet data reader implementation in XGBoost with optional Arrow dependency at compile time.
dmlc/dmlc-core#653
https://github.com/dmlc/dmlc-core/blob/5eaff7643a88949c81af1e5de11945632920bf96/CMakeLists.txt#L71-L73

@jameslamb
Copy link
Collaborator Author

Linking the eventual XGBoost implementation: dmlc/xgboost#7512

@github-actions

This comment was marked as off-topic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
@jameslamb
Copy link
Collaborator Author

Sorry, this was locked accidentally. Just unlocked it.

@microsoft microsoft unlocked this conversation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants