-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Add support for passing Arrow to LightGBM #6022
Conversation
@microsoft-github-policy-service agree company="QuantCo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I just gave this a very quick review and left a few small notes. Will try to give it a more thorough review in the coming days.
Please let us know if you need help with the failing CI jobs. I can say with confidence that most of them are related to the state of this PR, not other flakiiness in those tests.
One other question.... would you consider limiting the scope of this PR to just accepting Arrow
types for the training data, and defer init_score
, weight
, and being able to predict on Arrow data to follow-up PRs? That'd reduce the scope of this a bit, which should reduce the effort to for us to provide a thoughtful review. One way I've seen that work well in the past is to keep a PR like this one with all of the changes up as a draft, to show the end state you want to get to, and submit individual smaller PRs with more focused changesets.
from pyarrow.cffi import ffi | ||
|
||
|
||
@dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This project still supports Python 3.6, so you cannot rely on dataclasses
from the standard library being available: #5765 (comment)
For the purpose of this PR, please just make it a normal class and add the bit of __init__()
boilerplate like
class _ArrowCArray:
def __init__(self, n_chunks, chunks, schema):
self.n_chunks = n_chunks
self.chunks = chunks
self.schema = schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> "Dataset": | ||
"""Set property into the Dataset. | ||
|
||
Parameters | ||
---------- | ||
field_name : str | ||
The field name of the information. | ||
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None | ||
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None | |
data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, pyarrow ChunkedArray or None |
Based on the corresponding change to the type hint, this should also note that a ChunkedArray
is possible.
Arrow support would be really nice. |
@borchero any plans to move this forward? thanks. |
Development is happening in #6034 right now. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Motivation
This PR adds Arrow-support to the Python API of LightGBM and, thus, (partially) fixes #3369.
Changes
lgb.Dataset.data
andbooster.predict
lgb.Dataset
'slabel
,group
,weight
, andinit_score