[python-package] Add support for passing Arrow to LightGBM #6022

borchero · 2023-08-05T15:51:12Z

Motivation

This PR adds Arrow-support to the Python API of LightGBM and, thus, (partially) fixes #3369.

Changes

Allow to pass Arrow table to lgb.Dataset.data and booster.predict
Allow to pass Arrow arrays to lgb.Dataset's label, group, weight, and init_score
Add tests for C++ and Python

borchero · 2023-08-05T16:45:33Z

@microsoft-github-policy-service agree company="QuantCo"

jameslamb

Thanks for working on this! I just gave this a very quick review and left a few small notes. Will try to give it a more thorough review in the coming days.

Please let us know if you need help with the failing CI jobs. I can say with confidence that most of them are related to the state of this PR, not other flakiiness in those tests.

One other question.... would you consider limiting the scope of this PR to just accepting Arrow types for the training data, and defer init_score, weight, and being able to predict on Arrow data to follow-up PRs? That'd reduce the scope of this a bit, which should reduce the effort to for us to provide a thoughtful review. One way I've seen that work well in the past is to keep a PR like this one with all of the changes up as a draft, to show the end state you want to get to, and submit individual smaller PRs with more focused changesets.

python-package/lightgbm/arrow.py

jameslamb · 2023-08-06T18:23:02Z

python-package/lightgbm/arrow.py

+from pyarrow.cffi import ffi
+
+
+@dataclass


This project still supports Python 3.6, so you cannot rely on dataclasses from the standard library being available: #5765 (comment)

For the purpose of this PR, please just make it a normal class and add the bit of __init__() boilerplate like

class _ArrowCArray: def __init__(self, n_chunks, chunks, schema): self.n_chunks = n_chunks self.chunks = chunks self.schema = schema

Hey sorry @borchero ... now that #6048 has been merged, in this and the other PRs you can use dataclasses freely! We decided to take a Python-3.6-only dependency on the dataclasses backport.

jameslamb · 2023-08-06T18:25:02Z

python-package/lightgbm/basic.py

    ) -> "Dataset":
        """Set property into the Dataset.

        Parameters
        ----------
        field_name : str
            The field name of the information.
-        data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None
+        data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None


Suggested change

data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None

data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, pyarrow ChunkedArray or None

Based on the corresponding change to the type hint, this should also note that a ChunkedArray is possible.

lorentzenchr · 2023-08-20T11:03:52Z

Arrow support would be really nice.
Just an idea: How about using nanoarrow. Still early, but is is exactly meant for cases like here.

sheldonrong · 2023-10-15T10:58:48Z

@borchero any plans to move this forward? thanks.

jameslamb · 2023-10-15T16:40:43Z

any plans to move this forward

Development is happening in #6034 right now.

borchero · 2023-10-31T00:15:01Z

Logic of this PR is now fully implemented via #6034, #6163, #6164, #6166, #6167, #6168.

jameslamb · 2023-12-04T19:32:14Z

Now that there are just 2 PRs left in the initial work for this (#6168, #6210), I think this draft PR can be safely closed.

Thanks so much for all your help and patience @borchero , and for splitting this up into smaller and easier-to-review pieces.

github-actions · 2024-12-19T22:28:16Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

borchero added 2 commits August 5, 2023 17:42

Add Arrow support to Python API

ab2d5e2

Merge branch 'master' into arrow-support

570ca64

borchero requested review from guolinke, jameslamb, shiyu1994 and jmoralez as code owners August 5, 2023 15:51

borchero mentioned this pull request Aug 5, 2023

Feature Requests & Voting Hub #2302

Open

borchero added 2 commits August 5, 2023 22:39

Fix lint

c21fab4

Fix isort

2cd4302

jameslamb requested changes Aug 6, 2023

View reviewed changes

jameslamb added feature in progress labels Aug 6, 2023

borchero marked this pull request as draft August 12, 2023 16:16

borchero mentioned this pull request Aug 12, 2023

[python-package] Allow to pass Arrow table as training data #6034

Merged

jameslamb mentioned this pull request Aug 18, 2023

[python-package] use dataclass for CallbackEnv #6048

Merged

jameslamb closed this Dec 4, 2023

borchero deleted the arrow-support branch December 4, 2023 21:11

jameslamb removed the in progress label Dec 10, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Add support for passing Arrow to LightGBM #6022

[python-package] Add support for passing Arrow to LightGBM #6022

borchero commented Aug 5, 2023 •

edited

Loading

borchero commented Aug 5, 2023

jameslamb left a comment

jameslamb Aug 6, 2023 •

edited

Loading

jameslamb Aug 21, 2023

jameslamb Aug 6, 2023

lorentzenchr commented Aug 20, 2023

sheldonrong commented Oct 15, 2023

jameslamb commented Oct 15, 2023

borchero commented Oct 31, 2023

jameslamb commented Dec 4, 2023 •

edited

Loading

github-actions bot commented Dec 19, 2024

	data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, or None
	data : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Table, pyarrow Array, pyarrow ChunkedArray or None

[python-package] Add support for passing Arrow to LightGBM #6022

[python-package] Add support for passing Arrow to LightGBM #6022

Conversation

borchero commented Aug 5, 2023 • edited Loading

Motivation

Changes

borchero commented Aug 5, 2023

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Aug 6, 2023 • edited Loading

Choose a reason for hiding this comment

jameslamb Aug 21, 2023

Choose a reason for hiding this comment

jameslamb Aug 6, 2023

Choose a reason for hiding this comment

lorentzenchr commented Aug 20, 2023

sheldonrong commented Oct 15, 2023

jameslamb commented Oct 15, 2023

borchero commented Oct 31, 2023

jameslamb commented Dec 4, 2023 • edited Loading

github-actions bot commented Dec 19, 2024

borchero commented Aug 5, 2023 •

edited

Loading

jameslamb Aug 6, 2023 •

edited

Loading

jameslamb commented Dec 4, 2023 •

edited

Loading