Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow array as labels #6163

Merged
merged 44 commits into from
Nov 7, 2023
Merged

[python-package] Allow to pass Arrow array as labels #6163

merged 44 commits into from
Nov 7, 2023

Conversation

borchero
Copy link
Collaborator

Motivation

This is part of a series of PRs towards #6022. This is built on #6034 but, unfortunately, I cannot properly specify the PR base here as I can't use my fork's branch as base branch.

For a legible diff, see borchero/LightGBM@arrow-support-training-data...arrow-support-labels.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've converted this to a draft PR for now, since it shouldn't be reviewed or merged until #6034 has been merged.

@jameslamb
Copy link
Collaborator

Now that #6034 has been merged, let's work on this one next. @borchero I'd be happy to review this once you merge in master and mark it ready for review.

@borchero
Copy link
Collaborator Author

borchero commented Nov 2, 2023

Yes @jameslamb! Will do some time today (I'm currently on vacation, not spending too much time with my laptop 😄).

@jameslamb
Copy link
Collaborator

I'm currently on vacation

Oh don't worry about it, enjoy your vacation! I'll be here and ready whenever you get back. This is not urgent at all... it's up to you how fast we move through these.

@borchero borchero changed the title WIP: [python-package] Allow to pass Arrow array as labels [python-package] Allow to pass Arrow array as labels Nov 2, 2023
@borchero borchero marked this pull request as ready for review November 2, 2023 21:05
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes on the Python side look great to me... thanks for working within this project's existing patterns for type hints, optional-dependency imports, etc. And the tests look excellent to me!

Changes on the C/C++ side look good to me at a glance, but I'm not fully qualified to approve them, so I'm just leaving a "Comment" review.

@guolinke or @shiyu1994 could you please help with a review?

@@ -67,6 +67,10 @@ def dummy_dataset_params() -> Dict[str, Any]:
}


def assert_arrays_equal(lhs: np.ndarray, rhs: np.ndarray):
assert lhs.dtype == rhs.dtype and np.array_equal(lhs, rhs)
Copy link
Collaborator

@jameslamb jameslamb Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this approach for testing if two numpy arrays are identical, I'd prefer that we use np.testing.assert_array_equal(..., strict=True). That functionality is also so generically useful in lightgbm's tests that I think it belongs in tests/python_package_test/utils.py.

I've introduced such a function in https://github.com/microsoft/LightGBM/pull/6108/files#diff-da8bd23ee10ae6337467b4d88500a4a695b154d23a17742a4cef514cb1d6cf85R206.

@jmoralez I'm waiting to merge that PR until you have a chance to review #6108 again, specifically #6108 (comment). Will you have time to do that this week?

I know it's kind of involved, so if not I'll pull that new testing function out of that PR into a separate one that'd hopefully require less reviewing effort, so we could re-use it here and in other tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb any chance #6108 is merged soon so I can take advantage of the new function? Or should I just copy the function into this branch so that we can proceed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't just copy that function into this branch... I want it to be reviewed by someone other than myself.

We can just merge this today and I'll clean up and consolidate the testing functions later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right! I was under the impression that #6108 would have already been reviewed by @guolinke 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has but @jmoralez and I have been collaborating on it (as can be seen in the comment thread I linked to from #6163 (comment)), so I don't want to merge it until he has the time to review.

Pull request reviews here aren't just quality gates like a building or restaurant inspection... they're a chance for maintainers to agree/disagree about the state of this thing we're all maintaining together. "at least one approval" is the minimum standard we enforce but that doesn't mean that every PR should be merged as soon as one person approves.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. Sorry for the delay, I moved last week and missed the update in #6108. I'm reviewing it now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rush at all, don't let me add stress to your life! Seriously, it can happen any time.

src/c_api.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, LGTM

@jameslamb jameslamb self-requested a review November 7, 2023 18:12
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Let's do query groups next (#6166), as I think that'll be easier than init_score or weight since group can only be a 1-dimensional array.

@jameslamb jameslamb merged commit b7f6311 into microsoft:master Nov 7, 2023
41 checks passed
@borchero
Copy link
Collaborator Author

borchero commented Nov 7, 2023

@jameslamb: can't weight be only a one-dimensional input as well? At least that's what I assumed in #6164. If so, I'd prefer to do that one first since all the remaining PRs depend on each other sequentially 👀

@jameslamb
Copy link
Collaborator

Ah yeah you're right

weight : list, numpy 1-D array, pandas Series or None, optional (default=None)
Weight for each instance. Weights should be non-negative.

Sorry, forgot that for multi-class classification we don't support specifying different row weights by class.

Going to #6164 next is fine. I didn't realize they relied on each other sequentially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants