Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow table and array as init scores #6167

Merged
merged 82 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from 79 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
ab2d5e2
Add Arrow support to Python API
borchero Jul 31, 2023
570ca64
Merge branch 'master' into arrow-support
borchero Aug 5, 2023
c21fab4
Fix lint
borchero Aug 5, 2023
2cd4302
Fix isort
borchero Aug 5, 2023
71957f6
[python-package] Allow to pass Arrow table as training data
borchero Aug 12, 2023
175fb13
Merge branch 'master' into arrow-support-training-data
borchero Aug 12, 2023
32dfb11
Remove change
borchero Aug 12, 2023
b5f0676
Implement JL comments
borchero Aug 12, 2023
cca3b37
Fix isort
borchero Aug 12, 2023
001139a
Remove testcase
borchero Aug 12, 2023
5861ca6
Adjust pyarrow version
borchero Aug 12, 2023
54d171c
Revert gitignore
borchero Aug 21, 2023
a87a15b
Fix lint
borchero Sep 5, 2023
8cda7cd
Merge branch 'master' into arrow-support-training-data
borchero Sep 5, 2023
6b4245a
Increase timeout for bdist_wheel build
borchero Sep 6, 2023
14a9326
Fix layout
borchero Sep 6, 2023
854f306
Add newline
borchero Sep 6, 2023
269582c
Fix typo
borchero Sep 11, 2023
9164040
Merge branch 'master' into arrow-support-training-data
borchero Sep 11, 2023
9a0a18d
Merge branch 'master' into arrow-support-training-data
borchero Sep 15, 2023
e5540cd
Remove arrow.py
borchero Sep 15, 2023
98997bf
Merge branch 'master' into arrow-support-training-data
jameslamb Sep 26, 2023
4a66cba
Merge branch 'master' into arrow-support-training-data
borchero Oct 12, 2023
f44421e
Fix cpp tests
borchero Oct 12, 2023
80b0aa3
Fix tests
borchero Oct 12, 2023
1869cfb
Fix omp parallel
borchero Oct 12, 2023
ba62bcc
Add missing <cmath> header
borchero Oct 12, 2023
db449e1
Fix cpplint
borchero Oct 12, 2023
3dab653
Disable arrow tests
borchero Oct 12, 2023
840cba9
Try fixing memory issue in tests
borchero Oct 13, 2023
19b210b
Try chunking in test
borchero Oct 13, 2023
059419d
Fix lint
borchero Oct 13, 2023
36e7bf4
Merge branch 'master' into arrow-support-training-data
borchero Oct 25, 2023
143a247
Implement review comments
borchero Oct 25, 2023
bb97817
Merge branch 'master' into arrow-support-training-data
jameslamb Oct 30, 2023
62431f2
Uninstall optional dependencies correctly
borchero Oct 30, 2023
34ee108
[python-package] Allow to pass Arrow array as labels
borchero Oct 30, 2023
90a2c1f
Fix lint
borchero Oct 30, 2023
6b65bcf
Fix lint
borchero Oct 30, 2023
ec33f75
WIP: [python-package] Allow to pass Arrow array as weights
borchero Oct 30, 2023
20a23b8
Fix lint
borchero Oct 30, 2023
ccdb0ba
Push
borchero Oct 30, 2023
7dbce53
Remove test
borchero Oct 30, 2023
ce69120
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Oct 30, 2023
e1593c2
Groups
borchero Oct 30, 2023
0af7a7c
[python-package] Allow to pass Arrow table as training data
borchero Oct 30, 2023
45a67a6
Merge branch 'arrow-support-training-data' into arrow-support-labels
borchero Oct 30, 2023
80c12c0
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Oct 30, 2023
221cba4
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Oct 30, 2023
15c8637
Fix isort
borchero Oct 30, 2023
b1d2071
WIP: [python-package] Allow to pass Arrow table and array as init scores
borchero Oct 30, 2023
06bdce2
Merge branch 'master' into arrow-support-labels
borchero Nov 2, 2023
75a980e
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 2, 2023
a53e8bb
Merge branch 'arrow-support-weights' into arrow-support-init-scores
borchero Nov 2, 2023
3d3ffb1
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Nov 2, 2023
591fb71
Merge branch 'arrow-support-groups' into arrow-support-init-scores
borchero Nov 2, 2023
f7c67e7
Implement guolinke's review
borchero Nov 7, 2023
91fade9
Merge branch 'master' into arrow-support-labels
jameslamb Nov 7, 2023
09ad33b
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 7, 2023
33f3e44
Merge branch 'master' into arrow-support-labels
borchero Nov 7, 2023
cd556da
Merge branch 'arrow-support-labels' into arrow-support-weights
borchero Nov 7, 2023
678ae7d
Use np_assert_array_equal
borchero Nov 7, 2023
5331202
Implement jameslamb's review comments
borchero Nov 8, 2023
74910d4
Merge branch 'master' into arrow-support-weights
jameslamb Nov 8, 2023
5041282
Merge branch 'master' into arrow-support-weights
jameslamb Nov 13, 2023
04f0f21
Merge branch 'arrow-support-weights' into arrow-support-groups
borchero Nov 14, 2023
5e2baa1
Fix
borchero Nov 14, 2023
ff5c9f8
Merge branch 'master' into arrow-support-groups
borchero Nov 14, 2023
0f56ea0
Fix and implement review comments
borchero Nov 15, 2023
797cc3a
Fix
borchero Nov 15, 2023
8714625
Fix test
borchero Nov 16, 2023
acd916e
Fix
borchero Nov 16, 2023
c00b841
Merge branch 'master' into arrow-support-groups
borchero Nov 22, 2023
9b07160
Add tests for empty chunks
borchero Nov 22, 2023
79d050b
Fix lint
borchero Nov 22, 2023
caf66ee
Merge branch 'arrow-support-groups' into arrow-support-init-scores
borchero Nov 22, 2023
64d082f
Merge branch 'master' into arrow-support-init-scores
borchero Nov 22, 2023
a662a80
Fix
borchero Nov 22, 2023
e965af2
Fix
borchero Nov 22, 2023
c5acae7
Stricter test
borchero Nov 22, 2023
2e00f84
Merge branch 'master' into arrow-support-init-scores
jameslamb Nov 30, 2023
997705e
Merge branch 'master' into arrow-support-init-scores
jameslamb Dec 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions include/LightGBM/c_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -559,9 +559,10 @@ LIGHTGBM_C_EXPORT int LGBM_DatasetSetField(DatasetHandle handle,
* \brief Set vector to a content in info.
* \note
* - \a group converts input datatype into ``int32``;
* - \a label and \a weight convert input datatype into ``float32``.
* - \a label and \a weight convert input datatype into ``float32``;
* - \a init_score converts input datatype into ``float64``.
* \param handle Handle of dataset
* \param field_name Field name, can be \a label, \a weight, \a group
* \param field_name Field name, can be \a label, \a weight, \a init_score, \a group
* \param n_chunks The number of Arrow arrays passed to this function
* \param chunks Pointer to the list of Arrow arrays
* \param schema Pointer to the schema of all Arrow arrays
Expand Down
4 changes: 4 additions & 0 deletions include/LightGBM/dataset.h
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ class Metadata {
* \param init_score Initial scores, this class will manage memory for init_score.
*/
void SetInitScore(const double* init_score, data_size_t len);
void SetInitScore(const ArrowChunkedArray& array);


/*!
Expand Down Expand Up @@ -347,6 +348,9 @@ class Metadata {
void SetWeightsFromIterator(It first, It last);
/*! \brief Insert initial scores at the given index */
void InsertInitScores(const double* init_scores, data_size_t start_index, data_size_t len, data_size_t source_size);
/*! \brief Set init scores from pointers to the first element and the end of an iterator. */
template <typename It>
void SetInitScoresFromIterator(It first, It last);
/*! \brief Insert queries at the given index */
void InsertQueries(const data_size_t* queries, data_size_t start_index, data_size_t len);
/*! \brief Set queries from pointers to the first element and the end of an iterator. */
Expand Down
28 changes: 20 additions & 8 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
import scipy.sparse

from .compat import (PANDAS_INSTALLED, PYARROW_INSTALLED, arrow_cffi, arrow_is_floating, arrow_is_integer, concat,
dt_DataTable, pa_Array, pa_ChunkedArray, pa_compute, pa_Table, pd_CategoricalDtype, pd_DataFrame,
pd_Series)
dt_DataTable, pa_Array, pa_chunked_array, pa_ChunkedArray, pa_compute, pa_Table,
pd_CategoricalDtype, pd_DataFrame, pd_Series)
from .libpath import find_lib_path

if TYPE_CHECKING:
Expand Down Expand Up @@ -84,6 +84,9 @@
np.ndarray,
pd_Series,
pd_DataFrame,
pa_Table,
pa_Array,
pa_ChunkedArray,
]
_LGBM_TrainDataType = Union[
str,
Expand Down Expand Up @@ -1660,7 +1663,7 @@ def __init__(
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None, optional (default=None)
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)
Init score for Dataset.
feature_name : list of str, or 'auto', optional (default="auto")
Feature names.
Expand Down Expand Up @@ -2440,7 +2443,7 @@ def create_valid(
sum(group) = n_samples.
For example, if you have a 100-document dataset with ``group = [10, 20, 40, 10, 10, 10]``, that means that you have 6 groups,
where the first 10 records are in the first group, records 11-30 are in the second group, records 31-70 are in the third group, etc.
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None, optional (default=None)
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)
Init score for Dataset.
params : dict or None, optional (default=None)
Other parameters for validation Dataset.
Expand Down Expand Up @@ -2547,7 +2550,7 @@ def _reverse_update_params(self) -> "Dataset":
def set_field(
self,
field_name: str,
data: Optional[Union[List[List[float]], List[List[int]], List[float], List[int], np.ndarray, pd_Series, pd_DataFrame, pa_Array, pa_ChunkedArray]]
data: Optional[Union[List[List[float]], List[List[int]], List[float], List[int], np.ndarray, pd_Series, pd_DataFrame, pa_Table, pa_Array, pa_ChunkedArray]]
) -> "Dataset":
"""Set property into the Dataset.

Expand Down Expand Up @@ -2576,7 +2579,16 @@ def set_field(
return self

# If the data is a arrow data, we can just pass it to C
if _is_pyarrow_array(data):
if _is_pyarrow_array(data) or _is_pyarrow_table(data):
# If a table is being passed, we concatenate the columns. This is only valid for
# 'init_score'.
if _is_pyarrow_table(data):
if field_name != "init_score":
raise ValueError("pyarrow table provided for field other than init_score")
borchero marked this conversation as resolved.
Show resolved Hide resolved
data = pa_chunked_array([
chunk for array in data.columns for chunk in array.chunks # type: ignore
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
])

c_array = _export_arrow_to_c(data)
_safe_call(_LIB.LGBM_DatasetSetFieldFromArrow(
self._handle,
Expand Down Expand Up @@ -2869,7 +2881,7 @@ def set_init_score(

Parameters
----------
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
Init score for Booster.

Returns
Expand Down Expand Up @@ -4443,7 +4455,7 @@ def refit(

.. versionadded:: 4.0.0

init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), or None, optional (default=None)
init_score : list, list of lists (for multi-class task), numpy array, pandas Series, pandas DataFrame (for multi-class task), pyarrow Array, pyarrow ChunkedArray, pyarrow Table (for multi-class task) or None, optional (default=None)
Init score for ``data``.

.. versionadded:: 4.0.0
Expand Down
2 changes: 2 additions & 0 deletions python-package/lightgbm/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def __init__(self, *args, **kwargs):
from pyarrow import Array as pa_Array
from pyarrow import ChunkedArray as pa_ChunkedArray
from pyarrow import Table as pa_Table
from pyarrow import chunked_array as pa_chunked_array
from pyarrow.cffi import ffi as arrow_cffi
from pyarrow.types import is_floating as arrow_is_floating
from pyarrow.types import is_integer as arrow_is_integer
Expand Down Expand Up @@ -243,6 +244,7 @@ class pa_compute: # type: ignore
all = None
equal = None

pa_chunked_array = None
arrow_is_integer = None
arrow_is_floating = None

Expand Down
2 changes: 2 additions & 0 deletions src/io/dataset.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -904,6 +904,8 @@ bool Dataset::SetFieldFromArrow(const char* field_name, const ArrowChunkedArray
metadata_.SetLabel(ca);
} else if (name == std::string("weight") || name == std::string("weights")) {
metadata_.SetWeights(ca);
} else if (name == std::string("init_score")) {
metadata_.SetInitScore(ca);
} else if (name == std::string("query") || name == std::string("group")) {
metadata_.SetQuery(ca);
} else {
Expand Down
28 changes: 20 additions & 8 deletions src/io/metadata.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -355,32 +355,44 @@ void Metadata::CheckOrPartition(data_size_t num_all_data, const std::vector<data
}
}

void Metadata::SetInitScore(const double* init_score, data_size_t len) {
template <typename It>
void Metadata::SetInitScoresFromIterator(It first, It last) {
std::lock_guard<std::mutex> lock(mutex_);
// save to nullptr
if (init_score == nullptr || len == 0) {
// Clear init scores on empty input
if (last - first == 0) {
init_score_.clear();
num_init_score_ = 0;
return;
}
if ((len % num_data_) != 0) {
if (((last - first) % num_data_) != 0) {
Log::Fatal("Initial score size doesn't match data size");
}
if (init_score_.empty()) { init_score_.resize(len); }
num_init_score_ = len;
if (init_score_.empty()) {
init_score_.resize(last - first);
}
num_init_score_ = last - first;

#pragma omp parallel for num_threads(OMP_NUM_THREADS()) schedule(static, 512) if (num_init_score_ >= 1024)
for (int64_t i = 0; i < num_init_score_; ++i) {
init_score_[i] = Common::AvoidInf(init_score[i]);
init_score_[i] = Common::AvoidInf(first[i]);
}
init_score_load_from_file_ = false;

#ifdef USE_CUDA
if (cuda_metadata_ != nullptr) {
cuda_metadata_->SetInitScore(init_score_.data(), len);
cuda_metadata_->SetInitScore(init_score_.data(), init_score_.size());
}
#endif // USE_CUDA
}

void Metadata::SetInitScore(const double* init_score, data_size_t len) {
SetInitScoresFromIterator(init_score, init_score + len);
}

void Metadata::SetInitScore(const ArrowChunkedArray& array) {
SetInitScoresFromIterator(array.begin<double>(), array.end<double>());
}

void Metadata::InsertInitScores(const double* init_scores, data_size_t start_index, data_size_t len, data_size_t source_size) {
if (num_init_score_ <= 0) {
Log::Fatal("Inserting initial score data into dataset with no initial scores");
Expand Down
45 changes: 44 additions & 1 deletion tests/python_package_test/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ def test_dataset_construct_weights_none():
["array_type", "weight_data"],
[(pa.array, [3, 0.7, 1.5, 0.5, 0.1]), (pa.chunked_array, [[3], [0.7, 1.5, 0.5, 0.1]])],
)
@pytest.mark.parametrize("arrow_type", [pa.float32(), pa.float64()])
@pytest.mark.parametrize("arrow_type", _FLOAT_TYPES)
def test_dataset_construct_weights(array_type, weight_data, arrow_type):
data = generate_dummy_arrow_table()
weights = array_type(weight_data, type=arrow_type)
Expand Down Expand Up @@ -210,3 +210,46 @@ def test_dataset_construct_groups(array_type, group_data, arrow_type):

expected = np.array([0, 2, 5], dtype=np.int32)
np_assert_array_equal(expected, dataset.get_field("group"), strict=True)


# ----------------------------------------- INIT SCORES ----------------------------------------- #


@pytest.mark.parametrize(
["array_type", "init_score_data"],
[
(pa.array, [0, 1, 2, 3, 3]),
(pa.chunked_array, [[0, 1, 2], [3, 3]]),
(pa.chunked_array, [[], [0, 1, 2], [3, 3]]),
(pa.chunked_array, [[0, 1], [], [], [2], [3, 3], []]),
],
)
@pytest.mark.parametrize("arrow_type", _INTEGER_TYPES + _FLOAT_TYPES)
def test_dataset_construct_init_scores_array(
array_type: Any, init_score_data: Any, arrow_type: Any
):
data = generate_dummy_arrow_table()
init_scores = array_type(init_score_data, type=arrow_type)
dataset = lgb.Dataset(data, init_score=init_scores, params=dummy_dataset_params())
dataset.construct()

expected = np.array([0, 1, 2, 3, 3], dtype=np.float64)
np_assert_array_equal(expected, dataset.get_init_score(), strict=True)


def test_dataset_construct_init_scores_table():
data = generate_dummy_arrow_table()
init_scores = pa.Table.from_arrays(
[
generate_random_arrow_array(5, seed=1),
generate_random_arrow_array(5, seed=2),
generate_random_arrow_array(5, seed=3),
],
names=["a", "b", "c"],
)
dataset = lgb.Dataset(data, init_score=init_scores, params=dummy_dataset_params())
dataset.construct()

actual = dataset.get_init_score()
assert actual.dtype == np.float64
assert actual.shape == (5, 3)
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
Loading