Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Dataset cannot be constructed from sparse Sequence #5207

Open
tony-theorem opened this issue May 10, 2022 · 1 comment
Open

[python] Dataset cannot be constructed from sparse Sequence #5207

tony-theorem opened this issue May 10, 2022 · 1 comment
Labels

Comments

@tony-theorem
Copy link

Description

A Dataset cannot be constructed from a Sequence that returns sparse data. This behavior is unexpected as Dataset generally supports sparse data .

Reproducible example

from __future__ import annotations
import numbers
from typing import Iterable

import lightgbm as lgbm
import numpy as np
from scipy import sparse

class SparseSequence(lgbm.Sequence):
    def __init__(self, sparse_data: sparse.csr_array) -> None:
        assert sparse_data.ndim == 2
        self.sparse_data = sparse_data
        
    def __len__(self) -> None:
        return self.sparse_data.shape[0]
    
    def __getitem__(
        self,
        idx: numbers.Integral | slice | Iterable[int],
    ) -> sparse.csr_array:
        if isinstance(idx, numbers.Integral):
            return self._get_row(int(idx))
        elif isinstance(idx, (slice, Iterable)):
            iter_idx = range(len(self))[idx] if isinstance(idx, slice) else idx
            rows = [self._get_row(i) for i in iter_idx]
            return (
                sparse.csr_array(sparse.vstack(rows, format="csr"))
                if len(rows) != 0 else
                sparse.csr_array((0, self.sparse_data.shape[1]))
            )
        else:
            raise TypeError(
                f"Sequence index must be integer, slice or iterable, got {type(idx).__name__}"
            )
    
    def _get_row(self, idx: int) -> sparse.csr_array:
        return self.sparse_data[[idx], :]

sparse_array = sparse.csr_array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float64)
labels = [0, 1, 1, 0]

# Succeeds
lgbm.Dataset(data=sparse_array, label=labels).construct()
# Fails
lgbm.Dataset(data=SparseSequence(sparse_array), label=labels).construct()

Environment info

Python version 3.9.12

lightgbm == 3.3.2
numpy == 1.22.3
scipy == 1.8.0

Additional Comments

The immediate cause of failure is due to https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1543. Sparse arrays and matrices do not have the flags attribute, thus leading to the AttributeError. Even if this check is removed, a TypeError will then be encountered in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1545-L1571 when numpy ufuncs are applied to the sparse data.

@jameslamb jameslamb added the bug label May 23, 2022
@jameslamb
Copy link
Collaborator

Thanks very much for the thorough report, @tony-theorem !!! We really appreciate the time you put into create a reproducible example.

Are you interested in working on a contribution to add this support? We'd be happy to help you with that if you are interested. Otherwise, someone else here will try to pick up the work of adding tests and a fix.


In the future, when opening issues here, please do the following:

  1. Use commit-anchored links
    • see https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files)
    • "line 1545 of file basic.py on master" may point to something different a year from now than it did at the time you wrote this
    • here are such links for what you pointed to:
      • yield row if row.flags['OWNDATA'] else row.copy()
      • def __sample(self, seqs: List[Sequence], total_nrow: int) -> Tuple[List[np.ndarray], List[np.ndarray]]:
        """Sample data from seqs.
        Mimics behavior in c_api.cpp:LGBM_DatasetCreateFromMats()
        Returns
        -------
        sampled_rows, sampled_row_indices
        """
        indices = self._create_sample_indices(total_nrow)
        # Select sampled rows, transpose to column order.
        sampled = np.array([row for row in self._yield_row_from_seqlist(seqs, indices)])
        sampled = sampled.T
        filtered = []
        filtered_idx = []
        sampled_row_range = np.arange(len(indices), dtype=np.int32)
        for col in sampled:
        col_predicate = (np.abs(col) > ZERO_THRESHOLD) | np.isnan(col)
        filtered_col = col[col_predicate]
        filtered_row_idx = sampled_row_range[col_predicate]
        filtered.append(filtered_col)
        filtered_idx.append(filtered_row_idx)
        return filtered, filtered_idx
  2. Include the literal text of error messages
    • this allows search engines to find your bug report when others encounter the same issue and search an error message
    • AttributeError: flags not found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants