[python] `Dataset` cannot be constructed from sparse `Sequence` #5207

tony-theorem · 2022-05-10T18:33:23Z

Description

A Dataset cannot be constructed from a Sequence that returns sparse data. This behavior is unexpected as Dataset generally supports sparse data .

Reproducible example

from __future__ import annotations
import numbers
from typing import Iterable

import lightgbm as lgbm
import numpy as np
from scipy import sparse

class SparseSequence(lgbm.Sequence):
    def __init__(self, sparse_data: sparse.csr_array) -> None:
        assert sparse_data.ndim == 2
        self.sparse_data = sparse_data
        
    def __len__(self) -> None:
        return self.sparse_data.shape[0]
    
    def __getitem__(
        self,
        idx: numbers.Integral | slice | Iterable[int],
    ) -> sparse.csr_array:
        if isinstance(idx, numbers.Integral):
            return self._get_row(int(idx))
        elif isinstance(idx, (slice, Iterable)):
            iter_idx = range(len(self))[idx] if isinstance(idx, slice) else idx
            rows = [self._get_row(i) for i in iter_idx]
            return (
                sparse.csr_array(sparse.vstack(rows, format="csr"))
                if len(rows) != 0 else
                sparse.csr_array((0, self.sparse_data.shape[1]))
            )
        else:
            raise TypeError(
                f"Sequence index must be integer, slice or iterable, got {type(idx).__name__}"
            )
    
    def _get_row(self, idx: int) -> sparse.csr_array:
        return self.sparse_data[[idx], :]

sparse_array = sparse.csr_array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float64)
labels = [0, 1, 1, 0]

# Succeeds
lgbm.Dataset(data=sparse_array, label=labels).construct()
# Fails
lgbm.Dataset(data=SparseSequence(sparse_array), label=labels).construct()

Environment info

Python version 3.9.12

lightgbm == 3.3.2
numpy == 1.22.3
scipy == 1.8.0

Additional Comments

The immediate cause of failure is due to https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1543. Sparse arrays and matrices do not have the flags attribute, thus leading to the AttributeError. Even if this check is removed, a TypeError will then be encountered in https://github.com/microsoft/LightGBM/blob/master/python-package/lightgbm/basic.py#L1545-L1571 when numpy ufuncs are applied to the sparse data.

The text was updated successfully, but these errors were encountered:

jameslamb · 2022-05-23T02:25:48Z

Thanks very much for the thorough report, @tony-theorem !!! We really appreciate the time you put into create a reproducible example.

Are you interested in working on a contribution to add this support? We'd be happy to help you with that if you are interested. Otherwise, someone else here will try to pick up the work of adding tests and a fix.

In the future, when opening issues here, please do the following:

Use commit-anchored links

see https://docs.github.com/en/repositories/working-with-files/using-files/getting-permanent-links-to-files)
"line 1545 of file basic.py on master" may point to something different a year from now than it did at the time you wrote this

here are such links for what you pointed to:

LightGBM/python-package/lightgbm/basic.py

Line 1543 in c000b8c

yield row if row.flags['OWNDATA'] else row.copy()

LightGBM/python-package/lightgbm/basic.py

Lines 1545 to 1571 in c000b8c

    
               def __sample(self, seqs: List[Sequence], total_nrow: int) -> Tuple[List[np.ndarray], List[np.ndarray]]: 
        
                   """Sample data from seqs. 
        
                   Mimics behavior in c_api.cpp:LGBM_DatasetCreateFromMats() 
        
                   Returns 
        
                   ------- 
        
                       sampled_rows, sampled_row_indices 
        
                   """ 
        
                   indices = self._create_sample_indices(total_nrow) 
        
                   # Select sampled rows, transpose to column order. 
        
                   sampled = np.array([row for row in self._yield_row_from_seqlist(seqs, indices)]) 
        
                   sampled = sampled.T 
        
                   filtered = [] 
        
                   filtered_idx = [] 
        
                   sampled_row_range = np.arange(len(indices), dtype=np.int32) 
        
                   for col in sampled: 
        
                       col_predicate = (np.abs(col) > ZERO_THRESHOLD) | np.isnan(col) 
        
                       filtered_col = col[col_predicate] 
        
                       filtered_row_idx = sampled_row_range[col_predicate] 
        
                       filtered.append(filtered_col) 
        
                       filtered_idx.append(filtered_row_idx) 
        
                   return filtered, filtered_idx

Include the literal text of error messages
- this allows search engines to find your bug report when others encounter the same issue and search an error message
- AttributeError: flags not found

jameslamb added the bug label May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] `Dataset` cannot be constructed from sparse `Sequence` #5207

[python] `Dataset` cannot be constructed from sparse `Sequence` #5207

tony-theorem commented May 10, 2022

jameslamb commented May 23, 2022

[python] Dataset cannot be constructed from sparse Sequence #5207

[python] Dataset cannot be constructed from sparse Sequence #5207

Comments

tony-theorem commented May 10, 2022

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented May 23, 2022

[python] `Dataset` cannot be constructed from sparse `Sequence` #5207

[python] `Dataset` cannot be constructed from sparse `Sequence` #5207