Data Reconstruction fails when selected features doesn't include a categorical feature. #36

nikml · 2022-03-14T12:31:23Z

nannyml version: 0.2.0 (from main branch)
Python version: 3.10
Operating System: Fedora 35

Description

Perform Multivariate Drift Analysis with Data Reconstruciton on a dataset without categorical features.

What I Did

I tried to run the code from Data Reconstruction Deep Dive but the code failed with:

# Let's compute univariate drift
rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=DPP)
rcerror_calculator.fit(reference_data=reference)
# let's compute (and visualize) results across all the dataset.
rcerror_results = rcerror_calculator.calculate(data=data)
rcerror_results.data
# let's create plot with results
figure = rcerror_results.plot()
figure.show()
figure.write_image(file="butterfly-multivariate-drift.svg")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_10304/2321058283.py in <cell line: 3>()
      1 # Let's compute univariate drift
      2 rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=DPP)
----> 3 rcerror_calculator.fit(reference_data=reference)
      4 # let's compute (and visualize) results across all the dataset.
      5 rcerror_results = rcerror_calculator.calculate(data=data)

~/Source/nannyml/nannyml/drift/base.py in fit(self, reference_data)
    165                 self.chunker = DefaultChunker(minimum_chunk_size=minimum_chunk_size)
    166 
--> 167         self._fit(reference_data)
    168 
    169     def _fit(self, reference_data: pd.DataFrame):

~/Source/nannyml/nannyml/drift/data_reconstruction/calculator.py in _fit(self, reference_data)
    101         # TODO: We duplicate the reference data 3 times, here. Improve to something more memory efficient?
    102         imputed_reference_data = reference_data.copy(deep=True)
--> 103         imputed_reference_data[selected_categorical_column_names] = self._imputer_categorical.fit_transform(
    104             imputed_reference_data[selected_categorical_column_names]
    105         )

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    850         if y is None:
    851             # fit method of arity 1 (unsupervised transformation)
--> 852             return self.fit(X, **fit_params).transform(X)
    853         else:
    854             # fit method of arity 2 (supervised transformation)

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    317             Fitted estimator.
    318         """
--> 319         X = self._validate_input(X, in_fit=True)
    320 
    321         # default fill_value is 0 for numerical input and "missing_value"

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    285                 raise new_ve from None
    286             else:
--> 287                 raise ve
    288 
    289         _check_inputs_dtype(X, self.missing_values)

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    268 
    269         try:
--> 270             X = self._validate_data(
    271                 X,
    272                 reset=in_fit,

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    663 
    664         if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
--> 665             dtype_orig = np.result_type(*dtypes_orig)
    666 
    667     if dtype_numeric:

<__array_function__ internals> in result_type(*args, **kwargs)

ValueError: at least one array or dtype is required

It's likely that the current implementation cannot handle the selected features not having categorical or not having continuous features included.

The text was updated successfully, but these errors were encountered:

nikml added the bug Something isn't working label Mar 14, 2022

nikml mentioned this issue Mar 14, 2022

multivariate drift bugfix and doc update #37

Merged

nikml self-assigned this Mar 14, 2022

nnansters closed this as completed in #37 Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Reconstruction fails when selected features doesn't include a categorical feature. #36

Data Reconstruction fails when selected features doesn't include a categorical feature. #36

nikml commented Mar 14, 2022

Data Reconstruction fails when selected features doesn't include a categorical feature. #36

Data Reconstruction fails when selected features doesn't include a categorical feature. #36

Comments

nikml commented Mar 14, 2022

Description

What I Did