Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Reconstruction fails when selected features doesn't include a categorical feature. #36

Closed
nikml opened this issue Mar 14, 2022 · 0 comments · Fixed by #37
Closed
Assignees
Labels
bug Something isn't working

Comments

@nikml
Copy link
Contributor

nikml commented Mar 14, 2022

  • nannyml version: 0.2.0 (from main branch)
  • Python version: 3.10
  • Operating System: Fedora 35

Description

Perform Multivariate Drift Analysis with Data Reconstruciton on a dataset without categorical features.

What I Did

I tried to run the code from Data Reconstruction Deep Dive but the code failed with:

# Let's compute univariate drift
rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=DPP)
rcerror_calculator.fit(reference_data=reference)
# let's compute (and visualize) results across all the dataset.
rcerror_results = rcerror_calculator.calculate(data=data)
rcerror_results.data
# let's create plot with results
figure = rcerror_results.plot()
figure.show()
figure.write_image(file="butterfly-multivariate-drift.svg")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_10304/2321058283.py in <cell line: 3>()
      1 # Let's compute univariate drift
      2 rcerror_calculator = nml.DataReconstructionDriftCalculator(model_metadata=metadata, chunk_size=DPP)
----> 3 rcerror_calculator.fit(reference_data=reference)
      4 # let's compute (and visualize) results across all the dataset.
      5 rcerror_results = rcerror_calculator.calculate(data=data)

~/Source/nannyml/nannyml/drift/base.py in fit(self, reference_data)
    165                 self.chunker = DefaultChunker(minimum_chunk_size=minimum_chunk_size)
    166 
--> 167         self._fit(reference_data)
    168 
    169     def _fit(self, reference_data: pd.DataFrame):

~/Source/nannyml/nannyml/drift/data_reconstruction/calculator.py in _fit(self, reference_data)
    101         # TODO: We duplicate the reference data 3 times, here. Improve to something more memory efficient?
    102         imputed_reference_data = reference_data.copy(deep=True)
--> 103         imputed_reference_data[selected_categorical_column_names] = self._imputer_categorical.fit_transform(
    104             imputed_reference_data[selected_categorical_column_names]
    105         )

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    850         if y is None:
    851             # fit method of arity 1 (unsupervised transformation)
--> 852             return self.fit(X, **fit_params).transform(X)
    853         else:
    854             # fit method of arity 2 (supervised transformation)

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in fit(self, X, y)
    317             Fitted estimator.
    318         """
--> 319         X = self._validate_input(X, in_fit=True)
    320 
    321         # default fill_value is 0 for numerical input and "missing_value"

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    285                 raise new_ve from None
    286             else:
--> 287                 raise ve
    288 
    289         _check_inputs_dtype(X, self.missing_values)

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/impute/_base.py in _validate_input(self, X, in_fit)
    268 
    269         try:
--> 270             X = self._validate_data(
    271                 X,
    272                 reset=in_fit,

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    564             raise ValueError("Validation should be done on X, y or both.")
    565         elif not no_val_X and no_val_y:
--> 566             X = check_array(X, **check_params)
    567             out = X
    568         elif no_val_X and not no_val_y:

~/.cache/pypoetry/virtualenvs/nannyml-RmJkXFBz-py3.10/lib64/python3.10/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    663 
    664         if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
--> 665             dtype_orig = np.result_type(*dtypes_orig)
    666 
    667     if dtype_numeric:

<__array_function__ internals> in result_type(*args, **kwargs)

ValueError: at least one array or dtype is required

It's likely that the current implementation cannot handle the selected features not having categorical or not having continuous features included.

@nikml nikml added the bug Something isn't working label Mar 14, 2022
@nikml nikml self-assigned this Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant