Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/nominal dtype checks in transform #162

Merged
merged 5 commits into from
Jan 11, 2024

Conversation

TommyMatthews
Copy link
Contributor

@TommyMatthews TommyMatthews commented Jan 11, 2024

Addresses issue #145 in part. I was unable to replicate the error I found in the generic testing where the fit method did not raise an appropriate error for X not being a pandas.DataFrame object.

This bug definitely exists for the transform method so I have fixed that. super.transform was always getting called, but always after the check_mappable_rows method inherited from BaseNominalTransformer, which was making checks that assumed a dataframe. This is a good example of an implementation test (super transform call) not testing for the behaviour we want (error if X not pd dataframe).

In the case of MeanResponseTransformer I just moved the call of super.transform to the start of the transform method, however in OrdinalEncoderTransformer and NominalToIntegerTransformer this led to X not passing the check_mappable_rows check, so I just called the relevant columns_check method from BaseTransformer at the start of the transform method, which raises the appropriate error. I think the fact that I have had to do it like this reflects either a lack of understanding on my part or a need to refactor BaseNominalTransformer. I have added tests for this behaviour (copied from BaseTransformer test suite).

The issue still mentions the possibility of adding nominal type checks (maybe with a warning) for nominal transformers. I think probably best to get the bug fix in, close the issue and open a new issue for this feature.

@davidhopkinson26
Copy link
Contributor

Thanks @TommyMatthews. I agree this is all quite messy and could do with a further reaching refactor. Going forward I would like to see less reliance on super init/fit/transform calls and move towards the error handling being in more specific methods to be called in child classes (like you have done with the columns_check method here)

@davidhopkinson26
Copy link
Contributor

I don't understand why moving the super_transform call to the start broke check_mappable_rows. Can you elaborate on why it failed please?

I'm happy with the solution proposed (i.e. direct call to columns_check) but want to understand to help with future refactoring of BaseNominalTransformer as this seems like odd behaviour.

@TommyMatthews
Copy link
Contributor Author

TommyMatthews commented Jan 11, 2024

@davidhopkinson26 I'm not 100% sure myself. If I run X = super().transform(X), or even X = super(BaseNominalTransformer, self).transform(X) (to explicitly access the BaseTransformer) then the value error BaseNominalTransformer.check_mappable_rows is raised, e.g. ValueError: OrdinalEncoderTransformer: nulls would be introduced into column b from levels not present in mapping

image

So I guess mappable rows is now be calculated incorrectly? (I can't see it changing X.shape[0]?)

@TommyMatthews
Copy link
Contributor Author

TommyMatthews commented Jan 11, 2024

Figured this one out: hadn't fully understood the inheritance set up of the transformer but super.transform is actually calling the transform method of BaseMappingTransformerMixin, which is actually doing the mapping and hence if you call check_mappable_rows afterwards it fails

I think the best solution is still just calling the BaseTransformer.columns_check() as BaseTransformer.transform() is called in BaseMappingTransformerMixin.transform(), so calling this at the start of transform would duplicate a lot of checks

@davidhopkinson26
Copy link
Contributor

Figured this one out: hadn't fully understood the inheritance set up of the transformer but super.transform is actually calling the transform method of BaseMappingTransformerMixin, which is actually doing the mapping and hence if you call check_mappable_rows afterwards it fails

I think the best solution is still just calling the BaseTransformer.columns_check() as BaseTransformer.transform() is called in BaseMappingTransformerMixin.transform(), so calling this at the start of transform would duplicate a lot of checks

That makes sense. I agree that this is the best solution.

@davidhopkinson26 davidhopkinson26 merged commit 2462950 into develop Jan 11, 2024
1 check passed
@davidhopkinson26 davidhopkinson26 deleted the feature/nominal_dtype_checks_in_transform branch January 11, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants