Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization of NaN is not working as intended #23

Closed
mrckzgl opened this issue Jul 8, 2024 · 2 comments
Closed

Normalization of NaN is not working as intended #23

mrckzgl opened this issue Jul 8, 2024 · 2 comments
Assignees

Comments

@mrckzgl
Copy link

mrckzgl commented Jul 8, 2024

The data class tries to normalize na / nan values into empty strings.
This is done here:

self.dataset_1 = self.dataset_1.astype(str)
self.dataset_1.fillna("", inplace=True)
if not self.is_dirty_er:
self.dataset_2 = self.dataset_2.astype(str)
self.dataset_2.fillna("", inplace=True)

but it does not work as intended.
When casting the DataFrame to str, all nan values will be replaced with the string "nan" and fillna does nothing anymore.
see:

>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(float))
      0
0  True
>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(str))
       0
0  False
>>> 

Though, I do not know the best way to handle the intended conversion. One way could be to just change the order, first do fillna and later cast to string. But I don't know what happens if fillna('', inplace=True) is thrown against dtypes incompatible with / other than a string.

best

@mrckzgl mrckzgl changed the title Normalization of NaN Normalization of NaN is not working as intended Jul 8, 2024
@mrckzgl
Copy link
Author

mrckzgl commented Jul 8, 2024

I also wonder if it is necessary and good practice to convert the dataframe to string, as then there is no distinction between na and empty string anymore ...

@Nikoletos-K Nikoletos-K self-assigned this Jul 17, 2024
@Nikoletos-K
Copy link
Member

Nikoletos-K commented Jul 19, 2024

Hello, and I'm sorry for the late reply.

Yeah you're right on your remarks. Indeed NaN handling has no effect this way. So changing rows I think will do the trick.

        # Fill NaN values with empty string
        self.dataset_1.fillna("", inplace=True)
        self.dataset_1 = self.dataset_1.astype(str)
        if not self.is_dirty_er:
            self.dataset_2.fillna("", inplace=True)
            self.dataset_2 = self.dataset_2.astype(str)

As far as the str transformation, it is necessary in order to assure that no other types will be handled. It caused issues in many other steps, and that's why we decided to handle it this way.

The above fix will be uploaded in the next release.

Thank you that you shared it with us!

Konstantinos

Nikoletos-K added a commit that referenced this issue Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants