Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emitting a Dataframe with dtype float16 from a Python-UDF leads to silent data corruption #796

Closed
tkilias opened this issue May 8, 2023 · 0 comments · Fixed by #795
Closed
Assignees
Labels
bug Unwanted / harmful behavior

Comments

@tkilias
Copy link
Collaborator

tkilias commented May 8, 2023

Background

  • float16 columns are currently incorrectly converted from float16 to double.
  • Our protocol only speaks double for that reason, we need to convert it.
  • The current conversion casts the numpy float16 representation into an uint8 and from there into a double.
  • However, the uint8 doesn't get interpreted as double on bit level, instead the uint8 value is used as the double value

https://github.com/exasol/script-languages/blob/4cb6de499dd9c724b5a21db2450a31d36e19c732/exaudfclient/base/python/python3/python_ext_dataframe.cc#L737

Solution

  • We can solve this using astype from numpy to do the conversion

Acceptance Criteria

  • Test for all data types
  • Fix
  • Documentation of the data corruption
@tkilias tkilias added the bug Unwanted / harmful behavior label May 8, 2023
@tkilias tkilias self-assigned this May 26, 2023
tkilias added a commit that referenced this issue Jun 2, 2023
…ata from Python UDFs (#795)

* #796: Fixed silent data corruption when emitting dataframes with float16 dtype columns from Python UDFs
* Replace asscalar with item in test dataframe.py, because asscalar was removed
* Added handleEmitPyFloat to also support float pyarrow dtype columns with NAN and object-dtype columns with float
* Refactored and split Pandas Tests
* Added tests for more dtypes to Pandas Tests
* Documented supported dtypes when emitting dataframes

Co-authored-by: Thomas Ubensee <[email protected]>
tkilias added a commit that referenced this issue Jun 2, 2023
* #792: Updated script-languages-container-tool to 0.17.0 and script-languages-container-ci to 1.1.0 (#799)
* #800: Updated Ubuntu packages (#801)
* #803: Updated Ubuntu packages (#804)
* #814: Updated packages (#815)
* #793: Added support for Pandas 2 pyarrow dtype columns for emitting data from Python UDFs (#795)
* #796: Fixed silent data corruption when emitting dataframes with float16 dtype columns from Python UDFs

---------

Co-authored-by: Thomas Ubensee <[email protected]>
tkilias added a commit that referenced this issue Jun 2, 2023
 - #789: Updated Ubuntu packages
 - #777: Updated redis-py to fix CVE-2023-28859
 - #800: Updated Ubuntu packages
 - #803: Updated Ubuntu packages
 - #815: Updated Ubuntu packages and poetry.lock
 - #793: Added support for Pandas 2 pyarrow dtype columns for emitting data from Python UDFs
 - #792: Fixed Github workflow publish-test-container by updating script-languages-container-tool to 0.17.0 and script-languages-container-ci to 1.1.0
 - #796: Fixed silent data corruption when emitting dataframes with float16 dtype columns from Python UDFs
@tkilias tkilias closed this as completed Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unwanted / harmful behavior
Projects
None yet
1 participant