Skip to content

Handle NaN in Parquet statistics#4467

Merged
findepi merged 1 commit intotrinodb:masterfrom
aalbu:nan-parquet-orc-stats
Jul 17, 2020
Merged

Handle NaN in Parquet statistics#4467
findepi merged 1 commit intotrinodb:masterfrom
aalbu:nan-parquet-orc-stats

Conversation

@aalbu
Copy link
Member

@aalbu aalbu commented Jul 15, 2020

Fixes #4267

@cla-bot cla-bot bot added the cla-signed label Jul 15, 2020
@aalbu
Copy link
Member Author

aalbu commented Jul 16, 2020

@aalbu aalbu requested a review from findepi July 16, 2020 01:52
@aalbu aalbu force-pushed the nan-parquet-orc-stats branch from 30c9e29 to 97e9af8 Compare July 16, 2020 03:59
@aalbu aalbu requested a review from losipiuk July 16, 2020 16:38
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

do we have storage formats smoke tests to tests full write-read path?
this should be run for ORC as well, even though there isn't a problem to fix there

(like io.prestosql.plugin.hive.TestHiveIntegrationSmokeTest#testNaNPartition)

Copy link
Member Author

@aalbu aalbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have storage formats smoke tests to tests full write-read path?
this should be run for ORC as well, even though there isn't a problem to fix there

I did write a round-trip test, but I wasn't able to create stats with NaN, so it didn't fail without the fix 🤨 Inserting data through Presto didn't do it for me and I couldn't figure out how to insert NaN through Hive.

@aalbu aalbu force-pushed the nan-parquet-orc-stats branch from 97e9af8 to 1bcbbee Compare July 17, 2020 01:29
@findepi
Copy link
Member

findepi commented Jul 17, 2020

I did write a round-trip test, but I wasn't able to create stats with NaN, so it didn't fail without the fix

Since we use hive's writer, we depend in hive/parquet version we bundle.
Can you check whether the test would fail (before the change) with some old Presto version, like .172 (used by Athena)?

@aalbu
Copy link
Member Author

aalbu commented Jul 17, 2020

Can you check whether the test would fail (before the change) with some old Presto version, like .172 (used by Athena)?

I have generated Parquet files with Athena and Spark, but I have not had any luck creating NaN file stats. I have added the test anyway.

@aalbu aalbu force-pushed the nan-parquet-orc-stats branch from 1bcbbee to 2235b45 Compare July 17, 2020 19:36
@findepi findepi merged commit c457710 into trinodb:master Jul 17, 2020
@findepi
Copy link
Member

findepi commented Jul 17, 2020

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Query failure when reading ORC/Parquet files containing NaN

3 participants