-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs #406
Comments
Jan Finis / @JFinis: The whole topic of NaN handling in Parquet currently seems to be lacking and somewhat inconsistent, making columns with NaNs mostly unusable for scan pruning. Maybe there should be a redefinition of the semantics in a new version, so that columns with NaNs can be used for indexing as other columns. As mentioned, Iceberg has solved this problem by providing NaN counts. |
Xuwei Fu / @mapleFU: In Sorting, iceberg forces that:
I think (1) is bad, because NaN is never equal to NULL. IEEE754 (https://ieeexplore.ieee.org/document/8766229) and C++ standard support some "totalOrder", but I think regard it as totalOrder is strange, so min-max as If I'm wrong, please correct me |
Jan Finis / @JFinis: Sidepoint: Note that NaN being larger than all other values is also:
|
Gang Wu / @wgtmac: |
Jan Finis / @JFinis: @mapleFU I agree. Treating NaN as larger or smaller than any other value doesn't fit the semantics of all engines. Therefore, my fix for this would enable both types of semantics (and a third semantics, where NaN should be treated as neither larger nor smaller) to work. I guess I'll just create a pull request. We can then start the discussion based on this. Sounds good? |
Currently, the specification of
ColumnIndex
inparquet.thrift
is inconsistent, leading to cases where it is impossible to create a parquet file that is conforming to the spec.The problem is with double/float columns if a page contains only NaN values. The spec mentions that NaN values should not be included in min/max bounds, so a page consisting of only NaN values has no defined min/max bound. To quote the spec:
However, the comments in the ColumnIndex on the null_pages member states the following:
For a page with only NaNs, we now have a problem. The page definitly does not only contain null values, so
null_pages
should befalse
for this page. However, in this case the spec requires valid min/max values inmin_values
andmax_values
for this page. As the only value in the page is NaN, the only valid min/max value we could enter here is NaN, but as mentioned before, NaNs should never be written to min/max values.Thus, no writer can currently create a parquet file that conforms to this specification as soon as there is a only-NaN column and column indexes are to be written.
I see three possible solutions:
byte[0]
as min/max, even though the null_pages entry is set to {}false{}.None of the solutions is perfect. But I guess solution 3. is the best of them. It gives us valid min/max bounds, makes null_pages compatible with this, and gives us a way to determine only-Nan pages (min=max=NaN).
As a general note: I would say that it is a shortcoming that Parquet doesn't track NaN counts. E.g., Iceberg does track NaN counts and therefore doesn't have this inconsistency. In a future version, NaN counts could be introduced, but that doesn't help for backward compatibility, so we do need a solution for now.
Any of the solutions is better than the current situation where engines writing such a page cannot write a conforming parquet file and will randomly pick any of the solutions.
Thus, my suggestion would be to update parquet.thrift to use solution 3. I.e., rewrite the comments saying that NaNs shouldn't be included in min/max bounds by adding a clause stating that "if a page contains only NaNs or a mixture of NaNs and NULLs, then NaN should be written as min & max".
Reporter: Jan Finis / @JFinis
PRs and other links:
Note: This issue was originally created as PARQUET-2249. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: