Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: extract floats for parquet data pages #10972

Closed
wants to merge 4 commits into from

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Jun 18, 2024

Which issue does this PR close?

Closes #10951

Rationale for this change

Serves the overall epic #10922

What changes are included in this PR?

  1. Adds min and max DataPageStatsIterator for float32 and float64
  2. Updates the main match in get_data_page_statistics to include f16, f32, and f64
  3. Updates null_counts_page_statistics to handle floats
  4. Updates various test utils to work with floats

Are these changes tested?

Yes, added new tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Jun 18, 2024
@tshauck tshauck marked this pull request as ready for review June 18, 2024 04:19
@@ -507,9 +507,84 @@ async fn test_multiple_data_pages_nulls_and_negatives() {

/////////////// MORE GENERAL TESTS //////////////////////
// . Many columns in a file
// . Differnet data types
// . Different data types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup 🚀

Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tshauck
Thanks LGTM. I think only one test case should check both instead of only row_groups?

expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
expected_row_counts: Some(UInt64Array::from(vec![5, 5, 5, 5])),
column_name: "f",
check: Check::RowGroup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Check::Both ?

Copy link
Contributor Author

@tshauck tshauck Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think you're right, but doing that leads to a test assertion failure 🤔 ... gonna move the PR back draft while I look further into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is due to the difference in handling row_counts for RowGroups and DataPages. #10965

Im working on this - for now you'd need two test cases, one for each check with different expected row_counts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm stretching a here, but doing a bit of looking I think it's due to the physical type for an f16 being a fixed length byte array. I think column_page_index_per_row_group_per_column is an Index::FIXED_LEN_BYTE_ARRAY which seemingly isn't playing well with some of the conversion, so the values coming through end up being None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to be right here; I left a comment in the code - where I think you might have missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW #10965 is now fixex

@tshauck tshauck marked this pull request as draft June 18, 2024 04:35
Index::FIXED_LEN_BYTE_ARRAY,
f16
);
make_data_page_stats_iterator!(MinFloat32DataPageStatsIterator, min, Index::FLOAT, f32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Float (f32) and Double (f64) would be enough, in terms of naming. This would also be consistent with the row group stats iterators.

@@ -552,6 +552,22 @@ make_data_page_stats_iterator!(MinInt32DataPageStatsIterator, min, Index::INT32,
make_data_page_stats_iterator!(MaxInt32DataPageStatsIterator, max, Index::INT32, i32);
make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
make_data_page_stats_iterator!(
Copy link
Contributor

@marvinlanhenke marvinlanhenke Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tshauck
It seems like you are not using those in the maching down below? that might cause your error on f16. Instead you are using an Float32 iterator which won't match on Index::FIXED_LEN_BYTE_ARRAY and thus return None.

I think you need to switch to the f16 iterator; but perhaps might need to construct an f16 from the bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the confusion. It pushed this as a WIP. I still need to work through the f16 case. I think I will need to construct an f16 from the bytes, but not sure yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of what I mean...

image

Fixed length byte array is collected as parquet::data_type::FixedLenByteArray, which then means this doesn't match. Alternatively, using that data type in the macro causes later move issues when interacting with the calls to min() and max().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we take a friendly look at how f16 for row groups work:

DataType::Float16 => Ok(Arc::new(Float16Array::from_iter(
[<$stat_type_prefix FixedLenByteArrayStatsIterator>]::new($iterator).map(|x| x.and_then(|x| {
from_bytes_to_f16(x)
})),
))),

I suspect the statistics in data pages will be encodd the same

It is also possible that since f16 is a less common type, the parquet writer might write data page statistics for it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi there, I haven't noticed this PR and accidentally addressed the same issue in here #10982 . Apologies. I think the f32 and f64 cases we solve the ~same. I got the f16 working, though not in a way I'm particularly happy with (ie, the move issues you mention)

I don't have any other commits in that PR in my queue -- so happy to close it if you come up with any other f16 solution...

Copy link
Contributor Author

@tshauck tshauck Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooff, sorry @tmi, I should've done take on the issue. Taking @alamb's pointer to look at row groups, I might be wrong, but I think there's a couple of differences between it and the data page stats that complicate things...

  • max_bytes/min_bytes, which is used for row groups, does not exist on PageIndex in the parquet crate but does on Statistics, so using it in the macro isn't possible at this point.
  • Using [u8] as the stat_value_type is problematic when combined with the iterator macro in that type Item = Vec<Option<$stat_value_type>>; can't work with [u8] because it's unsized. Changing the item def to type Item = Vec<Option<&'a $stat_value_type>>; works and is similar to the item definition of row group (type Item = Option<&'a $stat_value_type>;).

f16 is less common, so maybe to have a solution that's similar to Statistics and hopefully avoids cloning, we could:

  1. Update this PR to do f32 and f64, get it in to support higher priority types and is mostly innocuous
  2. Update the parquet crate to support max_bytes/min_bytes on PageIndex
  3. Update the data page stats to use a &'a $stat_value_type and add f16 when that's merged

@tmi, maybe you and I could collaborate on 2 & 3 if that seems reasonable? Otherwise, certainly happy to go another route, if there's a better approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good plan to me

I think it is perfectly reasonable to leave f16 support as a 'TODO' in datafusion as we add the needed support upstream

Let me know if I can help (e.g. file some tickets, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #10982 is good to go FWIW

Copy link
Contributor Author

@tshauck tshauck Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good to me. Maybe a further improvement down the line to avoid the clone and simplify to use a method vs a function. I'll close this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the clone

I think the ideal sequence would be

  1. Tracking Issue for f16 and f128 float types rust-lang/rust#116909 gets merged
  2. parquet replaces FixedLengthByteArray with genuine f16 type
  3. we drop the special handling here and elsewhere

and until then the f16 support in datafusion would be as is... Or alternatively, introduce the half crate to parquet rs first. Because dealing with FixedLengthByteArray on the user side is always going to be messy

@tshauck tshauck closed this Jun 18, 2024
@tshauck tshauck deleted the extract-floats branch June 24, 2024 19:46
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support extracting Float{16, 32, 64} statistics from Parquet Data Pages
4 participants