Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for StringView and BinaryView statistics in StatisticsConverter #6164

Closed
Tracked by #11752 ...
alamb opened this issue Jul 31, 2024 · 5 comments · Fixed by #6181
Closed
Tracked by #11752 ...

Add support for StringView and BinaryView statistics in StatisticsConverter #6164

alamb opened this issue Jul 31, 2024 · 5 comments · Fixed by #6181
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jul 31, 2024

Part of #6163

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@efredine recently added support for extracting statistics from parquet files as arrays in #6046 using StatisticsConverter

During development we have also added support for StringViewArray and BinaryViewArray in #5374

Currently there is no way to read StringViewArray and BinaryViewArray statistics and it actually panics if you try to read data page level statistics as I found on apache/datafusion#11723

not implemented
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
External error: query failed: DataFusion error: Join Error
caused by

Describe the solution you'd like

  1. Implement the ability to extract parquet statistics as StringView and BinaryView
  2. Remove the panic caused by unimplemented! at

The code is in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_reader/statistics.rs

Describe alternatives you've considered

You can avoid the panic by following the model of this:

let len = $iterator.count();
// don't know how to extract statistics, so return a null array
Ok(new_null_array($data_type, len))

Then, you can probably write a test followig the model of utf8 and binary

fn roundtrip_utf8() {
Test {
input: utf8_array([
// row group 1
Some("A"),
None,
Some("Q"),
// row group 2
Some("ZZ"),
Some("AA"),
None,
// row group 3
None,
None,
None,
]),
expected_min: utf8_array([Some("A"), Some("AA"), None]),
expected_max: utf8_array([Some("Q"), Some("ZZ"), None]),
}
.run()
}

fn roundtrip_binary() {
Test {
input: Arc::new(BinaryArray::from_opt_vec(vec![
// row group 1
Some(b"A"),
None,
Some(b"Q"),
// row group 2
Some(b"ZZ"),
Some(b"AA"),
None,
// row group 3
None,
None,
None,
])),
expected_min: Arc::new(BinaryArray::from_opt_vec(vec![
Some(b"A"),
Some(b"AA"),
None,
])),
expected_max: Arc::new(BinaryArray::from_opt_vec(vec![
Some(b"Q"),
Some(b"ZZ"),
None,
])),
}
.run()
}

And then implement the missing pieces of code (use StringViewBuilder / BinaryViewBuilder instead of StringBuilder / BinaryBuilder)

I have a hacky version in apache/datafusion#11753 that looks something like

            DataType::Utf8View => {
                let iterator = [<$stat_type_prefix ByteArrayStatsIterator>]::new($iterator);
                let mut builder = StringViewBuilder::new();
                for x in iterator {
                    let Some(x) = x else {
                        builder.append_null(); // no statistics value
                        continue;
                    };

                    let Ok(x) = std::str::from_utf8(x) else {
                        log::debug!("Utf8 statistics is a non-UTF8 value, ignoring it.");
                        builder.append_null();
                        continue;
                    };

                    builder.append_value(x);
                }
                Ok(Arc::new(builder.finish()))
            },

Additional context

@alamb
Copy link
Contributor Author

alamb commented Aug 1, 2024

I think this would be a good first issue as it has a detailed description and existing patterns to follow

@Kev1n8
Copy link
Contributor

Kev1n8 commented Aug 1, 2024

take

@yashathwani
Copy link

Can i work on this issue?

@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2024

Can i work on this issue?

Hi @yashathwani -- it looks like @Kev1n8 has already created a PR: #6181 -- perhaps you can help review it?

@alamb alamb added the parquet Changes to the parquet crate label Aug 31, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'parquet'} from #6187

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants