-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Skip redundant validation checks in RecordBatch#project #8583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some micro benchmark results. The results are in line with my expectations. The more columns you retain, the more work the validation logic needs to do so you see the most benefit in the |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pepijnve -- this looks like a great find to me.
arrow-array/Cargo.toml
Outdated
| @@ -80,3 +80,7 @@ harness = false | |||
| [[bench]] | |||
| name = "union_array" | |||
| harness = false | |||
|
|
|||
| [[bench]] | |||
| name = "record_batch" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move this benchmark into its own PR (mostly to make it easier to run the benchmark against unmodified code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #8592.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewound this PRs branch one commit and force pushed that to remove the benchmark here.
arrow-array/benches/record_batch.rs
Outdated
| fn criterion_benchmark(c: &mut Criterion) { | ||
| project_benchmark(c, 100, 100, 1); | ||
| project_benchmark(c, 100, 100, 50); | ||
| project_benchmark(c, 100, 100, 99); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a row count of 8192 is more realistic
Also, adding a benchmark with 1000 columns would also be useful (and maybe show off your improvement more)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row count doesn't really affect project performance afaict from the code, but it's pretty trivial to add more test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the extra cases in #8592
| row_count: Some(self.row_count), | ||
| }, | ||
| ) | ||
| unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked the checks that happen as part of creating a RecordBatch and I agree they are entirely redundant when projecting an already valid RecordBatch
arrow-rs/arrow-array/src/record_batch.rs
Lines 307 to 365 in e5e4db9
| // check that number of fields in schema match column length | |
| if schema.fields().len() != columns.len() { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "number of columns({}) must match number of fields({}) in schema", | |
| columns.len(), | |
| schema.fields().len(), | |
| ))); | |
| } | |
| let row_count = options | |
| .row_count | |
| .or_else(|| columns.first().map(|col| col.len())) | |
| .ok_or_else(|| { | |
| ArrowError::InvalidArgumentError( | |
| "must either specify a row count or at least one column".to_string(), | |
| ) | |
| })?; | |
| for (c, f) in columns.iter().zip(&schema.fields) { | |
| if !f.is_nullable() && c.null_count() > 0 { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "Column '{}' is declared as non-nullable but contains null values", | |
| f.name() | |
| ))); | |
| } | |
| } | |
| // check that all columns have the same row count | |
| if columns.iter().any(|c| c.len() != row_count) { | |
| let err = match options.row_count { | |
| Some(_) => "all columns in a record batch must have the specified row count", | |
| None => "all columns in a record batch must have the same length", | |
| }; | |
| return Err(ArrowError::InvalidArgumentError(err.to_string())); | |
| } | |
| // function for comparing column type and field type | |
| // return true if 2 types are not matched | |
| let type_not_match = if options.match_field_names { | |
| |(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| col_type != field_type | |
| } else { | |
| |(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| { | |
| !col_type.equals_datatype(field_type) | |
| } | |
| }; | |
| // check that all columns match the schema | |
| let not_match = columns | |
| .iter() | |
| .zip(schema.fields().iter()) | |
| .map(|(col, field)| (col.data_type(), field.data_type())) | |
| .enumerate() | |
| .find(type_not_match); | |
| if let Some((i, (col_type, field_type))) = not_match { | |
| return Err(ArrowError::InvalidArgumentError(format!( | |
| "column types must match schema types, expected {field_type} but found {col_type} at column index {i}" | |
| ))); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for verifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Thanks @pepijnve
# Which issue does this PR close? - Related to #8591. # Rationale for this change Add a microbenchmark for `RecordBatch::project` to measure the performance impact of #8583 # What changes are included in this PR? Adds an additional micro benchmark to `arrow-rs`. # Are these changes tested? Not applicable for benchmark code. Benchmark manually tested. # Are there any user-facing changes? No
|
🤖 |
|
Thanks @pepijnve -- I kicked off the benchmarks |
|
🤖: Benchmark completed Details
|
That is some pretty nice confirmation |
# Which issue does this PR close? - Closes #8692. # Rationale for this change Explained in issue. # What changes are included in this PR? - Adds `FilterPredicate::filter_record_batch` - Adapts the free function `filter_record_batch` to use the new function - Uses `new_unchecked` to create the filtered result. The rationale for this is identical to #8583 # Are these changes tested? Covered by existing tests for `filter_record_batch` # Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <[email protected]>
Which issue does this PR close?
RecordBatch::project#8591.Rationale for this change
RecordBatch project currently uses the validating factory function. Since project starts from a valid RecordBatch these checks are redundant. A small amount of work can be saved by using
new_uncheckedinstead.A change I'm working on for DataFusion uses
RecordBatch#projectin the inner expression evaluation loop to reduce the amount of redundant array filteringcaseexpressions need to do. While a micro optimisation, avoiding redundant work in inner loops seems worthwhile.What changes are included in this PR?
new_uncheckedinstead oftry_new_with_optionsinRecordBatch#projectAre these changes tested?
No additional tests added.
Performance difference proven via microbenchmark
Are there any user-facing changes?
No