Skip to content

Conversation

@pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Oct 10, 2025

Which issue does this PR close?

Rationale for this change

RecordBatch project currently uses the validating factory function. Since project starts from a valid RecordBatch these checks are redundant. A small amount of work can be saved by using new_unchecked instead.

A change I'm working on for DataFusion uses RecordBatch#project in the inner expression evaluation loop to reduce the amount of redundant array filtering case expressions need to do. While a micro optimisation, avoiding redundant work in inner loops seems worthwhile.

What changes are included in this PR?

  • Use new_unchecked instead of try_new_with_options in RecordBatch#project

Are these changes tested?

No additional tests added.
Performance difference proven via microbenchmark

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 10, 2025
@pepijnve pepijnve marked this pull request as ready for review October 11, 2025 09:09
@pepijnve
Copy link
Contributor Author

pepijnve commented Oct 11, 2025

Some micro benchmark results. The results are in line with my expectations. The more columns you retain, the more work the validation logic needs to do so you see the most benefit in the 100x100 -> 99x100 case for instance.

project/100x100 -> 1x100
                        time:   [85.588 ns 85.747 ns 85.903 ns]
                        change: [-7.3087% -7.0345% -6.7565%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

project/100x100 -> 50x100
                        time:   [788.42 ns 789.33 ns 790.23 ns]
                        change: [-24.541% -24.365% -24.202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

project/100x100 -> 99x100
                        time:   [1.3220 µs 1.3261 µs 1.3303 µs]
                        change: [-25.897% -25.665% -25.434%] (p = 0.00 < 0.05)
                        Performance has improved.

project/5x100 -> 1x100  time:   [84.869 ns 84.963 ns 85.067 ns]
                        change: [-8.3173% -8.0696% -7.8238%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

project/5x100 -> 2x100  time:   [93.710 ns 93.886 ns 94.111 ns]
                        change: [-14.343% -14.136% -13.921%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

project/5x100 -> 5x100  time:   [176.45 ns 176.73 ns 177.00 ns]
                        change: [-13.424% -13.215% -12.985%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pepijnve -- this looks like a great find to me.

@@ -80,3 +80,7 @@ harness = false
[[bench]]
name = "union_array"
harness = false

[[bench]]
name = "record_batch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please move this benchmark into its own PR (mostly to make it easier to run the benchmark against unmodified code)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #8592.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewound this PRs branch one commit and force pushed that to remove the benchmark here.

fn criterion_benchmark(c: &mut Criterion) {
project_benchmark(c, 100, 100, 1);
project_benchmark(c, 100, 100, 50);
project_benchmark(c, 100, 100, 99);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a row count of 8192 is more realistic

Also, adding a benchmark with 1000 columns would also be useful (and maybe show off your improvement more)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row count doesn't really affect project performance afaict from the code, but it's pretty trivial to add more test cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the extra cases in #8592

row_count: Some(self.row_count),
},
)
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the checks that happen as part of creating a RecordBatch and I agree they are entirely redundant when projecting an already valid RecordBatch

// check that number of fields in schema match column length
if schema.fields().len() != columns.len() {
return Err(ArrowError::InvalidArgumentError(format!(
"number of columns({}) must match number of fields({}) in schema",
columns.len(),
schema.fields().len(),
)));
}
let row_count = options
.row_count
.or_else(|| columns.first().map(|col| col.len()))
.ok_or_else(|| {
ArrowError::InvalidArgumentError(
"must either specify a row count or at least one column".to_string(),
)
})?;
for (c, f) in columns.iter().zip(&schema.fields) {
if !f.is_nullable() && c.null_count() > 0 {
return Err(ArrowError::InvalidArgumentError(format!(
"Column '{}' is declared as non-nullable but contains null values",
f.name()
)));
}
}
// check that all columns have the same row count
if columns.iter().any(|c| c.len() != row_count) {
let err = match options.row_count {
Some(_) => "all columns in a record batch must have the specified row count",
None => "all columns in a record batch must have the same length",
};
return Err(ArrowError::InvalidArgumentError(err.to_string()));
}
// function for comparing column type and field type
// return true if 2 types are not matched
let type_not_match = if options.match_field_names {
|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| col_type != field_type
} else {
|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| {
!col_type.equals_datatype(field_type)
}
};
// check that all columns match the schema
let not_match = columns
.iter()
.zip(schema.fields().iter())
.map(|(col, field)| (col.data_type(), field.data_type()))
.enumerate()
.find(type_not_match);
if let Some((i, (col_type, field_type))) = not_match {
return Err(ArrowError::InvalidArgumentError(format!(
"column types must match schema types, expected {field_type} but found {col_type} at column index {i}"
)));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for verifying!

Copy link
Member

@Weijun-H Weijun-H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thanks @pepijnve

alamb pushed a commit that referenced this pull request Oct 14, 2025
# Which issue does this PR close?

- Related to #8591.

# Rationale for this change

Add a microbenchmark for `RecordBatch::project` to measure the
performance impact of #8583

# What changes are included in this PR?

Adds an additional micro benchmark to `arrow-rs`.

# Are these changes tested?

Not applicable for benchmark code. Benchmark manually tested.

# Are there any user-facing changes?

No
@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing rb_project (b8564d3) to 89e9612 diff
BENCH_NAME=record_batch
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench record_batch
BENCH_FILTER=
BENCH_BRANCH_NAME=rb_project
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

Thanks @pepijnve -- I kicked off the benchmarks

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

🤖: Benchmark completed

Details

group                            main                                   rb_project
-----                            ----                                   ----------
project/1000x8192 -> 1x8192      1.06    177.0±0.33ns        ? ?/sec    1.00    166.7±0.22ns        ? ?/sec
project/1000x8192 -> 500x8192    1.26     20.6±0.04µs        ? ?/sec    1.00     16.4±0.02µs        ? ?/sec
project/1000x8192 -> 999x8192    1.27     40.9±0.11µs        ? ?/sec    1.00     32.3±0.05µs        ? ?/sec
project/100x8192 -> 1x8192       1.04    176.9±0.26ns        ? ?/sec    1.00    170.4±0.25ns        ? ?/sec
project/100x8192 -> 50x8192      1.23      2.4±0.00µs        ? ?/sec    1.00   1950.3±2.82ns        ? ?/sec
project/100x8192 -> 99x8192      1.25      4.4±0.01µs        ? ?/sec    1.00      3.5±0.00µs        ? ?/sec
project/10x8192 -> 1x8192        1.06    177.1±0.39ns        ? ?/sec    1.00    166.7±0.70ns        ? ?/sec
project/10x8192 -> 5x8192        1.16    441.8±0.66ns        ? ?/sec    1.00    381.4±1.60ns        ? ?/sec
project/10x8192 -> 9x8192        1.15    695.7±1.44ns        ? ?/sec    1.00    605.1±0.95ns        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

🤖: Benchmark completed

That is some pretty nice confirmation

@alamb alamb merged commit 2f5ae5c into apache:main Oct 14, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 14, 2025

Thanks @pepijnve and @Weijun-H

@pepijnve pepijnve deleted the rb_project branch October 22, 2025 14:58
alamb pushed a commit that referenced this pull request Oct 24, 2025
# Which issue does this PR close?

- Closes #8692.

# Rationale for this change

Explained in issue.

# What changes are included in this PR?

- Adds `FilterPredicate::filter_record_batch`
- Adapts the free function `filter_record_batch` to use the new function
- Uses `new_unchecked` to create the filtered result. The rationale for
this is identical to #8583

# Are these changes tested?

Covered by existing tests for `filter_record_batch`

# Are there any user-facing changes?

No

---------

Co-authored-by: Martin Grigorov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eliminate redundant validation in RecordBatch::project

3 participants