Skip redundant validation checks in RecordBatch#project #8583

pepijnve · 2025-10-10T06:10:15Z

Which issue does this PR close?

Closes Eliminate redundant validation in RecordBatch::project #8591.

Rationale for this change

RecordBatch project currently uses the validating factory function. Since project starts from a valid RecordBatch these checks are redundant. A small amount of work can be saved by using new_unchecked instead.

A change I'm working on for DataFusion uses RecordBatch#project in the inner expression evaluation loop to reduce the amount of redundant array filtering case expressions need to do. While a micro optimisation, avoiding redundant work in inner loops seems worthwhile.

What changes are included in this PR?

Use new_unchecked instead of try_new_with_options in RecordBatch#project

Are these changes tested?

No additional tests added.
Performance difference proven via microbenchmark

Are there any user-facing changes?

No

pepijnve · 2025-10-11T09:56:18Z

Some micro benchmark results. The results are in line with my expectations. The more columns you retain, the more work the validation logic needs to do so you see the most benefit in the 100x100 -> 99x100 case for instance.

project/100x100 -> 1x100
                        time:   [85.588 ns 85.747 ns 85.903 ns]
                        change: [-7.3087% -7.0345% -6.7565%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

project/100x100 -> 50x100
                        time:   [788.42 ns 789.33 ns 790.23 ns]
                        change: [-24.541% -24.365% -24.202%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

project/100x100 -> 99x100
                        time:   [1.3220 µs 1.3261 µs 1.3303 µs]
                        change: [-25.897% -25.665% -25.434%] (p = 0.00 < 0.05)
                        Performance has improved.

project/5x100 -> 1x100  time:   [84.869 ns 84.963 ns 85.067 ns]
                        change: [-8.3173% -8.0696% -7.8238%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

project/5x100 -> 2x100  time:   [93.710 ns 93.886 ns 94.111 ns]
                        change: [-14.343% -14.136% -13.921%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

project/5x100 -> 5x100  time:   [176.45 ns 176.73 ns 177.00 ns]
                        change: [-13.424% -13.215% -12.985%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

alamb

Thank you @pepijnve -- this looks like a great find to me.

alamb · 2025-10-11T12:43:10Z

arrow-array/Cargo.toml

@@ -80,3 +80,7 @@ harness = false
 [[bench]]
 name = "union_array"
 harness = false
+
+[[bench]]
+name = "record_batch"


Could you please move this benchmark into its own PR (mostly to make it easier to run the benchmark against unmodified code)?

Sure thing.

I've rewound this PRs branch one commit and force pushed that to remove the benchmark here.

alamb · 2025-10-11T12:45:21Z

arrow-array/benches/record_batch.rs

+fn criterion_benchmark(c: &mut Criterion) {
+    project_benchmark(c, 100, 100, 1);
+    project_benchmark(c, 100, 100, 50);
+    project_benchmark(c, 100, 100, 99);


I think a row count of 8192 is more realistic

Also, adding a benchmark with 1000 columns would also be useful (and maybe show off your improvement more)

Row count doesn't really affect project performance afaict from the code, but it's pretty trivial to add more test cases.

I added the extra cases in #8592

alamb · 2025-10-11T12:49:02Z

arrow-array/src/record_batch.rs

-                row_count: Some(self.row_count),
-            },
-        )
+        unsafe {


I double checked the checks that happen as part of creating a RecordBatch and I agree they are entirely redundant when projecting an already valid RecordBatch

arrow-rs/arrow-array/src/record_batch.rs

Lines 307 to 365 in e5e4db9

// check that number of fields in schema match column length

if schema.fields().len() != columns.len() {

return Err(ArrowError::InvalidArgumentError(format!(

"number of columns({}) must match number of fields({}) in schema",

columns.len(),

schema.fields().len(),

)));

}

let row_count = options

.row_count

.or_else(|| columns.first().map(|col| col.len()))

.ok_or_else(|| {

ArrowError::InvalidArgumentError(

"must either specify a row count or at least one column".to_string(),

)

})?;

for (c, f) in columns.iter().zip(&schema.fields) {

if !f.is_nullable() && c.null_count() > 0 {

return Err(ArrowError::InvalidArgumentError(format!(

"Column '{}' is declared as non-nullable but contains null values",

f.name()

)));

}

}

// check that all columns have the same row count

if columns.iter().any(|c| c.len() != row_count) {

let err = match options.row_count {

Some(_) => "all columns in a record batch must have the specified row count",

None => "all columns in a record batch must have the same length",

};

return Err(ArrowError::InvalidArgumentError(err.to_string()));

}

// function for comparing column type and field type

// return true if 2 types are not matched

let type_not_match = if options.match_field_names {

|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| col_type != field_type

} else {

|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))| {

!col_type.equals_datatype(field_type)

}

};

// check that all columns match the schema

let not_match = columns

.iter()

.zip(schema.fields().iter())

.map(|(col, field)| (col.data_type(), field.data_type()))

.enumerate()

.find(type_not_match);

if let Some((i, (col_type, field_type))) = not_match {

return Err(ArrowError::InvalidArgumentError(format!(

"column types must match schema types, expected {field_type} but found {col_type} at column index {i}"

)));

}

Thanks for verifying!

Weijun-H

Nice catch! Thanks @pepijnve

# Which issue does this PR close? - Related to #8591. # Rationale for this change Add a microbenchmark for `RecordBatch::project` to measure the performance impact of #8583 # What changes are included in this PR? Adds an additional micro benchmark to `arrow-rs`. # Are these changes tested? Not applicable for benchmark code. Benchmark manually tested. # Are there any user-facing changes? No

alamb · 2025-10-14T15:30:35Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing rb_project (b8564d3) to 89e9612 diff
BENCH_NAME=record_batch
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench record_batch
BENCH_FILTER=
BENCH_BRANCH_NAME=rb_project
Results will be posted here when complete

alamb · 2025-10-14T15:31:03Z

Thanks @pepijnve -- I kicked off the benchmarks

alamb · 2025-10-14T15:34:00Z

🤖: Benchmark completed

Details

group                            main                                   rb_project
-----                            ----                                   ----------
project/1000x8192 -> 1x8192      1.06    177.0±0.33ns        ? ?/sec    1.00    166.7±0.22ns        ? ?/sec
project/1000x8192 -> 500x8192    1.26     20.6±0.04µs        ? ?/sec    1.00     16.4±0.02µs        ? ?/sec
project/1000x8192 -> 999x8192    1.27     40.9±0.11µs        ? ?/sec    1.00     32.3±0.05µs        ? ?/sec
project/100x8192 -> 1x8192       1.04    176.9±0.26ns        ? ?/sec    1.00    170.4±0.25ns        ? ?/sec
project/100x8192 -> 50x8192      1.23      2.4±0.00µs        ? ?/sec    1.00   1950.3±2.82ns        ? ?/sec
project/100x8192 -> 99x8192      1.25      4.4±0.01µs        ? ?/sec    1.00      3.5±0.00µs        ? ?/sec
project/10x8192 -> 1x8192        1.06    177.1±0.39ns        ? ?/sec    1.00    166.7±0.70ns        ? ?/sec
project/10x8192 -> 5x8192        1.16    441.8±0.66ns        ? ?/sec    1.00    381.4±1.60ns        ? ?/sec
project/10x8192 -> 9x8192        1.15    695.7±1.44ns        ? ?/sec    1.00    605.1±0.95ns        ? ?/sec

alamb · 2025-10-14T15:44:21Z

🤖: Benchmark completed

That is some pretty nice confirmation

alamb · 2025-10-14T15:53:03Z

Thanks @pepijnve and @Weijun-H

# Which issue does this PR close? - Closes #8692. # Rationale for this change Explained in issue. # What changes are included in this PR? - Adds `FilterPredicate::filter_record_batch` - Adapts the free function `filter_record_batch` to use the new function - Uses `new_unchecked` to create the filtered result. The rationale for this is identical to #8583 # Are these changes tested? Covered by existing tests for `filter_record_batch` # Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <[email protected]>

github-actions bot added the arrow Changes to the arrow crate label Oct 10, 2025

Skip redundant validation checks in RecordBatch#project

8cc69ca

pepijnve force-pushed the rb_project branch from d617470 to 8cc69ca Compare October 10, 2025 06:13

pepijnve marked this pull request as ready for review October 11, 2025 09:09

alamb added the performance label Oct 11, 2025

alamb approved these changes Oct 11, 2025

View reviewed changes

pepijnve mentioned this pull request Oct 11, 2025

Add RecordBatch::project microbenchmark #8592

Merged

pepijnve force-pushed the rb_project branch from e868782 to 8cc69ca Compare October 11, 2025 13:25

Weijun-H approved these changes Oct 12, 2025

View reviewed changes

Merge branch 'main' into rb_project

b8564d3

alamb merged commit 2f5ae5c into apache:main Oct 14, 2025
26 checks passed

alamb mentioned this pull request Oct 17, 2025

Eliminate redundant validation in RecordBatch::project #8591

Closed

pepijnve deleted the rb_project branch October 22, 2025 14:58

pepijnve mentioned this pull request Oct 23, 2025

Add FilterPredicate::filter_record_batch #8693

Merged

claude bot mentioned this pull request Oct 24, 2025

8693: Add FilterPredicate::filter_record_batch martin-augment/arrow-rs#7

Open

	// check that number of fields in schema match column length
	if schema.fields().len() != columns.len() {
	return Err(ArrowError::InvalidArgumentError(format!(
	"number of columns({}) must match number of fields({}) in schema",
	columns.len(),
	schema.fields().len(),
	)));
	}

	let row_count = options
	.row_count
	.or_else(\|\| columns.first().map(\|col\| col.len()))
	.ok_or_else(\|\| {
	ArrowError::InvalidArgumentError(
	"must either specify a row count or at least one column".to_string(),
	)
	})?;

	for (c, f) in columns.iter().zip(&schema.fields) {
	if !f.is_nullable() && c.null_count() > 0 {
	return Err(ArrowError::InvalidArgumentError(format!(
	"Column '{}' is declared as non-nullable but contains null values",
	f.name()
	)));
	}
	}

	// check that all columns have the same row count
	if columns.iter().any(\|c\| c.len() != row_count) {
	let err = match options.row_count {
	Some(_) => "all columns in a record batch must have the specified row count",
	None => "all columns in a record batch must have the same length",
	};
	return Err(ArrowError::InvalidArgumentError(err.to_string()));
	}

	// function for comparing column type and field type
	// return true if 2 types are not matched
	let type_not_match = if options.match_field_names {
	\|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))\| col_type != field_type
	} else {
	\|(_, (col_type, field_type)): &(usize, (&DataType, &DataType))\| {
	!col_type.equals_datatype(field_type)
	}
	};

	// check that all columns match the schema
	let not_match = columns
	.iter()
	.zip(schema.fields().iter())
	.map(\|(col, field)\| (col.data_type(), field.data_type()))
	.enumerate()
	.find(type_not_match);

	if let Some((i, (col_type, field_type))) = not_match {
	return Err(ArrowError::InvalidArgumentError(format!(
	"column types must match schema types, expected {field_type} but found {col_type} at column index {i}"
	)));
	}

Skip redundant validation checks in RecordBatch#project #8583

Skip redundant validation checks in RecordBatch#project #8583

Uh oh!

Conversation

pepijnve commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pepijnve commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Weijun-H left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 14, 2025

Uh oh!

alamb commented Oct 14, 2025

Uh oh!

alamb commented Oct 14, 2025

Uh oh!

alamb commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alamb commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pepijnve commented Oct 10, 2025 •

edited

Loading

pepijnve commented Oct 11, 2025 •

edited

Loading

Weijun-H left a comment •

edited

Loading

alamb commented Oct 14, 2025 •

edited

Loading