Skip to content

Conversation

@xudong963
Copy link
Member

Which issue does this PR close?

  • Closes #.

Rationale for this change

In our prod, we found that the optimize_projection is slow when the number of columns is large, and is_projection_unnecessary is a bottleneck. After checking the method, I found we can improve it.

Here is the benchmark:

Result

v1 is the old and v2 is the optimized version

Projection Optimization/v1_one_non_trivial
                        time:   [92.212 µs 93.578 µs 95.428 µs]
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe
Projection Optimization/v2_one_non_trivial
                        time:   [242.52 ns 248.62 ns 256.51 ns]
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmark code

1000 exprs in Projection, and one of them is non-trival expr.

use arrow::datatypes::{DataType, Field, Schema};
use criterion::{criterion_group, criterion_main, Criterion};
use datafusion_expr::{binary_expr, col,Operator, Expr, LogicalPlan, table_scan};
use datafusion_optimizer::optimize_projections::{is_projection_unnecessary, is_projection_unnecessary_v2};

pub fn test_table_scan_fields() -> Vec<Field> {
    vec![
        Field::new("a", DataType::UInt32, false),
        Field::new("b", DataType::UInt32, false),
        Field::new("c", DataType::UInt32, false),
    ]
}

pub fn test_table_scan_with_name(name: &str) -> datafusion_common::Result<LogicalPlan> {
    let schema = Schema::new(test_table_scan_fields());
    table_scan(Some(name), &schema, None)?.build()
}

/// some tests share a common table
pub fn test_table_scan() -> datafusion_common::Result<LogicalPlan> {
    test_table_scan_with_name("test")
}
fn create_mixed_expressions(count: usize) -> Vec<Expr> {
    let mut exprs = Vec::with_capacity(count);
        for i in 0..count {
            // Most expressions are just column references
            if i != count / 2 {
                match i % 3 {
                    0 => exprs.push(col("a")),
                    1 => exprs.push(col("b")),
                    _ => exprs.push(col("c")),
                }
            } else {
                // One non-trivial expression (binary operation)
                exprs.push(binary_expr(col("a"), Operator::Plus, col("b")));
            }
        }
        exprs
}

fn benchmark_projections(c: &mut Criterion) {
    let input_plan = test_table_scan().unwrap();
    let one_non_trivial = create_mixed_expressions(1000);

    let mut group = c.benchmark_group("Projection Optimization");

    // Benchmark with one non-trivial expression
    group.bench_function("v1_one_non_trivial", |b| {
        b.iter(|| is_projection_unnecessary(&input_plan, &one_non_trivial));
    });
    group.bench_function("v2_one_non_trivial", |b| {
        b.iter(|| is_projection_unnecessary_v2(&input_plan, &one_non_trivial));
    });

    group.finish();
}

criterion_group!(benches, benchmark_projections);
criterion_main!(benches);

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the optimizer Optimizer rules label Apr 18, 2025
@xudong963 xudong963 added the performance Make DataFusion faster label Apr 18, 2025
@xudong963
Copy link
Member Author

thank you @Dandandan

@xudong963 xudong963 merged commit 6ffed55 into apache:main Apr 19, 2025
27 checks passed
@xudong963 xudong963 deleted the improve_project_o branch April 19, 2025 15:56
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules performance Make DataFusion faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants