Optimizer is slow: Avoid too many string cloning in the optimizer #5157

zeodtr · 2023-02-02T09:35:27Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm not sure this is a feature request, but at least this is not a bug (albeit it's a performance problem), so I write this issue as a feature request.

I'm benchmarking the optimizers of DataFusion and Calcite.
I intended to compare the quality of the optimized plans between them, assuming that DataFusion's optimizing speed would be faster (since it's written in Rust).
But to my surprise, I found that Calcite's optimizer is way faster (~ 20x) in some cases.

The case is as follows:

Query statement is very simple: "select column_1 from table_1"
table_1 has about 700 columns (Yes, it has so many columns, that's the problem).

While Calcite finished the optimization in about 7 msec, DataFusion's optimizer took about 120 msec.
At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I've tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)

I ran the test program with valgrind / callgrind and drew the call graph. The graph showed that about half of the execution time is being spent on <alloc::string::String as core::clone::Clone>::clone. The call count was 3,930,814.

I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.

So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.

Describe the solution you'd like

Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.

Describe alternatives you've considered

No alternatives.

Additional context

The following attachment is the call graph in SVG format. It was created by gprof2dot.py and dot with callgrind's output data. 'batch_test' is the name of my test program. Somewhat contrary to the name, The program only tests one query statement.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-02-02T16:52:22Z

Possibly related - #4680

alamb · 2023-02-02T20:33:48Z

I looked at the trace and here are my observations:

As @tustvold has said, if we can have DFSchema / DFField that don't copy the values #4680 around that would help immensly

A large amount of the allocations come from DFSchema::merge -- see https://github.com/apache/arrow-datafusion/blob/224c682101949da57aebc36e92e5a881ef3040d4/datafusion/common/src/dfschema.rs#L135-L151

And a large part of that is how it ignores errors with .ok() where were quite expensive to produce

It also appears there is copying going on in unwrap_cast_in_comparison and common subexpr eliminiate

alamb · 2023-02-02T20:34:45Z

@zeodtr Would you be willing to contribute your benchmark program -- we have https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/sql_planner.rs but maybe we could have a more end to end plan creation test 🤔

ygf11 · 2023-02-03T02:07:35Z

And a large part of that is how it ignores errors with .ok() where were quite expensive to produce

The result type of field_with_name is Result<&DFField>, it is not clear enough I think.

Giving the name, there are three results:

No such field, return FieldNotFound error.
Find only one field, return this field.
Find one more field, return Ambiguous reference error.

The first and third both will return error, ~~in most times we should distinguish them, but it is not easy.~~ I am not sure if we should distinguish them.

Maybe we can change it to Result<Option<&Field>>.

zeodtr · 2023-02-03T02:42:04Z

@alamb
I've simplified my benchmark program as follows. (I would like to upload the file as an attachment, but it's impossible in my current environment) It is a single file, named main.rs. It is based on datafusion/sql/examples/sql.rs except for the optimizer part. The optimizer part is based on the documentation in https://crates.io/crates/datafusion-optimizer.

use std::{collections::HashMap, sync::Arc};

use datafusion::{
    arrow::datatypes::{DataType, Field, Schema},
    common::Result,
    config::ConfigOptions,
    error::DataFusionError,
    logical_expr::{
        logical_plan::builder::LogicalTableSource, AggregateUDF, LogicalPlan, ScalarUDF,
        TableSource,
    },
    optimizer::{optimizer::Optimizer, OptimizerContext, OptimizerRule},
    sql::{
        planner::{ContextProvider, SqlToRel},
        sqlparser::{dialect::GenericDialect, parser::Parser},
        TableReference,
    },
};

#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

fn main() {
    let sql = "select column1 from table1";
    let schema_provider = TestSchemaProvider::new();

    let now = std::time::Instant::now();

    let dialect = GenericDialect {};
    let ast = Parser::parse_sql(&dialect, sql).unwrap();
    let statement = &ast[0];
    let sql_to_rel = SqlToRel::new(&schema_provider);
    let plan = sql_to_rel.sql_statement_to_plan(statement.clone()).unwrap();

    println!(
        "elapsed time after creating a logical plan: {}",
        now.elapsed().as_millis()
    );

    let config = OptimizerContext::default();
    let optimizer = Optimizer::new();
    let optimized_plan = optimizer.optimize(&plan, &config, observe).unwrap();

    println!(
        "elapsed time after optimization: {}\n",
        now.elapsed().as_millis()
    );

    println!("plan:\n{:?}\n", plan);
    println!("optimized plan:\n{:?}", optimized_plan);
}

fn observe(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {}

struct TestSchemaProvider {
    options: ConfigOptions,
    tables: HashMap<String, Arc<dyn TableSource>>,
}

impl TestSchemaProvider {
    pub fn new() -> Self {
        let mut tables = HashMap::new();
        tables.insert(
            "table1".to_string(),
            create_table_source({
                let mut fields = Vec::new();
                for num in 0..700 {
                    fields.push(Field::new(
                        format!("column{}", num + 1),
                        DataType::Int32,
                        false,
                    ))
                }
                fields
            }),
        );

        Self {
            options: Default::default(),
            tables,
        }
    }
}

fn create_table_source(fields: Vec<Field>) -> Arc<dyn TableSource> {
    Arc::new(LogicalTableSource::new(Arc::new(
        Schema::new_with_metadata(fields, HashMap::new()),
    )))
}

impl ContextProvider for TestSchemaProvider {
    fn get_table_provider(&self, name: TableReference) -> Result<Arc<dyn TableSource>> {
        match self.tables.get(name.table()) {
            Some(table) => Ok(table.clone()),
            _ => Err(DataFusionError::Plan(format!(
                "Table not found: {}",
                name.table()
            ))),
        }
    }

    fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> {
        None
    }

    fn get_aggregate_meta(&self, _name: &str) -> Option<Arc<AggregateUDF>> {
        None
    }

    fn get_variable_type(&self, _variable_names: &[String]) -> Option<DataType> {
        None
    }

    fn options(&self) -> &ConfigOptions {
        &self.options
    }
}

And the content of Cargo.toml is as follows:

[package]
name = "simple_optimizer_test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
datafusion = "17.0.0"
mimalloc = "0.1.34"

[profile.release]
debug = 2

BTW, It seems that the performance of the optimizer worsens more than linearly as the number of the columns increases.

For example, when the number of the column is 1000, the total elapsed time is about 200 msec (30 for creating the logical plan, 170 for the optimizer), but when the number of the column is 2000, the total elapsed time is about 800 msec (100 for creating the logical plan, 700 for the optimizer).

alamb · 2023-02-08T18:05:40Z

Thank you @zeodtr -- I intend to port this into the datafusion benchmark suite eventually.

alamb · 2023-02-11T17:35:24Z

Proposed adding in #5256

alamb · 2023-02-16T19:07:53Z

Here is one possible task that would help: #5309

alamb · 2023-03-18T16:14:48Z

Filed #5637 to track making the optimizer faster in general

mslapek · 2023-03-19T14:28:27Z

TBH String interning would be the best ⭐️ - no cloning, quick Eq/Hash...

Instead of String we could use StringId:

#[derive(Hash, PartialEq, Eq, Copy, Clone)]
struct StringId(usize);

And some HashMap<String, StringId> to assign strings to unique IDs.

alamb · 2023-03-20T11:23:04Z

TBH String interning would be the best ⭐️ - no cloning, quick Eq/Hash...

Agreed though then we have to thread the HashMap through

Another possibility would be to use Arc<str> (aka refcounted strings) -- not quite as cheap to clone / create but also don't need any external context

mslapek · 2023-03-20T17:35:14Z

Another possibility would be to use Arc<str> (aka refcounted strings) -- not quite as cheap to clone / create but also don't need any external context

Looks like a good tradeoff.

What do you think about 🎯 precomputed hash + Arc<str>?

constant time cloning
constant time hashing (Add LogicalPlanSignature and use in the optimizer loop #5623 proposes to use LogicalPlan hashing in the optimizer loop)

Performance almost-like string interning.

alamb · 2023-03-20T18:26:31Z

What do you think about 🎯 precomputed hash + Arc?

Seems reasonable to me. Maybe as some intermediate state we could have a StringInterner structure that computed the hash + Arc<str> and tried to reuse any previously known about.

Then we could thread through a StringInterner when possible (e.g. on the optimizer or the sql planner) but also be able to make these strings easily without one (so we could incrementally update the code)

Any such solution, I think, should strive very hard to keep the burden of working in the DataFusion codebase low (e.g. should be easy to work with for anyone used to String, be well documented, etc)

tustvold · 2023-03-26T17:41:08Z

apache/arrow-rs#3955 may also be relevant here, I'm optimistic that we can drastically reduce the overheads without needing to reach for exotic solutions like string interning, at least initially

comphead · 2024-01-30T23:30:16Z

@alamb I'm wondering if #9020 closes this one

alamb · 2024-01-31T16:35:06Z

@alamb I'm wondering if #9020 closes this one

It is a good call @comphead -- It certainly helps 😅.

Speeds up benchmarks in sql_planner.rs by 18-75%.

(BTW these are the benchmarks that @zeodtr contributed ❤️ )

I am also quite excited about @matthewmturner 's work in #8905

@zeodtr / @mslapek I wonder if you have an opinion about this. Shall we keep it open? Shall we close it in favor of tickets that describe specific improvements? Do you have a chance to test with more recent versions of DataFusion?

zeodtr · 2024-01-31T23:51:08Z

@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues.
Thanks for your great work.
By the way, at the moment, I'm generally satisfied with the performance of logical planning and optimization routines after applying some optimizations here and there, as I explained in #7698 (comment). It would be nice if these optimizations could be incorporated into the official source code, perhaps with some adjustments if needed.

alamb · 2024-02-01T21:02:02Z

as I explained in #7698 (comment). I

I forgot out great that description was. FYI @matthewmturner for your inspiration

alamb · 2024-06-23T19:24:01Z

@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues.

Thanks again @zeodtr

BTW depending on the usecase we have made datafusion planning about 2x faster between 37.0.0 and 40.0.0 (will be released shortly). There were more pronounced gains for schemas with larger numbers of columns

zeodtr added the enhancement New feature or request label Feb 2, 2023

alamb changed the title ~~Optimizer: Avoid too many string cloning in the optimizer~~ Optimizer is slow: Avoid too many string cloning in the optimizer Feb 11, 2023

alamb mentioned this issue Feb 11, 2023

SQL planning benchmarks for very wide tables #5256

Merged

alamb mentioned this issue Feb 16, 2023

Make a faster way to check column existence in optimizer (not is_err()) #5309

Closed

alamb mentioned this issue Mar 18, 2023

[Epic] A collection of issues to improve planning performance / speed / efficiency #5637

Open

15 tasks

alamb mentioned this issue Mar 28, 2023

Add Fields abstraction (#3955) apache/arrow-rs#3965

Merged

alamb mentioned this issue Mar 16, 2024

[EPIC] Stop copying LogicalPlan during OptimizerPasses #9637

Closed

31 tasks

alamb closed this as completed Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

zeodtr commented Feb 2, 2023 •

edited

Loading

tustvold commented Feb 2, 2023

alamb commented Feb 2, 2023

alamb commented Feb 2, 2023

ygf11 commented Feb 3, 2023 •

edited

Loading

zeodtr commented Feb 3, 2023 •

edited

Loading

alamb commented Feb 8, 2023

alamb commented Feb 11, 2023

alamb commented Feb 16, 2023

alamb commented Mar 18, 2023

mslapek commented Mar 19, 2023

alamb commented Mar 20, 2023

mslapek commented Mar 20, 2023

alamb commented Mar 20, 2023

tustvold commented Mar 26, 2023 •

edited

Loading

comphead commented Jan 30, 2024

alamb commented Jan 31, 2024

zeodtr commented Jan 31, 2024 •

edited

Loading

alamb commented Feb 1, 2024

alamb commented Jun 23, 2024

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

Comments

zeodtr commented Feb 2, 2023 • edited Loading

tustvold commented Feb 2, 2023

alamb commented Feb 2, 2023

alamb commented Feb 2, 2023

ygf11 commented Feb 3, 2023 • edited Loading

zeodtr commented Feb 3, 2023 • edited Loading

alamb commented Feb 8, 2023

alamb commented Feb 11, 2023

alamb commented Feb 16, 2023

alamb commented Mar 18, 2023

mslapek commented Mar 19, 2023

alamb commented Mar 20, 2023

mslapek commented Mar 20, 2023

alamb commented Mar 20, 2023

tustvold commented Mar 26, 2023 • edited Loading

comphead commented Jan 30, 2024

alamb commented Jan 31, 2024

zeodtr commented Jan 31, 2024 • edited Loading

alamb commented Feb 1, 2024

alamb commented Jun 23, 2024

zeodtr commented Feb 2, 2023 •

edited

Loading

ygf11 commented Feb 3, 2023 •

edited

Loading

zeodtr commented Feb 3, 2023 •

edited

Loading

tustvold commented Mar 26, 2023 •

edited

Loading

zeodtr commented Jan 31, 2024 •

edited

Loading