Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ClickBench queries to DataFusion benchmark runner #7060

Merged
merged 3 commits into from
Jul 27, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 23, 2023

Draft as it builds on #7054

Which issue does this PR close?

closes #6994
closes #6128

Rationale for this change

see #6994 -- tldr is to optimize clickbench queries it needs to be easier to run them

What changes are included in this PR?

  1. Add new dfbench clickbench command to run ClickBench queries
  2. Update bench.sh to run clickbench queries
  3. Update benchmarks/README.md -- see rendered version https://github.com/alamb/arrow-datafusion/tree/alamb/clickbench_runner/benchmarks

Are these changes tested?

I tested them manually

Run clickbench q1 directly (e.g. for profiling):

cargo run  --bin dfbench -- clickbench --query 1
Running benchmarks with the following options: RunOpt { query: Some(1), common: CommonOpt { iterations: 3, partitions: 2, batch_size: 8192 }, path: "benchmarks/data/hits.parquet", queries_path: "benchmarks/queries/clickbench/queries.sql", output_path: None }
Q1: SELECT COUNT(*) FROM hits;
Query 1 iteration 0 took 305.7 ms and returned 1 rows
Query 1 iteration 1 took 13.6 ms and returned 1 rows
Query 1 iteration 2 took 13.6 ms and returned 1 rows

run with hits_partitioned (100 parquet files):

cargo run  --bin dfbench -- clickbench --query 1 --path=benchmarks/data/hits_partitioned

Run with bench.sh:

./bench.sh run clickbench_1

See help

cargo run  --bin dfbench  -- clickbench --help

dfbench-clickbench 27.0.0
Run the clickbench benchmark

The ClickBench[1] benchmarks are widely cited in the industry and
focus on grouping / aggregation / filtering. This runner uses the
scripts and queries from [2].

[1]: https://github.com/ClickHouse/ClickBench
[2]: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

USAGE:
    dfbench clickbench [OPTIONS]

FLAGS:
    -h, --help       
            Prints help information

    -V, --version    
            Prints version information


OPTIONS:
    -s, --batch-size <batch-size>        
            Batch size when reading CSV or Parquet files [default: 8192]

    -i, --iterations <iterations>        
            Number of iterations of each test run [default: 3]

    -o, --output <output-path>           
            If present, write results json here

    -n, --partitions <partitions>        
            Number of partitions to process in parallel [default: 2]

    -p, --path <path>                    
            Path to hits.parquet (single file) or `hits_partitioned` (partitioned, 100 files) [default:
            benchmarks/data/hits.parquet]
    -r, --queries_path <queries-path>    
            Path to queries.sql (single file) [default: benchmarks/queries/clickbench/queries.sql]

    -q, --query <query>                  
            Query number. If not specified, runs all queries

Are there any user-facing changes?

@alamb alamb marked this pull request as ready for review July 25, 2023 12:15
@alamb alamb marked this pull request as draft July 25, 2023 12:15

// Common benchmark options (don't use doc comments otherwise this doc
// shows up in help files)
#[derive(Debug, StructOpt, Clone)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored common options into CommonOpt

pub use run::{BenchQuery, BenchmarkRun};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was dead code I accidentally introduced in #7054

The actual entrypoint for the dfbench binary is in https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/dfbench.rs

/// Run the tpch benchmark
/// Run the tpch benchmark.
///
/// This benchmarks is derived from the [TPC-H][1] version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the details of what the benchmark was doing into the binary from the README, which I think is better as it is closer to the code, but I don't feel strongly about this and would welcome other opinions on the matter

@@ -53,17 +62,9 @@ pub struct RunOpt {
#[structopt(short, long)]
debug: bool,

/// Number of iterations of each test run
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first part of a payment to reduce duplication in the benchmark runners (e.g. part of #7052)

@alamb alamb marked this pull request as ready for review July 25, 2023 12:21
@alamb
Copy link
Contributor Author

alamb commented Jul 26, 2023

@tustvold or @Dandandan do you have time to review this PR (I hope to use this benchmark runner to drive / test further groupby peformance improvements)

@Dandandan Dandandan changed the title Add ClickBench queries to DataFusion benchmark runer Add ClickBench queries to DataFusion benchmark runner Jul 26, 2023

/// Returns the text of query `query_id`
fn get_query(&self, query_id: usize) -> Result<String> {
if query_id == 0 || query_id > 43 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClickBench numbers the queries 0-42:

https://benchmark.clickhouse.com/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- fixed ae4ce7d

@Dandandan
Copy link
Contributor

One comment about the clickbench numbers, the rest looks good to me! Thanks for driving this forward

@alamb alamb merged commit 11b7b5c into apache:main Jul 27, 2023
21 checks passed
@alamb alamb deleted the alamb/clickbench_runner branch July 27, 2023 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add clickbench queries to tpch and bench.sh Add a clickbench DataFusion benchmark runner
2 participants