Update ClickBench benchmarks with DataFusion 40 #11567

alamb · 2024-07-20T11:46:43Z

Is your feature request related to a problem or challenge?

DataFusion 40 has been released https://crates.io/crates/datafusion/40.0.0

It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version. It looks like we are still reporting numbers for DataFusion 36

Describe the solution you'd like

Perhaps we can follow the model of ClickHouse/ClickBench#178 (thanks @pmcgleenon ) or ClickHouse/ClickBench#145 (thanks @kmitchener )

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2024-07-20T11:48:14Z

FWIW I think the benchmarks will improve dramatically once we have completed the StringView work that @XiangpengHao is leading #10918

xinlifoobar · 2024-07-23T13:35:32Z

take

xinlifoobar · 2024-07-23T13:50:31Z

Sorry @alamb, I just found out that I could not create AWS account at this time. Is it fine to use Azure VM, e.g., F16sv2, instead? If not please unassign me...

alamb · 2024-07-25T12:14:47Z

I am not sure -- maybe @pmcgleenon could comment? I don't know if this is equivalent to the AWS machines

pmcgleenon · 2024-07-25T12:39:14Z

From what I understand, the previous datafusion ClickBench runs have been on this AWS EC2 instance:

c6a.4xlarge
Amazon Linux 2 AMI
Root 500GB gp2 SSD
no EBS optimized
no instance store

If we want to compare datafusion performance across different datafusion versions (and spot any improvements/degradations) then sticking to the same machine spec will allow us to do this.

If you check the Clickbench results many other databases also publish their results for the c6a.4xlarge AWS EC2 instance so we can compare datafusion results with DuckDB, ClickHouse, QuestDB and with many more.

IMO we should stick with the same AWS instance for future runs.

@alamb @xinlifoobar I'm happy to help out here if required!

pmcgleenon · 2024-07-25T12:50:50Z

by the way this is what I found comparing AWS c6a.4xlarge and Azure Standard_F16s_v2. Looks like the CPU clock speed is different and there are some differences in the storage performance numbers

https://cloudprice.net/aws/ec2/instances/c6a.4xlarge
- 16 vCPU
- CPU AMD 3.6 GHz
- 32 GB RAM
https://cloudprice.net/vm/Standard_F16s_v2
- 16 vCPU
- CPU Intel 2.70GHz
- 32 GB RAM

alamb · 2024-07-25T15:46:43Z

@alamb @xinlifoobar I'm happy to help out here if required!

That would be amazing 🙏

pmcgleenon · 2024-07-28T13:31:40Z

@alamb @xinlifoobar Here are the results for df40 (attached is a file comparing 33, 34, 36 and 40)

Single

Partitioned

df40.zip

Are these results inline with your expectations?

If so I can create a PR on Clickbench to update the datafusion results

alamb · 2024-07-29T10:36:33Z

Are these results inline with your expectations?

I would expect that we didn't see much performance difference between 40 and the other versions as we haven't done much on query performance recently.

It appears that the really low latency queries having gotten slower (perhaps due to increasing overhead in planning or runtime somewhere).

If these results are reproduceable, I do think we should publish them to clickbench

Thank you @pmcgleenon and @xinlifoobar

pmcgleenon · 2024-07-29T18:47:43Z

Thanks @alamb I'll create a PR on the Clickbench repo to update the results

pmcgleenon · 2024-07-29T19:44:23Z

I've opened this PR for the Datafusion 40 results ClickHouse/ClickBench#210

alamb · 2024-07-29T21:22:06Z

It appears that the really low latency queries having gotten slower (perhaps due to increasing overhead in planning or runtime somewhere).

BTW I plan to spend some time tomorrow organizing / profiling the results to see if I can find some additional improvements to make.

pmcgleenon · 2024-07-29T22:16:52Z

That sounds great!

One benefit of the Clickbench results is that we can easily compare with other projects. Overall datafusion (partitioned) seems competitive with ClickHouse and DuckDB on similar hardware. In some cases the datafusion performance is better.
The results highlight a couple of scenarios where datafusion is behind (e.g. Q18, Q29, Q39 etc).
It will be exciting to see how the StringView work and any other improvements will change this picture in the future!

alamb · 2024-07-30T10:03:36Z

I expect StringView to help as well as @korowa 's #11627

In case anyone else is interested, I did an analysis of the various benchmark query properties here:
https://docs.google.com/spreadsheets/d/1NZuh_dEs9gX5uEp8AQ3DfvkNfC6bXFayQtFkjeKjNxQ/edit?gid=0#gid=0

(e.g. that is how I determine the relative cardinalities / what types of queries)

I think Q16/Q17/Q18 are all "high cardinality aggregates with mutli-column group keys"

alamb · 2024-07-30T10:46:23Z

I looked at some short queries and found one potential improvement #11719

I also looked at Q38

SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000;

$ cargo run --release --bin dfbench -- clickbench --iterations 100 --path benchmarks/data/hits_partitioned  --query 38

More than 50% of the time is spent doing snappy decoding (which we aren't likely to be able to improve)

12% of the time is reading string data from parquet (maybe stringview will help)
10% of the time is spent decoding parquet metadata

alamb · 2024-07-30T10:48:47Z

I am pretty sure Q18 would be helped with #9403 -- maybe we'll find a way to do that shortly

SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;

pmcgleenon · 2024-08-02T13:46:04Z

Hi @alamb

the Clickbench PR has been merged

the Datafusion version 40 results are now visible on the main ClickBench page https://benchmark.clickhouse.com/

alamb · 2024-08-02T14:02:19Z

Thank you so much @pmcgleenon 🙏 -- I am pretty excited to complte our inprogress work (like stringview and high cardinality aggregates) and run these again with a newer version of DataFusion

alamb added the enhancement New feature or request label Jul 20, 2024

This was referenced Jul 20, 2024

Update ClickBench benchmarks with DataFusion 36 #9404

Closed

Blog post for release 40.0.0 apache/datafusion-site#6

Merged

github-actions bot assigned xinlifoobar Jul 23, 2024

alamb mentioned this issue Jul 29, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

8 tasks

alamb mentioned this issue Jul 30, 2024

Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719

Open

alamb closed this as completed Aug 2, 2024

alamb mentioned this issue Aug 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 #11826

Closed

6 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ClickBench benchmarks with DataFusion 40 #11567

Update ClickBench benchmarks with DataFusion 40 #11567

alamb commented Jul 20, 2024

alamb commented Jul 20, 2024

xinlifoobar commented Jul 23, 2024

xinlifoobar commented Jul 23, 2024

alamb commented Jul 25, 2024

pmcgleenon commented Jul 25, 2024

pmcgleenon commented Jul 25, 2024 •

edited

Loading

alamb commented Jul 25, 2024

pmcgleenon commented Jul 28, 2024

alamb commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

alamb commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

alamb commented Jul 30, 2024

alamb commented Jul 30, 2024

alamb commented Jul 30, 2024

pmcgleenon commented Aug 2, 2024

alamb commented Aug 2, 2024

Update ClickBench benchmarks with DataFusion 40 #11567

Update ClickBench benchmarks with DataFusion 40 #11567

Comments

alamb commented Jul 20, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jul 20, 2024

xinlifoobar commented Jul 23, 2024

xinlifoobar commented Jul 23, 2024

alamb commented Jul 25, 2024

pmcgleenon commented Jul 25, 2024

pmcgleenon commented Jul 25, 2024 • edited Loading

alamb commented Jul 25, 2024

pmcgleenon commented Jul 28, 2024

alamb commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

alamb commented Jul 29, 2024

pmcgleenon commented Jul 29, 2024

alamb commented Jul 30, 2024

alamb commented Jul 30, 2024

alamb commented Jul 30, 2024

pmcgleenon commented Aug 2, 2024

alamb commented Aug 2, 2024

pmcgleenon commented Jul 25, 2024 •

edited

Loading