Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark derived from TPC-H #355

Merged
merged 7 commits into from
May 3, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions benchmarks/tpch/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
data
results.csv
78 changes: 78 additions & 0 deletions benchmarks/tpch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# DataFusion Python Benchmarks Derived from TPC-H

## Create Release Build

From repo root:

```bash
maturin develop --release
```

Note that release builds take a really long time, so you may want to temporarily comment out this section of the
root Cargo.toml when frequently building.

```toml
[profile.release]
lto = true
codegen-units = 1
```

## Generate Data

```bash
./tpch-gen.sh 1
```

## Run Benchmarks

```bash
python tpch.py ./data ./queries
```

A summary of the benchmark timings will be written to `results.csv`. For example:

```csv
setup,1.4
q1,2978.6
q2,679.7
q3,2943.7
q4,2894.9
q5,3592.3
q6,1691.4
q7,3003.9
q8,3818.7
q9,4237.9
q10,2344.7
q11,526.1
q12,2284.6
q13,1009.2
q14,1738.4
q15,1942.1
q16,499.8
q17,5178.9
q18,4127.7
q19,2056.6
q20,2162.5
q21,8046.5
q22,754.9
total,58513.2
```
133 changes: 133 additions & 0 deletions benchmarks/tpch/create_tables.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
-- Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also contain the disclaimer about being derived from TPC-H since it is performing DDL? Don't know the answer there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will go ahead and add this.

-- or more contributor license agreements. See the NOTICE file
-- distributed with this work for additional information
-- regarding copyright ownership. The ASF licenses this file
-- to you under the Apache License, Version 2.0 (the
-- "License"); you may not use this file except in compliance
-- with the License. You may obtain a copy of the License at
--
-- http://www.apache.org/licenses/LICENSE-2.0
--
-- Unless required by applicable law or agreed to in writing,
-- software distributed under the License is distributed on an
-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-- KIND, either express or implied. See the License for the
-- specific language governing permissions and limitations
-- under the License.

CREATE EXTERNAL TABLE customer (
c_custkey INT NOT NULL,
c_name VARCHAR NOT NULL,
c_address VARCHAR NOT NULL,
c_nationkey INT NOT NULL,
c_phone VARCHAR NOT NULL,
c_acctbal DECIMAL(15, 2) NOT NULL,
c_mktsegment VARCHAR NOT NULL,
c_comment VARCHAR NOT NULL,
c_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/customer.csv';

CREATE EXTERNAL TABLE lineitem (
l_orderkey INT NOT NULL,
l_partkey INT NOT NULL,
l_suppkey INT NOT NULL,
l_linenumber INT NOT NULL,
l_quantity DECIMAL(15, 2) NOT NULL,
l_extendedprice DECIMAL(15, 2) NOT NULL,
l_discount DECIMAL(15, 2) NOT NULL,
l_tax DECIMAL(15, 2) NOT NULL,
l_returnflag VARCHAR NOT NULL,
l_linestatus VARCHAR NOT NULL,
l_shipdate DATE NOT NULL,
l_commitdate DATE NOT NULL,
l_receiptdate DATE NOT NULL,
l_shipinstruct VARCHAR NOT NULL,
l_shipmode VARCHAR NOT NULL,
l_comment VARCHAR NOT NULL,
l_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/lineitem.csv';

CREATE EXTERNAL TABLE nation (
n_nationkey INT NOT NULL,
n_name VARCHAR NOT NULL,
n_regionkey INT NOT NULL,
n_comment VARCHAR NOT NULL,
n_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/nation.csv';

CREATE EXTERNAL TABLE orders (
o_orderkey INT NOT NULL,
o_custkey INT NOT NULL,
o_orderstatus VARCHAR NOT NULL,
o_totalprice DECIMAL(15, 2) NOT NULL,
o_orderdate DATE NOT NULL,
o_orderpriority VARCHAR NOT NULL,
o_clerk VARCHAR NOT NULL,
o_shippriority INT NULL,
o_comment VARCHAR NOT NULL,
o_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/orders.csv';

CREATE EXTERNAL TABLE part (
p_partkey INT NOT NULL,
p_name VARCHAR NOT NULL,
p_mfgr VARCHAR NOT NULL,
p_brand VARCHAR NOT NULL,
p_type VARCHAR NOT NULL,
p_size INT NULL,
p_container VARCHAR NOT NULL,
p_retailprice DECIMAL(15, 2) NOT NULL,
p_comment VARCHAR NOT NULL,
p_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/part.csv';

CREATE EXTERNAL TABLE partsupp (
ps_partkey INT NOT NULL,
ps_suppkey INT NOT NULL,
ps_availqty INT NOT NULL,
ps_supplycost DECIMAL(15, 2) NOT NULL,
ps_comment VARCHAR NOT NULL,
ps_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/partsupp.csv';

CREATE EXTERNAL TABLE region (
r_regionkey INT NOT NULL,
r_name VARCHAR NOT NULL,
r_comment VARCHAR NOT NULL,
r_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/region.csv';

CREATE EXTERNAL TABLE supplier (
s_suppkey INT NOT NULL,
s_name VARCHAR NOT NULL,
s_address VARCHAR NOT NULL,
s_nationkey INT NOT NULL,
s_phone VARCHAR NOT NULL,
s_acctbal DECIMAL(15, 2) NOT NULL,
s_comment VARCHAR NOT NULL,
s_extra VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW DELIMITER '|'
LOCATION '$PATH/supplier.csv';
23 changes: 23 additions & 0 deletions benchmarks/tpch/queries/q1.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
-- Benchmark Query 1 derived from TPC-H query 1 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '68 days'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
33 changes: 33 additions & 0 deletions benchmarks/tpch/queries/q10.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
-- Benchmark Query 10 derived from TPC-H query 10 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate >= date '1993-07-01'
and o_orderdate < date '1993-07-01' + interval '3' month
and l_returnflag = 'R'
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc limit 20;
29 changes: 29 additions & 0 deletions benchmarks/tpch/queries/q11.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
-- Benchmark Query 11 derived from TPC-H query 11 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'ALGERIA'
group by
ps_partkey having
sum(ps_supplycost * ps_availqty) > (
select
sum(ps_supplycost * ps_availqty) * 0.0001000000
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'ALGERIA'
)
order by
value desc;
30 changes: 30 additions & 0 deletions benchmarks/tpch/queries/q12.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
-- Benchmark Query 12 derived from TPC-H query 12 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
l_shipmode,
sum(case
when o_orderpriority = '1-URGENT'
or o_orderpriority = '2-HIGH'
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> '1-URGENT'
and o_orderpriority <> '2-HIGH'
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in ('FOB', 'SHIP')
and l_commitdate < l_receiptdate
and l_shipdate < l_commitdate
and l_receiptdate >= date '1995-01-01'
and l_receiptdate < date '1995-01-01' + interval '1' year
group by
l_shipmode
order by
l_shipmode;
22 changes: 22 additions & 0 deletions benchmarks/tpch/queries/q13.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
-- Benchmark Query 13 derived from TPC-H query 13 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
c_count,
count(*) as custdist
from
(
select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%express%requests%'
group by
c_custkey
) as c_orders (c_custkey, c_count)
group by
c_count
order by
custdist desc,
c_count desc;
15 changes: 15 additions & 0 deletions benchmarks/tpch/queries/q14.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
-- Benchmark Query 14 derived from TPC-H query 14 under the terms of the TPC Fair Use Policy.
-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council.
select
100.00 * sum(case
when p_type like 'PROMO%'
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate >= date '1995-02-01'
and l_shipdate < date '1995-02-01' + interval '1' month;
Loading