Conversation
c674382 to
9ef7df9
Compare
* Move files that are shared between the integration test suite and performance benchmark test suite into a common parent directory. * Add customized pytest console reporting for benchmarks. * Add support for JSON benchmark result generation. * Update the Presto Benchmarking section of the README.
9ef7df9 to
71d991c
Compare
| raw_times_dict = benchmark_dict[BenchmarkKeys.RAW_TIMES_KEY] | ||
| assert raw_times_dict == {} | ||
|
|
||
| failed_queries_dict = benchmark_dict[BenchmarkKeys.FAILED_QUERIES_KEY] |
There was a problem hiding this comment.
While the failed queries are being identified and recorded in the json file, pytest still reports those queries as having passed. I think we should report them as failed tests if they did not go through.
./run_benchmark.sh -b tpch -s sf1 -q 1
1 passed in 0.05s
cat benchmark_output/benchmark_result.json
{
"tpch": {
"agg_times_ms": {
"avg": {},
"min": {},
"max": {}
},
"failed_queries": {
"Q1": "USER_ERROR: SYNTAX_ERROR"
}
}
}
There was a problem hiding this comment.
This should be fixed in the current PR revision.
There was a problem hiding this comment.
Generally looks good to me. I was able to get everything running smoothly (once I fixed some unrelated bugs in our scripts - which I will address separately).
At time of writing I believe we are still missing the ability to run warmup queries compared to the other scripts that are in use.
By my current comparison, using the python interface is slightly faster than using curl, but we do lose some metadata. Do we have any interest in collecting these other pieces like cpu_time?
Can you please expand on this requirement? The current benchmark suite would run the queries multiple times, and users can choose to use the min time if colder queries are to be ignored.
I believe the python client has the same metadata, so we can add this if needed. Is any other data besides the elapsed time being used? |
We clarified this with Todd in the standup, so I think the current form meets the current expectations now. If there are other tweaks we need we can always add them later.
I think elapsed time is the only one we really care about. If there are others we can add them in later. I have some slight concerns that Presto does not seem to be consistent with what "elapsed time" means when requested via different APIs (a request via curl will yield a different number than via the cli, despite both being recorded in presto metadata as "elapsed_time"). |
misiugodfrey
left a comment
There was a problem hiding this comment.
LGTM. I think this works and it's better to get it in so that folks can start using it and any further features can be added later.
That being said, one potential recommendation I have is to include all requested queries in the output regardless of if they fail or not. Right now I see:
Query ID | Avg(ms) | Min(ms) | Max(ms)
-------------------------------------------
Q1 | 1043.4 | 1014 | 1060
Q2 | 1080.2 | 1056 | 1110
Q3 | 1841.4 | 1771 | 1905
Q4 | 629 | 612 | 654
Q5 | 3627.2 | 3462 | 3919
Q6 | 175.6 | 163 | 188
Q7 | 1880.4 | 1754 | 1979
Q8 | 5455.2 | 5324 | 5707
Q9 | 5066.6 | 5025 | 5109
Q10 | 1846.2 | 1797 | 1922
Q11 | 601 | 552 | 655
Q12 | 783.2 | 700 | 941
Q13 | 738 | 724 | 757
Q14 | 363.6 | 344 | 379
Q15 | 442.4 | 416 | 479
Q16 | 524 | 513 | 539
Q17 | 4779.2 | 4744 | 4801
Q18 | 5679.4 | 5614 | 5799
Q19 | 527 | 499 | 554
Q20 | 1045.6 | 1016 | 1073
Q22 | 470 | 453 | 508
======================================================================
FAILED ../testing/performance_benchmarks/tpch_test.py::test_query[Q21] - prestodb.exceptions.PrestoQueryError: PrestoQueryError(type=INTERNAL_ERROR, name=GENERIC_INTERNAL_ERROR, message=" Operator::getOutput failed for [operator: CudfHashJoinProbe, plan node ID: 572]: std::bad_alloc: out_of_memory: CUDA error (failed to allocate 608240624 b
ytes) at: /presto_native_release_gpu_ON_build/_deps/rmm-src/cpp/incl...
1 failed, 21 passed in 198.08s (0:03:18)
Whereas I would prefer to see Q21 listed in the output but with some kind of null or empty values so that it is not missed:
Q20 | 1045.6 | 1016 | 1073
Q21 | null | null | null
Q22 | 470 | 453 | 508
|
Testing this PR, I ran into some practical issues that may be worth addressing by putting in guardrails or better documentation. I list them in detail below. I tried to run on a dataset that already existed, but was not setup with a schema. Reading the Running the data generation script I ran into this error. The solution is to ensure that This error appears to kick in when testing anything relatively large with the default config, which limits the server to 2GB. In my case I tested SF100. The solution is to increase the memory limits (manually) for the Presto server and restart. This error occurs after bumping the memory limits by a factor of 10 (from 2GB to 20GB). It appears we still hit memory related issues. As a side note, not related to this PR, increasing the memory limits to something large (like 200GB) on a server seems to fail silently and not start a worker node with the error: Presumably some error is generated somewhere, but the benchmark script nor the server (docker) process seem to have any logs. |
mattgara
left a comment
There was a problem hiding this comment.
Overall looks good to me.
Most of my comments are open-ended and have to do with discussing quality-of-life changes/improvements.
| from ..common.fixtures import tpch_queries, tpcds_queries | ||
|
|
||
|
|
||
| @pytest.fixture(scope="module") |
There was a problem hiding this comment.
We might want to consider the scope here. A scope of function may be more appropriate for benchmarking as it would create a new connection for each query being executed.
There was a problem hiding this comment.
Hmm, why would we want to create a new connection for each query?
There was a problem hiding this comment.
I'm not clear if there is a delay/cost to running the first query on a new connection, or if there is state that is carried over between queries run on the same connection or cursor (tied to that "session".) If so, using one approach or the other would have different benchmark timings.
Looking into this to see how other database benchmarks approach this, it looks like pgbench uses one connection per transaction while benchmarking (see here: https://github.com/postgres/postgres/blob/master/src/bin/pgbench/pgbench.c#L267.)
Reading a little more up on this, the transaction is the logical unit pgbench wants to benchmark at (see builtin TPC-B like benchmark here: https://github.com/postgres/postgres/blob/66cdef4425f3295a2cfbce3031b3c0f0b5dffc04/src/bin/pgbench/pgbench.c#L782)
If it's reasonable to follow Postgres' pattern here, we'd probably want to focus on what we want the logic unit to be that we are benchmarking, individual queries or all queries in the benchmark (for example TPC-H.) If it's the former we probably want a new connection or cursor per query and the latter I think we are fine reusing.
What's the typical logical unit we want to benchmark?
There was a problem hiding this comment.
I believe when it comes to PostgreSQL, client/server connections are stateful, so having to reset the connection for each query makes sense. In this case, client/server interactions happen over HTTP, which is a stateless protocol, and connection times are not included in the benchmark report.
There was a problem hiding this comment.
In this case, client/server interactions happen over HTTP, which is a stateless protocol, and connection times are not included in the benchmark report.
Ahh okay, in that case my premise that some state can be carried over is not valid anymore, thanks for the clarification.
| parser.addoption("--tag") | ||
|
|
||
|
|
||
| def pytest_terminal_summary(terminalreporter, exitstatus, config): |
There was a problem hiding this comment.
For the summary report consider,
- Separating the first query in each set of queries from computing averages or any statistics and reporting them separately.
- Reporting the arithmetic and geometric mean (the geometric mean is more robust to outliers.)
- Since we report min and max, median would be good to compute.
There was a problem hiding this comment.
- Separating the first query in each set of queries from computing averages or any statistics and reporting them separately.
I don't think this works with the "lukewarm" benchmark requirement/use case?
- Reporting the arithmetic and geometric mean (the geometric mean is more robust to outliers.)
- Since we report min and max, median would be good to compute.
Geometric mean and median have been added.
There was a problem hiding this comment.
I don't think this works with the "lukewarm" benchmark requirement/use case?
Hmm right. I wonder if it would be beneficial to have both sets of behaviour implemented for the reporting that can be toggled with a switch?
Alternatively, we can leave it as is, and come back and implement it if there is a user requesting this feature.
The report has been updated to include failed queries. |
I have updated the
Can you please expand on this? I believe this check is already done by Presto. |
The benchmark script requires an existing and valid schema but has no direct dependency on the |
Feel free to add configuration related comments to this PR: #48 |
I think that's fine. My thinking was to have more of a user friendly message, but I think the Presto database error is fine for the set of users we expect to be using this tool. |
Okay, could you please point me at the documentation that describes we should have |
I believe the setup script help description should contain this detail. |
Hmm is In that case, should this statement be included in a more user-visible place? |
This should be in the output of |
Move files that are shared between the integration test suite and performance benchmark test suite into a common parent directory.
Add customized pytest console reporting for benchmarks.
Add support for JSON benchmark result generation.
Update the Presto Benchmarking section of the README.