Solves problems with expensive inlinestats benchmarks#896
Solves problems with expensive inlinestats benchmarks#896ncordon merged 6 commits intoelastic:masterfrom
Conversation
57ddf24 to
4fc5adc
Compare
| "name": "stats_count_group_by_esql", | ||
| "operation-type": "esql", | ||
| "query" : "FROM nyc_taxis METADATA _id | stats count(passenger_count) by _id | LIMIT 1000" | ||
| "query" : "FROM nyc_taxis_sample METADATA _id | stats count(passenger_count) by _id" |
There was a problem hiding this comment.
Note the LIMIT 1000 was moved after the queries in this other pr: #873
There was a problem hiding this comment.
Having the limit before or after the stats completely changes the behaviour of the query. The other PR moved it after, and this one effectively moves it before again. I'm assuming we will see performance differences with both changes?
There was a problem hiding this comment.
Having said that, I understand the purpose of these queries was to limit before, so the change in this current PR is moving back towards that original purpose.
There was a problem hiding this comment.
The queries haven't even run after we changed the limit to be after, right? So we should be able to revert to the original purpose of these benchmarks that was to evaluate the inline stats more than the fetching part
craigtaverner
left a comment
There was a problem hiding this comment.
Approved, although I wonder if there are pre-existing operations that could be used. At least for deleting the index, I assume there exists a delete-index operation. For the sampling of the index, I doubt there exists an operation for that, so the raw-operation is probably necessary. Would be interesting to get an opinion from es-perf about this.
nyc_taxis/operations/default.json
Outdated
| "body": { | ||
| "source": { | ||
| "index": "nyc_taxis", | ||
| "query": { "match_all": {} } |
There was a problem hiding this comment.
Do we need a query here, isn't match_all implied?
There was a problem hiding this comment.
I assumed you had two ways of doing this, one with a query, and include a size inside that, and the other with the max_docs parameter. Since you used max_docs, I presumed the query was unnecessary.
There was a problem hiding this comment.
Yeah you are right, I thought we still needed the query 👍
nyc_taxis/operations/default.json
Outdated
| }, | ||
| { | ||
| "name": "delete-nyc-taxis-sample-index", | ||
| "operation-type": "raw-request", |
There was a problem hiding this comment.
Isn't there already a delete index operation you could use instead of a raw-request?
There was a problem hiding this comment.
See index property in https://esrally.readthedocs.io/en/stable/track.html#delete-index.
| "name": "stats_count_group_by_esql", | ||
| "operation-type": "esql", | ||
| "query" : "FROM nyc_taxis METADATA _id | stats count(passenger_count) by _id | LIMIT 1000" | ||
| "query" : "FROM nyc_taxis_sample METADATA _id | stats count(passenger_count) by _id" |
There was a problem hiding this comment.
Having the limit before or after the stats completely changes the behaviour of the query. The other PR moved it after, and this one effectively moves it before again. I'm assuming we will see performance differences with both changes?
| "name": "stats_count_group_by_esql", | ||
| "operation-type": "esql", | ||
| "query" : "FROM nyc_taxis METADATA _id | stats count(passenger_count) by _id | LIMIT 1000" | ||
| "query" : "FROM nyc_taxis_sample METADATA _id | stats count(passenger_count) by _id" |
There was a problem hiding this comment.
Having said that, I understand the purpose of these queries was to limit before, so the change in this current PR is moving back towards that original purpose.
gbanasiak
left a comment
There was a problem hiding this comment.
The reindexing approach with document limit does not guarantee the same set of documents in nyc_taxis_sample on each run which may or may not contribute to variability of the benchmark depending on uniformity of the corpus. A conservative approach would be hand-picking 1000 documents, creating a new corpus file, and defining a new index in corpora. Something to consider if results are more noisy than before.
Please see other comments below.
d7d6ee1 to
c5ee865
Compare
|
backported to |
This is an empty commit which records missing backports from manual or squashed backports through "cherry picked from" metadata. CI determines Elasticsearch build arguments #925 (#926) (cherry picked from commit 8c33ff5) Exclude some challenges when testing with ES release builds #922 (#919) (cherry picked from commit f38f8fc) Reduce filtering scope in CI workflow #908 (#919) (cherry picked from commit 8e571a5) Address pytest deprecations #911 (#919) (cherry picked from commit fa81c5e) Solves problems with expensive inlinestats benchmarks #896 (#919) (cherry picked from commit 8209244) Adds better challenges for comparing inlinestats #873 (#919) (cherry picked from commit c41a950)
Adds a smaller index
nyc_taxis_samplewith a 1000 rows of the originalnyc_taxisand runs the benchmarksstats_count_group_by_esqlandinlinestats_count_group_by_esqlon it.Why
Because there are some benchmarks for
inline statsthat are triggering circuit breaker exceptions: