Refactor and improve QueryInfo and StageInfo#11580
Conversation
|
@pettyjamesm can you please enhance commit messages with some motivation and/or short description? |
a636e86 to
2d82faa
Compare
Yep, I was in the process of preparing the commits while the PR was still in a draft state. Commit messages have been elaborated upon, some more commits added, and this should now be ready for review. |
163f0ab to
52394e1
Compare
core/trino-main/src/main/java/io/trino/execution/QueryInfo.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I wonder if scheduled should be kept in QueryInfo and removed from QueryStats as it seems to be about a query's state rather than statistics.
Will let @sopel39 decide this
There was a problem hiding this comment.
Yeah, I wasn’t totally sure and it’s easy enough to keep the field in both places (just redundant)
There was a problem hiding this comment.
@sopel39 - think you'll have a chance to weigh in on this soon?
There was a problem hiding this comment.
I wonder if scheduled should be kept in QueryInfo and removed from QueryStats as it seems to be about a query's state rather than statistics.
makes sense.
BTW: do we need scheduled since we have state in query info?
There was a problem hiding this comment.
@raunaqmorarka - I took a look at moving scheduled from QueryStats to QueryInfo but unfortunately it's used from within QueryStats#getProgressPercentage which makes extracting it a harder problem, so I propose we leave it in QueryStats since all QueryInfo instances have a QueryStats value they can consult.
BTW: do we need scheduled since we have state in query info?
@sopel39 : We don't "need" it in the sense that you could re-compute it, but the QueryInfo level notion of being scheduled is not a function of the QueryState but rather is set to true when of the query's stages are either done, pending, or running- so it does have semantic meaning on it's own.
core/trino-main/src/main/java/io/trino/server/protocol/Query.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Would it be better if we did globalUniqueNodes.addAll(stageUniqueNodes) after this loop ?
There was a problem hiding this comment.
There’s no particular advantage to bulk inserts on HashSet, the implementation just does another loop with another iterator which is probably slightly slower (extra iterator, iterator is over the Set instead of a List, temporal memory locality of access, etc)
core/trino-main/src/main/java/io/trino/server/protocol/Query.java
Outdated
Show resolved
Hide resolved
6f2cf68 to
4958232
Compare
4958232 to
ed9e499
Compare
|
@raunaqmorarka - let me know if you have another idea for how to handle where the field should live based on my earlier comment. I also realized that the web UI requires the |
ae5b937 to
c459fa9
Compare
|
@raunaqmorarka / @sopel39 - ready to merge? |
|
It lgtm, will let @sopel39 decide on merging it |
|
@sopel39 - any update here? |
5e7b3b0 to
af229d4
Compare
Refactors QueryInfo to consolidate usages of isCompleteInfo() and isFinalInfo() into a single method: isFinalInfo. The removed isCompleteInfo field and method lacked a @JsonProperty annotation and was therefore not being serialized anyway, and the isFinalInfo method which was annotated as a @JsonProperty for serialization had no corresponding constructor field and so also would not survive a serialize / deserialize round-trip and required re-building a list of StageInfo on each call, which is unnecessary overhead. The consolidation of the two related fields ensures that the work of determining whether a QueryInfo is "final" is done once while producing the QueryInfo object and successfully round-trips when serialized an deserialized.
StageInfo methods isCompleteInfo() and isFinalInfo() had identical implementations. Since neither method is serialized in the JSON representation, consolidating usage sites to call isFinalInfo() is safe and allows removing the redundant isCompleteInfo() method.
Refactors the QueryInfo and QueryStats creation logic to call getAllStages once, instead of four times as was being done before this change. Each call to getAllStages will build an ImmutableList with each StageInfo which can be relatively expensive for queries that have a large number of stages. This can be significant, since QueryInfo creation occurs on each batch of query results fetched through the coordinator.
QueryInfo and QueryStats both embed the field "scheduled", but QueryInfo embeds QueryStats- making the separate field redundant. With this change, QueryStats will still contain the boolean field and QueryInfo.isScheduled() can defer to the QueryStats value. Unfortunately, the value must still be serialized in the QueryInfo JSON payload since the web UI requires that BasicQueryInfo and QueryInfo both have this field available.
Avoids unnecessary Optional boxing and lambda allocations inside of getAllStages to reduce the overhead associated with flattening sub-stage info into a list.
Avoids extra allocations and copies associated with Immutable collection building removes an additional recursive child stage traversal associated with producing StatementStats and StageStats in the critical path of producing query results responses.
af229d4 to
da3aad6
Compare
Description
This PR refactors
QueryInfoand it's associated creation methods inQueryStateMachineto avoid unnecessary repeated work, remove redundant fields and methods, and reduce memory allocations- since this logic is performed as part of producing each query results response.Refactor and reduction of unnecessary overhead in the critical path for sending query result responses.
Core engine
No description of this change to non-technical end users and/or system administrators should be necessary.
Documentation
(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(x) No release notes entries required.
( ) Release notes entries required with the following suggested text: