Skip to content

Refactor and improve QueryInfo and StageInfo#11580

Merged
arhimondr merged 8 commits intotrinodb:masterfrom
pettyjamesm:fix-stage-info
Apr 28, 2022
Merged

Refactor and improve QueryInfo and StageInfo#11580
arhimondr merged 8 commits intotrinodb:masterfrom
pettyjamesm:fix-stage-info

Conversation

@pettyjamesm
Copy link
Member

@pettyjamesm pettyjamesm commented Mar 19, 2022

Description

This PR refactors QueryInfo and it's associated creation methods in QueryStateMachine to avoid unnecessary repeated work, remove redundant fields and methods, and reduce memory allocations- since this logic is performed as part of producing each query results response.

Is this change a fix, improvement, new feature, refactoring, or other?

Refactor and reduction of unnecessary overhead in the critical path for sending query result responses.

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Core engine

How would you describe this change to a non-technical end user or system administrator?

No description of this change to non-technical end users and/or system administrators should be necessary.

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@cla-bot cla-bot bot added the cla-signed label Mar 19, 2022
@findepi
Copy link
Member

findepi commented Mar 21, 2022

@pettyjamesm can you please enhance commit messages with some motivation and/or short description?

@findepi findepi requested review from kokosing and sopel39 March 21, 2022 12:43
@sopel39 sopel39 requested a review from raunaqmorarka March 21, 2022 12:58
@pettyjamesm pettyjamesm marked this pull request as ready for review March 21, 2022 16:58
@pettyjamesm
Copy link
Member Author

pettyjamesm commented Mar 21, 2022

can you please enhance commit messages with some motivation and/or short description?

Yep, I was in the process of preparing the commits while the PR was still in a draft state. Commit messages have been elaborated upon, some more commits added, and this should now be ready for review.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % minor comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if scheduled should be kept in QueryInfo and removed from QueryStats as it seems to be about a query's state rather than statistics.
Will let @sopel39 decide this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn’t totally sure and it’s easy enough to keep the field in both places (just redundant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 - think you'll have a chance to weigh in on this soon?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if scheduled should be kept in QueryInfo and removed from QueryStats as it seems to be about a query's state rather than statistics.

makes sense.

BTW: do we need scheduled since we have state in query info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raunaqmorarka - I took a look at moving scheduled from QueryStats to QueryInfo but unfortunately it's used from within QueryStats#getProgressPercentage which makes extracting it a harder problem, so I propose we leave it in QueryStats since all QueryInfo instances have a QueryStats value they can consult.

BTW: do we need scheduled since we have state in query info?

@sopel39 : We don't "need" it in the sense that you could re-compute it, but the QueryInfo level notion of being scheduled is not a function of the QueryState but rather is set to true when of the query's stages are either done, pending, or running- so it does have semantic meaning on it's own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if we did globalUniqueNodes.addAll(stageUniqueNodes) after this loop ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no particular advantage to bulk inserts on HashSet, the implementation just does another loop with another iterator which is probably slightly slower (extra iterator, iterator is over the Set instead of a List, temporal memory locality of access, etc)

@pettyjamesm pettyjamesm force-pushed the fix-stage-info branch 6 times, most recently from 6f2cf68 to 4958232 Compare March 28, 2022 15:39
@sopel39 sopel39 requested a review from raunaqmorarka March 29, 2022 10:51
@pettyjamesm
Copy link
Member Author

@raunaqmorarka - let me know if you have another idea for how to handle where the field should live based on my earlier comment. I also realized that the web UI requires the "scheduled" flag to be present at both QueryInfo and BasicQueryInfo top level serialized form, so the latest revision outputs the serialized value in both QueryInfo and QueryStats, but the in-memory representation is still consolidated.

@pettyjamesm pettyjamesm force-pushed the fix-stage-info branch 4 times, most recently from ae5b937 to c459fa9 Compare April 5, 2022 21:18
@pettyjamesm
Copy link
Member Author

pettyjamesm commented Apr 6, 2022

@raunaqmorarka / @sopel39 - ready to merge?

@raunaqmorarka
Copy link
Member

It lgtm, will let @sopel39 decide on merging it

@pettyjamesm
Copy link
Member Author

@sopel39 - any update here?

@pettyjamesm pettyjamesm requested a review from arhimondr April 26, 2022 15:09
@pettyjamesm pettyjamesm force-pushed the fix-stage-info branch 3 times, most recently from 5e7b3b0 to af229d4 Compare April 27, 2022 17:16
Refactors QueryInfo to consolidate usages of isCompleteInfo() and
isFinalInfo() into a single method: isFinalInfo. The removed
isCompleteInfo field and method lacked a @JsonProperty annotation
and was therefore not being serialized anyway, and the isFinalInfo
method which was annotated as a @JsonProperty for serialization had
no corresponding constructor field and so also would not survive a
serialize / deserialize round-trip and required re-building a list
of StageInfo on each call, which is unnecessary overhead.

The consolidation of the two related fields ensures that the work
of determining whether a QueryInfo is "final" is done once while
producing the QueryInfo object and successfully round-trips when
serialized an deserialized.
StageInfo methods isCompleteInfo() and isFinalInfo() had identical
implementations. Since neither method is serialized in the JSON
representation, consolidating usage sites to call isFinalInfo()
is safe and allows removing the redundant isCompleteInfo() method.
Refactors the QueryInfo and QueryStats creation logic to call
getAllStages once, instead of four times as was being done before
this change. Each call to getAllStages will build an ImmutableList
with each StageInfo which can be relatively expensive for queries
that have a large number of stages.

This can be significant, since QueryInfo creation occurs on each
batch of query results fetched through the coordinator.
QueryInfo and QueryStats both embed the field "scheduled", but
QueryInfo embeds QueryStats- making the separate field redundant.
With this change, QueryStats will still contain the boolean field
and QueryInfo.isScheduled() can defer to the QueryStats value.

Unfortunately, the value must still be serialized in the QueryInfo
JSON payload since the web UI requires that BasicQueryInfo and
QueryInfo both have this field available.
Avoids unnecessary Optional boxing and lambda allocations inside
of getAllStages to reduce the overhead associated with flattening
sub-stage info into a list.
Avoids extra allocations and copies associated with Immutable
collection building removes an additional recursive child stage
traversal associated with producing StatementStats and StageStats
in the critical path of producing query results responses.
@arhimondr arhimondr merged commit 94072fc into trinodb:master Apr 28, 2022
@pettyjamesm pettyjamesm deleted the fix-stage-info branch April 28, 2022 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants