Skip to content

Conversation

@piotrrzysko
Copy link
Member

@piotrrzysko piotrrzysko commented Nov 27, 2025

Description

Attempt to fix #27082.

Record heartbeats more frequently to prevent query timeouts during large
result sets or when the query state changes more often than the heartbeat
interval.

Additional context and related issues

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 27, 2025
@piotrrzysko piotrrzysko force-pushed the fixTestDirectTrinoClient branch from 05ca898 to f33c296 Compare November 27, 2025 15:59
@findepi findepi added the no-release-notes This pull request does not require release notes entry label Nov 27, 2025
@findepi
Copy link
Member

findepi commented Nov 27, 2025

As an experiment, I reverted the bugfix that this test is validating, and the test failed, so the intended behavior is still covered.

Thanks for doing that!

I understand this refers to #26371 where the test was added

that PR desc:

When a query being executed by DirectTrinoClient takes longer than the query.client.timeout, the query is then canceled. This client is used in some cases when executing queries from within the coordinator

this PR desc

Considering that processing the last page took 0.4s, it’s easy to imagine that sometimes the total interval between sending heartbeats exceeds 1s.

This test uses blackhole.test_schema.slow_test_table which produces pages slowly, but actually any other query/connector can produces pages slowly. for example aggregation will read all input before producing a page.
If pages coming slowly to DirectTrinoClient make query cancelled and abandoned, then it sounds like the bug is still there?

@findepi findepi requested a review from lukasz-stec November 27, 2025 16:52
@piotrrzysko
Copy link
Member Author

piotrrzysko commented Nov 27, 2025

Polling pages is non-blocking (if there are no pages, pollPage returns null immediately):

for (Slice serializedPage = exchangeClient.pollPage(); serializedPage != null; serializedPage = exchangeClient.pollPage()) {

The timings I attached mainly reflect deserialization:

Page page = pageDeserializer.deserialize(serializedPage);

My understanding is that if no new page is available, DirectTrinoClient immediately starts waiting on this future: https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/client/direct/DirectTrinoClient.java#L12. If it doesn’t complete within query.client.timeout / 2, it sends a heartbeat and this repeats until the future completes.

The timeout is set to a very low value (1s), which is fine for tests, but in production it should be higher (the default is 5 minutes) to avoid getting close to the deserialization time.

Copy link
Member

@lukasz-stec lukasz-stec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. As an alternative, we could increase both timeouts, but that would make the test run longer, so this seems better.

public void testDirectTrinoClientLongQuery()
{
queryRunner.execute(TEST_SESSION, "SELECT * FROM blackhole.test_schema.slow_test_table");
queryRunner.execute(TEST_SESSION, "EXPLAIN ANALYZE SELECT * FROM blackhole.test_schema.slow_test_table");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would SELECT count(*) FROM blackhole.test_schema.slow_test_table have the same effect?

@piotrrzysko piotrrzysko force-pushed the fixTestDirectTrinoClient branch from f33c296 to 5437a88 Compare November 28, 2025 13:46
@piotrrzysko
Copy link
Member Author

@findepi I applied what we discussed offline. The test didn’t fail locally after running it in a loop for an hour - previously it would fail sporadically.

Record heartbeats more frequently to prevent query timeouts during large
result sets or when the query state changes more often than the heartbeat
interval.
@findepi findepi force-pushed the fixTestDirectTrinoClient branch from 5437a88 to b826007 Compare November 28, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed no-release-notes This pull request does not require release notes entry

Development

Successfully merging this pull request may close these issues.

Flaky TestDirectTrinoClient.testDirectTrinoClientLongQuery

3 participants