fix(): Fix query failure logging race#27479
Merged
spershin merged 1 commit intoprestodb:masterfrom Apr 1, 2026
Merged
Conversation
Summary: When a query failed, there is a race between dispatching and executing code to log the query failure event. That can result in double-logging in the systems that allow it or logging without any stats/metrics in the systems which allow logging only once. We observe both behaviors in our logging systems. The fix is to set a flag that the query went for execution and not log from the dispatching side in this case. Differential Revision: D98954601
|
Contributor
Reviewer's GuideAdds a guard flag in LocalDispatchQuery to distinguish pre-dispatch from post-dispatch failures/cancellations so that immediate failure events with zeroed stats are only logged if the query was never submitted for execution, and introduces unit tests to cover the race conditions and expected logging behavior before and after dispatch. Sequence diagram for query failure before dispatch (immediate failure event emitted)sequenceDiagram
actor Client
participant LocalDispatchQuery
participant QueryStateMachine
participant QueryMonitor
Client->>LocalDispatchQuery: fail(throwable)
LocalDispatchQuery->>QueryStateMachine: transitionToFailed(throwable)
QueryStateMachine-->>LocalDispatchQuery: true
Note over LocalDispatchQuery,QueryStateMachine: sentForExecution is false (query not submitted)
LocalDispatchQuery->>QueryStateMachine: getBasicQueryInfo(Optional.empty)
QueryStateMachine-->>LocalDispatchQuery: BasicQueryInfo
LocalDispatchQuery->>LocalDispatchQuery: toFailure(throwable)
LocalDispatchQuery->>QueryMonitor: queryImmediateFailureEvent(BasicQueryInfo, ExecutionFailureInfo)
QueryMonitor-->>QueryMonitor: Log failure with zeroed stats
Sequence diagram for query failure after dispatch (completion event only)sequenceDiagram
actor Client
participant LocalDispatchQuery
participant QueryStateMachine
participant SqlQueryManager
participant QueryMonitor
Client->>LocalDispatchQuery: startExecution(queryExecution, isDispatching=true)
LocalDispatchQuery->>QueryExecution: setResourceGroupQueryLimits(...)
LocalDispatchQuery->>SqlQueryManager: querySubmitter.accept(queryExecution)
SqlQueryManager-->>SqlQueryManager: Register query for execution
LocalDispatchQuery->>LocalDispatchQuery: sentForExecution = true
Client->>LocalDispatchQuery: fail(throwable)
LocalDispatchQuery->>QueryStateMachine: transitionToFailed(throwable)
QueryStateMachine-->>LocalDispatchQuery: true
Note over LocalDispatchQuery: sentForExecution is true
LocalDispatchQuery-->>QueryMonitor: (no call to queryImmediateFailureEvent)
SqlQueryManager-->>QueryMonitor: queryCompletedEvent(QueryInfoWithStats)
QueryMonitor-->>QueryMonitor: Log failure with real execution stats
Class diagram for LocalDispatchQuery logging guard changesclassDiagram
class LocalDispatchQuery {
// Fields
- SettableFuture submitted
- AtomicReference resourceGroupQueryLimits
- boolean retry
- QueryPrerequisites queryPrerequisites
- QueryMonitor queryMonitor
- Consumer querySubmitter
- volatile boolean sentForExecution
// Methods (subset related to this change)
- void startExecution(QueryExecution queryExecution, boolean isDispatching)
- void fail(Throwable throwable)
- void cancel()
}
class QueryExecution {
+ void setResourceGroupQueryLimits(ResourceGroupQueryLimits limits)
}
class QueryMonitor {
+ void queryImmediateFailureEvent(BasicQueryInfo queryInfo, ExecutionFailureInfo failureInfo)
+ void queryCompletedEvent(QueryInfo queryInfo)
}
class QueryStateMachine {
+ boolean transitionToFailed(Throwable throwable)
+ boolean transitionToCanceled()
+ BasicQueryInfo getBasicQueryInfo(Optional optionalToken)
}
class SqlQueryManager {
+ void accept(QueryExecution queryExecution)
+ void finalQueryInfoListener(QueryInfo queryInfo)
}
class BasicQueryInfo {
+ ExecutionFailureInfo getFailureInfo()
}
class ExecutionFailureInfo
class QueryInfo
class ResourceGroupQueryLimits
class QueryPrerequisites
class Optional
class Throwable
LocalDispatchQuery --> QueryExecution : uses
LocalDispatchQuery --> QueryMonitor : logs_via
LocalDispatchQuery --> QueryStateMachine : manages_state_via
LocalDispatchQuery --> SqlQueryManager : submits_via_querySubmitter
LocalDispatchQuery --> QueryPrerequisites : uses
LocalDispatchQuery --> Optional : uses
QueryMonitor --> BasicQueryInfo : parameter
QueryMonitor --> ExecutionFailureInfo : parameter
SqlQueryManager --> QueryMonitor : notifies_via_finalQueryInfoListener
BasicQueryInfo --> ExecutionFailureInfo : contains
QueryStateMachine --> BasicQueryInfo : returns
QueryStateMachine --> Throwable : parameter
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- There is still a small race window between
querySubmitter.accept(queryExecution)completing andsentForExecutionbeing set totrue, during whichfail()/cancel()can still emit a duplicate immediate-failure event; consider settingsentForExecutionbefore calling the submitter and clearing it on failure, or otherwise tying this guard to a state already tracked by theQueryStateMachine. - Since
sentForExecutionis guarding correctness against races, you might want to make the lifecycle more explicit (e.g., using anAtomicBooleanwith compare-and-set or documenting all threads that can write it) to avoid future changes accidentally introducing additional racy writes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There is still a small race window between `querySubmitter.accept(queryExecution)` completing and `sentForExecution` being set to `true`, during which `fail()`/`cancel()` can still emit a duplicate immediate-failure event; consider setting `sentForExecution` before calling the submitter and clearing it on failure, or otherwise tying this guard to a state already tracked by the `QueryStateMachine`.
- Since `sentForExecution` is guarding correctness against races, you might want to make the lifecycle more explicit (e.g., using an `AtomicBoolean` with compare-and-set or documenting all threads that can write it) to avoid future changes accidentally introducing additional racy writes.
## Individual Comments
### Comment 1
<location path="presto-main/src/main/java/com/facebook/presto/dispatcher/LocalDispatchQuery.java" line_range="218-221" />
<code_context>
try {
resourceGroupQueryLimits.get().ifPresent(queryExecution::setResourceGroupQueryLimits);
querySubmitter.accept(queryExecution);
+ // Mark only after successful submission. If querySubmitter throws,
+ // SqlQueryManager won't have the query, so we still need
+ // queryImmediateFailureEvent from fail() below.
+ sentForExecution = true;
}
catch (Throwable t) {
</code_context>
<issue_to_address>
**issue (bug_risk):** There is a race where `fail()`/`cancel()` can see `sentForExecution == false` after a successful `querySubmitter.accept`, causing duplicate events.
Because `sentForExecution` is set *after* `querySubmitter.accept`, `fail()`/`cancel()` can run in between and still see `sentForExecution == false` even though the query was successfully submitted. They will then emit `queryImmediateFailureEvent`, and later `SqlQueryManager` will emit its normal completion event, causing duplicates.
To avoid this race, set the flag before submission and clear it only if submission fails, e.g.:
```java
sentForExecution = true;
try {
querySubmitter.accept(queryExecution);
}
catch (Throwable t) {
sentForExecution = false; // submission didn’t complete successfully
// existing catch handling
}
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
presto-main/src/main/java/com/facebook/presto/dispatcher/LocalDispatchQuery.java
Show resolved
Hide resolved
Contributor
|
arhimondr
approved these changes
Mar 31, 2026
bibith4
pushed a commit
to bibith4/presto
that referenced
this pull request
Apr 1, 2026
Summary: When a query failed, there is a race between dispatching and executing code to log the query failure event. That can result in double-logging in the systems that allow it or logging without any stats/metrics in the systems which allow logging only once. We observe both behaviors in our logging systems. The fix is to set a flag that the query went for execution and not log from the dispatching side in this case. Differential Revision: D98954601 ``` == NO RELEASE NOTE == ``` ## Summary by Sourcery Prevent duplicate zero-stats query failure events for dispatched queries by tracking execution submission and guarding immediate-failure logging. Bug Fixes: - Avoid emitting queryImmediateFailureEvent after a query has been submitted for execution, preventing duplicate or zeroed-stat completion events on failures and cancellations. Tests: - Add unit coverage to ensure fail() and cancel() do not emit immediate failure events after dispatch, while pre-dispatch failures still emit them. Co-authored-by: Sergey Pershin <spershin@meta.com>
15 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
When a query failed, there is a race between dispatching and executing code to log the query failure event. That can result in double-logging in the systems that allow it or logging without any stats/metrics in the systems which allow logging only once. We observe both behaviors in our logging systems.
The fix is to set a flag that the query went for execution and not log from the dispatching side in this case.
Differential Revision: D98954601
Summary by Sourcery
Prevent duplicate zero-stats query failure events for dispatched queries by tracking execution submission and guarding immediate-failure logging.
Bug Fixes:
Tests: