-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Graph: fix race condition in timeout #88946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @nik9000, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
| ); | ||
| grb.createNextHop(timeoutQuery).addVertexRequest("people").size(100).minDocCount(1); | ||
| // 00s friends of beatles | ||
| grb.createNextHop(QueryBuilders.termQuery("decade", "00s")).addVertexRequest("people").size(100).minDocCount(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move the script query causing the timeout to the hop before the last hop because we no longer check the timeout on the final response. If we get a full response from the query we return it even if we're above the timeout time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that was fine because it should be quite fast.
| * connected terms in a single index. | ||
| */ | ||
| public class TransportGraphExploreAction extends HandledTransportAction<GraphExploreRequest, GraphExploreResponse> { | ||
| private static final Logger logger = LogManager.getLogger(TransportGraphExploreAction.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger this one was using was deprecated.
| private final ActionListener<GraphExploreResponse> listener; | ||
|
|
||
| private final long startTime; | ||
| private final AtomicBoolean timedOut; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was pretty unnecessary. We only ever set it one time so I just pass it in.
| listener.onResponse(buildResponse()); | ||
| listener.onResponse(buildResponse(false)); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the bit where we return the response even if we're over time. If we're done we may as well return anyway, I say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for me
| client.search(searchRequest, new ActionListener.Delegating<>(listener) { | ||
| @Override | ||
| public void onResponse(SearchResponse searchResponse) { | ||
| // System.out.println(searchResponse); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These just seemed like leftovers. I figure we'd dump these into the trace logs if we want them.
Previously `graph` checked if the request timed out, then spent some time doing work, then passed the timeout on to the next request. Over and over again. It's quite possible that the response may not have timed out for the first check but would have timed out for the second check. This manifests as the timeout being sent to the next hop being a negative number of milliseconds. We don't allow this sort of thing. This fixes this by moving the timeout check to the same spot it is read for setting the timeout on the next request - we just check if its `> 0` to find the timeouts. This does keep the request running slightly longer after it's officially timed out - but it's just long enough to prepare the next layer of request. Usually microseconds. Which should be fine. Closes elastic#55396
|
@elasticmachine run elasticsearch-ci/packaging-tests-unix-sample |
iverase
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously
graphchecked if the request timed out, then spent sometime doing work, then passed the timeout on to the next request. Over
and over again. It's quite possible that the response may not have timed
out for the first check but would have timed out for the second check.
This manifests as the timeout being sent to the next hop being a
negative number of milliseconds. We don't allow this sort of thing.
This fixes this by moving the timeout check to the same spot it is read
for setting the timeout on the next request - we just check if its
> 0to find the timeouts.
This does keep the request running slightly longer after it's officially
timed out - but it's just long enough to prepare the next layer of
request. Usually microseconds. Which should be fine.
Closes #55396