QL: retry SQL and EQL requests in a mixed-node (rolling upgrade) cluster by astefan · Pull Request #68602 · elastic/elasticsearch

astefan · 2021-02-05T18:13:55Z

These changes make use of previous work added with PR #65896 (adds minimum compatibility version to search requests in ES) by using a minimum compatibility version when creating a search request against ES and re-trying the request if the search proves to be executed on at least one incompatible shard.

The retrial happens on a node that has an older version, the original request (SQL/EQL request) being sent through transport layer.
The node receiving the retried request will re-parse it and create another query DSL to be sent to ES.
As it is at the moment when this PR was created, the introduction of "fields" API in QL is such a change that needs this feature.

Testing happens in two new qa similar projects, one for SQL and one for EQL.

Add mixed-node tests to SQL and EQL

elasticmachine · 2021-02-05T18:13:58Z

Pinging @elastic/es-ql (Team:QL)

costin

Left a comment regarding a base class between the two redirect listeners - looks good otherwise.
Thanks for extensive tests!

costin · 2021-02-07T17:59:07Z

x-pack/plugin/eql/qa/mixed-node/build.gradle

@@ -0,0 +1,66 @@
+apply plugin: 'elasticsearch.testclusters'


This must have been fun...

costin · 2021-02-07T18:00:00Z

...n/eql/qa/mixed-node/src/test/java/org/elasticsearch/xpack/eql/qa/mixed_node/EqlSearchIT.java

+        }
+    }
+
+    private List<String> getSequencesBulkEntries() {


How about externalizing this and reading each entry line by line.

costin · 2021-02-07T18:02:12Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/execution/search/QlSourceBuilder.java

 * the resulting ES document as a field.
 */
 public class QlSourceBuilder {
+    public static final Version FIELDS_API_INTRODUCTION_VERSION = Version.V_7_10_0;


The name is not very clear - how about USE_FIELD_API_VERSION or FIELD_API_USAGE_VERSION

The name is not very clear

It'd be great if we could "standardise" on a format, given that these introducing version constants will only get more (nanos, fields api, unsigned long, arrays, plus to come).

costin · 2021-02-07T18:05:58Z

x-pack/plugin/sql/src/main/java/org/elasticsearch/xpack/sql/plugin/TransportSqlQueryAction.java

+                wrap(p -> listener.onResponse(createResponseWithSchema(request, p)), e -> {
+                    // the search request will likely run on nodes with different versions of ES
+                    // we will retry on a node with an older version that should generate a backwards compatible _search request
+                    if (e instanceof SearchPhaseExecutionException
+                        && ((SearchPhaseExecutionException) e).getCause() instanceof VersionMismatchException) {
+
+                        SearchPhaseExecutionException spee = (SearchPhaseExecutionException) e;
+                        if (log.isTraceEnabled()) {
+                            log.trace("Caught exception type [{}] with cause [{}].", e.getClass().getName(), e.getCause());
+                        }
+                        DiscoveryNode localNode = clusterService.state().nodes().getLocalNode();
+                        DiscoveryNode candidateNode = null;
+                        for (DiscoveryNode node : clusterService.state().nodes()) {
+                            // find the first node that's older than the current node
+                            if (node != localNode && node.getVersion().before(localNode.getVersion())) {
+                                candidateNode = node;
+                                break;
+                            }
+                        }
+                        if (candidateNode != null) {


This class and the one in EQL can be simplified by moving the common code into a base class in QL.
The exception check plus DiscoveryNode selection, plus logging and retry are the same.
The only differences that I can see are calling transportService and the planExecutor call which can be passed as Runnable or Function in case a property needs to passed in.

I've moved the common code to a separate method in QL.

bpintea

Nice. Only left some small comments and a question on the retrying logic.

bpintea · 2021-02-08T10:11:54Z

x-pack/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/execution/search/RuntimeUtils.java

                                               SearchSourceBuilder source,
                                               boolean includeFrozen,
                                               String... indices) {
-        return client.prepareSearch(indices)


client function argument can now be removed.

Nicely spot. Removed.

bpintea · 2021-02-08T10:21:53Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

-            listener::onFailure));
+        Holder<Boolean> retrySecondTime = new Holder<Boolean>(false);
+        planExecutor.eql(cfg, request.query(), params, wrap(r -> listener.onResponse(createResponse(r, task.getExecutionId())), e -> {
+            // the search request will likely run on nodes with different versions of ES


is this actually true? it is possible, but "likely"?

I think I put "likely" there because the actual search might run on a sub-set of shards than the one at the start of the search process. There is an initial phase in the search process that's called canMatch that is really quick and can evaluate if a request on a shard should actually be executed on the shard or not, based on a timestamp range query. If I am not mistaken, this canMatch phase is also part of this different versions mismatch.
The canMatch can quickly "discard" some shards and the actual request truly runs on a sub-set of them. If you want to read more about this, I think this is the initial PR adding this - #25658. And some other good resources here: https://stackoverflow.com/a/64693782/3498062

Thanks. I think what threw me off is the future tense, in a failure handler :-).
I guess "the search request likely ran on nodes with different versions of ES" might be clearer. Anyways, not super relevant.

bpintea · 2021-02-08T10:23:59Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

+                    // find the first node that's older than the current node
+                    if (Objects.equals(node, localNode) == false && node.getVersion().before(localNode.getVersion())) {
+                        candidateNode = node;
+                        break;
+                    }


I guess normally the difference, should be the same for all nodes (if not null), but would it not make sense to get the oldest, to potentially prevent another redirection?

There's a longer discussion about the improvements here... I would keep the algorithm as is for the moment and, maybe, improve in the next versions if we hit bumps along the way. Not sure we support a multiple versions (more than two) rolling upgrades. Our documentation say upgrade one by one until all are upgraded, but this assumes upgrading from version X to Y, there is no Z in the story.

bpintea · 2021-02-08T10:32:19Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

+        }));
+        if (retrySecondTime.get()) {
+            if (log.isTraceEnabled()) {
+                log.trace("No candidate node found, likely all were upgraded in the meantime. Re-trying the original request.");


Was wondering about the trace logging choice here (and above) vs. for instance debug.
( Pico-nit: the repetitive part of the message could maybe simply be extracted as a comment, as a stand-alone message - i.e. not knowing what's going on; and supposedly this being the reason for adding the explanation - it is not quite clear what the "candidate" is. )

I based my decision for trace on the likeliness of this logging actually being used - hopefully very rarely to none. Willing to change it for a good reason.
Regarding the tracing messages I think the first one (where retrySecondTime is not necessary.

Changed to debug.

bpintea · 2021-02-08T10:47:02Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

+                listener.onFailure(e);
+            }
+        }));
+        if (retrySecondTime.get()) {


If there are a higher number of nodes in the cluster, the "old" ones are the most likely to be take offline. So I guess this means there's a higher chance the request will fail when sent to one "old" node (if there's a race between querying for the versions and sending the request).
Would it make sense to reattempt transport-redirecting, for as long as there are old nodes in the cluster?

For the moment, I'd like to keep it as is and try to improve the algorithm in a future PR. There are, for sure, things that can be improved.

bpintea · 2021-02-08T10:54:56Z

...n/eql/qa/mixed-node/src/test/java/org/elasticsearch/xpack/eql/qa/mixed_node/EqlSearchIT.java

+
+        final List<String> bulkEntries = getSequencesBulkEntries();
+        StringBuilder builder = new StringBuilder();
+        for (int i = 1; i < 16; i++) {


bulkEntries.size()?

I've refactored this to read the bulk entries from a file as per @costin's suggestion.

bpintea · 2021-02-08T10:58:07Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/execution/search/QlSourceBuilder.java

 * the resulting ES document as a field.
 */
 public class QlSourceBuilder {
+    public static final Version FIELDS_API_INTRODUCTION_VERSION = Version.V_7_10_0;


The name is not very clear

It'd be great if we could "standardise" on a format, given that these introducing version constants will only get more (nanos, fields api, unsigned long, arrays, plus to come).

astefan · 2021-02-08T16:53:00Z

@elasticmachine run elasticsearch-ci/default-distro

matriv

LGTM, Nice stuff!
Left some very minor comments.

matriv · 2021-02-08T17:43:43Z

x-pack/plugin/eql/src/test/java/org/elasticsearch/xpack/eql/analysis/CancellationTests.java

-        TransportEqlSearchAction.operation(planExecutor, task, new EqlSearchRequest().query("foo where blah"), "", "", "node_id",
-            new ActionListener<>() {
+        TransportEqlSearchAction.operation(planExecutor, task, new EqlSearchRequest().query("foo where blah"), "",
+            mock(TransportService.class), mockClusterService, new ActionListener<>() {


minor: since mock(TransportService.class) is reused, you could also assign it to a var.

matriv · 2021-02-08T17:45:17Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/plugin/TransportActionUtils.java

+            // we will retry on a node with an older version that should generate a backwards compatible _search request
+            if (e instanceof SearchPhaseExecutionException
+                && ((SearchPhaseExecutionException) e).getCause() instanceof VersionMismatchException) {
+                if (log.isTraceEnabled()) {


Not familiar with the strategy here, just double checking if it maybe should be debug instead?

Changed to debug.

matriv · 2021-02-08T17:47:23Z

x-pack/plugin/sql/qa/mixed-node/src/test/resources/all_field_types.json

@@ -0,0 +1,59 @@
+      "properties": {


minor: why are these entries indented like this? (many whitespaces from the beginning of the line)

matriv · 2021-02-08T17:48:05Z

x-pack/plugin/eql/qa/mixed-node/src/test/resources/eql_mapping.json

@@ -0,0 +1,35 @@
+  "properties": {


Same here, why the leading whitespaces?

…stic/elasticsearch into eql_sql_request_retry

bpintea · 2021-02-08T18:18:56Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/execution/search/QlSourceBuilder.java

 * the resulting ES document as a field.
 */
 public class QlSourceBuilder {
+    public static final Version FIELDS_API_USAGE_VERSION = Version.V_7_10_0;


I personally find this naming less evocative ("is the fields api only used in that version?"), but not a biggie.

The upside is that these are internal constants that we can rename in the future if we find better names.
Any suggestions for better naming?
FIELDS_API_MIGRATION_VERSION, MIGRATE_TO_FIELD_API_VERSION, SWITCH_OVER/TO_FIELDS_API_VERSION ?

bpintea · 2021-02-08T18:20:22Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

-            listener::onFailure));
+        Holder<Boolean> retrySecondTime = new Holder<Boolean>(false);
+        planExecutor.eql(cfg, request.query(), params, wrap(r -> listener.onResponse(createResponse(r, task.getExecutionId())), e -> {
+            // the search request will likely run on nodes with different versions of ES


Thanks. I think what threw me off is the future tense, in a failure handler :-).
I guess "the search request likely ran on nodes with different versions of ES" might be clearer. Anyways, not super relevant.

bpintea

LGTM

costin

LGTM

costin · 2021-02-09T07:12:22Z

x-pack/plugin/eql/qa/mixed-node/src/test/resources/eql_data.json

@@ -0,0 +1,30 @@
+{"index":{"_id":1}}


costin · 2021-02-09T07:12:49Z

...ck/plugin/eql/src/main/java/org/elasticsearch/xpack/eql/plugin/TransportEqlSearchAction.java

-        }
+        executeRequestWithRetryAttempt(clusterService, listener::onFailure,
+            onFailure -> planExecutor.eql(cfg, request.query(), params,
+                wrap(r -> listener.onResponse(createResponse(r, task.getExecutionId())), onFailure)),


costin · 2021-02-09T07:16:27Z

x-pack/plugin/ql/src/main/java/org/elasticsearch/xpack/ql/execution/search/QlSourceBuilder.java

 * the resulting ES document as a field.
 */
 public class QlSourceBuilder {
+    public static final Version FIELDS_API_USAGE_VERSION = Version.V_7_10_0;


The upside is that these are internal constants that we can rename in the future if we find better names.
Any suggestions for better naming?
FIELDS_API_MIGRATION_VERSION, MIGRATE_TO_FIELD_API_VERSION, SWITCH_OVER/TO_FIELDS_API_VERSION ?

bpintea · 2021-02-09T10:43:14Z

we can rename in the future if we find better names.

true.

Any suggestions for better naming?
SWITCH_OVER/TO_FIELDS_API_VERSION ?

I like this one myself. But yes, can be done later, it's hair splitting.

…stic/elasticsearch into eql_sql_request_retry

* Integrate "fields" API into QL (#68467) * QL: retry SQL and EQL requests in a mixed-node (rolling upgrade) cluster (#68602) * Adapt nested fields extraction from "fields" API output to the new un-flattened structure (#68745)

* Integrate "fields" API into QL (elastic#68467) * QL: retry SQL and EQL requests in a mixed-node (rolling upgrade) cluster (elastic#68602) * Adapt nested fields extraction from "fields" API output to the new un-flattened structure (elastic#68745) (cherry picked from commit ee5cc54)

* Integrate "fields" API into QL (#68467) * QL: retry SQL and EQL requests in a mixed-node (rolling upgrade) cluster (#68602) * Adapt nested fields extraction from "fields" API output to the new un-flattened structure (#68745) (cherry picked from commit ee5cc54)

astefan added 2 commits February 5, 2021 19:20

Re-try an SQL and EQL request on nodes of older/compatible version

f8b92ff

Add mixed-node tests to SQL and EQL

Adjust the test (temporarily) for 8.0

369a1b9

astefan added >feature :Analytics/SQL SQL querying :Analytics/EQL EQL querying labels Feb 5, 2021

elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Feb 5, 2021

astefan requested review from bpintea, costin, matriv and palesz February 5, 2021 18:14

astefan added 2 commits February 5, 2021 22:40

Spotless check

2a93545

Test fix

94e8359

costin reviewed Feb 7, 2021

View reviewed changes

bpintea reviewed Feb 8, 2021

View reviewed changes

Address reviews

f025951

astefan requested review from bpintea and costin February 8, 2021 16:16

matriv approved these changes Feb 8, 2021

View reviewed changes

Merge branch 'ql_fields_api_implementation' of https://github.com/ela…

e2532dc

…stic/elasticsearch into eql_sql_request_retry

bpintea reviewed Feb 8, 2021

View reviewed changes

bpintea approved these changes Feb 8, 2021

View reviewed changes

costin approved these changes Feb 9, 2021

View reviewed changes

astefan added 3 commits February 9, 2021 23:58

Merge branch 'ql_fields_api_implementation' of https://github.com/ela…

d29abc1

…stic/elasticsearch into eql_sql_request_retry

Address reviews

1154d0e

Update

735ccb9

astefan merged commit 801de6a into elastic:ql_fields_api_implementation Feb 9, 2021

astefan deleted the eql_sql_request_retry branch February 9, 2021 23:32

astefan mentioned this pull request Feb 10, 2021

QL: "fields" api implementation in QL #68802

Merged

astefan mentioned this pull request Jul 28, 2021

Avoid running all EQL BWC tasks when running check #75743

Merged

Conversation

astefan commented Feb 5, 2021

Uh oh!

elasticmachine commented Feb 5, 2021

Uh oh!

costin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefan commented Feb 8, 2021

Uh oh!

matriv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bpintea left a comment

Choose a reason for hiding this comment