Skip to content

Conversation

@rudolf
Copy link
Contributor

@rudolf rudolf commented Nov 21, 2019

Summary

Fixes #49785 #14480

Testing notes:

  1. When starting Kibana against a cluster with an unsupported ES node, Kibana should log:

    Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...

    but should not start saved object migrations until all the nodes are compatible

    Starting saved objects migrations

  2. If Kibana has successfully started with a compatible ES cluster, but then an incompatible ES node joins the cluster, Kibana's "elasticsearch" plugin should go into a red state and reloading Kibana in a browser should render the status page.

Release note:

This fix addresses a regression where Kibana would not check that all Elasticsearch nodes are compatible before starting Saved Object migrations.

Checklist

Use strikethroughs to remove checklist items you don't feel are applicable to this PR.

For maintainers

@rudolf rudolf added Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// Feature:New Platform v8.0.0 v7.5.0 v7.6.0 labels Nov 21, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@elasticmachine
Copy link
Contributor

💔 Build Failed

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw too late that this was only a draft! Kept only my NITs on current progress.

@rudolf rudolf removed the v7.5.0 label Nov 25, 2019
@joshdover
Copy link
Contributor

@rudolf What's the status here?

@rudolf rudolf added v7.7.0 and removed v7.6.0 labels Jan 15, 2020
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from 5c91f29 to 07829f8 Compare January 22, 2020 15:19
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from 07829f8 to bd49618 Compare January 24, 2020 14:21
@rudolf rudolf force-pushed the healthcheck-before-migrations branch from bd49618 to e2d6157 Compare January 24, 2020 14:53
this.logger.debug(
'Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...'
);
await this.setupDeps!.elasticsearch.esNodesCompatibility$.pipe(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour isn't 100% the same as what it was in legacy. In legacy we would start the status service so even though migrations wouldn't run, there would be a running server which showed that the Elasticsearch plugin was red with the reason which helps surface the underlying problem. Once we have a status service in NP we should aim to create similar behaviour.

@rudolf rudolf marked this pull request as ready for review January 26, 2020 21:18
@rudolf rudolf requested a review from a team as a code owner January 26, 2020 21:18
@rudolf rudolf added the v7.6.1 label Jan 26, 2020
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#49785 states

Although the reasons for abandoning the health check still stand, we will have to keep polling to do the version check since new Elasticsearch nodes can join an existing cluster after Kibana has started up.

In current PR, we are only waiting for ES to be ready once to trigger some actions, but further state change are doing nothing. Do we know what we are planning to do in case of scenario like

  • red -> green (trigger start of SO + legacy's waitUntilReady) -> red (atm do nothing) -> green (atm do nothing)

Comment on lines 24 to 36
esNodesCompatibility$.subscribe(({ isCompatible, message, kibanaVersion, warningNodes }) => {
if (!isCompatible) {
esPlugin.status.red(message);
} else {
if (message && message.length > 0) {
logWithMetadata(['warning'], message, {
kibanaVersion,
nodes: warningNodes,
});
}
esPlugin.status.green('Ready');
resolve();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we unsubscribe after the first resolve call to avoid wrongly recalling resolve in case of green->red->green ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep updating the status so we need the subscription. Although resolve() should only be called once, calling it multiple times is a no-op so it won't cause any problems.

Comment on lines +294 to +297
await this.setupDeps!.elasticsearch.esNodesCompatibility$.pipe(
filter(nodes => nodes.isCompatible),
take(1)
).toPromise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Except in the legacy code you adapted, we are not displaying any info message for the user about the fact that we are waiting (maybe indefinitely) for ES to be ready?
  2. Maybe we should add a timeout and throw a fatal after some time? Or are we expecting Kibana to hang indefinitely waiting for this condition?
  3. Should this check be done at a higher level (thinking in the Server)? It seems to me that waiting for ES to be ready is higher responsibility than the SOService should handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Except in the legacy code you adapted, we are not displaying any info message for the user about the fact that we are waiting (maybe indefinitely) for ES to be ready?

I've changed the log message to an info to indicate that we're waiting for ES and when we're starting migrations.

  1. Maybe we should add a timeout and throw a fatal after some time? Or are we expecting Kibana to hang indefinitely waiting for this condition?

The existing behaviour is to wait indefinitely. It could take a day before a faulty cluster is fixed, in such a case I think it's nice if Kibana just starts working again automatically.

  1. Should this check be done at a higher level (thinking in the Server)? It seems to me that waiting for ES to be ready is higher responsibility than the SOService should handle.

I don't have a strong opinion, but I think if the SO Service has some dependency on an external condition then the logic to wait for that condition belongs in the SO Service. This is a minor, but when it comes to the logging tags it might make it easier to see that these logs are related if they all have the same tags, rather than some being tagged server and others savedobjects-service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it might be good to repeat this message on an interval but I wouldn't consider that a blocker to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but I think if the SO Service has some dependency on an external condition then the logic to wait for that condition belongs in the SO Service. This is a minor, but when it comes to the logging tags it might make it easier to see that these logs are related if they all have the same tags, rather than some being tagged server and others savedobjects-service.

I think it's fine we put this in SO service until/if there are other Core services that require this as well.

@LeeDr LeeDr added the blocker label Jan 29, 2020
Copy link
Member

@legrego legrego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces changes LGTM - code review only

Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after a couple minor changes (and green CI)

const { http, elasticsearch } = await root.setup();

// Mock esNodesCompatibility$ to prevent `root.start()` from blocking on ES version check
elasticsearch.esNodesCompatibility$ = elasticsearchServiceMock.createInternalSetup().esNodesCompatibility$;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to do this in KbnTestUtils? Seems like this could break other tests in confusing ways in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I find this rather ugly. KbnTestUtils is sometimes used with an ES server in which case we don't need to fake esNodesCompatibility$, but plugins should never have to run tests against an incompatible ES node so it's probably safe to always skip this during testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent potentially long build/test cycle when changing integration tests, I'll rather do this in a separate PR.

Comment on lines +294 to +297
await this.setupDeps!.elasticsearch.esNodesCompatibility$.pipe(
filter(nodes => nodes.isCompatible),
take(1)
).toPromise();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, but I think if the SO Service has some dependency on an external condition then the logic to wait for that condition belongs in the SO Service. This is a minor, but when it comes to the logging tags it might make it easier to see that these logs are related if they all have the same tags, rather than some being tagged server and others savedobjects-service.

I think it's fine we put this in SO service until/if there are other Core services that require this as well.

@rudolf
Copy link
Contributor Author

rudolf commented Jan 31, 2020

After further testing I realised there were two incorrect behaviours:

  1. pollEsNodesVersion would not start polling immediately, but the first poll would be scheduled to happen after esVersionCheckInterval. For a long interval like 30s this blocks kibana from starting up with an additional 30s.
  2. Because pollEsNodesVersion used a switchMap there's a risk that we'll bombard ES with queries if the nodes.info response time > esVersionCheckInterval. Since this is an expensive api call there's a risk that we'll suddenly make a slow cluster much slower. I've changed the implementation to use an exhaustMap so we won't schedule new requests until the previous request has resolved.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

let root: Root;
beforeAll(async () => {
root = kbnTestServer.createRoot();
root = kbnTestServer.createRoot({ migrations: { skip: true } });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It seems you adapted every call to createRoot to add this. Should we set migrations: { skip: true } as a default in kbnTestServer.createRoot ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tests run with esArchiver and then they need to apply migrations. Ideally we shouldn't disable migrations, but we should disable the ES version check itself. There is elasticsearch.ignoreVersionMismatch but it's only available in development and our integration tests run against production. We could just make this option available in production, but I think it warrants a bigger discussion so I created #56505

@tylersmalley tylersmalley merged commit f1068cd into elastic:master Jan 31, 2020
tylersmalley pushed a commit to tylersmalley/kibana that referenced this pull request Jan 31, 2020
…stic#51311)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
@tylersmalley
Copy link
Contributor

@rudolf I have backported to 7.x, but 7.6 is a bit more challenging as the tests rely on saved_objects_service.test.mocks which was added here #55012, which relies on #55156 - both not present in 7.6. I think it would be best for you to handle the conflict resolution here to ensure there are no errors.

rudolf added a commit that referenced this pull request Feb 3, 2020
) (#56600)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
rudolf added a commit that referenced this pull request Feb 3, 2020
) (#56629)

* Convert parts of Elasticsearch version check to ts
* Move ES version check to NP
* Improve types
* Wait till for compatible ES nodes before SO migrations
* Don't wait for ES compatibility if skipMigrations=true
* Legacy Elasticsearch plugin integration test
* Make ES compatibility check and migrations logging more visible
* Test for isCompatible=false when ES version check throws
* Start pollEsNodesVersion immediately
* Refactor pollEsNodesVersion
@rudolf rudolf deleted the healthcheck-before-migrations branch February 4, 2020 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker bug Fixes for quality problems that affect the customer experience Feature:New Platform release_note:fix Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// v7.6.0 v7.6.1 v7.7.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Elasticsearch version check to New Platform

8 participants