Skip to content

Conversation

@DmitryNekrasov
Copy link

@DmitryNekrasov DmitryNekrasov commented Jan 24, 2023

This patch fixes some issues when starting Elasticsearch on the Azul Platform Prime JDK.
The first of them is related to the fact that the Azul Platform Prime JDK does not have a G1 garbage collector, and, as a result, there are no options for configuring it. One fix checks that these options exist and are not null.
The second fix is related to the following. When parsing the message output, which is obtained using the -XX:+PrintFlagsFinal option, we are tied to the fact that there is an origin field in the PrintFlagsFinal message string. This is valid for jdk11+ versions, however jdk8 does not have this field. Older versions of the Azul Platform Prime JDK use the PrintFlagsFinal message format that jdk8 uses, but for later versions. The second fix adds a null check for the origin field in the JvmOption class.
Relates #91577.

@cla-checker-service
Copy link

cla-checker-service bot commented Jan 24, 2023

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added v8.7.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jan 24, 2023
@DmitryNekrasov
Copy link
Author

We have a confirmation from Elastic's Legal Ops that the CLA is signed.

@DmitryNekrasov DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from a306df2 to 0a6bcbc Compare January 24, 2023 13:36
@michaelbaamonde michaelbaamonde added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels Jan 24, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@grcevski grcevski self-assigned this Jan 24, 2023
@grcevski
Copy link
Contributor

Hi @DmitryNekrasov,

Thanks for proposing this PR. We don't support Azul Prime, but I think if a customer decided to use it themselves, I think Elasticsearch should run. My feedback on this particular PR is as follows:

  • Elasticsearch ships with a default JVM configuration options which explicitly specify -XX:+UseG1GC. Elasticsearch should fail to start and warn the users that the chosen collector isn't supported on the JVM. Does Azul Prime start with a default collector and -XX:+UseG1GC, or will the process end because of an incorrect chosen collector?
  • I agree, we should check for nullness of finalJvmOptions.get("UseG1GC") so that if the underlying JVM isn't OpenJDK we don't throw an unexpected NullPointerException.
  • JDK8 support is a contentious issue, we only support JDK17 and onward. Our packages are compiled with JDK17, I'm surprised the bytecode loaded, class version should be newer? I don't think 8.x source would compile with JDK8 as well, we use modules and a lot of JDK14+ language features.
  • Since we can't offer support for Azul Prime, I think we need to log an explicit warning on start that we have detected a JVM that's not officially supported by Elastic.

Thanks again and please let me know if this makes sense or not.

@DmitryNekrasov
Copy link
Author

Hi @grcevski! Thank you for your comment.

Azul Platform Prime JDK only supports its own collector - C4 Garbage Collector, it does not support other collectors, and, as a result, we do not have options for configuring them, which is why it is proposed to add a check for these options against null.
As for jdk8, I meant a little differently. The format of the message that is printed when the -XX:+PrintFlagsFinal option is enabled differs between jdk8 and jdk11+ when it comes to OpenJDK.

PrintFlagsFinal message example for jdk8 in OpenJDK:

 intx ActiveProcessorCount                      = -1                                  {product}
uintx AdaptiveSizeDecrementScaleFactor          = 4                                   {product}
uintx AdaptiveSizeMajorGCDecayTimeScale         = 10                                  {product}

PrintFlagsFinal message example for jdk11 in OpenJDK:

  int ActiveProcessorCount                     = -1                                        {product} {default}
uintx AdaptiveSizeDecrementScaleFactor         = 4                                         {product} {default}
uintx AdaptiveSizeMajorGCDecayTimeScale        = 10                                        {product} {default}

And elastic is tied to the new message format from jdk11+. However, older versions of the Azul Platform Prime JDK have the same message format for all jdk versions, and it is the same as the jdk8 format from OpenJDK. This format does not have an origin field, and we get an NPE when we try to dereference it. A simple check for null on this field allows us to avoid this problem.
Thus, those small changes that we propose in this patch allow elastic to start correctly on the Azul Platform Prime JDK.

@DmitryNekrasov DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from 0a6bcbc to 06c588d Compare January 25, 2023 09:32
@grcevski
Copy link
Contributor

Thanks for the explanation @DmitryNekrasov! I think adding the null checks makes sense. I think we should be good to go, except that I think we should add tests in JvmErgonomicsTests, which explicitly test the problem methods when the jvm properties map has no entries (for G1 or others).

This is exactly what caused the issue, and I want to make sure that future refactoring in the JvmErgonomics class don't break things again.

There are plenty of example tests in JvmErgonomicsTests, passing an empty map should expose the null pointer in all of the methods.

Perhaps also a test with manually created JvmOption which has a null origin field.

@DmitryNekrasov DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from 06c588d to 1defde8 Compare January 25, 2023 16:15
@DmitryNekrasov
Copy link
Author

Thanks for the explanation @DmitryNekrasov! I think adding the null checks makes sense. I think we should be good to go, except that I think we should add tests in JvmErgonomicsTests, which explicitly test the problem methods when the jvm properties map has no entries (for G1 or others).

This is exactly what caused the issue, and I want to make sure that future refactoring in the JvmErgonomics class don't break things again.

There are plenty of example tests in JvmErgonomicsTests, passing an empty map should expose the null pointer in all of the methods.

Perhaps also a test with manually created JvmOption which has a null origin field.

Hi @grcevski!
I've added tests that check that methods related to G1 work correctly when there are no options in the finalJvmOptions map. I also added a test to call jvmOption.isCommandLineOrigin() on an option with a null origin field. Also, I made sure that without changes in this patch, these tests fail.

@grcevski
Copy link
Contributor

@elasticsearchmachine test this please

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The changes look great, thanks for all the iterations.

We need to make the CI green and it seem two things need to be done:

  • Change area to Infra/Settings as per my comment in the YAML file.
  • You'll need to rebase on main again, we had updated out backwards compatibility tests since this was branched and those tests fail. Simply merge the main branch and push the updated.

@DmitryNekrasov DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch 2 times, most recently from 558a8e0 to deda1ed Compare January 25, 2023 19:15
@grcevski
Copy link
Contributor

@elasticsearchmachine test this please

@DmitryNekrasov
Copy link
Author

DmitryNekrasov commented Jan 26, 2023

Hi @grcevski, @thecoop!
Do you think we can merge this pull request?
Thank you!

This patch fixes some issues when starting Elasticsearch on the Azul Platform Prime JDK.
The first of them is related to the fact that the Azul Platform Prime JDK does not have a G1 garbage collector, and, as a result, there are no options for configuring it. One fix checks that these options exist and are not null.
The second fix is related to the following. When parsing the message output, which is obtained using the -XX:+PrintFlagsFinal option, we are tied to the fact that there is an origin field in the PrintFlagsFinal message string. This is valid for jdk11+ versions, however jdk8 does not have this field. Older versions of the Azul Platform Prime JDK use the PrintFlagsFinal message format that jdk8 uses, but for later versions. The second fix adds a null check for the origin field in the JvmOption class.
Relates elastic#91577.
@DmitryNekrasov DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from deda1ed to 8a5cb60 Compare January 27, 2023 11:50
Copy link
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, my comment was just a minor nit

@rjernst
Copy link
Member

rjernst commented Jan 31, 2023

@dimitrynekrasov Thanks very much for your interest in Elasticsearch.

We choose to support particular JDKs with care. We test the JDKs on our CI with our full suite of tests, we create JDK specific options tests, and our team ensures that every Elasticsearch release works correctly with those JDKs. Additionally we need to spend time investigating failures on those JDKs, and making sure performance is acceptable, etc.

Because of the above we generally will not merge PRs targeting an untested and unsupported JDK because we cannot guarantee that the change works, and will continue to work in future releases and we aren't prepared to have code in the repository that we cannot maintain, because it's not tested. It would be a very bad experience for our users if Elasticsearch suddenly stopped working on a particular JDK after an upgrade.

This change has the stated goal of allowing Elasticsearch to run on the Azul Prime JDK. While it does not add any Azul specific logic, it does add leniency in our CLI where none was previously needed. That leniency could lead to masking an actual problem in our parsing logic on our officially supported JDKs.

Given the reasoning above, I will close this PR. However, we do understand the desire to have the flexibility to still make your setup work, regardless of our inability to officially support it. Your PR intended to address two issues. The first is regarding the G1GC settings. These have now been guarded so that Elasticsearch will ignore if G1 options are not present (or if G1 has been disabled through -XX:-UseG1GC). Our suggestion though is to make the Azul Prime JDK more compatible with OpenJDK based JDKs by accepting (and otherwise ignoring) -XX:-UseG1GC. The second part of this PR addresses origin not existing in final flags. Since we support and test only OpenJDK based JDKs, we expect that format to be consistent with that. We suggest making the Azul final options more in line with those in OpenJDK.

In general, making the Azul Prime JDK more compatible with OpenJDK is more likely to result in Elasticsearch working on it. While we don't intend to support that JDK, we do hope you consider this suggestion. These issues in the CLI are unlikely to be the last given our tight coupling with OpenJDK behavior and internals.

@rjernst rjernst closed this Jan 31, 2023
@Holmistr
Copy link

Holmistr commented Feb 1, 2023

@rjernst Thank you for the commentary. Obviously, we're not super happy about the decision as we would really love to be officially supported JDK, but I also understand your reasoning. We'll try to persuade you with some astonishing performance data in the future to change your decision :)

In the meantime, we have actually implemented the fix on Prime's side, so Elasticsearch will work seamlessly on the newest Prime version (starting since 22.12 I believe).

One more question - what happens if your customer wants to run on Prime (there are already some)? We're obviously happy to support them on our side with the JVM issues but I'm wondering how will your support policy react if the customer is running on Prime and e.g. the issue is obviously unrelated to JDK. Would it be a big no-no "hey, you're running on unsupported JDK, we're closing the ticket", or "hey, after looking at this issue, we're suspecting a problem with JDK compatibility. Please go to your Azul support?". The latter would be absolutely fine.

Thanks a lot for your comments.

@rjernst
Copy link
Member

rjernst commented Feb 8, 2023

It would be the latter. We might first ask the user to try to reproduce the issue on a supported JDK (for example, running the same reproduction steps but with the bundled jdk), but if we can reproduce with OpenJDK, we would keep the issue open and see how best to address it.

@Holmistr
Copy link

@rjernst Makes perfect sense. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Team:Core/Infra Meta label for core/infra team v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants