Add null check for G1 related options and origin field #93197

DmitryNekrasov · 2023-01-24T13:12:40Z

This patch fixes some issues when starting Elasticsearch on the Azul Platform Prime JDK.
The first of them is related to the fact that the Azul Platform Prime JDK does not have a G1 garbage collector, and, as a result, there are no options for configuring it. One fix checks that these options exist and are not null.
The second fix is related to the following. When parsing the message output, which is obtained using the -XX:+PrintFlagsFinal option, we are tied to the fact that there is an origin field in the PrintFlagsFinal message string. This is valid for jdk11+ versions, however jdk8 does not have this field. Older versions of the Azul Platform Prime JDK use the PrintFlagsFinal message format that jdk8 uses, but for later versions. The second fix adds a null check for the origin field in the JvmOption class.
Relates #91577.

cla-checker-service · 2023-01-24T13:12:44Z

💚 CLA has been signed

DmitryNekrasov · 2023-01-24T13:18:07Z

We have a confirmation from Elastic's Legal Ops that the CLA is signed.

elasticsearchmachine · 2023-01-24T21:12:34Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

grcevski · 2023-01-24T22:05:29Z

Hi @DmitryNekrasov,

Thanks for proposing this PR. We don't support Azul Prime, but I think if a customer decided to use it themselves, I think Elasticsearch should run. My feedback on this particular PR is as follows:

Elasticsearch ships with a default JVM configuration options which explicitly specify -XX:+UseG1GC. Elasticsearch should fail to start and warn the users that the chosen collector isn't supported on the JVM. Does Azul Prime start with a default collector and -XX:+UseG1GC, or will the process end because of an incorrect chosen collector?
I agree, we should check for nullness of finalJvmOptions.get("UseG1GC") so that if the underlying JVM isn't OpenJDK we don't throw an unexpected NullPointerException.
JDK8 support is a contentious issue, we only support JDK17 and onward. Our packages are compiled with JDK17, I'm surprised the bytecode loaded, class version should be newer? I don't think 8.x source would compile with JDK8 as well, we use modules and a lot of JDK14+ language features.
Since we can't offer support for Azul Prime, I think we need to log an explicit warning on start that we have detected a JVM that's not officially supported by Elastic.

Thanks again and please let me know if this makes sense or not.

DmitryNekrasov · 2023-01-25T08:38:31Z

Hi @grcevski! Thank you for your comment.

Azul Platform Prime JDK only supports its own collector - C4 Garbage Collector, it does not support other collectors, and, as a result, we do not have options for configuring them, which is why it is proposed to add a check for these options against null.
As for jdk8, I meant a little differently. The format of the message that is printed when the -XX:+PrintFlagsFinal option is enabled differs between jdk8 and jdk11+ when it comes to OpenJDK.

PrintFlagsFinal message example for jdk8 in OpenJDK:

 intx ActiveProcessorCount                      = -1                                  {product}
uintx AdaptiveSizeDecrementScaleFactor          = 4                                   {product}
uintx AdaptiveSizeMajorGCDecayTimeScale         = 10                                  {product}

PrintFlagsFinal message example for jdk11 in OpenJDK:

  int ActiveProcessorCount                     = -1                                        {product} {default}
uintx AdaptiveSizeDecrementScaleFactor         = 4                                         {product} {default}
uintx AdaptiveSizeMajorGCDecayTimeScale        = 10                                        {product} {default}

And elastic is tied to the new message format from jdk11+. However, older versions of the Azul Platform Prime JDK have the same message format for all jdk versions, and it is the same as the jdk8 format from OpenJDK. This format does not have an origin field, and we get an NPE when we try to dereference it. A simple check for null on this field allows us to avoid this problem.
Thus, those small changes that we propose in this patch allow elastic to start correctly on the Azul Platform Prime JDK.

docs/changelog/93197.yaml

grcevski · 2023-01-25T14:21:28Z

Thanks for the explanation @DmitryNekrasov! I think adding the null checks makes sense. I think we should be good to go, except that I think we should add tests in JvmErgonomicsTests, which explicitly test the problem methods when the jvm properties map has no entries (for G1 or others).

This is exactly what caused the issue, and I want to make sure that future refactoring in the JvmErgonomics class don't break things again.

There are plenty of example tests in JvmErgonomicsTests, passing an empty map should expose the null pointer in all of the methods.

Perhaps also a test with manually created JvmOption which has a null origin field.

DmitryNekrasov · 2023-01-25T16:21:59Z

Thanks for the explanation @DmitryNekrasov! I think adding the null checks makes sense. I think we should be good to go, except that I think we should add tests in JvmErgonomicsTests, which explicitly test the problem methods when the jvm properties map has no entries (for G1 or others).

This is exactly what caused the issue, and I want to make sure that future refactoring in the JvmErgonomics class don't break things again.

There are plenty of example tests in JvmErgonomicsTests, passing an empty map should expose the null pointer in all of the methods.

Perhaps also a test with manually created JvmOption which has a null origin field.

Hi @grcevski!
I've added tests that check that methods related to G1 work correctly when there are no options in the finalJvmOptions map. I also added a test to call jvmOption.isCommandLineOrigin() on an option with a null origin field. Also, I made sure that without changes in this patch, these tests fail.

grcevski · 2023-01-25T16:41:10Z

@elasticsearchmachine test this please

grcevski

LGTM! The changes look great, thanks for all the iterations.

We need to make the CI green and it seem two things need to be done:

Change area to Infra/Settings as per my comment in the YAML file.
You'll need to rebase on main again, we had updated out backwards compatibility tests since this was branched and those tests fail. Simply merge the main branch and push the updated.

docs/changelog/93197.yaml

grcevski · 2023-01-25T19:35:48Z

@elasticsearchmachine test this please

DmitryNekrasov · 2023-01-26T14:08:15Z

Hi @grcevski, @thecoop!
Do you think we can merge this pull request?
Thank you!

This patch fixes some issues when starting Elasticsearch on the Azul Platform Prime JDK. The first of them is related to the fact that the Azul Platform Prime JDK does not have a G1 garbage collector, and, as a result, there are no options for configuring it. One fix checks that these options exist and are not null. The second fix is related to the following. When parsing the message output, which is obtained using the -XX:+PrintFlagsFinal option, we are tied to the fact that there is an origin field in the PrintFlagsFinal message string. This is valid for jdk11+ versions, however jdk8 does not have this field. Older versions of the Azul Platform Prime JDK use the PrintFlagsFinal message format that jdk8 uses, but for later versions. The second fix adds a null check for the origin field in the JvmOption class. Relates elastic#91577.

thecoop

Oh, yes, my comment was just a minor nit

rjernst · 2023-01-31T05:27:30Z

@dimitrynekrasov Thanks very much for your interest in Elasticsearch.

We choose to support particular JDKs with care. We test the JDKs on our CI with our full suite of tests, we create JDK specific options tests, and our team ensures that every Elasticsearch release works correctly with those JDKs. Additionally we need to spend time investigating failures on those JDKs, and making sure performance is acceptable, etc.

Because of the above we generally will not merge PRs targeting an untested and unsupported JDK because we cannot guarantee that the change works, and will continue to work in future releases and we aren't prepared to have code in the repository that we cannot maintain, because it's not tested. It would be a very bad experience for our users if Elasticsearch suddenly stopped working on a particular JDK after an upgrade.

This change has the stated goal of allowing Elasticsearch to run on the Azul Prime JDK. While it does not add any Azul specific logic, it does add leniency in our CLI where none was previously needed. That leniency could lead to masking an actual problem in our parsing logic on our officially supported JDKs.

Given the reasoning above, I will close this PR. However, we do understand the desire to have the flexibility to still make your setup work, regardless of our inability to officially support it. Your PR intended to address two issues. The first is regarding the G1GC settings. These have now been guarded so that Elasticsearch will ignore if G1 options are not present (or if G1 has been disabled through -XX:-UseG1GC). Our suggestion though is to make the Azul Prime JDK more compatible with OpenJDK based JDKs by accepting (and otherwise ignoring) -XX:-UseG1GC. The second part of this PR addresses origin not existing in final flags. Since we support and test only OpenJDK based JDKs, we expect that format to be consistent with that. We suggest making the Azul final options more in line with those in OpenJDK.

In general, making the Azul Prime JDK more compatible with OpenJDK is more likely to result in Elasticsearch working on it. While we don't intend to support that JDK, we do hope you consider this suggestion. These issues in the CLI are unlikely to be the last given our tight coupling with OpenJDK behavior and internals.

Holmistr · 2023-02-01T10:18:19Z

@rjernst Thank you for the commentary. Obviously, we're not super happy about the decision as we would really love to be officially supported JDK, but I also understand your reasoning. We'll try to persuade you with some astonishing performance data in the future to change your decision :)

In the meantime, we have actually implemented the fix on Prime's side, so Elasticsearch will work seamlessly on the newest Prime version (starting since 22.12 I believe).

One more question - what happens if your customer wants to run on Prime (there are already some)? We're obviously happy to support them on our side with the JVM issues but I'm wondering how will your support policy react if the customer is running on Prime and e.g. the issue is obviously unrelated to JDK. Would it be a big no-no "hey, you're running on unsupported JDK, we're closing the ticket", or "hey, after looking at this issue, we're suspecting a problem with JDK compatibility. Please go to your Azul support?". The latter would be absolutely fine.

Thanks a lot for your comments.

rjernst · 2023-02-08T04:05:14Z

It would be the latter. We might first ask the user to try to reproduce the issue on a supported JDK (for example, running the same reproduction steps but with the bundled jdk), but if we can reproduce with OpenJDK, we would keep the issue open and see how best to address it.

Holmistr · 2023-02-14T16:09:34Z

@rjernst Makes perfect sense. Thank you.

elasticsearchmachine added v8.7.0 needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team labels Jan 24, 2023

DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from a306df2 to 0a6bcbc Compare January 24, 2023 13:36

michaelbaamonde added :Core/Infra/Core Core issues without another label and removed needs:triage Requires assignment of a team area label labels Jan 24, 2023

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jan 24, 2023

grcevski self-assigned this Jan 24, 2023

grcevski added the >non-issue label Jan 24, 2023

thecoop reviewed Jan 25, 2023

View reviewed changes

docs/changelog/93197.yaml Outdated Show resolved Hide resolved

DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from 0a6bcbc to 06c588d Compare January 25, 2023 09:32

DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from 06c588d to 1defde8 Compare January 25, 2023 16:15

grcevski approved these changes Jan 25, 2023

View reviewed changes

docs/changelog/93197.yaml Outdated Show resolved Hide resolved

DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch 2 times, most recently from 558a8e0 to deda1ed Compare January 25, 2023 19:15

DmitryNekrasov force-pushed the dnekrasov/bugfix/handling-missing-options branch from deda1ed to 8a5cb60 Compare January 27, 2023 11:50

DmitryNekrasov requested a review from thecoop January 27, 2023 15:18

thecoop approved these changes Jan 27, 2023

View reviewed changes

rjernst closed this Jan 31, 2023

bhavanisn mentioned this pull request Sep 15, 2023

To accept OpenJDK option +UseG1GC to enable startup of elasticsearch application eclipse-openj9/openj9#18151

Closed

Holmistr mentioned this pull request Feb 12, 2024

Elasticsearch fails to start on Azul Platform Prime JDK (formerly Zing) from 7.11 onwards #91577

Closed

Add null check for G1 related options and origin field #93197

Add null check for G1 related options and origin field #93197

Uh oh!

Conversation

DmitryNekrasov commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cla-checker-service bot commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DmitryNekrasov commented Jan 24, 2023

Uh oh!

elasticsearchmachine commented Jan 24, 2023

Uh oh!

grcevski commented Jan 24, 2023

Uh oh!

DmitryNekrasov commented Jan 25, 2023

Uh oh!

Uh oh!

grcevski commented Jan 25, 2023

Uh oh!

DmitryNekrasov commented Jan 25, 2023

Uh oh!

grcevski commented Jan 25, 2023

Uh oh!

grcevski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

grcevski commented Jan 25, 2023

Uh oh!

DmitryNekrasov commented Jan 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thecoop left a comment

Choose a reason for hiding this comment

Uh oh!

rjernst commented Jan 31, 2023 • edited by DaveCTurner Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Holmistr commented Feb 1, 2023

Uh oh!

rjernst commented Feb 8, 2023

Uh oh!

Holmistr commented Feb 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

DmitryNekrasov commented Jan 24, 2023 •

edited

Loading

cla-checker-service bot commented Jan 24, 2023 •

edited

Loading

DmitryNekrasov commented Jan 26, 2023 •

edited

Loading

rjernst commented Jan 31, 2023 •

edited by DaveCTurner

Loading