Skip to content

Allow product tests' framework to use more memory#16231

Merged
electrum merged 1 commit intotrinodb:masterfrom
nineinchnick:product-tests-min-mem
Feb 24, 2023
Merged

Allow product tests' framework to use more memory#16231
electrum merged 1 commit intotrinodb:masterfrom
nineinchnick:product-tests-min-mem

Conversation

@nineinchnick
Copy link
Copy Markdown
Member

Description

After #15833 we saw a big increase in the duration of product tests jobs. Starting tests got stuck for 2–3 minutes before loading the tempto configuration. This only happens on amd64 hosts with 7 GB of memory. I found this section that explains why we set it to 10% before: https://docs.oracle.com/en/java/javase/17/gctuning/factors-affecting-garbage-collection-performance.html#GUID-7FB2D1D5-D75F-4AA1-A3B1-4A17F8FF97D0

I chose to increase MaxHeapFreeRatio to 50% because this is the lowest value when tests didn't get stuck.

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Feb 23, 2023

Doesn't this suggest we're starting to make the GC work more either because allocation rate increased or because we actually consume more memory?

cc: @raunaqmorarka have you noticed any changes in recent benchmark runs?

EDIT: Seems it's limited to the PT launcher itself and not the server itself? How did you verify this @nineinchnick ?

@hashhar
Copy link
Copy Markdown
Member

hashhar commented Feb 23, 2023

Great job on investigating and finding a possible fix @nineinchnick.

@nineinchnick
Copy link
Copy Markdown
Member Author

Doesn't this suggest we're starting to make the GC work more either because allocation rate increased or because we actually consume more memory?

cc: @raunaqmorarka have you noticed any changes in recent benchmark runs?

EDIT: Seems it's limited to the PT launcher itself and not the server itself? How did you verify this @nineinchnick ?

I don't have much knowledge around this. I found out the minimum value when the issue is not appearing by trial and error, running tests locally outside the tests' container. I hope we'll see the same results in the CI in this PR.

@nineinchnick
Copy link
Copy Markdown
Member Author

BTW the issue also stopped appearing when I only added the async profiler agent. It must have somehow affected the GC too.

@kokosing
Copy link
Copy Markdown
Member

After #15833

That suggest that product has changed. Should we also change recommended jvm config for Trino?

@nineinchnick
Copy link
Copy Markdown
Member Author

It worked, I'm comparing the oauth2 PTs between this PR (https://github.com/trinodb/trino/actions/runs/4251852279/jobs/7397034977) where it finished in ~9m, same as before, and master (https://github.com/trinodb/trino/actions/runs/4243983476/jobs/7382774818) where it takes 30m.

after:
image

before:
image

@nineinchnick
Copy link
Copy Markdown
Member Author

Can someone retry maven-checks? I don't know why it timed out, maybe it's a flake. Building took longer than usual, and the timeout is relatively low.

@electrum
Copy link
Copy Markdown
Member

We don’t use ParallelGC for Trino. This is a very specific tuning for product tests, that I actually don’t understand. Does anyone know why it was setup this way? The docs indicate that low settings here can impact performance. I wonder if we should remove these settings and use G1.

@martint
Copy link
Copy Markdown
Member

martint commented Feb 23, 2023

There have been some reports in Slack about instability in 407, so it’s plausible there’s a real issue introduced in that version .

@nineinchnick
Copy link
Copy Markdown
Member Author

All green and good to go. Let's merge this to reduce the CI queue, and we can do more investigation later.

@electrum
Copy link
Copy Markdown
Member

Agreed, going to merge this now.

@electrum electrum merged commit ce2464a into trinodb:master Feb 24, 2023
@electrum
Copy link
Copy Markdown
Member

@nineinchnick can you try these experiments:

  • Remove all the GC settings, so that it uses G1 (now the default) with default settings?
  • Remove the explicit tunings for ParallelGC, so that it uses the defaults.

@electrum
Copy link
Copy Markdown
Member

electrum commented Feb 24, 2023

Another idea: simply set a max heap size, for both G1 and ParallelGC. The intent seems to be to use as little memory as possible for the product tests framework, to allow more memory for the containers. As long as we don't have variable usage (i.e. some tests need more framework memory and less container memory, while others need more container memory), then this should be simpler.

@github-actions github-actions bot added this to the 409 milestone Feb 25, 2023
@nineinchnick nineinchnick deleted the product-tests-min-mem branch February 28, 2023 15:19
@nineinchnick
Copy link
Copy Markdown
Member Author

Let's continue in #16306

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants