Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests #15187

Merged
merged 1 commit into from
Aug 14, 2024

Conversation

lukas-vlcek
Copy link
Contributor

Description

Current implementation of RemoteSegmentTransferTrackerTests.testComputeTimeLagOnUpdate() test rely on some assumptions about how fast the testing code will finish in JVM. Moreover it does not precisely control boundaries of the time span, specifically the start of the span because it is determined by internal implementation of RemoteSegmentTransferTracker.getTimeMsLag() which indirectly makes call to System.nanoTime().

This commit loosens the assumption that the test code execution will finish within +/-20ms. Instead it only assumes that the execution time span won't be shorter than predefined (and controlled) thread sleep interval and any larger interval value is considered a success.

The whole point of this test is not to verify execution speed with defined precision. Instead the point is that the getTimeMsLag() method returns either 0 (for specific conditions) or possitive number (assuming that remoteRefreshStartTimeMs is not greater than System.nanoTime()).

Related Issues

Closes: #14325

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@lukas-vlcek
Copy link
Contributor Author

Please add skip-changelog label.

@lukas-vlcek
Copy link
Contributor Author

Idea for possible future improvement:

We could remove all direct calls to System.nanoTime() and similar System time methods from RemoteSegmentTransferTracker.java class and delegate it to some "TimeProvider" object. Then we could implement tests that have some more specific assumptions about code execution time span, because we would be able to control the TimeProvider in the test precisely.

If there is an agreement that this would be beneficial then we can open a new ticket.

Copy link
Contributor

github-actions bot commented Aug 9, 2024

✅ Gradle check result for 025c303: SUCCESS

Copy link

codecov bot commented Aug 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.92%. Comparing base (b6c80b1) to head (6e191a3).
Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15187      +/-   ##
============================================
+ Coverage     71.90%   71.92%   +0.01%     
- Complexity    63033    63114      +81     
============================================
  Files          5197     5197              
  Lines        295313   295313              
  Branches      42677    42677              
============================================
+ Hits         212354   212390      +36     
- Misses        65552    65607      +55     
+ Partials      17407    17316      -91     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@linuxpi linuxpi assigned lukas-vlcek and unassigned linuxpi Aug 13, 2024
@linuxpi
Copy link
Collaborator

linuxpi commented Aug 13, 2024

Thanks for raising a fix for this flaky test @lukas-vlcek, seems like a hot one.

Idea for possible future improvement:

We could remove all direct calls to System.nanoTime() and similar System time methods from RemoteSegmentTransferTracker.java class and delegate it to some "TimeProvider" object. Then we could implement tests that have some more specific assumptions about code execution time span, because we would be able to control the TimeProvider in the test precisely.

If there is an agreement that this would be beneficial then we can open a new ticket.

As of now we have always relied of directly using the System.nanoTime() but i think it would be good to have an abstraction like TimeProvider . To get an agreement we can maybe open a small RFC and get some thoughts from others

@lukas-vlcek lukas-vlcek force-pushed the 14325 branch 2 times, most recently from 510c03a to 18dbf31 Compare August 13, 2024 11:09
Copy link
Contributor

❌ Gradle check result for 510c03a: UNSTABLE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 18dbf31: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❕ Gradle check result for d1cc5c7: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@linuxpi
Copy link
Collaborator

linuxpi commented Aug 13, 2024

Mostly changes look good. One thing is we should run multiple iterations locally to make sure we have not regressed

Current implementation of [`RemoteSegmentTransferTrackerTests.testComputeTimeLagOnUpdate()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/test/java/org/opensearch/index/remote/RemoteSegmentTransferTrackerTests.java#L139) test rely on some assumptions about how fast the testing code will finish in JVM. Moreover it does not precisely control boundaries of the time span, specifically the start of the span because it is determined by internal implementation of [`RemoteSegmentTransferTracker.getTimeMsLag()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/main/java/org/opensearch/index/remote/RemoteSegmentTransferTracker.java#L262) which indirectly makes call to `System.nanoTime()`.

This commit loosens the assumption that the test code execution will finish within +/-20ms. Instead it only assumes that the execution time span won't be shorter than predefined (and controlled) thread sleep interval and any larger interval value is considered a success.

The whole point of this test is not to verify execution speed with defined precision. Instead the point is that the [`getTimeMsLag()`](https://github.com/opensearch-project/OpenSearch/blob/2b17902643738f0d2a75ade7c85cbca94d18ce49/server/src/main/java/org/opensearch/index/remote/RemoteSegmentTransferTracker.java#L262) method returns either 0 (for specific conditions) or possitive number (assuming that `remoteRefreshStartTimeMs` is not greater than `System.nanoTime()`).

Closes: opensearch-project#14325

Signed-off-by: Lukáš Vlček <[email protected]>
@lukas-vlcek
Copy link
Contributor Author

@linuxpi As for TimeProvider RFC, I will check other uses of System.nanoTime() in the code base (once I'm back from 🌴) and I will let you know what I think.

Copy link
Contributor

❕ Gradle check result for 6e191a3: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.index.ShardIndexingPressureIT.testShardIndexingPressureTrackingDuringBulkWrites

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@linuxpi linuxpi changed the title Don't rely on test code execution time span Don't rely on test code execution time span for RemoteSegmentTransferTrackerTests Aug 14, 2024
@linuxpi linuxpi merged commit ef1a79f into opensearch-project:main Aug 14, 2024
36 checks passed
@linuxpi linuxpi added the backport 2.x Backport to 2.x branch label Aug 14, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 14, 2024
…TrackerTests (#15187)

Signed-off-by: Lukáš Vlček <[email protected]>
(cherry picked from commit ef1a79f)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
dblock pushed a commit that referenced this pull request Aug 14, 2024
…TrackerTests (#15187) (#15244)

(cherry picked from commit ef1a79f)

Signed-off-by: Lukáš Vlček <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@lukas-vlcek lukas-vlcek deleted the 14325 branch August 15, 2024 06:49
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
akolarkunnu pushed a commit to akolarkunnu/OpenSearch that referenced this pull request Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.x Backport to 2.x branch flaky-test Random test failure that succeeds on second run skip-changelog Storage:Remote >test-failure Test failure from CI, local build, etc.
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for RemoteSegmentTransferTrackerTests
4 participants