-
Notifications
You must be signed in to change notification settings - Fork 590
HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. #4093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sumitagrawl for reporting the problem, @ashishkumar50 for the patch.
- The same instance of
MonotonicClockis used in several places, so the same problem may affect other services. We should fix all of them. Clockwas introduced here to improve the tests (quicker and more deterministic thansleeporwaitFor). Reverting toInstant.now()eliminates those advantages. It also breaksTestPipelineStateManagerImpl, which relies onTestClock.
I think we should try to fix MonotonicClock, or replace it with SystemClock for production use.
|
I don't recall why I didn't use SystemClock, and instead created MonotonicClock. Perhaps I was trying to replace some calls to Time.monoticNow() that were originally there. Looks that way in the original PR: Strange it has taken so long for this to be noticed, as we have been using it for a long time. I wonder can we just replace all occurrences of MontonicClock with Java.time.SystemClock to fix this? |
As I have checked, SystemClock will give system time wrt January 1, 1970 UTC. The use will fix this problem. And MontonicClock will give time epoc wrt system start -- This will be useful to measure performance / time taken for a method @sodonnel @adoroszlai Do we need change the clock instance for this to SystemClock ? or replace MonotonicClock all places. |
I guess |
|
We should not use |
We need use SystemClock() as object inside method, because pipeline Time is serialized to DB also. But clock instance is also used for checking time taken, but way of usecase. In Recon, its using Instance.now(). So,
OR current changes in PR is ok as similar to Recon. Please suggest the changes .... |
|
I think we should replace MonotonicClock with SystemClock anywhere it is created and passed into objects to be used. It will achieve the intended goal and be more safe to use. |
This is because we didn't change all usage of time to the new model. We have been fixing things on a case by case basis, and trying to ensure new code follows the new pattern.
Existing change is not OK. It also breaks the tests. Lets just try to replace all occurrences on MonotonicClock in the project with java.util.SystemClock instead. It should be a drop in replacement. Intellij reckons there are 24 occurrences of it through the code base, which isn't too many to fix. |
|
@sodonnel is |
|
@kerneltime The monotonic clock as it stands cant be used to compare times across JVM restarts. Its starting point is some meaningless instance that can be used to compare duration within the same JVM instance. I feel this could result in more unexpected bugs like this one. If not SystemClock, then what do you suggest? |
|
Hi @kerneltime @sodonnel please suggest the change to be done for this issue. |
|
For PipelineManager we need to switch from MonotonicClock to an instance of java.util.SystemClock instead. Feels like going forward with Monotonic clock in other places is a risk too, but for now we can fix the immediate problem as above. |
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we can commit after a green CI run.
* master: (88 commits) HDDS-7463. SCM Pipeline scrubber never able to cleanup allocated pipeline. (apache#4093) HDDS-7683. EC: ReplicationManager - UnderRep maintenance handler should not request nodes if none needed (apache#4109) HDDS-7635. Update failure metrics when allocate block fails in preExecute. (apache#4086) HDDS-7565. FSO purge directory for old bucket can update quota for new bucket (apache#4021) HDDS-7654. EC: ReplicationManager - merge mis-rep queue into under replicated queue (apache#4099) HDDS-7621. Update SCM term in datanode from heartbeat without any commands (apache#4101) HDDS-7649. S3 multipart upload EC release space quota wrong for old version (apache#4095) HDDS-7399. Enable specifying external root ca (apache#4053) HDDS-7398. Tool to remove old certs from the scm db (apache#3972) HDDS-6650. S3MultipartUpload support update bucket usedNamespace. (apache#4081) HDDS-7605. Improve logging in Container Balancer (apache#4067) HDDS-7616. EC: Refactor Unhealthy Replicated Processor (apache#4063) HDDS-7426. Add a new acceptance test for Streaming Pipeline. (apache#4019) HDDS-7478. [Ozone-Streaming] NPE in when creating a file with o3fs. (apache#3949) HDDS-7425. Add documentation for the new Streaming Pipeline feature. (apache#3913) HDDS-7438. [Ozone-Streaming] Add a createStreamKey method to OzoneBucket. (apache#3914) HDDS-7431. [Ozone-Streaming] Disable data steam by default. (apache#3900) HDDS-6955. [Ozone-streaming] Add explicit stream flag in ozone shell (apache#3559) HDDS-6867. [Ozone-Streaming] PutKeyHandler should not use streaming to put EC key. (apache#3516) HDDS-6842. [Ozone-Streaming] Reduce the number of watch requests in StreamCommitWatcher. (apache#3492) ...
…line. (apache#4093) (cherry picked from commit c7785fa) Change-Id: I7308a10f8ff61baf6cfd92031f622187a7c14cc7
What changes were proposed in this pull request?
While creating pipeline we are using time from java.util time package. During pipeline scrub we should use same java.util time package for correct comparison. So that pipeline which is in allocated state for long time can be removed.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-7463
Please replace this section with the link to the Apache JIRA)
How was this patch tested?
Code has been built locally and manually tested.