-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-17728. HDFS-16033. Fix issue of the StatisticsDataReferenceCleaner cleanUp #3042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
gentle cc @steveloughran Could you help take a look when you have a time, please. |
| while (!Thread.interrupted()) { | ||
| try { | ||
| StatisticsDataReference ref = | ||
| (StatisticsDataReference)STATS_DATA_REF_QUEUE.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewer, quote JDK ReferenceQueue#:
public Reference<? extends T> remove(long timeout)
throws IllegalArgumentException, InterruptedException
{
if (timeout < 0) {
throw new IllegalArgumentException("Negative timeout value");
}
synchronized (lock) {
Reference<? extends T> r = reallyPoll();
if (r != null) return r;
long start = (timeout == 0) ? 0 : System.nanoTime();
for (;;) {
lock.wait(timeout);
r = reallyPoll();
if (r != null) return r;
if (timeout != 0) {
long end = System.nanoTime();
timeout -= (end - start) / 1000_000;
if (timeout <= 0) return null;
start = end;
}
}
}
}
/**
* Removes the next reference object in this queue, blocking until one
* becomes available.
*
* @return A reference object, blocking until one becomes available
* @throws InterruptedException If the wait is interrupted
*/
public Reference<? extends T> remove() throws InterruptedException {
return remove(0);
}
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
| * Background action to act on references being removed. | ||
| */ | ||
| private static class StatisticsDataReferenceCleaner implements Runnable { | ||
| private static int REF_QUEUE_POLL_TIMEOUT = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's going to be waking every 100 milliseconds, demanding cpu time etc etc. If there has to be a timeout, it needs to be something less disruptive, like 100 seconds.
What would happen if that was the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
En, We should take this time into consideration. If it is too long, many resources need to be recycled in a certain period of time, which may lead to low recycling efficiency.
we can refer this: referenceQueue timeput
|
I'm going to pull @liuml07 in to help review this, as it's a bit of code they've gone near in the past and weak reference queues are new to me |
|
Thanks for including me, @steveloughran Let me first understand the problem: unless new reference object is available in the queue (Java code calling To fix the problem, here we propose to call That makes sense to me, if I understand the problem and solution correctly. Let me know @yikf As to implementation, I agree 100s might be too stingy to this cleanup (we remove one every time, so essentially 100s to cleanup one at best). I'm also wondering if 100ms is too generous here. How many threads do we target here? To my best knowledge, 1K is pretty large and close to the upper limit. To cleanup everything eventually AND without any help of enqueue events, it takes 10min to cleanup everything, if the timeout is 600ms. Is this a reasonable value? I see you refer to Spark settings, but I assume that is targeting much more references including RDD, shuffle, and broadcast state etc? |
Thanks for review, In fact, If there are reference objects in reference JDK Code snippet To sum up, we won't have to wait very long like 10min, FYI |
|
@liuml07 Oh sorry, I misunderstood what you meant you are right, In Spark, that is targeting much more references. For the timeout, we need to consider the rate at which the reference object is generated. But sorry, I'm not familiar with this area, and I don't quite understand whether 600ms is an appropriate value |
|
I think the code looks good to me. Just the timeout value I am thinking about something between 100ms to 100s, mostly close to 100ms. What is your thought @steveloughran ? Thanks, |
|
anything under a few seconds is potentially taking work from others, and, as discussed, once the thread is woken up it's going to be able to run through the entire queue. Which will only happen after a GC event. So we are worrying about
I'm happy with something like a 10s delay from GC to thread live. Make it a constant, with javadoc etc so everyone understands what and why.... |
gentle ping @steveloughran @liuml07 Thanks for review OK, Will we use 10s as the delay? If so, I'll update later. I am not familiar with the object that needs to be collected, we need to consider the rate at which the object will be collected and then choose an appropriate value |
|
@steveloughran @liuml07 updated |
|
💔 -1 overall
This message was automatically generated. |
|
gentle ping @steveloughran Could you help me take a look when you have time? |
|
I will commit in 24 hours if no more comments. Thanks, |
…3042) Contributed by kaifeiYi (yikf). Signed-off-by: Mingliang Liu <[email protected]> Reviewed-by: Steve Loughran <[email protected]>
…3042) Contributed by kaifeiYi (yikf). Signed-off-by: Mingliang Liu <[email protected]> Reviewed-by: Steve Loughran <[email protected]>
…pache#3042) Contributed by kaifeiYi (yikf). Signed-off-by: Mingliang Liu <[email protected]> Reviewed-by: Steve Loughran <[email protected]>
…pache#3042) Contributed by kaifeiYi (yikf). Signed-off-by: Mingliang Liu <[email protected]> Reviewed-by: Steve Loughran <[email protected]>
…eanUp (apache#3042)" This reverts commit 4a26a61.
What changes were proposed in this pull request?
In
StatisticsDataReferenceCleaner, Cleaner thread will be blocked if we remove reference from ReferenceQueue unless the queue.enqueue` called but no now.As shown below, We call ReferenceQueue.remove() now while cleanUp, Call chain as follow:
StatisticsDataReferenceCleaner#queue.remove() -> ReferenceQueue.remove(0) -> lock.wait(0)lock.wait(0) will waitting perpetual unless lock.notify/notifyAll be called, But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread will be blocked.
ThreadDump: