Skip to content

Conversation

@yikf
Copy link
Contributor

@yikf yikf commented May 21, 2021

What changes were proposed in this pull request?

In StatisticsDataReferenceCleaner, Cleaner thread will be blocked if we remove reference from ReferenceQueue unless the queue.enqueue` called but no now.

As shown below, We call ReferenceQueue.remove() now while cleanUp, Call chain as follow:
StatisticsDataReferenceCleaner#queue.remove() -> ReferenceQueue.remove(0) -> lock.wait(0)

lock.wait(0) will waitting perpetual unless lock.notify/notifyAll be called, But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread will be blocked.

ThreadDump:

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f7afc088800 nid=0x2119 in Object.wait() [0x00007f7b00230000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000c00c2f58> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
        - locked <0x00000000c00c2f58> (a java.lang.ref.Reference$Lock)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

@yikf
Copy link
Contributor Author

yikf commented May 21, 2021

gentle cc @steveloughran Could you help take a look when you have a time, please.

while (!Thread.interrupted()) {
try {
StatisticsDataReference ref =
(StatisticsDataReference)STATS_DATA_REF_QUEUE.remove();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewer, quote JDK ReferenceQueue#:

    public Reference<? extends T> remove(long timeout)
        throws IllegalArgumentException, InterruptedException
    {
        if (timeout < 0) {
            throw new IllegalArgumentException("Negative timeout value");
        }
        synchronized (lock) {
            Reference<? extends T> r = reallyPoll();
            if (r != null) return r;
            long start = (timeout == 0) ? 0 : System.nanoTime();
            for (;;) {
                lock.wait(timeout);
                r = reallyPoll();
                if (r != null) return r;
                if (timeout != 0) {
                    long end = System.nanoTime();
                    timeout -= (end - start) / 1000_000;
                    if (timeout <= 0) return null;
                    start = end;
                }
            }
        }
    }

    /**
     * Removes the next reference object in this queue, blocking until one
     * becomes available.
     *
     * @return A reference object, blocking until one becomes available
     * @throws  InterruptedException  If the wait is interrupted
     */
    public Reference<? extends T> remove() throws InterruptedException {
        return remove(0);
    }

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 33m 57s trunk passed
+1 💚 compile 20m 48s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 compile 18m 0s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 checkstyle 1m 7s trunk passed
+1 💚 mvnsite 1m 33s trunk passed
+1 💚 javadoc 1m 6s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 39s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 22s trunk passed
+1 💚 shadedclient 15m 23s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 52s the patch passed
+1 💚 compile 20m 8s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javac 20m 8s the patch passed
+1 💚 compile 17m 59s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 javac 17m 59s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 5s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 2 new + 75 unchanged - 0 fixed = 77 total (was 75)
+1 💚 mvnsite 1m 30s the patch passed
+1 💚 javadoc 1m 3s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 37s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 32s the patch passed
+1 💚 shadedclient 15m 47s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 17m 3s hadoop-common in the patch passed.
+1 💚 asflicense 0m 53s The patch does not generate ASF License warnings.
177m 49s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/1/artifact/out/Dockerfile
GITHUB PR #3042
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux ee328c833ac4 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a1ec603
Default Java Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/1/testReport/
Max. process+thread count 1280 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 34m 11s trunk passed
+1 💚 compile 20m 46s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 compile 18m 17s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 checkstyle 1m 8s trunk passed
+1 💚 mvnsite 1m 34s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 40s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 23s trunk passed
+1 💚 shadedclient 15m 41s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 51s the patch passed
+1 💚 compile 20m 9s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javac 20m 9s the patch passed
+1 💚 compile 18m 6s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 javac 18m 6s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 1m 7s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 1 new + 75 unchanged - 0 fixed = 76 total (was 75)
+1 💚 mvnsite 1m 33s the patch passed
+1 💚 javadoc 0m 57s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 38s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 37s the patch passed
+1 💚 shadedclient 15m 57s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 16m 59s hadoop-common in the patch passed.
+1 💚 asflicense 0m 56s The patch does not generate ASF License warnings.
178m 41s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/2/artifact/out/Dockerfile
GITHUB PR #3042
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 89a115fef54b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 6cf7c77
Default Java Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/2/testReport/
Max. process+thread count 3158 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

* Background action to act on references being removed.
*/
private static class StatisticsDataReferenceCleaner implements Runnable {
private static int REF_QUEUE_POLL_TIMEOUT = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's going to be waking every 100 milliseconds, demanding cpu time etc etc. If there has to be a timeout, it needs to be something less disruptive, like 100 seconds.

What would happen if that was the case?

Copy link
Contributor Author

@yikf yikf May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

En, We should take this time into consideration. If it is too long, many resources need to be recycled in a certain period of time, which may lead to low recycling efficiency.

we can refer this: referenceQueue timeput

@steveloughran
Copy link
Contributor

I'm going to pull @liuml07 in to help review this, as it's a bit of code they've gone near in the past and weak reference queues are new to me

@liuml07
Copy link
Member

liuml07 commented May 25, 2021

Thanks for including me, @steveloughran

Let me first understand the problem: unless new reference object is available in the queue (Java code calling enqueue()), those existing references will not be cleaned up forever. That is because when calling remove(), the StatisticsDataReferenceCleaner thread will wait forever in case there is no notify/notifyAll events upon the internal queue lock.

To fix the problem, here we propose to call remove(timeout) version in the StatisticsDataReferenceCleaner thread. Its timeout value will be honored when waiting for internal queue lock. That will give the cleaner thread an opportunity to dequeue periodically - instead of getting blocked forever if no notify event happens to the internal queue lock. Eventually, all reference object in the queue will get cleaned up by cleaner with this mechanism.

That makes sense to me, if I understand the problem and solution correctly. Let me know @yikf

As to implementation, I agree 100s might be too stingy to this cleanup (we remove one every time, so essentially 100s to cleanup one at best). I'm also wondering if 100ms is too generous here. How many threads do we target here? To my best knowledge, 1K is pretty large and close to the upper limit. To cleanup everything eventually AND without any help of enqueue events, it takes 10min to cleanup everything, if the timeout is 600ms. Is this a reasonable value?

I see you refer to Spark settings, but I assume that is targeting much more references including RDD, shuffle, and broadcast state etc?

@yikf
Copy link
Contributor Author

yikf commented May 26, 2021

Thanks for including me, @steveloughran

Let me first understand the problem: unless new reference object is available in the queue (Java code calling enqueue()), those existing references will not be cleaned up forever. That is because when calling remove(), the StatisticsDataReferenceCleaner thread will wait forever in case there is no notify/notifyAll events upon the internal queue lock.

To fix the problem, here we propose to call remove(timeout) version in the StatisticsDataReferenceCleaner thread. Its timeout value will be honored when waiting for internal queue lock. That will give the cleaner thread an opportunity to dequeue periodically - instead of getting blocked forever if no notify event happens to the internal queue lock. Eventually, all reference object in the queue will get cleaned up by cleaner with this mechanism.

That makes sense to me, if I understand the problem and solution correctly. Let me know @yikf

As to implementation, I agree 100s might be too stingy to this cleanup (we remove one every time, so essentially 100s to cleanup one at best). I'm also wondering if 100ms is too generous here. How many threads do we target here? To my best knowledge, 1K is pretty large and close to the upper limit. To cleanup everything eventually AND without any help of enqueue events, it takes 10min to cleanup everything, if the timeout is 600ms. Is this a reasonable value?

I see you refer to Spark settings, but I assume that is targeting much more references including RDD, shuffle, and broadcast state etc?

Thanks for review, In fact, If there are reference objects in ReferenceQueue, ReferenceQueue.remove will be returned directly without wait.

reference JDK Code snippet

public Reference<? extends T> remove(long timeout)
        throws IllegalArgumentException, InterruptedException
    {
        if (timeout < 0) {
            throw new IllegalArgumentException("Negative timeout value");
        }
        synchronized (lock) {
            Reference<? extends T> r = reallyPoll();
            if (r != null) return r;
            long start = (timeout == 0) ? 0 : System.nanoTime();
            for (;;) {
                lock.wait(timeout);
…

To sum up, we won't have to wait very long like 10min, FYI

@yikf
Copy link
Contributor Author

yikf commented May 26, 2021

@liuml07 Oh sorry, I misunderstood what you meant

you are right, In Spark, that is targeting much more references.

For the timeout, we need to consider the rate at which the reference object is generated. But sorry, I'm not familiar with this area, and I don't quite understand whether 600ms is an appropriate value

@liuml07
Copy link
Member

liuml07 commented May 26, 2021

I think the code looks good to me. Just the timeout value I am thinking about something between 100ms to 100s, mostly close to 100ms. What is your thought @steveloughran ? Thanks,

@steveloughran
Copy link
Contributor

anything under a few seconds is potentially taking work from others, and, as discussed, once the thread is woken up it's going to be able to run through the entire queue. Which will only happen after a GC event.

So we are worrying about

  1. How long it takes after a GC event for the thread to wake up and process data
  2. What is the cost to the rest of the system with the thread waking up and going back to sleep.

I'm happy with something like a 10s delay from GC to thread live. Make it a constant, with javadoc etc so everyone understands what and why....

@yikf
Copy link
Contributor Author

yikf commented May 28, 2021

anything under a few seconds is potentially taking work from others, and, as discussed, once the thread is woken up it's going to be able to run through the entire queue. Which will only happen after a GC event.

So we are worrying about

  1. How long it takes after a GC event for the thread to wake up and process data
  2. What is the cost to the rest of the system with the thread waking up and going back to sleep.

I'm happy with something like a 10s delay from GC to thread live. Make it a constant, with javadoc etc so everyone understands what and why....

gentle ping @steveloughran @liuml07 Thanks for review

OK, Will we use 10s as the delay?

If so, I'll update later. I am not familiar with the object that needs to be collected, we need to consider the rate at which the object will be collected and then choose an appropriate value

@yikf
Copy link
Contributor Author

yikf commented May 31, 2021

@steveloughran @liuml07 updated

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 52s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 34m 11s trunk passed
+1 💚 compile 20m 57s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 compile 18m 15s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 checkstyle 1m 9s trunk passed
+1 💚 mvnsite 1m 36s trunk passed
+1 💚 javadoc 1m 9s trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 38s trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 26s trunk passed
+1 💚 shadedclient 15m 51s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 54s the patch passed
+1 💚 compile 20m 3s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javac 20m 3s the patch passed
+1 💚 compile 19m 4s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 javac 19m 4s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 8s the patch passed
+1 💚 mvnsite 1m 34s the patch passed
+1 💚 javadoc 1m 4s the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04
+1 💚 javadoc 1m 41s the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
+1 💚 spotbugs 2m 44s the patch passed
+1 💚 shadedclient 15m 48s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 17m 14s hadoop-common in the patch passed.
+1 💚 asflicense 0m 54s The patch does not generate ASF License warnings.
180m 13s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/3/artifact/out/Dockerfile
GITHUB PR #3042
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell
uname Linux 9eb2123aeb48 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 8b59d92
Default Java Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/3/testReport/
Max. process+thread count 1469 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3042/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@liuml07 liuml07 self-requested a review June 1, 2021 00:53
@yikf
Copy link
Contributor Author

yikf commented Jun 4, 2021

gentle ping @steveloughran Could you help me take a look when you have time?

@liuml07
Copy link
Member

liuml07 commented Jun 4, 2021

I will commit in 24 hours if no more comments. Thanks,

@liuml07 liuml07 merged commit 4a26a61 into apache:trunk Jun 4, 2021
asfgit pushed a commit that referenced this pull request Jun 4, 2021
…3042)

Contributed by kaifeiYi (yikf).

Signed-off-by: Mingliang Liu <[email protected]>
Reviewed-by: Steve Loughran <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 4, 2021
…3042)

Contributed by kaifeiYi (yikf).

Signed-off-by: Mingliang Liu <[email protected]>
Reviewed-by: Steve Loughran <[email protected]>
zhaomin1423 pushed a commit to zhaomin1423/hadoop that referenced this pull request Jun 8, 2021
…pache#3042)

Contributed by kaifeiYi (yikf).

Signed-off-by: Mingliang Liu <[email protected]>
Reviewed-by: Steve Loughran <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 11, 2021
asfgit pushed a commit that referenced this pull request Jun 11, 2021
asfgit pushed a commit that referenced this pull request Jun 11, 2021
@liuml07 liuml07 changed the title HDFS-16033 Fix issue of the StatisticsDataReferenceCleaner cleanUp HADOOP-17728. HDFS-16033. Fix issue of the StatisticsDataReferenceCleaner cleanUp Jun 11, 2021
kiran-maturi pushed a commit to kiran-maturi/hadoop that referenced this pull request Nov 24, 2021
…pache#3042)

Contributed by kaifeiYi (yikf).

Signed-off-by: Mingliang Liu <[email protected]>
Reviewed-by: Steve Loughran <[email protected]>
kiran-maturi pushed a commit to kiran-maturi/hadoop that referenced this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants