Skip to content

[SUPPORT] Deltastreamer job does not terminate on Kubernetes when hoodie.metrics.on=true #5249

@toninis

Description

@toninis

Describe the problem you faced

We 've noticed that when you enable hoodie JMX metrics Shutdown hook is never called.
We took a thread dump to check with threads remain in running state and keep the JVM up .

"RMI TCP Accept-9889": running
	at [java.net.PlainSocketImpl.socketAccept(Native Method)](https://spotify.github.io/threaddump-analyzer/#java.net.PlainSocketImpl.socketAccept(Native%20Method))
	at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
	at java.net.ServerSocket.implAccept(ServerSocket.java:560)
	at java.net.ServerSocket.accept(ServerSocket.java:528)
	at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:405)
	at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:377)
	at java.lang.Thread.run(Thread.java:750)

To Reproduce

Steps to reproduce the behavior:

  1. Enable JMX hoodie metrics
    hoodie.metrics.on=true
    hoodie.metrics.reporter.type=JMX
  1. Run a Deltastreamer job on Kubernetes

Expected behavior

We expect when Spark Context is stopped also the JVM should stop.

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 3.1.2

  • Hive version : 2.3.7

  • Hadoop version : N/A

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : yes

  • Java Version :

openjdk version "1.8.0_322"
OpenJDK Runtime Environment (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)

&

openjdk version "11.0.14.1" 2022-02-08
OpenJDK Runtime Environment 18.9 (build 11.0.14.1+1)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.14.1+1, mixed mode, sharing)

Additional context

We tested this with both Java 11 and Java 8 and got the same results .
You can find below the full thread dump output once the Spark Context is succefully stoppped .
We got it with kill -3 <pid> .

The only relevant issue I could find is this

Full Thread Dump Output
Full thread dump OpenJDK 64-Bit Server VM (25.322-b06 mixed mode):

"DestroyJavaVM" #73 prio=5 os_prio=0 tid=0x00007f8a7c011000 nid=0x1a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"RMI TCP Connection(idle)" #67 daemon prio=5 os_prio=0 tid=0x00007f8974003800 nid=0x5d waiting on condition [0x00007f897b8f9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000c10bc948> (a java.util.concurrent.SynchronousQueue$TransferStack)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
    at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"RMI RenewClean-[100.97.64.2:9889]" #66 daemon prio=5 os_prio=0 tid=0x00007f896c006800 nid=0x5c in Object.wait() [0x00007f897b9fa000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    - locked <0x00000000c1064a20> (a java.lang.ref.ReferenceQueue$Lock)
    at sun.rmi.transport.DGCClient$EndpointEntry$RenewCleanThread.run(DGCClient.java:563)
    at java.lang.Thread.run(Thread.java:750)

"RMI Scheduler(0)" #65 daemon prio=5 os_prio=0 tid=0x00007f8a7ddf6000 nid=0x5b waiting on condition [0x00007f897bafb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000c1064c28> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"RMI TCP Connection(idle)" #64 daemon prio=5 os_prio=0 tid=0x00007f8974001800 nid=0x5a waiting on condition [0x00007f897bbfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000c10bc948> (a java.util.concurrent.SynchronousQueue$TransferStack)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
    at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"GC Daemon" #63 daemon prio=2 os_prio=0 tid=0x00007f8a7dde9800 nid=0x59 in Object.wait() [0x00007f897befd000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at sun.misc.GC$Daemon.run(GC.java:117)
    - locked <0x00000000c1067260> (a sun.misc.GC$LatencyLock)

"RMI Reaper" #62 prio=5 os_prio=0 tid=0x00007f8a7dde7800 nid=0x58 in Object.wait() [0x00007f897bffe000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    - locked <0x00000000c10673e8> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at sun.rmi.transport.ObjectTable$Reaper.run(ObjectTable.java:351)
    at java.lang.Thread.run(Thread.java:750)

"RMI TCP Accept-9889" #61 daemon prio=5 os_prio=0 tid=0x00007f8a7c275800 nid=0x57 runnable [0x00007f89d81e5000]
   java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketAccept(Native Method)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
    at java.net.ServerSocket.implAccept(ServerSocket.java:560)
    at java.net.ServerSocket.accept(ServerSocket.java:528)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:405)
    at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:377)
    at java.lang.Thread.run(Thread.java:750)

"java-sdk-http-connection-reaper" #58 daemon prio=5 os_prio=0 tid=0x00007f8a7da85800 nid=0x54 waiting on condition [0x00007f89d8ae8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:181)

"Timer for 's3a-file-system' metrics system" #57 daemon prio=5 os_prio=0 tid=0x00007f8a7e0fa000 nid=0x53 in Object.wait() [0x00007f89d8fe9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.util.TimerThread.mainLoop(Timer.java:552)
    - locked <0x00000000d6720a88> (a java.util.TaskQueue)
    at java.util.TimerThread.run(Timer.java:505)

"MutableQuantiles-0" #56 daemon prio=5 os_prio=0 tid=0x00007f8a7e0c2000 nid=0x52 waiting on condition [0x00007f89d90ea000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000d66e87c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
    at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #55 daemon prio=5 os_prio=0 tid=0x00007f8a7e0bd000 nid=0x51 in Object.wait() [0x00007f89d91eb000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    - locked <0x00000000d66e8a08> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3762)
    at java.lang.Thread.run(Thread.java:750)

"Okio Watchdog" #41 daemon prio=5 os_prio=0 tid=0x00007f8a7d1ea800 nid=0x41 in Object.wait() [0x00007f89dbbfe000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at okio.AsyncTimeout.awaitTimeout(AsyncTimeout.java:347)
    at okio.AsyncTimeout$Watchdog.run(AsyncTimeout.java:312)
    - locked <0x00000000d62d18c8> (a java.lang.Class for okio.AsyncTimeout)

"OkHttp ConnectionPool" #40 daemon prio=5 os_prio=0 tid=0x00007f8a7da40000 nid=0x40 in Object.wait() [0x00007f8a4c69b000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:460)
    at okhttp3.ConnectionPool$1.run(ConnectionPool.java:67)
    - locked <0x00000000d5ccb0e8> (a okhttp3.ConnectionPool)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"process reaper" #10 daemon prio=10 os_prio=0 tid=0x00007f8a7ec2d800 nid=0x24 waiting on condition [0x00007f8a6c082000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000d5888a60> (a java.util.concurrent.SynchronousQueue$TransferStack)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
    at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
    at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

"Service Thread" #7 daemon prio=9 os_prio=0 tid=0x00007f8a7c0c1800 nid=0x21 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C1 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f8a7c0bc800 nid=0x20 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f8a7c0ba800 nid=0x1f waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f8a7c0a6800 nid=0x1e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f8a7c083000 nid=0x1d in Object.wait() [0x00007f8a810e3000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
    - locked <0x00000000d55feb78> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)

"Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f8a7c07e800 nid=0x1c in Object.wait() [0x00007f8a811e4000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
    - locked <0x00000000d55fed30> (a java.lang.ref.Reference$Lock)
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

"VM Thread" os_prio=0 tid=0x00007f8a7c075000 nid=0x1b runnable

"VM Periodic Task Thread" os_prio=0 tid=0x00007f8a7c0c4800 nid=0x22 waiting on condition

Metadata

Metadata

Labels

area:ingestIngestion into Hudipriority:criticalProduction degraded; pipelines stalledstatus:triagedIssue has been reviewed and categorized

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions