Skip to content

Regression: Coordinator crashes with OOM in 468 #24572

@guyco33

Description

@guyco33

Coordinator in 468 consumes much more memory than 467 that leads to OS OOM.
What could be the reason for such memory consumption change ?

Listed below the output from free memory stats and gc logs during execution from both coordinator versions.

467
===
** after 15 minutes 

[2024-12-24T11:10:18.671+0000][7691][info][gc,start       ] GC(60) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:18.672+0000][7691][info][gc,task        ] GC(60) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Pre Evacuate Collection Set: 0.87ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Merge Heap Roots: 0.44ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Evacuate Collection Set: 159.86ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Post Evacuate Collection Set: 1.28ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Other: 0.12ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Eden regions: 100->0(89)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Survivor regions: 8->9(14)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Old regions: 34->34
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Humongous regions: 0->0
[2024-12-24T11:10:18.834+0000][7691][info][gc,metaspace   ] GC(60) Metaspace: 291741K(295488K)->291741K(295488K) NonClass: 254345K(256448K)->254345K(256448K) Class: 37395K(39040K)->37395K(39040K)
[2024-12-24T11:10:18.834+0000][7691][info][gc             ] GC(60) Pause Young (Normal) (G1 Evacuation Pause) 4449M->1279M(16384M) 162.817ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,cpu         ] GC(60) User=0.43s Sys=0.04s Real=0.17s
               total        used        free      shared  buff/cache   available
Mem:              30           7          19           0           4          23
Swap:              0           0           0
Tue Dec 24 11:10:26 UTC 2024


** after 1:15 hour

[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark From Roots 2367.862ms
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean
[2024-12-24T12:09:59.681+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean 1.161ms
[2024-12-24T12:09:59.681+0000][7691][info][gc,start       ] GC(173) Pause Remark
[2024-12-24T12:09:59.724+0000][7691][info][gc             ] GC(173) Pause Remark 5988M->5691M(16384M) 42.204ms
[2024-12-24T12:09:59.724+0000][7691][info][gc,cpu         ] GC(173) User=0.15s Sys=0.00s Real=0.04s
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark 2413.972ms
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions 483.491ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,start       ] GC(173) Pause Cleanup
[2024-12-24T12:10:00.210+0000][7691][info][gc             ] GC(173) Pause Cleanup 5725M->5725M(16384M) 0.054ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,cpu         ] GC(173) User=0.00s Sys=0.00s Real=0.00s
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks 0.427ms
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark
[2024-12-24T12:10:00.228+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark 17.547ms
[2024-12-24T12:10:00.228+0000][7686][info][gc             ] GC(173) Concurrent Mark Cycle 3285.446ms
               total        used        free      shared  buff/cache   available
Mem:              30          16          10           0           4          14
Swap:              0           0           0
Tue Dec 24 12:10:41 UTC 2024

===
468
===
** after 15 minutes 

[2024-12-24T11:10:04.019+0000][7679][info][gc,start       ] GC(62) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:04.019+0000][7679][info][gc,task        ] GC(62) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Pre Evacuate Collection Set: 0.32ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Merge Heap Roots: 0.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Evacuate Collection Set: 139.19ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Post Evacuate Collection Set: 2.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Other: 0.08ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Eden regions: 118->0(104)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Survivor regions: 8->10(16)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Old regions: 33->33
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Humongous regions: 0->0
[2024-12-24T11:10:04.161+0000][7679][info][gc,metaspace   ] GC(62) Metaspace: 292209K(296000K)->292209K(296000K) NonClass: 254676K(256768K)->254676K(256768K) Class: 37533K(39232K)->37533K(39232K)
[2024-12-24T11:10:04.161+0000][7679][info][gc             ] GC(62) Pause Young (Normal) (G1 Evacuation Pause) 5009M->1277M(16384M) 142.330ms
[2024-12-24T11:10:04.162+0000][7679][info][gc,cpu         ] GC(62) User=0.44s Sys=0.04s Real=0.14s
               total        used        free      shared  buff/cache   available
Mem:              30          16           9           0           4          13
Swap:              0           0           0
Tue Dec 24 11:10:27 UTC 2024


** after 1:15 hour

[2024-12-24T12:10:12.713+0000][7679][info][gc,start       ] GC(179) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T12:10:12.714+0000][7679][info][gc,task        ] GC(179) Using 4 workers of 4 for evacuation
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Pre Evacuate Collection Set: 0.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Merge Heap Roots: 0.24ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Evacuate Collection Set: 77.50ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Post Evacuate Collection Set: 2.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Other: 0.09ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Eden regions: 303->0(302)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Survivor regions: 4->5(39)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Old regions: 113->113
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Humongous regions: 1->1
[2024-12-24T12:10:12.795+0000][7679][info][gc,metaspace   ] GC(179) Metaspace: 296387K(300352K)->296387K(300352K) NonClass: 258630K(260800K)->258630K(260800K) Class: 37756K(39552K)->37756K(39552K)
[2024-12-24T12:10:12.795+0000][7679][info][gc             ] GC(179) Pause Young (Normal) (G1 Evacuation Pause) 13361M->3699M(16384M) 82.112ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,cpu         ] GC(179) User=0.26s Sys=0.01s Real=0.08s
               total        used        free      shared  buff/cache   available
Mem:              30          29           1           0           0           0
Swap:              0           0           0
Tue Dec 24 12:10:25 UTC 2024


*** syslog OOM output 
grep -i oom /var/log/syslog

2024-12-24T12:14:35.481240+00:00 ip-172-30-20-32 kernel: http-worker-691 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T12:14:35.496433+00:00 ip-172-30-20-32 kernel:  oom_kill_process+0x116/0x270
2024-12-24T12:14:35.496434+00:00 ip-172-30-20-32 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T12:14:35.496435+00:00 ip-172-30-20-32 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T12:14:35.511859+00:00 ip-172-30-20-32 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7634,uid=0
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: Out of memory: Killed process 7634 (trino-server) total-vm:36015048kB, anon-rss:29973656kB, file-rss:2948kB, shmem-rss:0kB, UID:0 pgtables:61856kB oom_score_adj:0
2024-12-24T12:14:35.792898+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: A process of this unit has been killed by the OOM killer.
2024-12-24T12:14:35.803968+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
2024-12-24T12:14:35.806272+00:00 ip-172-30-20-32 systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.

Trino coordinator is running on EC2 machine type r6a.xlarge with OpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)

jvm.conf is

-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-Djava.io.tmpdir=/data/tmpdir
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Dfile.encoding=UTF-8
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:+UnlockDiagnosticVMOptions
-XX:G1NumCollectionsKeepPinned=10000000
-XX:GCLockerRetryAllocationCount=32
-Xlog:gc*:/data/trino/var/log/gc.log:time,level,tags,tid:filecount=10,filesize=100M

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions