how to avoid nebula OOM killed by OS? #5727

coldgust · 2023-09-25T17:20:45Z

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?

And when I execute a query, is there a way to set a timeout parameter to kill the slow query?

Thanks!

The text was updated successfully, but these errors were encountered:

porscheme · 2023-09-26T04:18:58Z

@wey-gu

In nebula v3.6.0, situation is much worsen.
The graphd crashes much bore the memory allocation (limits set using k8s pod resources) reached.
Also, we have this max_edge_returned_per_vertex on the storaged

I wish someone in product team responds.

wey-gu · 2023-09-26T06:09:49Z

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?

And when I execute a query, is there a way to set a timeout parameter to kill the slow query?

Thanks!

Can we confirm there are no other processes consuming memory?
the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

wey-gu · 2023-09-26T06:30:06Z

@wey-gu

In nebula v3.6.0, situation is much worsen. The graphd crashes much bore the memory allocation (limits set using k8s pod resources) reached. Also, we have this max_edge_returned_per_vertex on the storaged

I wish someone in product team responds.

Everything else are exactly the same, right? only memory limit will hit more often?

porscheme · 2023-09-26T06:40:17Z

@wey-gu

Everything else are exactly the same, right? only memory limit will hit more often?

Yes, every thing is same.

My cluster info.
graphd: 3 nodes
meatad: 3 modes
storaged: 7 nodes

Our space has total vertices Count: 2.8 Billion and total edges Count: 1 Billon.

We don't have memory_tracker_limit_ratio, how does it help the situation?

coldgust · 2023-09-26T06:40:22Z

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

Can we confirm there are no other processes consuming memory?

the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:

Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB

storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.

wey-gu · 2023-09-26T07:47:40Z

@wey-gu

Everything else are exactly the same, right? only memory limit will hit more often?

Yes, every thing is same.

My cluster info. graphd: 3 nodes meatad: 3 modes storaged: 7 nodes

Our space has total vertices Count: 2.8 Billion and total edges Count: 1 Billon.

We don't have memory_tracker_limit_ratio, how does it help the situation?

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!

What version are you running before 3.6.0, please?

Also, was the limit in node over commit?

wey-gu · 2023-09-26T07:50:35Z

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

Can we confirm there are no other processes consuming memory?

the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:
Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB
storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.

refer to https://www.nebula-graph.io/posts/memory-tracker-practices , the mem tracker will stop things from requesting memory but the ratio was based on total - unmanaged(OS or other process), it's indeed strange that OOM reported it's way more than 0.3 ratio.

By memory_tracker_limit_ratio=0.3 do you mean

--memory_tracker_limit_ratio=0.3
or
memory_tracker_limit_ratio=0.3?

porscheme · 2023-09-26T09:11:14Z

@wey-gu

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!

What version are you running before 3.6.0, please?
We were running v3.4.0

Also, was the limit in node over commit?
Yes, more memory was allocated for graphd

Based on this discussion we added memory_tracker_limit_ratio=0.3. After this, we are not seeing OOM but queries are failing with 'GraphMemoryExceeded: (-2600)' error.

Below is what we have for graphd

  graphd:
    config:
      "memory_tracker_limit_ratio": "0.3"
      "storage_client_timeout_ms": "300000"
      "max_allowed_query_size": "2097152"
      "session_idle_timeout_secs": "3600"
      "client_idle_timeout_secs": "3600"
      "optimize_appendvertices": "true"
      "enable_authorize": "true"
      "max_job_size": "10"
    resources:
      requests:
        cpu: "4000m"
        memory: "64Gi"
      limits:
        cpu: "6000m"
        memory: "108Gi"

coldgust · 2023-09-26T09:25:54Z

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

Can we confirm there are no other processes consuming memory?

the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:
Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB
storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.
refer to https://www.nebula-graph.io/posts/memory-tracker-practices , the mem tracker will stop things from requesting memory but the ratio was based on total - unmanaged(OS or other process), it's indeed strange that OOM reported it's way more than 0.3 ratio.

By memory_tracker_limit_ratio=0.3 do you mean

--memory_tracker_limit_ratio=0.3 or memory_tracker_limit_ratio=0.3?

It's the config:

########## memory tracker ##########
# trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )
--memory_tracker_limit_ratio=0.3
# untracked reserved memory in Mib
--memory_tracker_untracked_reserved_memory_mb=50

# enable log memory tracker stats periodically
--memory_tracker_detail_log=true
# log memory tacker stats interval in milliseconds
--memory_tracker_detail_log_interval_ms=6000

And there are memory items get from curl host:port/flags

 -memory_purge_enabled=1
-:memory_purge_interval_seconds=10
-memory_tracker_available_ratio=0.8
-:memory_tracker_detail_log=1
-memory_tracker_detail_log_interval_ms=1000
-memory_tracker_limit_ratio=0.3
 memory_tracker_untracked_reserved_memory_mb=50
--system_memory_high_watermark_ratio=1

wey-gu · 2023-09-28T02:47:49Z

@coldgust it turns out --system_memory_high_watermark_ratio=1 will prevent from mem tracker from functioning normally, we'll highlight this in docs later.

wey-gu · 2023-09-28T02:52:30Z

@wey-gu

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!
What version are you running before 3.6.0, please?
We were running v3.4.0

Also, was the limit in node over commit?
Yes, more memory was allocated for graphd

Based on this discussion we added memory_tracker_limit_ratio=0.3. After this, we are not seeing OOM but queries are failing with 'GraphMemoryExceeded: (-2600)' error.

Below is what we have for graphd
  graphd:
    config:
      "memory_tracker_limit_ratio": "0.3"
      "storage_client_timeout_ms": "300000"
      "max_allowed_query_size": "2097152"
      "session_idle_timeout_secs": "3600"
      "client_idle_timeout_secs": "3600"
      "optimize_appendvertices": "true"
      "enable_authorize": "true"
      "max_job_size": "10"
    resources:
      requests:
        cpu: "4000m"
        memory: "64Gi"
      limits:
        cpu: "6000m"
        memory: "108Gi"

The mem tracker will limit based on the untracked memory, ratio, and the limit in k8s limit, it should hit the ratio to trigger this exceeding error, we could enlarge the ratio to one that's larger than 0.3, maybe start from 0.6?

Also, for the limit, was that overcommit value when accumulating all workloads?

QingZ11 · 2024-02-18T02:58:57Z

Hi, I have noticed that the issue you created hasn’t been updated for nearly a month, so I have to close it for now. If you have any new updates, you are welcome to reopen this issue anytime.

Thanks a lot for your contribution anyway 😊

wey-gu mentioned this issue Sep 30, 2023

Weekly Report 2023-09-29 vesoft-inc/nebula-community#412

Closed

QingZ11 added the type/question Type: question about the product label Feb 18, 2024

QingZ11 closed this as completed Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to avoid nebula OOM killed by OS? #5727

how to avoid nebula OOM killed by OS? #5727

coldgust commented Sep 25, 2023

porscheme commented Sep 26, 2023

wey-gu commented Sep 26, 2023

wey-gu commented Sep 26, 2023

porscheme commented Sep 26, 2023

coldgust commented Sep 26, 2023

wey-gu commented Sep 26, 2023

wey-gu commented Sep 26, 2023

porscheme commented Sep 26, 2023 •

edited

Loading

coldgust commented Sep 26, 2023

wey-gu commented Sep 28, 2023

wey-gu commented Sep 28, 2023 •

edited

Loading

QingZ11 commented Feb 18, 2024

how to avoid nebula OOM killed by OS? #5727

how to avoid nebula OOM killed by OS? #5727

Comments

coldgust commented Sep 25, 2023

porscheme commented Sep 26, 2023

wey-gu commented Sep 26, 2023

wey-gu commented Sep 26, 2023

porscheme commented Sep 26, 2023

coldgust commented Sep 26, 2023

wey-gu commented Sep 26, 2023

wey-gu commented Sep 26, 2023

porscheme commented Sep 26, 2023 • edited Loading

coldgust commented Sep 26, 2023

wey-gu commented Sep 28, 2023

wey-gu commented Sep 28, 2023 • edited Loading

QingZ11 commented Feb 18, 2024

porscheme commented Sep 26, 2023 •

edited

Loading

wey-gu commented Sep 28, 2023 •

edited

Loading