Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to avoid nebula OOM killed by OS? #5727

Closed
coldgust opened this issue Sep 25, 2023 · 12 comments
Closed

how to avoid nebula OOM killed by OS? #5727

coldgust opened this issue Sep 25, 2023 · 12 comments
Labels
type/question Type: question about the product

Comments

@coldgust
Copy link

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?

And when I execute a query, is there a way to set a timeout parameter to kill the slow query?

Thanks!

@porscheme
Copy link

@wey-gu

In nebula v3.6.0, situation is much worsen.
The graphd crashes much bore the memory allocation (limits set using k8s pod resources) reached.
Also, we have this max_edge_returned_per_vertex on the storaged

I wish someone in product team responds.

@wey-gu
Copy link
Contributor

wey-gu commented Sep 26, 2023

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?

And when I execute a query, is there a way to set a timeout parameter to kill the slow query?

Thanks!

  • Can we confirm there are no other processes consuming memory?
  • the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

@wey-gu
Copy link
Contributor

wey-gu commented Sep 26, 2023

@wey-gu

In nebula v3.6.0, situation is much worsen. The graphd crashes much bore the memory allocation (limits set using k8s pod resources) reached. Also, we have this max_edge_returned_per_vertex on the storaged

I wish someone in product team responds.

Everything else are exactly the same, right? only memory limit will hit more often?

@porscheme
Copy link

@wey-gu

Everything else are exactly the same, right? only memory limit will hit more often?

Yes, every thing is same.

My cluster info.
graphd: 3 nodes
meatad: 3 modes
storaged: 7 nodes

Our space has total vertices Count: 2.8 Billion and total edges Count: 1 Billon.

We don't have memory_tracker_limit_ratio, how does it help the situation?

@coldgust
Copy link
Author

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

  • Can we confirm there are no other processes consuming memory?
  • the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:

Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB

storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.

@wey-gu
Copy link
Contributor

wey-gu commented Sep 26, 2023

@wey-gu

Everything else are exactly the same, right? only memory limit will hit more often?

Yes, every thing is same.

My cluster info. graphd: 3 nodes meatad: 3 modes storaged: 7 nodes

Our space has total vertices Count: 2.8 Billion and total edges Count: 1 Billon.

We don't have memory_tracker_limit_ratio, how does it help the situation?

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!

What version are you running before 3.6.0, please?

Also, was the limit in node over commit?

@wey-gu
Copy link
Contributor

wey-gu commented Sep 26, 2023

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

  • Can we confirm there are no other processes consuming memory?
  • the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:

Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB

storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.

refer to https://www.nebula-graph.io/posts/memory-tracker-practices , the mem tracker will stop things from requesting memory but the ratio was based on total - unmanaged(OS or other process), it's indeed strange that OOM reported it's way more than 0.3 ratio.

By memory_tracker_limit_ratio=0.3 do you mean

--memory_tracker_limit_ratio=0.3
or
memory_tracker_limit_ratio=0.3?

@porscheme
Copy link

porscheme commented Sep 26, 2023

@wey-gu

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!

What version are you running before 3.6.0, please?
We were running v3.4.0

Also, was the limit in node over commit?
Yes, more memory was allocated for graphd

Based on this discussion we added memory_tracker_limit_ratio=0.3. After this, we are not seeing OOM but queries are failing with 'GraphMemoryExceeded: (-2600)' error.

Below is what we have for graphd

  graphd:
    config:
      "memory_tracker_limit_ratio": "0.3"
      "storage_client_timeout_ms": "300000"
      "max_allowed_query_size": "2097152"
      "session_idle_timeout_secs": "3600"
      "client_idle_timeout_secs": "3600"
      "optimize_appendvertices": "true"
      "enable_authorize": "true"
      "max_job_size": "10"
    resources:
      requests:
        cpu: "4000m"
        memory: "64Gi"
      limits:
        cpu: "6000m"
        memory: "108Gi"

@coldgust
Copy link
Author

I have tried set memory_tracker_limit_ratio=0.3 to graphd and storaged, and graphd and storaged are deployed on the same node. When I try to execute a large query, it still get killed due to OOM. Is there have a method to kill the query rather than OOM killed by OS?
And when I execute a query, is there a way to set a timeout parameter to kill the slow query?
Thanks!

  • Can we confirm there are no other processes consuming memory?
  • the process being killed by OOM was graphd? do we have the memory utilization state monitored during the execution?

I found the OOM log in dmesg:

Memory cgroup out of memory: Killed process 425131 (nebula-storaged) total-vm:3496320kB, anon-rss:1970324kB, file-rss:0kB, shmem-rss:0kB
...
Memory cgroup out of memory: Killed process 425092 (nebula-graphd) total-vm:5948832kB, anon-rss:3842228kB, file-rss:0kB, shmem-rss:0kB

storaged and graphd are in the same 8GB RAM node. And I use top to watch memory during execution, storaged and graphd are the mainly processes used memory.

refer to https://www.nebula-graph.io/posts/memory-tracker-practices , the mem tracker will stop things from requesting memory but the ratio was based on total - unmanaged(OS or other process), it's indeed strange that OOM reported it's way more than 0.3 ratio.

By memory_tracker_limit_ratio=0.3 do you mean

--memory_tracker_limit_ratio=0.3 or memory_tracker_limit_ratio=0.3?

It's the config:

########## memory tracker ##########
# trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )
--memory_tracker_limit_ratio=0.3
# untracked reserved memory in Mib
--memory_tracker_untracked_reserved_memory_mb=50

# enable log memory tracker stats periodically
--memory_tracker_detail_log=true
# log memory tacker stats interval in milliseconds
--memory_tracker_detail_log_interval_ms=6000

And there are memory items get from curl host:port/flags

 -memory_purge_enabled=1
-:memory_purge_interval_seconds=10
-memory_tracker_available_ratio=0.8
-:memory_tracker_detail_log=1
-memory_tracker_detail_log_interval_ms=1000
-memory_tracker_limit_ratio=0.3
 memory_tracker_untracked_reserved_memory_mb=50
--system_memory_high_watermark_ratio=1

@wey-gu
Copy link
Contributor

wey-gu commented Sep 28, 2023

@coldgust it turns out --system_memory_high_watermark_ratio=1 will prevent from mem tracker from functioning normally, we'll highlight this in docs later.

@wey-gu
Copy link
Contributor

wey-gu commented Sep 28, 2023

@wey-gu

By default the memory tracker was enabled and there are default ratio(80%) defined., refer to https://www.nebula-graph.io/posts/memory-tracker-practices for more details!
What version are you running before 3.6.0, please?
We were running v3.4.0

Also, was the limit in node over commit?
Yes, more memory was allocated for graphd

Based on this discussion we added memory_tracker_limit_ratio=0.3. After this, we are not seeing OOM but queries are failing with 'GraphMemoryExceeded: (-2600)' error.

Below is what we have for graphd

  graphd:
    config:
      "memory_tracker_limit_ratio": "0.3"
      "storage_client_timeout_ms": "300000"
      "max_allowed_query_size": "2097152"
      "session_idle_timeout_secs": "3600"
      "client_idle_timeout_secs": "3600"
      "optimize_appendvertices": "true"
      "enable_authorize": "true"
      "max_job_size": "10"
    resources:
      requests:
        cpu: "4000m"
        memory: "64Gi"
      limits:
        cpu: "6000m"
        memory: "108Gi"

The mem tracker will limit based on the untracked memory, ratio, and the limit in k8s limit, it should hit the ratio to trigger this exceeding error, we could enlarge the ratio to one that's larger than 0.3, maybe start from 0.6?

Also, for the limit, was that overcommit value when accumulating all workloads?

@QingZ11 QingZ11 added the type/question Type: question about the product label Feb 18, 2024
@QingZ11
Copy link
Contributor

QingZ11 commented Feb 18, 2024

Hi, I have noticed that the issue you created hasn’t been updated for nearly a month, so I have to close it for now. If you have any new updates, you are welcome to reopen this issue anytime.

Thanks a lot for your contribution anyway 😊

@QingZ11 QingZ11 closed this as completed Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Type: question about the product
Projects
None yet
Development

No branches or pull requests

4 participants