Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy with golang filter crashes randomly #37225

Closed
tahmoor opened this issue Nov 18, 2024 · 7 comments · Fixed by #37405
Closed

Envoy with golang filter crashes randomly #37225

tahmoor opened this issue Nov 18, 2024 · 7 comments · Fixed by #37405
Labels

Comments

@tahmoor
Copy link

tahmoor commented Nov 18, 2024

We implemented an envoy golang filter on envoy 1.32.1 and it crashes randomly.
After further investigation we found that this crash happens during garbage collection.
We also implemented a golang filter that does nothing and saw the same crash happen randomly.
Also note that when we call runtime.GC() manually on each request, the rate of crash increases.

Here is the call stack of crash:
[2024-11-15 12:11:10.256][32][critical][backtrace] [./source/server/backtrace.h:127] Caught Segmentation fault, suspect faulting address 0x34c19902aab0
[2024-11-15 12:11:10.256][32][critical][backtrace] [./source/server/backtrace.h:111] Backtrace (use tools/stack_decode.py to get line numbers):
[2024-11-15 12:11:10.256][32][critical][backtrace] [./source/server/backtrace.h:112] Envoy version: e3b4a6e9570da15ac1caffdded17a8bebdc7dfc9/1.32.1/Clean/RELEASE/BoringSSL
[2024-11-15 12:11:10.256][32][critical][backtrace] [./source/server/backtrace.h:114] Address mapping: 56501b83c000-56501f3b6000 /usr/local/bin/envoy
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #0: runtime.sigfwd.abi0 [0x7fe229faa7e0]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #1: runtime.sigfwdgo [0x7fe229f833b1]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #2: runtime.sigtrampgo [0x7fe229f81d45]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #3: runtime.sigtramp.abi0 [0x7fe229faa849]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #4: runtime.sigfwd.abi0 [0x7fe1dfbe6920]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #5: runtime.sigfwdgo [0x7fe1dfbbf3b1]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #6: runtime.sigtrampgo [0x7fe1dfbbdd45]
[2024-11-15 12:11:10.257][32][critical][backtrace] [./source/server/backtrace.h:119] #7: runtime.sigtramp.abi0 [0x7fe1dfbe6989]
[2024-11-15 12:11:10.258][32][critical][backtrace] [./source/server/backtrace.h:119] #8: runtime.sigfwd.abi0 [0x7fe193e2e2e0]
[2024-11-15 12:11:10.259][32][critical][backtrace] [./source/server/backtrace.h:119] #9: runtime.sigfwdgo [0x7fe193e063f1]
[2024-11-15 12:11:10.259][32][critical][backtrace] [./source/server/backtrace.h:119] #10: runtime.sigtrampgo [0x7fe193e04d85]
[2024-11-15 12:11:10.260][32][critical][backtrace] [./source/server/backtrace.h:119] #11: runtime.sigtramp.abi0 [0x7fe193e2e349]
[2024-11-15 12:11:10.260][32][critical][backtrace] [./source/server/backtrace.h:121] #12: [0x7fe22d76c520]
[2024-11-15 12:11:10.260][32][critical][backtrace] [./source/server/backtrace.h:119] #13: envoyGoFilterHttpFinalize [0x56501d838b75]
[2024-11-15 12:11:10.260][32][critical][backtrace] [./source/server/backtrace.h:119] #14: runtime.asmcgocall.abi0 [0x7fe193e2c481]

The source code of our sample golang plugin is attached.
gc.zip

@tahmoor tahmoor added bug triage Issue requires triage labels Nov 18, 2024
@soulxu soulxu removed the triage Issue requires triage label Nov 19, 2024
@soulxu
Copy link
Member

soulxu commented Nov 19, 2024

cc @doujiang24

@doujiang24
Copy link
Member

@tahmoor Thanks for your feedback.
Seems weird to me, it's a simple case, please provide more clues:

  1. how did you build the envoy binary? or could you reproduce it by using the official docker image,envoyproxy/envoy:contrib-v1.32.1?
  2. which golang version are you using? and how did you build the golang so file? with which glibc version in your build machine?

@tahmoor
Copy link
Author

tahmoor commented Nov 19, 2024

@doujiang24 thanks for your response.
We used official envoy image: docker.io/envoyproxy/envoy:contrib-v1.32.1
Our plugin source code is 1.22 compatible and for building the plugin we installed official go 1.23.1 from https://go.dev/dl/ on above envoy-contrib image.
Also, we built golang plugins using golang:1.22.9-bullseye and golang:1.23.1-bullseye images and saw the same problem.

@doujiang24
Copy link
Member

@tahmoor I had a try with envoyproxy/envoy:contrib-v1.32.1 + golang:1.22.9-bullseye, but no able to reproduce it:

wrk -t 1 -c 100 -d 100 http://localhost:8089
Running 2m test @ http://localhost:8089
  1 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.82ms    5.94ms 141.23ms   66.99%
    Req/Sec     8.57k   302.55     9.30k    78.70%
  853045 requests in 1.67m, 124.64MB read
Requests/sec:   8529.42
Transfer/sec:      1.25MB

Here is the full runable demo: https://github.com/doujiang24/test-golang-segfault

Please provide more info from your side, feel free to create a PR to the test demo repo, so that I can reproduce it.

@tahmoor
Copy link
Author

tahmoor commented Nov 26, 2024

@doujiang24 thanks for your response.
The crash happens on much higher load and sometimes after few days.
I reproduced the crash on my local machine with following command:
hey -n 1000000 -c 100 http://localhost:8089/
Also note that based on our investigations it seems that when we turn off automatic GC and run GC periodically every 10 seconds, the rate of crash reduce significantly.

@doujiang24
Copy link
Member

doujiang24 commented Nov 28, 2024

@tahmoor Thanks for your report, could you please try this PR: #37405

It's been pressure tested for half a day now, and so far there's no crash. Without this patch, it crashes about once every ten minutes.

It's a bug that introduced in #33377

@tahmoor
Copy link
Author

tahmoor commented Dec 9, 2024

@doujiang24, we performed stress-test on your PR for about one week and it didn't crashed.
Thanks for your wonderful effort on fixing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants