-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy crash when validating bootstrap config file #27775
Comments
We actually found a potential root cause to be a tcmalloc bug that's being used in Envoy 1.24.4 that is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e. $ lscpu Can you confirm if that's the valid root cause? |
Can you confirm if some newer Envoy version (i.e. 1.26.1) uses the tcmalloc version which fixed the bug? |
That certainly sounds like a plausible root cause. It looks like tcmalloc version was last updated to a version from 2022-10-24 which was right after 1.24 was cut. 1.24 is using a version from 2022-08-06. Do you know when/if the described tcmalloc bug was fixed? |
I'm not sure if this is fixed so open an issue with tcmalloc |
Looks like this is confirmed as a tcmalloc issue that's not resolved, so there's nothing to do here until it's addressed there. |
I see a commit referencing the tcmalloc issue! Once the tcmalloc side is closed/confirmed we will have to bump the revision on this side to include the fix 🎉 |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This is still an issue. |
Looks like Jul 24's commit may be the one to address the problem. Would it be reasonable to patch in an update to envoy's tcmalloc version at your end and see whether it fixes the problem? |
we're working on building envoy with the tcmalloc patch to test it |
We hit compilation errors attempting to update Envoy's dependencies. My coworker Thejas attempted to build Envoy with the patch and encountered a chain of errors.
The errors we found were:
The patch boils down to: diff --git a/bazel/repository_locations.bzl b/bazel/repository_locations.bzl
index a262e3fe44..c64d708995 100644
--- a/bazel/repository_locations.bzl
+++ b/bazel/repository_locations.bzl
@@ -150,8 +150,8 @@ REPOSITORY_LOCATIONS_SPEC = dict(
project_name = "Abseil",
project_desc = "Open source collection of C++ libraries drawn from the most fundamental pieces of Google’s internal codebase",
project_url = "https://abseil.io/",
- version = "9bff2a9302a8dbf91712fc215eb2e2cf8ec234e7",
- sha256 = "ae959138730b55b3fb968d3c357e740e7ffdeab4648dc3eb28843a1e9fa56b57",
+ version = "a3020c763c12bd16bbf00804abe853afa5778174",
+ sha256 = "0734c1d74a75fef0298f8d08c279e092d319b783ea5ff46873af904df0003f81",
strip_prefix = "abseil-cpp-{version}",
urls = ["https://github.com/abseil/abseil-cpp/archive/{version}.tar.gz"],
use_category = ["dataplane_core", "controlplane"],
@@ -336,8 +336,8 @@ REPOSITORY_LOCATIONS_SPEC = dict(
project_name = "tcmalloc",
project_desc = "Fast, multi-threaded malloc implementation",
project_url = "https://github.com/google/tcmalloc",
- version = "e33c7bc60415127c104006d3301c96902f98d42a",
- sha256 = "14a2c91b71d6719558768a79671408c9acd8284b418e80386c5888047e2c15aa",
+ version = "cbbe578d8f2822a5f2cefff42ebabfa364b725ab",
+ sha256 = "ceef110ed7ea3fe1a4665b9b5adf38fdca8b026739db78cba4686d1a03224582",
strip_prefix = "tcmalloc-{version}",
urls = ["https://github.com/google/tcmalloc/archive/{version}.tar.gz"],
use_category = ["dataplane_core", "controlplane"], |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
@ravenblackx This is not stale and still not resolved |
@mattklein123 I don't know what the strategy is for updating conflicting chains of dependencies? |
online CPUs. envoyproxy#27775 Signed-off-by: Can Cecen <[email protected]>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
This is not stale. |
Description:
Envoy occasionally crashes when validating bootstrap config
Repro steps:
This has only been observed on 0.05% of Stripe hosts that run Envoy. And I can login to those hosts and manually call and roughly get segfault 1 out of 2/3 times
$ sudo /pay/jenkins-artifacts/envoy/1.24.4-stripe1/envoy-stripe --config-path $bootstrap_path --mode validate --service-cluster certhorse --service-zone us-west-2b --service-node qa-certhorse--01b66f3177b5f6cdc
Segmentation fault
Or sometimes the validate is hanging indefinitely.
Call stack:
[external/envoy/source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[external/envoy/source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[external/envoy/source/server/backtrace.h:92] Envoy version: 2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
[external/envoy/source/server/backtrace.h:96] #0: __restore_rt [0x7f26c19d1420]
[external/envoy/source/server/backtrace.h:96] #1: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Refill() [0x55cf3f28ce2a]
[external/envoy/source/server/backtrace.h:96] #2: tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<>::Allocate<>()::Helper::Underflow() [0x55cf3f28df77]
[external/envoy/source/server/backtrace.h:96] #3: Envoy::Api::ValidationImpl::allocateDispatcher() [0x55cf3dd5b3de]
[external/envoy/source/server/backtrace.h:96] #4: Envoy::Server::ValidationInstance::ValidationInstance() [0x55cf3dd4b52f]
[external/envoy/source/server/backtrace.h:96] #5: Envoy::Server::validateConfig() [0x55cf3dd4aa65]
[external/envoy/source/server/backtrace.h:96] #6: Envoy::MainCommonBase::run() [0x55cf3dd11370]
[external/envoy/source/server/backtrace.h:96] #7: Envoy::MainCommon::main() [0x55cf3dd11a7d]
[external/envoy/source/server/backtrace.h:96] #8: main [0x55cf3dd0da4a]
[external/envoy/source/server/backtrace.h:96] #9: __libc_start_main [0x7f26c17ef083]
uname -a:
Linux 5.15.0-1036-aws envoyproxy/go-control-plane#40 SMP Mon Apr 24 00:21:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Envoy version
2f44165e55dd47475c44d2d03018eac3cb8a6264/1.24.4-stripe1/Clean/RELEASE/BoringSSL
2f44165e55dd47475c44d2d03018eac3cb8a6264 is internal commit of Stripe's Envoy repo, it uses OSS envoy 1.24.4
The text was updated successfully, but these errors were encountered: