Skip to content

e2e: disable gRPC GOAWAY/too_many_pings keepalive errors in e2e CI#19083

Merged
timvaillancourt merged 4 commits intovitessio:mainfrom
timvaillancourt:e2e-grpc-keepalive-relax
Jan 6, 2026
Merged

e2e: disable gRPC GOAWAY/too_many_pings keepalive errors in e2e CI#19083
timvaillancourt merged 4 commits intovitessio:mainfrom
timvaillancourt:e2e-grpc-keepalive-relax

Conversation

@timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Dec 24, 2025

Description

This PR disables too-many-ping protection in gRPC when running e2e tests with test.go. This is achieved by setting GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA to 0

Context from docs:

GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA
This channel argument controls the maximum number of pings that can be sent when there is no data/header frame to be sent. gRPC Core will not continue sending pings if we run over the limit. Setting it to 0 allows sending pings without such a restriction. (Note that this is an unfortunate setting that does not agree with A8-client-side-keepalive.md. There should ideally be no such restriction on the keepalive ping and we plan to deprecate it in the future.)

and

Why am I receiving a GOAWAY with error code ENHANCE_YOUR_CALM?
A server sends a GOAWAY with ENHANCE_YOUR_CALM if the client sends too many misbehaving pings as described in A8-client-side-keepalive.md. Some scenarios where this can happen are -
if a server has GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS set to false while the client has set this to true resulting in keepalive pings being sent even when there is no call in flight.
if the client's GRPC_ARG_KEEPALIVE_TIME_MS setting is lower than the server's GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS.

Today, many e2e tests connect/reconnect to the gRPC server too quickly and hit the GOAWAY/too_many_pings error. When this occurs, the victim test has to wait the GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS default of 300000 milliseconds (5 minutes) doing nothing 😱. This protection makes sense in a normal world, but in an e2e test all running on localhost we don't really care

When tests hit this, we see this error and pause for 5min:

E1218 18:18:06.674435  32993 component.go:44] [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

Docs: https://github.com/grpc/grpc/blob/master/doc/keepalive.md

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@github-actions github-actions bot added this to the v24.0.0 milestone Dec 24, 2025
@timvaillancourt timvaillancourt self-assigned this Dec 24, 2025
@vitess-bot vitess-bot bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Dec 24, 2025
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Dec 24, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@timvaillancourt timvaillancourt removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Dec 24, 2025
@timvaillancourt timvaillancourt changed the title e2e: disable GOAWAY/too_many_pings errors in e2e CI e2e: disable gRPC GOAWAY/too_many_pings keepalive errors in e2e CI Dec 24, 2025
@timvaillancourt timvaillancourt marked this pull request as ready for review December 24, 2025 16:07
@timvaillancourt timvaillancourt enabled auto-merge (squash) December 24, 2025 16:14
timvaillancourt and others added 2 commits December 24, 2025 17:25
Co-authored-by: Mohamed Hamza <mhamza@fastmail.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.90%. Comparing base (7a3acd5) to head (5a41620).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #19083   +/-   ##
=======================================
  Coverage   69.90%   69.90%           
=======================================
  Files        1612     1612           
  Lines      215817   215789   -28     
=======================================
- Hits       150865   150849   -16     
+ Misses      64952    64940   -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mattlord
Copy link
Member

mattlord commented Jan 6, 2026

@timvaillancourt IMO these logs are not helpful, at all, to the end user. Why don't we disable them entirely in Vitess rather than just in the tests?

Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@timvaillancourt timvaillancourt merged commit b891801 into vitessio:main Jan 6, 2026
103 of 104 checks passed
@timvaillancourt timvaillancourt deleted the e2e-grpc-keepalive-relax branch January 6, 2026 21:23
@timvaillancourt
Copy link
Contributor Author

timvaillancourt commented Jan 6, 2026

@timvaillancourt IMO these logs are not helpful, at all, to the end user. Why don't we disable them entirely in Vitess rather than just in the tests?

@mattlord unfortunately the gRPC library can't silence the logging unless you disable the protection entirely like we're doing here. I don't recommend we do that in a non-test environment

I think the built-in protection still makes sense in a non-e2e use case, because it's possible for the gRPC port to abused by attackers. Arguably this change shouldn't have been necessary if we used the gRPC client by-the-book, but it seems we often launch a single tmclient per-go-package in go/test/endtoend (via TestMain) that doesn't close it's connections properly - I believe this is the reason we really see this, and hypothetically if we were more careful we wouldn't see this

Anecdotally, the only time I've seen this error/wait as a Vitess user in a real deployment was when writing external automation that hits vtctld APIs, possibly without closing connections properly as well, or perhaps doing something too quickly (the details of the in-gRPC protection aren't too clear to me)

@mattlord mattlord mentioned this pull request Jan 7, 2026
5 tasks
mhamza15 pushed a commit to mhamza15/vitess that referenced this pull request Jan 8, 2026
… CI (vitessio#19083)

Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Mohamed Hamza <mhamza@fastmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants