Skip to content

Fix Kubernetes load balancing GOAWAY errors by buffering request body#60695

Closed
rana wants to merge 1 commit intomasterfrom
rana/kube-retryable-transport
Closed

Fix Kubernetes load balancing GOAWAY errors by buffering request body#60695
rana wants to merge 1 commit intomasterfrom
rana/kube-retryable-transport

Conversation

@rana
Copy link
Copy Markdown
Contributor

@rana rana commented Oct 28, 2025

Kubernetes API servers send HTTP/2 GOAWAY errors to redistribute load across replicas for up to 2% of requests. Setting the request's GetBody function enables automatic retries.

For HTTP/2 requests, GetBody is set, and request bodies are incrementally buffered and accumulated as reads occur. In the case of GOAWAY errors occurring mid-send, body buffering is completed before closing. HTTP/1.1 protocol upgrades are not buffered, since they wouldn't receive HTTP/2 GOAWAY errors.

A weighted semaphore limits total concurrent buffering to prevent OOM. The default global memory limit is 500 MiB, and is adjusted with an environment variable. Each request body size is limited to a default of 50 MiB, and may be adjusted with an environment variable.

In this PR:

  • Added retryableTransport and retryBuffer enabling incremental request body buffering
  • Added a weighted semaphore limiting total concurrent buffer size to 500 MiB by default
  • Added a per-request buffer size limit with a 50 MiB default
  • Added tunable parameters RetryBufferTotal and RetryBufferPerRequest
  • Added environment variable TELEPORT_UNSTABLE_KUBE_RETRY_BUFFER_TOTAL
  • Added environment variable TELEPORT_UNSTABLE_KUBE_RETRY_BUFFER_PER_REQ
  • Added unit tests

Fixes:


Changelog: Fixed intermittent connection errors when accessing Kubernetes clusters, particularly EKS 1.27+


Manual Testing

A test-load-balance app was written to reproduce the load balancing GOAWAY errors with Kubernetes.

Three manual tests were run and compared.

  1. Test run 1 ran directly to Kubernetes without Teleport. Load balancing GOAWAY errors were seen.
  2. Test run 2 ran through Teleport without the bug fix. Load balancing GOAWAY errors were seen.
  3. Test run 3 ran through Teleport with the bug fix applied. No load balancing GOAWAY errors were seen.

Test Runs

Test Configuration Total Ops Failed GOAWAY Error Rate
Direct K8s 1000 5 5 0.50%
Teleport (no fix) 1000 6 6 0.60%
Teleport (with fix) 1000 0 0 0.00%

Latency Comparison

Metric Baseline Without Fix With Fix Delta
p50 199.87ms 199.81ms 199.78ms -0.0%
p95 202.43ms 203.07ms 202.77ms -0.1%
p99 208.68ms 206.76ms 205.99ms -0.4%
Test app

A test-load-balance app exercises Kubernetes operations to display load balancing goaway-chance errors. It records statistics for display and test comparison.

package main

import (
	"bytes"
	"context"
	"encoding/json"
	"flag"
	"fmt"
	"net/http"
	"os"
	"path/filepath"
	"sort"
	"strings"
	"sync"
	"sync/atomic"
	"time"

	v1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/clientcmd"
)

// TestResult captures load test metrics for GOAWAY error detection
type TestResult struct {
	TestName        string        `json:"test_name"`
	StartTime       time.Time     `json:"start_time"`
	EndTime         time.Time     `json:"end_time"`
	Duration        time.Duration `json:"duration"`
	TotalOperations int           `json:"total_operations"`
	SuccessfulOps   int           `json:"successful_ops"`
	FailedOps       int           `json:"failed_ops"`
	GoawayErrors    int           `json:"goaway_errors"`
	ErrorRate       float64       `json:"error_rate"`
	GoawayRate      float64       `json:"goaway_rate"`
	Workers         int           `json:"workers"`
	OpsPerWorker    int           `json:"ops_per_worker"`
	KubeContext     string        `json:"kube_context"`
	KubeVersion     string        `json:"kube_version"`
	APIServer       string        `json:"api_server"`
	LatencyP50      time.Duration `json:"latency_p50"`
	LatencyP95      time.Duration `json:"latency_p95"`
	LatencyP99      time.Duration `json:"latency_p99"`
	SampleErrors    []string      `json:"sample_errors,omitempty"`
}

func main() {
	if len(os.Args) < 2 {
		os.Args = append(os.Args, "run") // Default to 'run' command
	}

	switch os.Args[1] {
	case "run":
		runTest()
	case "compare":
		compareResults()
	case "help", "-h", "--help":
		printHelp()
	default:
		// Treat unknown command as 'run' with that arg as flag
		os.Args = append([]string{os.Args[0], "run"}, os.Args[1:]...)
		runTest()
	}
}

func printHelp() {
	fmt.Println(`test-load-balance - Kubernetes GOAWAY load testing tool

USAGE:
  test-load-balance run [flags]        Run a load test
  test-load-balance compare [flags]    Compare test results

RUN FLAGS:
  -name <string>           Test name (default: "test")
  -workers <int>           Concurrent workers (default: 20)
  -ops <int>              Operations per worker (default: 50)
  -namespace <string>      Kubernetes namespace (default: "teleport-test")
  -kubeconfig <string>     Path to kubeconfig
  -no-retry               Disable retry (exposes GOAWAY errors)
  -output <file.json>      Save results to JSON file
  -quiet                   Minimal output (progress only)

COMPARE FLAGS:
  -files <file1,file2,...> Comma-separated JSON result files
  -markdown                Output as markdown
  -summary                 Output summary table only

EXAMPLES:
  # Run test and save results
  test-load-balance run -name "baseline" -workers 10 -ops 100 -output baseline.json

  # Compare multiple results
  test-load-balance compare -files baseline.json,without-fix.json,with-fix.json -markdown

  # Generate summary table
  test-load-balance compare -files *.json -summary`)
}

func runTest() {
	fs := flag.NewFlagSet("run", flag.ExitOnError)
	var (
		workers      = fs.Int("workers", 20, "Number of concurrent workers")
		opsPerWorker = fs.Int("ops", 50, "Operations per worker")
		namespace    = fs.String("namespace", "teleport-test", "Kubernetes namespace")
		testName     = fs.String("name", "test", "Test run name")
		kubeconfig   = fs.String("kubeconfig", "", "Path to kubeconfig (default: $HOME/.kube/config)")
		noRetry      = fs.Bool("no-retry", false, "Disable automatic retry (exposes GOAWAY errors)")
		outputFile   = fs.String("output", "", "Save results to JSON file")
		quiet        = fs.Bool("quiet", false, "Minimal output")
	)
	fs.Parse(os.Args[2:])

	if *kubeconfig == "" {
		*kubeconfig = filepath.Join(os.Getenv("HOME"), ".kube", "config")
	}

	// Load kubeconfig
	config, err := clientcmd.LoadFromFile(*kubeconfig)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error loading kubeconfig: %v\n", err)
		os.Exit(1)
	}

	restConfig, err := clientcmd.NewDefaultClientConfig(*config, &clientcmd.ConfigOverrides{}).ClientConfig()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error creating client config: %v\n", err)
		os.Exit(1)
	}

	// Increase QPS to avoid rate limiting
	restConfig.QPS = 100
	restConfig.Burst = 200

	// Disable retry if requested
	if *noRetry {
		restConfig.Wrap(func(rt http.RoundTripper) http.RoundTripper {
			return &noRetryTransport{inner: rt}
		})
	}

	clientset, err := kubernetes.NewForConfig(restConfig)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error creating clientset: %v\n", err)
		os.Exit(1)
	}

	// Get server version
	versionInfo, err := clientset.Discovery().ServerVersion()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error getting server version: %v\n", err)
		os.Exit(1)
	}

	currentContext := config.CurrentContext

	result := TestResult{
		TestName:        *testName,
		StartTime:       time.Now(),
		Workers:         *workers,
		OpsPerWorker:    *opsPerWorker,
		TotalOperations: *workers * *opsPerWorker,
		KubeContext:     currentContext,
		KubeVersion:     versionInfo.String(),
		APIServer:       restConfig.Host,
		SampleErrors:    make([]string, 0),
	}

	if !*quiet {
		fmt.Printf("=== Kubernetes GOAWAY Load Test ===\n")
		fmt.Printf("Test Name:     %s\n", result.TestName)
		fmt.Printf("Context:       %s\n", result.KubeContext)
		fmt.Printf("API Server:    %s\n", result.APIServer)
		fmt.Printf("K8s Version:   %s\n", result.KubeVersion)
		fmt.Printf("Namespace:     %s\n", *namespace)
		fmt.Printf("Workers:       %d\n", result.Workers)
		fmt.Printf("Ops/Worker:    %d\n", result.OpsPerWorker)
		fmt.Printf("Total Ops:     %d\n", result.TotalOperations)
		fmt.Printf("Retry Mode:    %s\n", map[bool]string{
			true:  "DISABLED (exposes GOAWAY)",
			false: "ENABLED (hides GOAWAY)",
		}[*noRetry])
		if *outputFile != "" {
			fmt.Printf("Output File:   %s\n", *outputFile)
		}
		fmt.Printf("Started:       %s\n\n", result.StartTime.Format(time.RFC3339))
	}

	// Counters and latency tracking
	var (
		successCount atomic.Int64
		errorCount   atomic.Int64
		goawayCount  atomic.Int64
		errorsMu     sync.Mutex
		latencies    []time.Duration
		latenciesMu  sync.Mutex
	)

	// Worker function
	worker := func(workerID int) {
		ctx := context.Background()
		for i := 0; i < *opsPerWorker; i++ {
			opStart := time.Now()
			cmName := fmt.Sprintf("test-cm-w%d-i%d-%d", workerID, i, time.Now().UnixNano())

			// Create ConfigMap with 1KB data
			cm := &v1.ConfigMap{
				ObjectMeta: metav1.ObjectMeta{
					Name:      cmName,
					Namespace: *namespace,
				},
				Data: map[string]string{
					"worker":    fmt.Sprintf("%d", workerID),
					"iteration": fmt.Sprintf("%d", i),
					"data":      string(bytes.Repeat([]byte("x"), 1024)),
				},
			}

			_, err := clientset.CoreV1().ConfigMaps(*namespace).Create(ctx, cm, metav1.CreateOptions{})
			if err != nil {
				errorCount.Add(1)
				if isGoawayError(err) {
					goawayCount.Add(1)
					errorsMu.Lock()
					if len(result.SampleErrors) < 10 {
						result.SampleErrors = append(result.SampleErrors, err.Error())
					}
					errorsMu.Unlock()
				}
				continue
			}

			// Delete ConfigMap
			err = clientset.CoreV1().ConfigMaps(*namespace).Delete(ctx, cmName, metav1.DeleteOptions{})
			if err != nil {
				errorCount.Add(1)
				if isGoawayError(err) {
					goawayCount.Add(1)
				}
			} else {
				successCount.Add(1)
				// Record latency only for successful operations
				opLatency := time.Since(opStart)
				latenciesMu.Lock()
				latencies = append(latencies, opLatency)
				latenciesMu.Unlock()
			}

			// Progress indicator
			if !*quiet && (successCount.Load()+errorCount.Load())%100 == 0 {
				fmt.Printf("Progress: %d/%d (errors: %d, GOAWAY: %d)\n",
					successCount.Load()+errorCount.Load(),
					result.TotalOperations,
					errorCount.Load(),
					goawayCount.Load())
			}
		}
	}

	// Run workers
	var wg sync.WaitGroup
	for w := 0; w < *workers; w++ {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()
			worker(id)
		}(w)
	}
	wg.Wait()

	// Calculate latency percentiles
	if len(latencies) > 0 {
		sort.Slice(latencies, func(i, j int) bool { return latencies[i] < latencies[j] })
		result.LatencyP50 = latencies[len(latencies)*50/100]
		result.LatencyP95 = latencies[len(latencies)*95/100]
		result.LatencyP99 = latencies[len(latencies)*99/100]
	}

	// Finalize results
	result.EndTime = time.Now()
	result.Duration = result.EndTime.Sub(result.StartTime)
	result.SuccessfulOps = int(successCount.Load())
	result.FailedOps = int(errorCount.Load())
	result.GoawayErrors = int(goawayCount.Load())
	result.ErrorRate = float64(result.FailedOps) / float64(result.TotalOperations) * 100
	result.GoawayRate = float64(result.GoawayErrors) / float64(result.TotalOperations) * 100

	// Display results
	if !*quiet {
		fmt.Printf("\n=== Results ===\n")
		fmt.Printf("Completed:        %s\n", result.EndTime.Format(time.RFC3339))
		fmt.Printf("Duration:         %s\n", result.Duration)
		fmt.Printf("Total Operations: %d\n", result.TotalOperations)
		fmt.Printf("Successful:       %d\n", result.SuccessfulOps)
		fmt.Printf("Failed:           %d\n", result.FailedOps)
		fmt.Printf("GOAWAY Errors:    %d\n", result.GoawayErrors)
		fmt.Printf("Error Rate:       %.2f%%\n", result.ErrorRate)
		fmt.Printf("GOAWAY Rate:      %.2f%%\n", result.GoawayRate)

		if len(latencies) > 0 {
			fmt.Printf("\nLatency Percentiles:\n")
			fmt.Printf("  p50: %s\n", formatDuration(result.LatencyP50))
			fmt.Printf("  p95: %s\n", formatDuration(result.LatencyP95))
			fmt.Printf("  p99: %s\n", formatDuration(result.LatencyP99))
		}

		if len(result.SampleErrors) > 0 {
			fmt.Printf("\nSample Errors:\n")
			for i, err := range result.SampleErrors {
				fmt.Printf("  %d. %s\n", i+1, err)
			}
		}
	}

	// Save to JSON if requested
	if *outputFile != "" {
		if err := saveJSON(&result, *outputFile); err != nil {
			fmt.Fprintf(os.Stderr, "Error saving JSON: %v\n", err)
			os.Exit(1)
		}
		if !*quiet {
			fmt.Printf("\nResults saved to: %s\n", *outputFile)
		}
	}

	// Exit with error if operations failed
	if result.FailedOps > 0 {
		os.Exit(1)
	}
}

func compareResults() {
	fs := flag.NewFlagSet("compare", flag.ExitOnError)
	var (
		filesStr = fs.String("files", "", "Comma-separated list of JSON result files")
		markdown = fs.Bool("markdown", false, "Output as markdown")
		summary  = fs.Bool("summary", false, "Output summary table only")
	)
	fs.Parse(os.Args[2:])

	if *filesStr == "" {
		fmt.Fprintf(os.Stderr, "Error: -files flag is required\n")
		os.Exit(1)
	}

	// Parse file list
	files := strings.Split(*filesStr, ",")
	for i, f := range files {
		files[i] = strings.TrimSpace(f)
	}

	// Load all results
	results := make([]*TestResult, 0, len(files))
	for _, file := range files {
		result, err := loadJSON(file)
		if err != nil {
			fmt.Fprintf(os.Stderr, "Error loading %s: %v\n", file, err)
			os.Exit(1)
		}
		results = append(results, result)
	}

	if *markdown {
		generateMarkdown(results, *summary)
	} else {
		generateTextComparison(results)
	}
}

func generateMarkdown(results []*TestResult, summaryOnly bool) {
	// Summary table
	fmt.Println("| Test Configuration | Total Ops | Failed | GOAWAY | Error Rate | p50 | p95 | p99 |")
	fmt.Println("|-------------------|-----------|--------|--------|------------|-----|-----|-----|")
	for _, r := range results {
		fmt.Printf("| %s | %d | %d | %d | %.2f%% | %s | %s | %s |\n",
			r.TestName,
			r.TotalOperations,
			r.FailedOps,
			r.GoawayErrors,
			r.ErrorRate,
			formatDuration(r.LatencyP50),
			formatDuration(r.LatencyP95),
			formatDuration(r.LatencyP99))
	}

	if summaryOnly {
		return
	}

	// Key findings
	fmt.Println()
	fmt.Println("## Key Findings")
	fmt.Println()

	// Find baseline, without-fix, with-fix
	var baseline, withoutFix, withFix *TestResult
	for _, r := range results {
		name := strings.ToLower(r.TestName)
		if strings.Contains(name, "baseline") || strings.Contains(name, "direct") {
			baseline = r
		} else if strings.Contains(name, "without") || strings.Contains(name, "no fix") {
			withoutFix = r
		} else if strings.Contains(name, "with") {
			withFix = r
		}
	}

	// Generate findings
	if withoutFix != nil && withFix != nil {
		errorReduction := withoutFix.FailedOps - withFix.FailedOps
		fmt.Printf("1. **Problem Reproduced**: %d GOAWAY errors (%.2f%% rate)\n",
			withoutFix.GoawayErrors, withoutFix.GoawayRate)
		fmt.Printf("2. **Fix Validated**: Error rate reduced from %.2f%% to %.2f%%\n",
			withoutFix.ErrorRate, withFix.ErrorRate)
		fmt.Printf("3. **Error Elimination**: %d errors eliminated (100%% improvement)\n",
			errorReduction)

		if withoutFix.LatencyP99 > 0 && withFix.LatencyP99 > 0 {
			latencyDelta := float64(withFix.LatencyP99-withoutFix.LatencyP99) * 100 / float64(withoutFix.LatencyP99)
			fmt.Printf("4. **Latency Impact**: p99 increased by %.1f%% (%s)\n",
				latencyDelta,
				formatDuration(withFix.LatencyP99-withoutFix.LatencyP99))
		}
		fmt.Printf("5. **Consistency**: Tested with %d total operations\n",
			withFix.TotalOperations)
	}

	// Latency comparison
	if baseline != nil && withoutFix != nil && withFix != nil {
		fmt.Println()
		fmt.Println("### Latency Comparison")
		fmt.Println()
		fmt.Println("| Metric | Baseline | Without Fix | With Fix | Delta |")
		fmt.Println("|--------|----------|-------------|----------|-------|")

		for _, metric := range []struct {
			name string
			fn   func(*TestResult) time.Duration
		}{
			{"p50", func(r *TestResult) time.Duration { return r.LatencyP50 }},
			{"p95", func(r *TestResult) time.Duration { return r.LatencyP95 }},
			{"p99", func(r *TestResult) time.Duration { return r.LatencyP99 }},
		} {
			baseVal := metric.fn(baseline)
			withoutVal := metric.fn(withoutFix)
			withVal := metric.fn(withFix)
			delta := float64(withVal-withoutVal) * 100 / float64(withoutVal)

			fmt.Printf("| %s | %s | %s | %s | %+.1f%% |\n",
				metric.name,
				formatDuration(baseVal),
				formatDuration(withoutVal),
				formatDuration(withVal),
				delta)
		}
	}
}

func generateTextComparison(results []*TestResult) {
	fmt.Println("=== Test Comparison ===")
	fmt.Println()

	for _, r := range results {
		fmt.Printf("Test: %s\n", r.TestName)
		fmt.Printf("  Total Operations: %d\n", r.TotalOperations)
		fmt.Printf("  Failed:           %d\n", r.FailedOps)
		fmt.Printf("  GOAWAY Errors:    %d\n", r.GoawayErrors)
		fmt.Printf("  Error Rate:       %.2f%%\n", r.ErrorRate)
		fmt.Printf("  Latency p50:      %s\n", formatDuration(r.LatencyP50))
		fmt.Printf("  Latency p95:      %s\n", formatDuration(r.LatencyP95))
		fmt.Printf("  Latency p99:      %s\n", formatDuration(r.LatencyP99))
		fmt.Println()
	}
}

func saveJSON(result *TestResult, filename string) error {
	data, err := json.MarshalIndent(result, "", "  ")
	if err != nil {
		return err
	}
	return os.WriteFile(filename, data, 0644)
}

func loadJSON(filename string) (*TestResult, error) {
	data, err := os.ReadFile(filename)
	if err != nil {
		return nil, err
	}
	var result TestResult
	if err := json.Unmarshal(data, &result); err != nil {
		return nil, err
	}
	return &result, nil
}

// noRetryTransport wraps a RoundTripper and clears GetBody to prevent
// automatic retry on GOAWAY, exposing the raw error.
type noRetryTransport struct {
	inner http.RoundTripper
}

func (t *noRetryTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	req.GetBody = nil
	return t.inner.RoundTrip(req)
}

func isGoawayError(err error) bool {
	if err == nil {
		return false
	}
	errStr := err.Error()
	return strings.Contains(errStr, "cannot retry") ||
		strings.Contains(errStr, "GOAWAY") ||
		strings.Contains(errStr, "graceful shutdown") ||
		strings.Contains(errStr, "http2: server sent GOAWAY") ||
		strings.Contains(errStr, "http2: Transport") ||
		strings.Contains(errStr, "Request.Body was written")
}

func formatDuration(d time.Duration) string {
	if d == 0 {
		return "N/A"
	}
	if d < time.Millisecond {
		return fmt.Sprintf("%.0fµs", float64(d.Microseconds()))
	}
	if d < time.Second {
		return fmt.Sprintf("%.2fms", float64(d.Microseconds())/1000.0)
	}
	return fmt.Sprintf("%.2fs", d.Seconds())
}
Kind config

A kind configuration file sets the load balancing goaway-chance to the maximum of 2%. The actual chance per request is randomized with the 2% being the maximum probability.

# kind.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: load-balance
nodes:
  - role: control-plane
    kubeadmConfigPatches:
      - |
        kind: ClusterConfiguration
        apiServer:
          extraArgs:
            goaway-chance: "0.02"
            v: "4"
Teleport config

The Teleport config file used during testing.

# teleport.yaml
version: v3
teleport:
  log:
    severity: DEBUG
  nodename: teleport
  diag_addr: "127.0.0.1:3000"

auth_service:
  enabled: true
  cluster_name: "teleport-laptop"
  listen_addr: 0.0.0.0:3025

proxy_service:
  enabled: true
  listen_addr: 0.0.0.0:3023
  web_listen_addr: 0.0.0.0:3080
  public_addr: localhost:3080
  kube_listen_addr: 0.0.0.0:3026
  kube_public_addr: localhost:3026
  https_keypairs:
    - cert_file: /Users/rana.ian/src/wrk/cfg/crt/_wildcard.teleport-laptop+4.pem
      key_file: /Users/rana.ian/src/wrk/cfg/crt/_wildcard.teleport-laptop+4-key.pem

ssh_service:
  enabled: false

kubernetes_service:
  enabled: true
  listen_addr: localhost:3027
  kubeconfig_file: /Users/rana.ian/.kube/config
Teleport role config

A Teleport role config file granting Kubernetes administrator permissions which enable the test app to run.

# teleport-role-kube.yaml
kind: role
version: v8
metadata:
  name: kube-admin
spec:
  allow:
    # Match all Kubernetes clusters
    kubernetes_labels:
      "*": "*"

    # Map to Kubernetes RBAC
    kubernetes_groups:
      - system:masters   # K8s built-in admin group

    kubernetes_users:
      - rana.ian

    # What resources can be accessed at the Teleport level
    kubernetes_resources:
      - kind: "*"
        api_group: "*"
        namespace: "*"
        name: "*"
        verbs: ["*"]
      - kind: namespaces
        name: "*"
        verbs: ["*"]
Manual Testing

Environment: Macos, Kind, Kubernetes v1.34.0

1. Kind setup

The intent is to setup a local Kubernetes cluster with the load balancing option --goaway-chance=0.02 turned on.

  • A new Kubernetes cluster was created using kind and the kind.yaml configuration file.
> kind create cluster --config ~/src/wrk/cfg/kind.yaml
Creating cluster "load-balance" ...
 ✓ Ensuring node image (kindest/node:v1.34.0) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-load-balance"
  • Validated that Kubernetes is running with the --goaway-chance=0.02 flag.
> docker exec load-balance-control-plane \
  ps aux | grep kube-apiserver | grep goaway-chance
root         552 10.6  3.2 1443096 263676 ?      Ssl  20:17   0:03 kube-apiserver --advertise-address=172.19.0.2 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --goaway-chance=0.02 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --runtime-config= --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-account-signing-key-file=/etc/kubernetes/pki/sa.key --service-cluster-ip-range=10.96.0.0/16 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key --v=4
  • Exercised a Kubernetes operation for smoke testing.
> kubectl create namespace teleport-test
namespace/teleport-test created

2. Test run (no Teleport)

The intent is to see load balancing errors with the test-load-balance app and Kubernetes. This forms a baseline understanding that we can see GOAWAY load balancing errors without Teleport.

  • Ran the test-load-balance app. Load balancing GOAWAY errors are seen.
> ./test-load-balance run -name "Direct K8s" -workers 10 -ops 100 -no-retry -output baseline.json
=== Kubernetes GOAWAY Load Test ===
Test Name:     Direct K8s
Context:       kind-load-balance
API Server:    https://127.0.0.1:63755
K8s Version:   v1.34.0
Namespace:     teleport-test
Workers:       10
Ops/Worker:    100
Total Ops:     1000
Retry Mode:    DISABLED (exposes GOAWAY)
Output File:   baseline.json
Started:       2025-11-02T12:20:29-08:00

Progress: 100/1000 (errors: 5, GOAWAY: 5)
Progress: 200/1000 (errors: 5, GOAWAY: 5)
Progress: 300/1000 (errors: 5, GOAWAY: 5)
Progress: 400/1000 (errors: 5, GOAWAY: 5)
Progress: 500/1000 (errors: 5, GOAWAY: 5)
Progress: 600/1000 (errors: 5, GOAWAY: 5)
Progress: 700/1000 (errors: 5, GOAWAY: 5)
Progress: 800/1000 (errors: 5, GOAWAY: 5)
Progress: 900/1000 (errors: 5, GOAWAY: 5)
Progress: 1000/1000 (errors: 5, GOAWAY: 5)

=== Results ===
Completed:        2025-11-02T12:20:47-08:00
Duration:         17.956518208s
Total Operations: 1000
Successful:       995
Failed:           5
GOAWAY Errors:    5
Error Rate:       0.50%
GOAWAY Rate:      0.50%

Latency Percentiles:
  p50: 199.87ms
  p95: 202.43ms
  p99: 208.68ms

Sample Errors:
  1. Post "https://127.0.0.1:63755/api/v1/namespaces/teleport-test/configmaps": http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
  2. Post "https://127.0.0.1:63755/api/v1/namespaces/teleport-test/configmaps": http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
  3. Post "https://127.0.0.1:63755/api/v1/namespaces/teleport-test/configmaps": http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error

3. Test run (Teleport, no fix)

The intent is to see load balancing errors through Teleport, without the bug fix. We'll compare this test run with a run with the bug fix.

  • Teleport backend database was deleted.
sudo rm -rf /var/lib/teleport
sudo mkdir -p -m0700 /var/lib/teleport
sudo chown $USER /var/lib/teleport
  • Teleport was built and run with the latest master branch. No bug fix present.
> git switch master
Already on 'master'
Your branch is up to date with 'origin/master'.

> make clean && make full
  • Teleport started with the teleport.yaml config file, and is configured with auth + proxy + kube agent.

  • Looking at the Teleport terminal output, Kubernetes health checks show the local kind cluster kind-load-balance is healthy.

2025-11-02T12:26:58.414-08:00 INFO [KUBERNETE] Target became healthy target_name:kind-load-balance target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Configured Teleport by creating a Teleport role and user. Logged in, listed kube clusters, and logged into the kube cluster kind-load-balance.
> tctl create -f ~/src/wrk/cfg/teleport-role-kube.yaml
role "kube-admin" has been created

> tctl users add $(whoami) --roles=editor,access,kube-admin
User "rana.ian" has been created but requires a password. Share this URL with the user to complete user setup, link is valid for 1h:
https://localhost:3080/web/invite/76723dfd5982560be4bb552a14aa82c0

> tsh login --proxy=localhost:3080 --user=$(whoami) --auth=local
Enter password for Teleport user rana.ian:
Enter an OTP code from a device:
> Profile URL:        https://localhost:3080
  Logged in as:       rana.ian
  Cluster:            teleport-laptop
  Roles:              access, editor, kube-admin
  Kubernetes:         enabled
  Kubernetes users:   rana.ian
  Kubernetes groups:  system:masters
  Valid until:        2025-11-03 00:29:15 -0800 PST [valid for 11h59m]
  Extensions:         login-ip, permit-agent-forwarding, permit-port-forwarding, permit-pty, private-key-policy

> tsh kube ls
Kube Cluster Name Labels Selected
----------------- ------ --------
kind-load-balance

> tsh kube login kind-load-balance
Logged into Kubernetes cluster "kind-load-balance". Try 'kubectl version' to test the connection.

> kubectl version
Client Version: v1.34.1
Kustomize Version: v5.7.1
Server Version: v1.34.0
  • Ran the test-load-balance app. Load balancing GOAWAY errors are seen.
> ./test-load-balance run -name "Teleport (no fix)" -workers 10 -ops 100 -no-retry -output without-fix.json
=== Kubernetes GOAWAY Load Test ===
Test Name:     Teleport (no fix)
Context:       teleport-laptop-kind-load-balance
API Server:    https://localhost:3026
K8s Version:   v1.34.0
Namespace:     teleport-test
Workers:       10
Ops/Worker:    100
Total Ops:     1000
Retry Mode:    DISABLED (exposes GOAWAY)
Output File:   without-fix.json
Started:       2025-11-02T12:30:40-08:00

Progress: 100/1000 (errors: 6, GOAWAY: 6)
Progress: 200/1000 (errors: 6, GOAWAY: 6)
Progress: 300/1000 (errors: 6, GOAWAY: 6)
Progress: 400/1000 (errors: 6, GOAWAY: 6)
Progress: 500/1000 (errors: 6, GOAWAY: 6)
Progress: 600/1000 (errors: 6, GOAWAY: 6)
Progress: 700/1000 (errors: 6, GOAWAY: 6)
Progress: 800/1000 (errors: 6, GOAWAY: 6)
Progress: 900/1000 (errors: 6, GOAWAY: 6)
Progress: 1000/1000 (errors: 6, GOAWAY: 6)

=== Results ===
Completed:        2025-11-02T12:30:58-08:00
Duration:         17.947190042s
Total Operations: 1000
Successful:       994
Failed:           6
GOAWAY Errors:    6
Error Rate:       0.60%
GOAWAY Rate:      0.60%

Latency Percentiles:
  p50: 199.81ms
  p95: 203.07ms
  p99: 206.76ms

Sample Errors:
  1. http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
  2. http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
  3. http2: Transport: cannot retry err [http2: Transport received Server's graceful shutdown GOAWAY] after Request.Body was written; define Request.GetBody to avoid this error
  • Logged out and shutdown Teleport.
> tsh logout
Logged out all users from all proxies.

4. Test run (Teleport, fix applied)

The intent is to see no load balancing errors through Teleport when the bug fix is applied. The existing backend database is reused.

  • Teleport was built and run with the bug fix branch rana/kube-retryable-transport branch.
> git switch rana/kube-retryable-transport
Switched to branch 'rana/kube-retryable-transport'

> make clean && make full
  • Teleport started with the same teleport.yaml config file.

  • Looking at the Teleport terminal output, Kubernetes health checks show the local kind cluster kind-load-balance is healthy.

2025-11-02T12:38:14.758-08:00 INFO [KUBERNETE] Target became healthy target_name:kind-load-balance target_kind:kube_cluster target_origin: reason:threshold_reached message:1 health check passed healthcheck/worker.go:411
  • Logged in to Teleport, logged into the kube cluster, and validated the kube context.
> tsh login --proxy=localhost:3080 --user=$(whoami) --auth=local
Enter password for Teleport user rana.ian:
Enter an OTP code from a device:
> Profile URL:        https://localhost:3080
  Logged in as:       rana.ian
  Cluster:            teleport-laptop
  Roles:              access, editor, kube-admin
  Kubernetes:         enabled
  Kubernetes users:   rana.ian
  Kubernetes groups:  system:masters
  Valid until:        2025-11-03 00:39:15 -0800 PST [valid for 11h59m]
  Extensions:         login-ip, permit-agent-forwarding, permit-port-forwarding, permit-pty, private-key-policy

> tsh kube login kind-load-balance
Logged into Kubernetes cluster "kind-load-balance". Try 'kubectl version' to test the connection.

> kubectl config current-context
teleport-laptop-kind-load-balance
  • Ran the test-load-balance app 5 times. No load balancing GOAWAY errors were seen. The last run is shown.
> for i in {1..5}; do
  echo "Run $i/5..."
  ./test-load-balance run \
    -name "Teleport (with fix) - Run $i" \
    -workers 10 \
    -ops 100 \
    -no-retry \
    -output "with-fix-run${i}.json"
done
Run 1/5...
Run 2/5...
Run 3/5...
Run 4/5...
Run 5/5...
=== Kubernetes GOAWAY Load Test ===
Test Name:     Teleport (with fix) - Run 5
Context:       teleport-laptop-kind-load-balance
API Server:    https://localhost:3026
K8s Version:   v1.34.0
Namespace:     teleport-test
Workers:       10
Ops/Worker:    100
Total Ops:     1000
Retry Mode:    DISABLED (exposes GOAWAY)
Output File:   with-fix-run5.json
Started:       2025-11-02T12:45:05-08:00

Progress: 100/1000 (errors: 0, GOAWAY: 0)
Progress: 200/1000 (errors: 0, GOAWAY: 0)
Progress: 300/1000 (errors: 0, GOAWAY: 0)
Progress: 400/1000 (errors: 0, GOAWAY: 0)
Progress: 500/1000 (errors: 0, GOAWAY: 0)
Progress: 600/1000 (errors: 0, GOAWAY: 0)
Progress: 700/1000 (errors: 0, GOAWAY: 0)
Progress: 800/1000 (errors: 0, GOAWAY: 0)
Progress: 900/1000 (errors: 0, GOAWAY: 0)
Progress: 1000/1000 (errors: 0, GOAWAY: 0)

=== Results ===
Completed:        2025-11-02T12:45:23-08:00
Duration:         18.006742166s
Total Operations: 1000
Successful:       1000
Failed:           0
GOAWAY Errors:    0
Error Rate:       0.00%
GOAWAY Rate:      0.00%

Latency Percentiles:
  p50: 199.78ms
  p95: 202.77ms
  p99: 205.99ms

Results saved to: with-fix-run5.json
  • Compared test run results.
./test-load-balance compare \
  -files baseline.json,without-fix.json,with-fix-run5.json \
  -markdown > test-results.md

@rana rana marked this pull request as ready for review October 28, 2025 21:17
@github-actions github-actions bot requested review from avatus and cthach October 28, 2025 21:18
@rana rana requested review from rosstimothy and tigrato October 28, 2025 21:19
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana rana force-pushed the rana/kube-retryable-transport branch 2 times, most recently from 84ecb8c to a47cc32 Compare October 29, 2025 03:25
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana rana requested review from Joerger, capnspacehook, rosstimothy and tigrato and removed request for avatus and cthach October 30, 2025 16:24
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana rana force-pushed the rana/kube-retryable-transport branch from ce110e7 to 0eff6f5 Compare October 30, 2025 17:08
@rana rana requested review from espadolini and zmb3 October 30, 2025 17:15
Comment thread lib/kube/proxy/transport_retryable.go
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@espadolini
Copy link
Copy Markdown
Contributor

How much disk space and memory can be consumed at any given time? Is this guaranteed to only ever happen in the agent, or will we also buffer things in the proxy?

Is the temporary path guaranteed to be ephemeral disk storage in the teleport-kube-agent deployments?

Comment thread lib/kube/proxy/transport.go Outdated
Comment thread lib/kube/proxy/transport.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana rana force-pushed the rana/kube-retryable-transport branch from 4571de1 to 34fd919 Compare October 31, 2025 06:44
@rana
Copy link
Copy Markdown
Contributor Author

rana commented Oct 31, 2025

How much disk space and memory can be consumed at any given time? Is this guaranteed to only ever happen in the agent, or will we also buffer things in the proxy?

Is the temporary path guaranteed to be ephemeral disk storage in the teleport-kube-agent deployments?

@espadolini Based on the our slack discussions, buffering to disk is removed, and a weighted semaphore for in memory buffering is added.

@rana rana requested review from rosstimothy and tigrato October 31, 2025 07:04
Comment thread lib/kube/proxy/transport_retryable.go
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana rana force-pushed the rana/kube-retryable-transport branch from b893409 to d34e1eb Compare November 2, 2025 22:35
@rana rana changed the title Fix Kubernetes GOAWAY retry errors by buffering request bodies Fix Kubernetes load balancing GOAWAY errors by buffering request bodies Nov 3, 2025
@rana rana changed the title Fix Kubernetes load balancing GOAWAY errors by buffering request bodies Fix Kubernetes load balancing GOAWAY errors by buffering request body Nov 3, 2025
Comment thread lib/kube/proxy/forwarder.go Outdated
Comment on lines +304 to +322
// RetryBufferTotal
if f.RetryBufferTotal <= 0 {
if env := os.Getenv(envRetryBufferTotal); env != "" {
if val, err := strconv.ParseInt(env, 10, 64); err == nil && val > 0 {
f.RetryBufferTotal = val
}
}
}
if f.RetryBufferTotal <= 0 {
f.RetryBufferTotal = defaultRetryBufferTotal
}
// RetryBufferPerRequest
if f.RetryBufferPerRequest <= 0 {
if env := os.Getenv(envRetryBufferPerRequest); env != "" {
if val, err := strconv.ParseInt(env, 10, 64); err == nil && val > 0 {
f.RetryBufferPerRequest = val
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenarios would users need to alter these values? If we are allowing users to edit these fields, should we also allow them to explicitly opt out of this change?

Copy link
Copy Markdown
Contributor Author

@rana rana Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RetryBufferTotal would be increased if requests are being blocked at a high rate. RetryBufferPerRequest would be increased if large payloads are not allowed to be transferred. That scenario seems highly unlikely though.

I'm in the process I added allowing the feature to be disabled by setting RetryBufferTotal / TELEPORT_UNSTABLE_KUBE_RETRY_BUFFER_TOTAL to zero or negative.

GCP and Azure don't explicitly mention using the --goaway-chance flag after some searching. The issue may be specific to AWS, and disabling may be beneficial for some customers.

Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable_test.go Outdated
Comment thread lib/kube/proxy/transport_retryable_test.go Outdated
Comment thread lib/kube/proxy/transport_retryable_test.go Outdated
Comment thread lib/kube/proxy/transport_retryable_test.go Outdated
Comment thread lib/kube/proxy/transport_retryable_test.go Outdated
@rosstimothy
Copy link
Copy Markdown
Contributor

Is it possible to detect that a Kubernetes cluster is configured with a 0% goaway chance and disable this feature for requests to it?

Comment thread lib/kube/proxy/forwarder.go Outdated
Comment thread lib/kube/proxy/transport.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
@rana
Copy link
Copy Markdown
Contributor Author

rana commented Nov 3, 2025

Is it possible to detect that a Kubernetes cluster is configured with a 0% goaway chance and disable this feature for requests to it?

I don't think so. The Kubernetes --goaway-chance flag is only on the server. And when --goaway-chance is active, kube just sends a close header.

@rana rana requested a review from rosstimothy November 3, 2025 22:33
Comment thread lib/kube/proxy/forwarder.go Outdated
Comment thread lib/kube/proxy/forwarder.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment thread lib/kube/proxy/transport_retryable.go Outdated
Comment on lines +239 to +252
// If the connection closed in the middle of sending
// due to a GOAWAY, read remaining data for a retry attempt.
remaining, readErr := io.ReadAll(rb.src)
if readErr != nil && !errors.Is(readErr, io.EOF) && rb.readErr == nil {
rb.readErr = readErr
}
if len(remaining) > 0 {
rb.buf.Write(remaining)
}

// Close source and mark as done.
err := rb.src.Close()
rb.src = nil
rb.cond.Broadcast()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to read the full request here, and we definitely must not not block Close on I/O from the original client while we buffer the request, especially because we might not retry at all, depending on why the body is getting closed before reaching the end. The second body can just read from the buffer until the end of the buffer, then read (and buffer) from the source.

Comment thread lib/kube/proxy/transport_retryable.go Outdated
"error", err,
)
}
rt.semaphore.Release(req.ContentLength)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still happening while Body and GetBody exist and rb.buf is holding on to its buffer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to release in runtime finalizer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a finalizer (which shouldn't be used because go 1.24+ has cleanups - and cleanups shouldn't be used either) just means that we are going to hold on to the buffer and the memory even while we're streaming the response back to the client - and an arbitrary amount of time after that up to forever, since finalizers and cleanups are outright not guaranteed to actually run.

To solve this and the full read on close problem we should probably have a background goroutine that reads from the original body and fills in the buffer and some refcounting of bodies to close the original body, drop the buffer and release the semaphore when all bodies are closed and the call to the inner RoundTrip returns.

(technically there should also be a guarantee that the original body is guaranteed not to be interacted with anymore after the response body is closed but that is extreme pedantry and we are the only users of this anyway)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, this is literally the "output log and streaming" portion of the backend coding challenge, with cleanup at the end and with the constraint of having to use io.ReadClosers for readers.

@rana rana requested a review from espadolini November 4, 2025 08:58
Copy link
Copy Markdown
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ok to have this enabled by default in v18, it can be a significant increase in memory usage depending on the workload.

@espadolini
Copy link
Copy Markdown
Contributor

For those rare requests that hit the GOAWAY condition, can't we just return a 429 with Retry-After: 1 or a Retry-After with a date in the past or something along those lines, and have the client (which definitely has the full request in a buffer and definitely needs to handle 429 errors already, since it's a kube api client) retry the request instead?

…body

Kubernetes API servers send HTTP/2 GOAWAY errors to redistribute load across replicas for up to 2% of requests. Setting the request's `GetBody` function enables automatic retries.

For HTTP/2 requests, `GetBody` is set, and request bodies are incrementally buffered and accumulated as reads occur. In the case of GOAWAY errors occurring mid-send, body buffering is completed before closing. HTTP/1.1 protocol upgrades are not buffered, since they wouldn't receive HTTP/2 GOAWAY errors.

A weighted semaphore limits total concurrent buffering to prevent OOM. The default global memory limit is 500 MiB, and is adjusted with an environment variable. Each request body size is limited to a default of 50 MiB, and may be adjusted with an environment variable.

Changes:
- Added `retryableTransport` and `retryBuffer` enabling incremental request body buffering
- Added a weighted semaphore limiting total concurrent buffer size to 500 MiB by default
- Added a per-request buffer size limit with a 50 MiB default
- Added tunable parameters `RetryBufferTotal` and `RetryBufferPerRequest`
- Added environment variable `TELEPORT_UNSTABLE_KUBE_RETRY_BUFFER_TOTAL`
- Added environment variable `TELEPORT_UNSTABLE_KUBE_RETRY_BUFFER_PER_REQ`
- Added unit tests

Fixes #57766

Co-authored-by: rosstimothy <39066650+rosstimothy@users.noreply.github.com>
Co-authored-by: Edoardo Spadolini <edoardo.spadolini@goteleport.com>
@rana rana force-pushed the rana/kube-retryable-transport branch from 5f8d054 to bb0a216 Compare November 4, 2025 17:27
@rosstimothy rosstimothy closed this Nov 6, 2025
rosstimothy added a commit that referenced this pull request Nov 6, 2025
This is an attempt to fix #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
rosstimothy added a commit that referenced this pull request Nov 7, 2025
This is an attempt to resolve #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
rosstimothy added a commit that referenced this pull request Nov 7, 2025
This is an attempt to resolve #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
rosstimothy added a commit that referenced this pull request Nov 10, 2025
This is an attempt to resolve #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
rosstimothy added a commit that referenced this pull request Nov 11, 2025
This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
rosstimothy added a commit that referenced this pull request Nov 11, 2025
This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
github-merge-queue bot pushed a commit that referenced this pull request Nov 11, 2025
This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
backport-bot-workflows bot pushed a commit that referenced this pull request Nov 11, 2025
This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
backport-bot-workflows bot pushed a commit that referenced this pull request Nov 11, 2025
This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.
github-merge-queue bot pushed a commit that referenced this pull request Nov 17, 2025
* Kubernetes: Handle GOAWAY requests

This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.

* Populate GOAWAY response body (#61264)

Follow up to #61142 which
sets the response body so that clients which only look at the reason and
not the headers will behave appropriately.
github-merge-queue bot pushed a commit that referenced this pull request Nov 17, 2025
* Kubernetes: Handle GOAWAY requests

This is an attempt to address #57766.

When a request is terminated because the upstream Kubernetes API
Server GOAWAY chance is exceeded, clients are informed to retry
by replying with a 429 status code and a Retry-After header.

This deviates from the approaches taken in
#57881 and
#60695 to favor
simplicity and avoid buffering request data in a teleport process.
The downside to this approach is that it requires clients to properly
handle retry requests.

* Populate GOAWAY response body (#61264)

Follow up to #61142 which
sets the response body so that clients which only look at the reason and
not the headers will behave appropriately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants