diff --git a/Index.md b/Index.md index 86a8d66..9ef5f6c 100644 --- a/Index.md +++ b/Index.md @@ -11,7 +11,7 @@ permalink: / A Kubernetes-native Go microservice framework for building production-grade gRPC services with built-in observability, resilience, and HTTP gateway support. Follows [12-factor](https://12factor.net/) principles out of the box. {: .fs-6 .fw-300 } -**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/). +**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/). Full interceptor chain adds [~10–12% overhead](/architecture#interceptor-chain-overhead) vs bare gRPC — the bottleneck is transport, not the framework. {: .fs-5 .fw-500 } [Get Started](/getting-started){: .btn .btn-primary .fs-5 .mb-4 .mb-md-0 .mr-2 } diff --git a/architecture.md b/architecture.md index 54a3348..238165b 100644 --- a/architecture.md +++ b/architecture.md @@ -211,6 +211,25 @@ OpenTelemetry tracing spans are created by the `otelgrpc` stats handler configur {: .note } Health checks, ready checks, and gRPC reflection are **excluded by default** via `FilterMethods`. This prevents observability noise from Kubernetes probes. See the [FAQ](/faq) for how to customize this. +### Interceptor Chain Overhead + +The full interceptor chain adds **~10–12% overhead** compared to bare gRPC (no interceptors). The bottleneck at high concurrency is gRPC HTTP/2 transport and Go runtime scheduling — not the interceptors themselves. + +End-to-end throughput measured on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler): + +| Configuration | RPS @ c=1 | RPS @ c=50 | RPS @ c=200 | Avg @ c=1 | P99 @ c=200 | +|---------------|-----------|------------|-------------|-----------| ------------| +| **Default** (all interceptors) | 5,500 | 40,900 | 50,000 | 0.12ms | 7.9ms | +| **Tuned** (error-only logging, no histograms) | 6,300 | 42,700 | 53,200 | 0.10ms | 7.3ms | +| **No interceptors** (bare gRPC) | 7,000 | 46,600 | 55,800 | 0.09ms | 7.2ms | + +Per-interceptor micro-benchmark: **~4.8µs, ~1.8KB, ~45 allocs** per unary request. Profile with: +```bash +go test -run='^$' -bench=BenchmarkDefaultInterceptors -benchmem ./... +``` + +The tuned configuration uses `RESPONSE_TIME_LOG_ERROR_ONLY=true` and `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false`. See the [Configuration Reference](/config-reference#example-high-throughput-production) for the full set of tuning knobs. + ### Adding Custom Interceptors You can prepend your own interceptors to the chain: diff --git a/config-reference.md b/config-reference.md index 5a6e36a..6f58c3e 100644 --- a/config-reference.md +++ b/config-reference.md @@ -88,7 +88,7 @@ cfg := config.GetColdBrewConfig() | Variable | Type | Default | Description | |----------|------|---------|-------------| | `DISABLE_PROMETHEUS` | bool | `false` | Disable Prometheus metrics endpoint at `/metrics` | -| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms | +| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms (`grpc_server_handling_seconds`). When `false`, latency percentile queries and alerts stop working — only counters remain | | `PROMETHEUS_GRPC_HISTOGRAM_BUCKETS` | []float64 | `""` | Custom histogram buckets (comma-separated seconds, e.g., `0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10`) | ## New Relic @@ -183,14 +183,32 @@ For services at 70k+ QPS where observability overhead matters: ```bash export APP_NAME=myservice export ENVIRONMENT=production -export LOG_LEVEL=warn # suppresses info-level response time logs -# export OTLP_ENDPOINT=your-collector:4317 # uncomment if using OTLP tracing -export OTLP_SAMPLING_RATIO=0.05 # only applies when OTLP_ENDPOINT is set -export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false +export LOG_LEVEL=warn +export RESPONSE_TIME_LOG_LEVEL=warn # must be >= LOG_LEVEL to take effect +export RESPONSE_TIME_LOG_ERROR_ONLY=true # skip per-request logging for successful RPCs +# export OTLP_ENDPOINT=your-collector:4317 # uncomment if using OTLP tracing +export OTLP_SAMPLING_RATIO=0.05 # only applies when OTLP_ENDPOINT is set +export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false # see warning below export DISABLE_NEW_RELIC=true +export DISABLE_UNIX_GATEWAY=false # Unix socket for HTTP gateway (1.9x faster) export HTTP_COMPRESSION_MIN_SIZE=512 ``` +### Measured tuning impact + +End-to-end throughput on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler): + +| Configuration | RPS @ c=200 | P99 @ c=200 | Change | +|---------------|-------------|-------------|--------| +| **Default** (all interceptors, info logging) | 50,000 | 7.9ms | baseline | +| **Tuned** (above config) | 53,200 | 7.3ms | +6% RPS | +| **No interceptors** (bare gRPC) | 55,800 | 7.2ms | +12% RPS | + +The full interceptor chain (logging, tracing, metrics, panic recovery) adds ~10–12% overhead. Most of that is I/O-bound (log writes, gRPC transport) rather than CPU-bound, which is why tuning yields a modest but real improvement. See the [Architecture page](/architecture#interceptor-chain-overhead) for the full breakdown. + +{: .warning } +Setting `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false` removes the `grpc_server_handling_seconds` metric entirely. This means **latency percentile queries and alerts** (e.g., `histogram_quantile(0.99, ...)`) will stop working. You will still have `grpc_server_handled_total` (request count by status code) and `grpc_server_started_total` (request count started). Only disable histograms if you have an alternative latency signal (e.g., distributed tracing percentiles, or an external load balancer metric). + --- Source: [`core/config/config.go`](https://github.com/go-coldbrew/core/blob/main/config/config.go) diff --git a/howto/Debugging.md b/howto/Debugging.md index ed17376..43695ff 100644 --- a/howto/Debugging.md +++ b/howto/Debugging.md @@ -38,6 +38,48 @@ Showing top 5 nodes out of 45 0 0% 100% 20ms 66.67% github.com/ankurs/MyApp/proto.RegisterMySvcHandlerClient.func3 ``` +### Profiling under gRPC load + +To understand where CPU time goes under realistic load, combine pprof with a gRPC load generator like [ghz](https://ghz.sh/): + +```bash +# Terminal 1: start your service +make run + +# Terminal 2: generate sustained gRPC load +ghz --insecure --call your.package.Service/Method \ + -d '{"msg":"hello"}' -c 100 --duration 25s localhost:9090 + +# Terminal 3: capture a 15-second CPU profile while load is running +curl -s "http://localhost:9091/debug/pprof/profile?seconds=15" -o cpu.prof +go tool pprof -top -cum cpu.prof +``` + +Here's what a typical ColdBrew CPU profile looks like under load (Apple M1 Pro, c=100): + +| Component | Cumulative CPU | What it does | +|-----------|---------------|--------------| +| `ResponseTimeLoggingInterceptor` | ~44% | JSON log serialization + stdout write per request | +| gRPC HTTP/2 transport | ~15% | Frame reading/writing, buffer flush | +| Go runtime (scheduling, GC) | ~15% | Goroutine scheduling, garbage collection | +| `TraceIdInterceptor` | ~6% | Trace ID extraction and propagation | +| `errors/notifier.SetTraceIdWithValue` | ~5% | Setting trace ID on error notifier context | +| `otelgrpc.TagRPC` | ~1% | OpenTelemetry span creation | +| Prometheus metrics | ~1% | Histogram bucket recording | + +{: .important } +**Cumulative CPU does not equal throughput impact.** Response time logging shows 44% cumulative CPU because `os.(*File).Write` blocks the goroutine in a syscall — but other goroutines run during that time. Disabling logging yields only ~6% more RPS, because the real bottleneck is gRPC transport I/O and goroutine scheduling. Always validate pprof findings with actual throughput measurements. + +For allocation profiling under load: + +```bash +# Capture heap profile while load is running +curl -s "http://localhost:9091/debug/pprof/heap?debug=0" -o heap.prof +go tool pprof -alloc_objects -top heap.prof +``` + +Top allocation sources under load are gRPC metadata copying (~27%), otelgrpc span creation (~13%), and options context store (~10%). These are largely inherent to gRPC's per-request metadata model. + ### Analyzing profiles The `go tool pprof` command can also be used to analyze profiles to find the root cause of performance issues. For more information, please refer to the [pprof walkthrough] and the [diagnostics doc]. diff --git a/howto/production.md b/howto/production.md index dbf19da..d6ea841 100644 --- a/howto/production.md +++ b/howto/production.md @@ -227,7 +227,7 @@ ColdBrew exposes these metrics out of the box via gRPC interceptors: | Metric | Type | Description | |--------|------|-------------| | `grpc_server_handled_total` | Counter | Total RPCs completed, by method and status code | -| `grpc_server_handling_seconds` | Histogram | RPC latency distribution (if `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`) | +| `grpc_server_handling_seconds` | Histogram | RPC latency distribution. **Only available when `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`** (the default). Disabling this removes all latency percentile data from Prometheus | | `grpc_server_started_total` | Counter | Total RPCs started | Recommended alerts: @@ -251,6 +251,9 @@ Recommended alerts: for: 5m ``` +{: .warning } +The latency alert above requires `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true` (the default). If you set it to `false` for [throughput tuning](/config-reference#measured-tuning-impact), the `grpc_server_handling_seconds` metric disappears and this alert will silently stop firing. Ensure you have an alternative latency signal (distributed tracing, load balancer metrics) before disabling histograms. + ### Custom histogram buckets If the default latency buckets don't match your SLOs, customize them: @@ -435,6 +438,7 @@ env: - [ ] Store secrets in Kubernetes Secrets, not environment variable literals - [ ] Disable debug endpoints in production if not needed (`DISABLE_DEBUG=true`) - [ ] Run `make lint` (includes `govulncheck`) before deploying +- [ ] For high-QPS services: set `RESPONSE_TIME_LOG_ERROR_ONLY=true` to skip per-request logging on successful RPCs (the single largest source of per-request CPU overhead — see [tuning impact](/config-reference#measured-tuning-impact)) --- [ColdBrew cookiecutter]: /getting-started