go-coldbrew · ankurs · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/Index.md b/Index.md
@@ -11,7 +11,7 @@ permalink: /
 A Kubernetes-native Go microservice framework for building production-grade gRPC services with built-in observability, resilience, and HTTP gateway support. Follows [12-factor](https://12factor.net/) principles out of the box.
 {: .fs-6 .fw-300 }
 
-**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/).
+**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/). Full interceptor chain adds [~10–12% overhead](/architecture#interceptor-chain-overhead) vs bare gRPC — the bottleneck is transport, not the framework.
 {: .fs-5 .fw-500 }
 
 [Get Started](/getting-started){: .btn .btn-primary .fs-5 .mb-4 .mb-md-0 .mr-2 }

diff --git a/architecture.md b/architecture.md
@@ -211,6 +211,25 @@ OpenTelemetry tracing spans are created by the `otelgrpc` stats handler configur
 {: .note }
 Health checks, ready checks, and gRPC reflection are **excluded by default** via `FilterMethods`. This prevents observability noise from Kubernetes probes. See the [FAQ](/faq) for how to customize this.
 
+### Interceptor Chain Overhead
+
+The full interceptor chain adds **~10–12% overhead** compared to bare gRPC (no interceptors). The bottleneck at high concurrency is gRPC HTTP/2 transport and Go runtime scheduling — not the interceptors themselves.
+
+End-to-end throughput measured on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler):
+
+| Configuration | RPS @ c=1 | RPS @ c=50 | RPS @ c=200 | Avg @ c=1 | P99 @ c=200 |
+|---------------|-----------|------------|-------------|-----------| ------------|
+| **Default** (all interceptors) | 5,500 | 40,900 | 50,000 | 0.12ms | 7.9ms |
+| **Tuned** (error-only logging, no histograms) | 6,300 | 42,700 | 53,200 | 0.10ms | 7.3ms |
+| **No interceptors** (bare gRPC) | 7,000 | 46,600 | 55,800 | 0.09ms | 7.2ms |
+
+Per-interceptor micro-benchmark: **~4.8µs, ~1.8KB, ~45 allocs** per unary request. Profile with:
+```bash
+go test -run='^$' -bench=BenchmarkDefaultInterceptors -benchmem ./...
+```
+
+The tuned configuration uses `RESPONSE_TIME_LOG_ERROR_ONLY=true` and `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false`. See the [Configuration Reference](/config-reference#example-high-throughput-production) for the full set of tuning knobs.
+
 ### Adding Custom Interceptors
 
 You can prepend your own interceptors to the chain:

diff --git a/config-reference.md b/config-reference.md
@@ -88,7 +88,7 @@ cfg := config.GetColdBrewConfig()
 | Variable | Type | Default | Description |
 |----------|------|---------|-------------|
 | `DISABLE_PROMETHEUS` | bool | `false` | Disable Prometheus metrics endpoint at `/metrics` |
-| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms |
+| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms (`grpc_server_handling_seconds`). When `false`, latency percentile queries and alerts stop working — only counters remain |
 | `PROMETHEUS_GRPC_HISTOGRAM_BUCKETS` | []float64 | `""` | Custom histogram buckets (comma-separated seconds, e.g., `0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10`) |
 
 ## New Relic
@@ -183,14 +183,32 @@ For services at 70k+ QPS where observability overhead matters:
 ```bash
 export APP_NAME=myservice
 export ENVIRONMENT=production
-export LOG_LEVEL=warn                       # suppresses info-level response time logs
-# export OTLP_ENDPOINT=your-collector:4317  # uncomment if using OTLP tracing
-export OTLP_SAMPLING_RATIO=0.05              # only applies when OTLP_ENDPOINT is set
-export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false
+export LOG_LEVEL=warn
+export RESPONSE_TIME_LOG_LEVEL=warn            # must be >= LOG_LEVEL to take effect
+export RESPONSE_TIME_LOG_ERROR_ONLY=true       # skip per-request logging for successful RPCs
+# export OTLP_ENDPOINT=your-collector:4317     # uncomment if using OTLP tracing
+export OTLP_SAMPLING_RATIO=0.05                # only applies when OTLP_ENDPOINT is set
+export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false   # see warning below
 export DISABLE_NEW_RELIC=true
+export DISABLE_UNIX_GATEWAY=false              # Unix socket for HTTP gateway (1.9x faster)
 export HTTP_COMPRESSION_MIN_SIZE=512
 ```
 
+### Measured tuning impact
+
+End-to-end throughput on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler):
+
+| Configuration | RPS @ c=200 | P99 @ c=200 | Change |
+|---------------|-------------|-------------|--------|
+| **Default** (all interceptors, info logging) | 50,000 | 7.9ms | baseline |
+| **Tuned** (above config) | 53,200 | 7.3ms | +6% RPS |
+| **No interceptors** (bare gRPC) | 55,800 | 7.2ms | +12% RPS |
+
+The full interceptor chain (logging, tracing, metrics, panic recovery) adds ~10–12% overhead. Most of that is I/O-bound (log writes, gRPC transport) rather than CPU-bound, which is why tuning yields a modest but real improvement. See the [Architecture page](/architecture#interceptor-chain-overhead) for the full breakdown.
+
+{: .warning }
+Setting `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false` removes the `grpc_server_handling_seconds` metric entirely. This means **latency percentile queries and alerts** (e.g., `histogram_quantile(0.99, ...)`) will stop working. You will still have `grpc_server_handled_total` (request count by status code) and `grpc_server_started_total` (request count started). Only disable histograms if you have an alternative latency signal (e.g., distributed tracing percentiles, or an external load balancer metric).
+
 ---
 
 Source: [`core/config/config.go`](https://github.com/go-coldbrew/core/blob/main/config/config.go)
diff --git a/howto/Debugging.md b/howto/Debugging.md
@@ -38,6 +38,48 @@ Showing top 5 nodes out of 45
          0     0%   100%       20ms 66.67%  github.com/ankurs/MyApp/proto.RegisterMySvcHandlerClient.func3
 ```
 
+### Profiling under gRPC load
+
+To understand where CPU time goes under realistic load, combine pprof with a gRPC load generator like [ghz](https://ghz.sh/):
+
+```bash
+# Terminal 1: start your service
+make run
+
+# Terminal 2: generate sustained gRPC load
+ghz --insecure --call your.package.Service/Method \
+  -d '{"msg":"hello"}' -c 100 --duration 25s localhost:9090
+
+# Terminal 3: capture a 15-second CPU profile while load is running
+curl -s "http://localhost:9091/debug/pprof/profile?seconds=15" -o cpu.prof
+go tool pprof -top -cum cpu.prof
+```
+
+Here's what a typical ColdBrew CPU profile looks like under load (Apple M1 Pro, c=100):
+
+| Component | Cumulative CPU | What it does |
+|-----------|---------------|--------------|
+| `ResponseTimeLoggingInterceptor` | ~44% | JSON log serialization + stdout write per request |
+| gRPC HTTP/2 transport | ~15% | Frame reading/writing, buffer flush |
+| Go runtime (scheduling, GC) | ~15% | Goroutine scheduling, garbage collection |
+| `TraceIdInterceptor` | ~6% | Trace ID extraction and propagation |
+| `errors/notifier.SetTraceIdWithValue` | ~5% | Setting trace ID on error notifier context |
+| `otelgrpc.TagRPC` | ~1% | OpenTelemetry span creation |
+| Prometheus metrics | ~1% | Histogram bucket recording |
+
+{: .important }
+**Cumulative CPU does not equal throughput impact.** Response time logging shows 44% cumulative CPU because `os.(*File).Write` blocks the goroutine in a syscall — but other goroutines run during that time. Disabling logging yields only ~6% more RPS, because the real bottleneck is gRPC transport I/O and goroutine scheduling. Always validate pprof findings with actual throughput measurements.
+
+For allocation profiling under load:
+
+```bash
+# Capture heap profile while load is running
+curl -s "http://localhost:9091/debug/pprof/heap?debug=0" -o heap.prof
+go tool pprof -alloc_objects -top heap.prof
+```
+
+Top allocation sources under load are gRPC metadata copying (~27%), otelgrpc span creation (~13%), and options context store (~10%). These are largely inherent to gRPC's per-request metadata model.
+
 ### Analyzing profiles
 
 The `go tool pprof` command can also be used to analyze profiles to find the root cause of performance issues. For more information, please refer to the [pprof walkthrough] and the [diagnostics doc].

diff --git a/howto/production.md b/howto/production.md
@@ -227,7 +227,7 @@ ColdBrew exposes these metrics out of the box via gRPC interceptors:
 | Metric | Type | Description |
 |--------|------|-------------|
 | `grpc_server_handled_total` | Counter | Total RPCs completed, by method and status code |
-| `grpc_server_handling_seconds` | Histogram | RPC latency distribution (if `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`) |
+| `grpc_server_handling_seconds` | Histogram | RPC latency distribution. **Only available when `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`** (the default). Disabling this removes all latency percentile data from Prometheus |
 | `grpc_server_started_total` | Counter | Total RPCs started |
 
 Recommended alerts:
@@ -251,6 +251,9 @@ Recommended alerts:
   for: 5m
 ```
 
+{: .warning }
+The latency alert above requires `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true` (the default). If you set it to `false` for [throughput tuning](/config-reference#measured-tuning-impact), the `grpc_server_handling_seconds` metric disappears and this alert will silently stop firing. Ensure you have an alternative latency signal (distributed tracing, load balancer metrics) before disabling histograms.
+
 ### Custom histogram buckets
 
 If the default latency buckets don't match your SLOs, customize them:
@@ -435,6 +438,7 @@ env:
 - [ ] Store secrets in Kubernetes Secrets, not environment variable literals
 - [ ] Disable debug endpoints in production if not needed (`DISABLE_DEBUG=true`)
 - [ ] Run `make lint` (includes `govulncheck`) before deploying
+- [ ] For high-QPS services: set `RESPONSE_TIME_LOG_ERROR_ONLY=true` to skip per-request logging on successful RPCs (the single largest source of per-request CPU overhead — see [tuning impact](/config-reference#measured-tuning-impact))
 
 ---
 [ColdBrew cookiecutter]: /getting-started