Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ permalink: /
A Kubernetes-native Go microservice framework for building production-grade gRPC services with built-in observability, resilience, and HTTP gateway support. Follows [12-factor](https://12factor.net/) principles out of the box.
{: .fs-6 .fw-300 }

**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/).
**Production-proven:** Powers 100+ microservices, handling peaks of ~70k QPS per service at [Gojek](https://www.gojek.com/en-id/). Full interceptor chain adds [~10–12% overhead](/architecture#interceptor-chain-overhead) vs bare gRPC — the bottleneck is transport, not the framework.
{: .fs-5 .fw-500 }

[Get Started](/getting-started){: .btn .btn-primary .fs-5 .mb-4 .mb-md-0 .mr-2 }
Expand Down
19 changes: 19 additions & 0 deletions architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,25 @@ OpenTelemetry tracing spans are created by the `otelgrpc` stats handler configur
{: .note }
Health checks, ready checks, and gRPC reflection are **excluded by default** via `FilterMethods`. This prevents observability noise from Kubernetes probes. See the [FAQ](/faq) for how to customize this.

### Interceptor Chain Overhead

The full interceptor chain adds **~10–12% overhead** compared to bare gRPC (no interceptors). The bottleneck at high concurrency is gRPC HTTP/2 transport and Go runtime scheduling — not the interceptors themselves.

End-to-end throughput measured on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler):

| Configuration | RPS @ c=1 | RPS @ c=50 | RPS @ c=200 | Avg @ c=1 | P99 @ c=200 |
|---------------|-----------|------------|-------------|-----------| ------------|
| **Default** (all interceptors) | 5,500 | 40,900 | 50,000 | 0.12ms | 7.9ms |
| **Tuned** (error-only logging, no histograms) | 6,300 | 42,700 | 53,200 | 0.10ms | 7.3ms |
| **No interceptors** (bare gRPC) | 7,000 | 46,600 | 55,800 | 0.09ms | 7.2ms |

Per-interceptor micro-benchmark: **~4.8µs, ~1.8KB, ~45 allocs** per unary request. Profile with:
```bash
go test -run='^$' -bench=BenchmarkDefaultInterceptors -benchmem ./...
```

The tuned configuration uses `RESPONSE_TIME_LOG_ERROR_ONLY=true` and `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false`. See the [Configuration Reference](/config-reference#example-high-throughput-production) for the full set of tuning knobs.

### Adding Custom Interceptors

You can prepend your own interceptors to the chain:
Expand Down
28 changes: 23 additions & 5 deletions config-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ cfg := config.GetColdBrewConfig()
| Variable | Type | Default | Description |
|----------|------|---------|-------------|
| `DISABLE_PROMETHEUS` | bool | `false` | Disable Prometheus metrics endpoint at `/metrics` |
| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms |
| `ENABLE_PROMETHEUS_GRPC_HISTOGRAM` | bool | `true` | Enable gRPC request latency histograms (`grpc_server_handling_seconds`). When `false`, latency percentile queries and alerts stop working — only counters remain |
| `PROMETHEUS_GRPC_HISTOGRAM_BUCKETS` | []float64 | `""` | Custom histogram buckets (comma-separated seconds, e.g., `0.005,0.01,0.025,0.05,0.1,0.25,0.5,1,2.5,5,10`) |

## New Relic
Expand Down Expand Up @@ -183,14 +183,32 @@ For services at 70k+ QPS where observability overhead matters:
```bash
export APP_NAME=myservice
export ENVIRONMENT=production
export LOG_LEVEL=warn # suppresses info-level response time logs
# export OTLP_ENDPOINT=your-collector:4317 # uncomment if using OTLP tracing
export OTLP_SAMPLING_RATIO=0.05 # only applies when OTLP_ENDPOINT is set
export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false
export LOG_LEVEL=warn
export RESPONSE_TIME_LOG_LEVEL=warn # must be >= LOG_LEVEL to take effect
export RESPONSE_TIME_LOG_ERROR_ONLY=true # skip per-request logging for successful RPCs
# export OTLP_ENDPOINT=your-collector:4317 # uncomment if using OTLP tracing
export OTLP_SAMPLING_RATIO=0.05 # only applies when OTLP_ENDPOINT is set
export ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false # see warning below
export DISABLE_NEW_RELIC=true
export DISABLE_UNIX_GATEWAY=false # Unix socket for HTTP gateway (1.9x faster)
export HTTP_COMPRESSION_MIN_SIZE=512
```

### Measured tuning impact

End-to-end throughput on Apple M1 Pro (loopback, [ghz](https://ghz.sh/) load test, simple Echo handler):

| Configuration | RPS @ c=200 | P99 @ c=200 | Change |
|---------------|-------------|-------------|--------|
| **Default** (all interceptors, info logging) | 50,000 | 7.9ms | baseline |
| **Tuned** (above config) | 53,200 | 7.3ms | +6% RPS |
| **No interceptors** (bare gRPC) | 55,800 | 7.2ms | +12% RPS |

The full interceptor chain (logging, tracing, metrics, panic recovery) adds ~10–12% overhead. Most of that is I/O-bound (log writes, gRPC transport) rather than CPU-bound, which is why tuning yields a modest but real improvement. See the [Architecture page](/architecture#interceptor-chain-overhead) for the full breakdown.

{: .warning }
Setting `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=false` removes the `grpc_server_handling_seconds` metric entirely. This means **latency percentile queries and alerts** (e.g., `histogram_quantile(0.99, ...)`) will stop working. You will still have `grpc_server_handled_total` (request count by status code) and `grpc_server_started_total` (request count started). Only disable histograms if you have an alternative latency signal (e.g., distributed tracing percentiles, or an external load balancer metric).

---

Source: [`core/config/config.go`](https://github.com/go-coldbrew/core/blob/main/config/config.go)
42 changes: 42 additions & 0 deletions howto/Debugging.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,48 @@ Showing top 5 nodes out of 45
0 0% 100% 20ms 66.67% github.com/ankurs/MyApp/proto.RegisterMySvcHandlerClient.func3
```

### Profiling under gRPC load

To understand where CPU time goes under realistic load, combine pprof with a gRPC load generator like [ghz](https://ghz.sh/):

```bash
# Terminal 1: start your service
make run

# Terminal 2: generate sustained gRPC load
ghz --insecure --call your.package.Service/Method \
-d '{"msg":"hello"}' -c 100 --duration 25s localhost:9090

# Terminal 3: capture a 15-second CPU profile while load is running
curl -s "http://localhost:9091/debug/pprof/profile?seconds=15" -o cpu.prof
go tool pprof -top -cum cpu.prof
```

Here's what a typical ColdBrew CPU profile looks like under load (Apple M1 Pro, c=100):

| Component | Cumulative CPU | What it does |
|-----------|---------------|--------------|
| `ResponseTimeLoggingInterceptor` | ~44% | JSON log serialization + stdout write per request |
| gRPC HTTP/2 transport | ~15% | Frame reading/writing, buffer flush |
| Go runtime (scheduling, GC) | ~15% | Goroutine scheduling, garbage collection |
| `TraceIdInterceptor` | ~6% | Trace ID extraction and propagation |
| `errors/notifier.SetTraceIdWithValue` | ~5% | Setting trace ID on error notifier context |
| `otelgrpc.TagRPC` | ~1% | OpenTelemetry span creation |
| Prometheus metrics | ~1% | Histogram bucket recording |

{: .important }
**Cumulative CPU does not equal throughput impact.** Response time logging shows 44% cumulative CPU because `os.(*File).Write` blocks the goroutine in a syscall — but other goroutines run during that time. Disabling logging yields only ~6% more RPS, because the real bottleneck is gRPC transport I/O and goroutine scheduling. Always validate pprof findings with actual throughput measurements.

For allocation profiling under load:

```bash
# Capture heap profile while load is running
curl -s "http://localhost:9091/debug/pprof/heap?debug=0" -o heap.prof
go tool pprof -alloc_objects -top heap.prof
```

Top allocation sources under load are gRPC metadata copying (~27%), otelgrpc span creation (~13%), and options context store (~10%). These are largely inherent to gRPC's per-request metadata model.

### Analyzing profiles

The `go tool pprof` command can also be used to analyze profiles to find the root cause of performance issues. For more information, please refer to the [pprof walkthrough] and the [diagnostics doc].
Expand Down
6 changes: 5 additions & 1 deletion howto/production.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ ColdBrew exposes these metrics out of the box via gRPC interceptors:
| Metric | Type | Description |
|--------|------|-------------|
| `grpc_server_handled_total` | Counter | Total RPCs completed, by method and status code |
| `grpc_server_handling_seconds` | Histogram | RPC latency distribution (if `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`) |
| `grpc_server_handling_seconds` | Histogram | RPC latency distribution. **Only available when `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true`** (the default). Disabling this removes all latency percentile data from Prometheus |
| `grpc_server_started_total` | Counter | Total RPCs started |

Recommended alerts:
Expand All @@ -251,6 +251,9 @@ Recommended alerts:
for: 5m
```

{: .warning }
The latency alert above requires `ENABLE_PROMETHEUS_GRPC_HISTOGRAM=true` (the default). If you set it to `false` for [throughput tuning](/config-reference#measured-tuning-impact), the `grpc_server_handling_seconds` metric disappears and this alert will silently stop firing. Ensure you have an alternative latency signal (distributed tracing, load balancer metrics) before disabling histograms.

### Custom histogram buckets

If the default latency buckets don't match your SLOs, customize them:
Expand Down Expand Up @@ -435,6 +438,7 @@ env:
- [ ] Store secrets in Kubernetes Secrets, not environment variable literals
- [ ] Disable debug endpoints in production if not needed (`DISABLE_DEBUG=true`)
- [ ] Run `make lint` (includes `govulncheck`) before deploying
- [ ] For high-QPS services: set `RESPONSE_TIME_LOG_ERROR_ONLY=true` to skip per-request logging on successful RPCs (the single largest source of per-request CPU overhead — see [tuning impact](/config-reference#measured-tuning-impact))

---
[ColdBrew cookiecutter]: /getting-started
Loading