Skip to content

[Cosmos] [Don't Review] add azcosmos_perf — Go SDK performance testing tool#26764

Draft
tvaron3 wants to merge 2 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-perf-pr
Draft

[Cosmos] [Don't Review] add azcosmos_perf — Go SDK performance testing tool#26764
tvaron3 wants to merge 2 commits into
Azure:mainfrom
tvaron3:tvaron3/cosmos-perf-pr

Conversation

@tvaron3
Copy link
Copy Markdown
Member

@tvaron3 tvaron3 commented May 13, 2026

What this PR does

Adds a new Go performance testing tool (sdk/data/azcosmos_perf) that mirrors the Rust azure_data_cosmos_perf crate, intended for steady-state benchmarking of the Go Cosmos DB SDK on VMs / AKS with metrics emitted to Azure Data Explorer / Grafana.

Features

  • 6 operations: ReadItem, QueryItems, ReadManyItems, UpsertItem, CreateItem, ChangeFeedItems
  • Per-operation latency histograms (HDR), throughput, error counters, and backend x-ms-request-duration-ms accounting
  • Per-iteration recover() keeps the process alive on a panic in any single op and writes the failure (with full goroutine stack trace in source_message) to the ADX ErrorResults table for post-mortem
  • Deterministic seeded item pool with thread-safe sample / push
  • Configurable concurrency, region exclusions, change-feed MaxItemCount, ReadMany batch size
  • Cosmos-backed result schema (PerfResults, ErrorResults) with per-interval upserts so dashboards stay live
  • Drop-in entrypoint.sh and run_perf.sh for the AKS deployment used in our internal runs

Commits

  1. feat(azcosmos_perf): add Go performance testing tool with ChangeFeed and ReadMany ops — the package itself
  2. feat(azcosmos_perf): persist panic stack trace to ADX error_message source — the recover handler now persists the stack trace, not just the panic message

What this PR deliberately does not include

  • The getPartitionKeyRanges nil-deref fix in cosmos_container.go that this perf tool uncovered in production at concurrency=200. That exact same fix is already present in @simorenoh's open PR [Cosmos] add container cache and pk range cache #26723 ("[Cosmos] add container cache and pk range cache"). No need to duplicate or risk a conflict.
  • Pyroscope integration — pure deployment concern, lives only in our internal image.
  • Defense-in-depth newResponse(nil) hardening in cosmos_response.go — best as its own small SDK PR.

Production validation

The tool has been running on AKS (2 pods, concurrency=50) backed by ADX/Grafana. After applying the panic-handler stack-trace fix and (locally) the pk-range fix from #26723, 0 panics / 0 errors across all 6 operations at ~135K ops/op/5min sustained throughput. Earlier runs at concurrency=200 reproduced the panic 69 times in 4 hours — exclusively in ReadManyItems and ChangeFeedItems, which both call getPartitionKeyRanges.

Draft because

  • Awaiting decision on whether to ship before or after [Cosmos] add container cache and pk range cache #26723 (no functional dependency, but keeping the order clean avoids reviewer churn)
  • Open to renaming the module / repackaging if there's a preferred layout (e.g., sdk/data/azcosmos/perf vs top-level azcosmos_perf)

tvaron3 and others added 2 commits May 13, 2026 10:02
…and ReadMany ops

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ource

When an operation panics and the recover handler in runIteration catches it,
also pass the captured goroutine stack trace through to the ErrorResults
document via a new UpsertErrorWithSource helper. Previously only
fmt.Errorf("panic: %v", r) reached ADX and the stack went solely to
container stderr, making post-mortem investigation impossible once the pod
recycled.

UpsertError remains an unchanged convenience wrapper for non-panic call sites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@tvaron3 tvaron3 changed the title [Cosmos] add azcosmos_perf — Go SDK performance testing tool [Cosmos] [Don't Review] add azcosmos_perf — Go SDK performance testing tool May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant