-
Notifications
You must be signed in to change notification settings - Fork 107
Jaeger cleanup: much fewer spans, but with more stats - and more stats for meta section #1380
Conversation
I like the idea of simplifying the traces. On the implementation, instead of having the numerous getSeries* calls pass back RenderStats (which you will need to aggregate at each step, though the current code is missing this), i would just pass a single *RenderStats through the call pipeline. Each func can then just increment the counters as the data is processed. eg
|
* don't trace on a per-series level, we don't want that many spans * consistent error logging via logger and tracer, in every callsite of getSeriesCachedStore. meaning we can remove the logs and tracing stuff out of getSeriesCachedStore and the store.Search functions
along these call trees: executePlan <- records to span + reports stats into api response getTargets getTargetsRemote getTargetsLocal getTarget getSeriesFixed getSeries getSeriesCachedStore prometheus.querier.Select <-- we ignore the stats here getTargets .... Server.getData <- records to span + reports stats into rpc response getTargetsLocal getTarget getSeriesFixed getSeries getSeriesCachedStore Note that: * GetDataRespV0 and GetDataRespV1 responses can be used interchangably in clusters * we blend the different sources of stats into a unified presentation for the user, this required making the json marshaling slightly more complicated. * in jaeger, the stats for executePlan are the sum of the (already reported) stats of the getData calls.
extended version of #1344. I think @shanson7 will like this. note that the stats are collected across RPC boundaries and summed across all peers
|
otherwise we only had the total and the ones for other peers
cc @tomwilkie |
Looks good! |
api/models/storagestats.go
Outdated
atomic.AddUint32(&ss.ChunksFromStore, n) | ||
} | ||
|
||
// Add adds a to ss. Note that a is presumed to be "done" (read unsafely) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it would only make sense to try and add a
once it is 'done', it would be unwise/unsafe to presume that is always true.
We should just use atomic operations to read a
to avoid the possibility of a data race. eg
atomic.AddUint32(&ss.CacheHit, atomic.LoadUint32(&a.CacheHit)))
The call to Add()
itself wont be atomic (i.e. individual counters might increase between when the first counter from a
is read and when the last one is), but that wont have any impact on our use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have pondered the same thing.
I considered it the programmer's responsibility to make sure a
is "done". i agree using atomics for a
is safer, but at the same time, it would mask issues should they appear: it doesn't make sense to compute the aggregate stats if we're not done yet generating or collecting the stats. similar to how it also wouldn't make sense to build up our series slice if the inputs haven't been decoded yet.
I think this is in line with best practices everywhere in go code: whenever calling a function it's the programmers responsibility to make sure there is no unsafe data access. this is true for stdlib and many 3rd party libraries.
note that Add in the current form is also not atomic. I agree there is no need for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would mask issues should they appear
The main issue we need to avoid is panics. Panics are bad and we should avoid them were ever possible. Wrapping the the reads in atomic.LoadUint32 avoids panics. It doesn't avoid other issues that may result from users adding stats that are being updated and that is fine. The comment about a
being done should still stay.
I think this is in line with best practices everywhere in go code:
As a general statement that is true. But a more specific "best practice" is "When using atomic, all reads and writes need to use atomic". We have run into lots of race issues because this simple rule hasn't been followed. So, we either need to always use atomic operations or not use them at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. actually the way the memory model works is you shouldn't mix atomic access with non-atomic ones, even when you know you're not accessing concurrently. Go's memory model provides no guarantees if you don't use atomics and the compiler/runtime actually take advantage of this to do memory optimizations behind the scenes.
it's so easy to forget this. funny, because i recently pointed it out somewhere else as well.
i'll fix
api/cluster.go
Outdated
if err != nil { | ||
// the only errors returned are from us catching panics, so we should treat them | ||
// all as internalServerErrors | ||
log.Errorf("HTTP getData() %s", err.Error()) | ||
response.Write(ctx, response.WrapError(err)) | ||
return | ||
} | ||
response.Write(ctx, response.NewMsgp(200, &models.GetDataResp{Series: series})) | ||
ss.Trace(opentracing.SpanFromContext(ctx.Req.Context())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think this is needed here. ss.Trace()
is already called at the end of s.getTargetsLocal()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense. we only need it in executePlan (for individual nodes, both local and cluster peers), and executePlan (for aggregated-across-peers per-response stats)
api/models/storagestats.go
Outdated
|
||
func (ss StorageStats) MarshalJSONFastRaw(b []byte) ([]byte, error) { | ||
b = append(b, `"executeplan.cache-miss.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
b = append(b, `"executeplan.cache-miss.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheMiss), 10) | ||
b = append(b, `,"executeplan.cache-hit-partial.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
b = append(b, `,"executeplan.cache-hit-partial.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheHitPartial), 10) | ||
b = append(b, `,"executeplan.cache-hit.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheHit), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
b = append(b, `,"executeplan.cache-hit.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.CacheHit), 10) | ||
b = append(b, `,"executeplan.chunks-from-tank.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
b = append(b, `,"executeplan.chunks-from-tank.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.ChunksFromTank), 10) | ||
b = append(b, `,"executeplan.chunks-from-cache.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
b = append(b, `,"executeplan.chunks-from-cache.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.ChunksFromCache), 10) | ||
b = append(b, `,"executeplan.chunks-from-store.count":`...) | ||
b = strconv.AppendUint(b, uint64(ss.ChunksFromStore), 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
} | ||
|
||
func (ss StorageStats) Trace(span opentracing.Span) { | ||
span.SetTag("cache-miss", ss.CacheMiss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
|
||
func (ss StorageStats) Trace(span opentracing.Span) { | ||
span.SetTag("cache-miss", ss.CacheMiss) | ||
span.SetTag("cache-hit-partial", ss.CacheHitPartial) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
func (ss StorageStats) Trace(span opentracing.Span) { | ||
span.SetTag("cache-miss", ss.CacheMiss) | ||
span.SetTag("cache-hit-partial", ss.CacheHitPartial) | ||
span.SetTag("cache-hit", ss.CacheHit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
span.SetTag("cache-miss", ss.CacheMiss) | ||
span.SetTag("cache-hit-partial", ss.CacheHitPartial) | ||
span.SetTag("cache-hit", ss.CacheHit) | ||
span.SetTag("chunks-from-tank", ss.ChunksFromTank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
span.SetTag("cache-hit-partial", ss.CacheHitPartial) | ||
span.SetTag("cache-hit", ss.CacheHit) | ||
span.SetTag("chunks-from-tank", ss.ChunksFromTank) | ||
span.SetTag("chunks-from-cache", ss.ChunksFromCache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
api/models/storagestats.go
Outdated
span.SetTag("cache-hit", ss.CacheHit) | ||
span.SetTag("chunks-from-tank", ss.ChunksFromTank) | ||
span.SetTag("chunks-from-cache", ss.ChunksFromCache) | ||
span.SetTag("chunks-from-store", ss.ChunksFromStore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs atomic.LoadUint32()
@@ -23,12 +23,18 @@ func (rwm ResponseWithMeta) MarshalJSONFast(b []byte) ([]byte, error) { | |||
// RenderMeta holds metadata about a render request/response | |||
type RenderMeta struct { | |||
Stats stats | |||
StorageStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just seems hacky. Lets clean it up and use Stats
and StorageStats
consistently. eg
type RenderMeta struct {
RenderStats
StorageStats
}
addressed all feedback. PTAL |
Goals:
too soon for in-depth review, but curious to hear anyone's thoughts?