Aggregation support for InfluxDBStore. #136

emidoots · 2016-04-16T02:09:59Z

Details

Adds aggregation support to: /dashboard & /aggregate endpoints.
Adds re-usable functions:
- appdash.AggregateEvent.SlowestRawQuery() string
- appdash.Trace.TimespanEvent() (TimespanEvent, error)
Extends interface:
- appdash.Traces() ([]*Trace, error) ---> Traces(opts TracesOpts) ([]*Trace, error)
  - So we are be able to pass aggregation information (Eg. start/end time) through TracesOpts.
Removes AggregateStore in favor of InfluxDBStore.

Closes #137

emidoots · 2016-04-16T02:12:21Z

CHANGELOG.md

+### Features
+
+- [#99](https://github.com/sourcegraph/appdash/pull/99): New store engine backed by [InfluxDB](https://github.com/influxdata/influxdb).
+- [#110](https://github.com/sourcegraph/appdash/pull/110): Implementation of the OpenTracing API.


I think we can use perhaps a better format for changelog; in particular one that is date-based in case Release Notes and Features sections become lengthy in the future.

Personal preference also tells me to shy away from version numbers somewhat.

Fixed in 97111e6

emidoots · 2016-04-16T03:27:04Z

(Previous PR by @chris-ramon (I rebased against master) at #127)

I've given an in-depth overview of why removing AggregateStore is such an important move for Appdash as a whole over at issue 137, and will update the changelog to link to this issue as well.

emidoots · 2016-04-16T04:50:57Z

traceapp/dashboard.go

+	})
+	if err != nil {
+		return err
+	}


There is a rather fatal issue here. InfluxDBStore.Traces only returns e.g. 10 traces at a time due to pagination. So it right now returns "traces in the last 72 hr, maximum 10" but ideally the dashboard shows us N slowest traces over the past 72hr (all traces, not just 10).

I think this also makes the reported numbers on the dashboard less accurate, because it would be the average time of 10 traces, instead of the average time of all traces in the selected timeline.

I tried lifting the limit on InfluxDBStore.Traces so it returns all traces, but if there are many the query becomes extremely slow (obviously). We will need to use InfluxDB's average etc features instead.

emidoots · 2016-04-17T07:12:18Z

I've looked into using InfluxDB's calculation features like MEAN, MIN, MAX, STDDEV and COUNT against a simple measurement (table) with both 'trace name' and 'trace time' columns. This turns out to be surprisingly obvious and works pretty well, but crops up a few critical issues with Appdash architecture:

There is no concept of when a trace 'ends' -- we assume that data can be constantly put into the appdash.Store at any point in time. This is valuable because.. it's hard to say when a trace ends in distributed systems all together! Who decides? Ideally you would say the web browser knows both when the request (trace) starts and ends, but this is harder to achieve in reality due to a number of reasons.

But, if we look at why we need 'trace time' in the first place (for the dashboard), we can reconsider the problem from the top-down: Is displaying 'trace name' and 'trace time' the most valuable information from the Dashboard? Would it be more valuable to just display operations/spans on the dashboard instead? This would mean..

Finer grained data (getting an overview of all operations in a trace, instead of an overview of just traces at a top-level).
Ability to see "What was the average time for spans named XYZ over the past X hours?".
We always know the start and end time of a span (unlike a trace)!
We can even use the same storage that we use for storing spans in InfluxDB already, instead of a different measurement/table!
Con: Calculating the average for all spans is more complex than calculating the average for all traces; there are many more spans than there are traces.

Aside from the one con, which means we would need to sampling and/or a different measurement/table for dashboard data to be sampled at a different rate, this has a lot of cool implications, and could provide much more valuable insight to users on the dashboard who are asking the question "what does the system look like overall?"

There is one implementation complexity not mentioned here, which is that we don't know a spans name and duration at exactly the same time. Luckily, the work @chris-ramon is doing with httptrace will in fact enable this.

This change fixes the InfluxDBStore aggregation support (previously, it would calculate only the average of the selected timerange "up to 10 traces" which produced counter-intuitive results). Instead, to make use of InfluxDB in a more consistent mannor switch the Dashboard to a span-based display rather than a trace-based display. This means users will see individual operations instead of a summarization of all operations within traces. For more details about the motivation for this change, see #136 (comment)

…ering now `/dashboard` timeline filter works for InfluxDBStore

…span ids now `/traces?show=span_ids...` works for `InfluxDBStore`

- Make it date-based such that the two columns will not become too lengthy in the future. - Shy away from version numbers for now.

…tore"

This change fixes the InfluxDBStore aggregation support (previously, it would calculate only the average of the selected timerange "up to 10 traces" which produced counter-intuitive results). Instead, to make use of InfluxDB in a more consistent mannor switch the Dashboard to a span-based display rather than a trace-based display. This means users will see individual operations instead of a summarization of all operations within traces. For more details about the motivation for this change, see #136 (comment)

The user interface tries to filter based on _trace_ IDs whereas here we tried to filter based on _span IDs_, this caused the filter to not show correct information.

stddev can be nil when there are <= 1 entries and a standard deviation cannot be calculated at all.

emidoots · 2016-04-18T05:47:44Z

@chris-ramon can you review my follow-up commits on this branch?

chris-ramon · 2016-04-18T06:46:34Z

influxdb_store.go

+		mean, min, max, stddev, count := v[1], v[2], v[3], v[4], v[5]
+		results[i] = &AggregatedResult{
+			RootSpanName: row.Tags["name"],
+			Average:      time.Duration(mustJSONFloat64(mean) * float64(time.Second)),


Could we replace float64(time.Second) with a variable instead? declared & assigned around line 209, to avoid repetitive equal calculations.

chris-ramon · 2016-04-18T07:36:05Z

Hi @slimsag! great work on improving the aggregation support for the InfluxDBStore - I've left a few of in-line comments other than that this is really looking good. 👍

…fixed integer This is more resiliant to change / future-proof.

emidoots mentioned this pull request Apr 16, 2016

Aggregation support for InfluxDBStore. #127

Closed

4 tasks

emidoots reviewed Apr 16, 2016
View reviewed changes

chris-ramon and others added 13 commits April 17, 2016 14:49

initial aggregation support for InfluxDBStore

a8e5351

updates InfluxDBStore.Traces(...) to support time range traces filt…

3a9d290

…ering now `/dashboard` timeline filter works for InfluxDBStore

removes AggregateStore in favor of InfluxDBStore

c065587

updates InfluxDBStore.Traces(...) to support filtering by a set of …

1ddd075

…span ids now `/traces?show=span_ids...` works for `InfluxDBStore`

adds initial CHANGELOG.md

2fa0477

CHANGELOG: use better general changelog format

daf43b0

- Make it date-based such that the two columns will not become too lengthy in the future. - Shy away from version numbers for now.

CHANGELOG: link to issue 137 / "Replace AggregateStore with InfluxDBS…

4dbd6c0

…tore"

traceapp: remove AggregateStore support code from the dashboard

af53b2a

correct typo in TracesOpts.Timespan docstring

dbb4f3a

remove remaining AggregateStore support code

f484bba

CHANGELOG: mention removal of AggregateStore support code

65d9492

Fix InfluxDBStore.Traces method when filtering by trace IDs

8b57bf4

The user interface tries to filter based on _trace_ IDs whereas here we tried to filter based on _span IDs_, this caused the filter to not show correct information.

emidoots force-pushed the influxdb-aggregate-feature branch from 1005858 to 8b57bf4 Compare April 17, 2016 21:52

emidoots added 2 commits April 17, 2016 18:22

InfluxDBStore: handle nil std deviation as zero instead of panic

ed37632

stddev can be nil when there are <= 1 entries and a standard deviation cannot be calculated at all.

InfluxDBStore: workaround issue with tags and newlines

4e8ef32

chris-ramon reviewed Apr 18, 2016
View reviewed changes

InfluxDBStore: properly check by column name instead of relying on a …

094b223

…fixed integer This is more resiliant to change / future-proof.

emidoots merged commit 9dd479d into master Apr 20, 2016

emidoots deleted the influxdb-aggregate-feature branch April 20, 2016 07:29

dmitshur mentioned this pull request May 8, 2016

cmd/appdash: actually send spans to collector #164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregation support for InfluxDBStore. #136

Aggregation support for InfluxDBStore. #136

emidoots commented Apr 16, 2016 •

edited

Loading

emidoots Apr 16, 2016

emidoots Apr 16, 2016

emidoots commented Apr 16, 2016 •

edited

Loading

emidoots Apr 16, 2016

emidoots commented Apr 17, 2016

emidoots commented Apr 18, 2016

chris-ramon Apr 18, 2016

chris-ramon commented Apr 18, 2016

Aggregation support for InfluxDBStore. #136

Aggregation support for InfluxDBStore. #136

Conversation

emidoots commented Apr 16, 2016 • edited Loading

Details

emidoots Apr 16, 2016

Choose a reason for hiding this comment

emidoots Apr 16, 2016

Choose a reason for hiding this comment

emidoots commented Apr 16, 2016 • edited Loading

emidoots Apr 16, 2016

Choose a reason for hiding this comment

emidoots commented Apr 17, 2016

emidoots commented Apr 18, 2016

chris-ramon Apr 18, 2016

Choose a reason for hiding this comment

chris-ramon commented Apr 18, 2016

emidoots commented Apr 16, 2016 •

edited

Loading

emidoots commented Apr 16, 2016 •

edited

Loading