Skip to content
This repository has been archived by the owner on Oct 29, 2021. It is now read-only.

Aggregation support for InfluxDBStore. #136

Merged
merged 16 commits into from
Apr 20, 2016
Merged

Conversation

emidoots
Copy link
Member

@emidoots emidoots commented Apr 16, 2016

Details

  • Adds aggregation support to: /dashboard & /aggregate endpoints.
  • Adds re-usable functions:
    • appdash.AggregateEvent.SlowestRawQuery() string
    • appdash.Trace.TimespanEvent() (TimespanEvent, error)
  • Extends interface:
    • appdash.Traces() ([]*Trace, error) ---> Traces(opts TracesOpts) ([]*Trace, error)
      • So we are be able to pass aggregation information (Eg. start/end time) through TracesOpts.
  • Removes AggregateStore in favor of InfluxDBStore.

Closes #137

### Features

- [#99](https://github.com/sourcegraph/appdash/pull/99): New store engine backed by [InfluxDB](https://github.com/influxdata/influxdb).
- [#110](https://github.com/sourcegraph/appdash/pull/110): Implementation of the OpenTracing API.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use perhaps a better format for changelog; in particular one that is date-based in case Release Notes and Features sections become lengthy in the future.

Personal preference also tells me to shy away from version numbers somewhat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 97111e6

@emidoots
Copy link
Member Author

emidoots commented Apr 16, 2016

(Previous PR by @chris-ramon (I rebased against master) at #127)

I've given an in-depth overview of why removing AggregateStore is such an important move for Appdash as a whole over at issue 137, and will update the changelog to link to this issue as well.

})
if err != nil {
return err
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a rather fatal issue here. InfluxDBStore.Traces only returns e.g. 10 traces at a time due to pagination. So it right now returns "traces in the last 72 hr, maximum 10" but ideally the dashboard shows us N slowest traces over the past 72hr (all traces, not just 10).

I think this also makes the reported numbers on the dashboard less accurate, because it would be the average time of 10 traces, instead of the average time of all traces in the selected timeline.

I tried lifting the limit on InfluxDBStore.Traces so it returns all traces, but if there are many the query becomes extremely slow (obviously). We will need to use InfluxDB's average etc features instead.

@emidoots
Copy link
Member Author

I've looked into using InfluxDB's calculation features like MEAN, MIN, MAX, STDDEV and COUNT against a simple measurement (table) with both 'trace name' and 'trace time' columns. This turns out to be surprisingly obvious and works pretty well, but crops up a few critical issues with Appdash architecture:

There is no concept of when a trace 'ends' -- we assume that data can be constantly put into the appdash.Store at any point in time. This is valuable because.. it's hard to say when a trace ends in distributed systems all together! Who decides? Ideally you would say the web browser knows both when the request (trace) starts and ends, but this is harder to achieve in reality due to a number of reasons.

But, if we look at why we need 'trace time' in the first place (for the dashboard), we can reconsider the problem from the top-down: Is displaying 'trace name' and 'trace time' the most valuable information from the Dashboard? Would it be more valuable to just display operations/spans on the dashboard instead? This would mean..

  • Finer grained data (getting an overview of all operations in a trace, instead of an overview of just traces at a top-level).
  • Ability to see "What was the average time for spans named XYZ over the past X hours?".
  • We always know the start and end time of a span (unlike a trace)!
  • We can even use the same storage that we use for storing spans in InfluxDB already, instead of a different measurement/table!
  • Con: Calculating the average for all spans is more complex than calculating the average for all traces; there are many more spans than there are traces.

Aside from the one con, which means we would need to sampling and/or a different measurement/table for dashboard data to be sampled at a different rate, this has a lot of cool implications, and could provide much more valuable insight to users on the dashboard who are asking the question "what does the system look like overall?"

There is one implementation complexity not mentioned here, which is that we don't know a spans name and duration at exactly the same time. Luckily, the work @chris-ramon is doing with httptrace will in fact enable this.

emidoots added a commit that referenced this pull request Apr 17, 2016
This change fixes the InfluxDBStore aggregation support (previously, it
would calculate only the average of the selected timerange "up to 10
traces" which produced counter-intuitive results). Instead, to make use
of InfluxDB in a more consistent mannor switch the Dashboard to a span-based
display rather than a trace-based display. This means users will see individual
operations instead of a summarization of all operations within traces.

For more details about the motivation for this change, see #136 (comment)
chris-ramon and others added 13 commits April 17, 2016 14:49
…ering

now `/dashboard` timeline filter works for InfluxDBStore
…span ids

now `/traces?show=span_ids...` works for `InfluxDBStore`
- Make it date-based such that the two columns will not become too lengthy
  in the future.
- Shy away from version numbers for now.
This change fixes the InfluxDBStore aggregation support (previously, it
would calculate only the average of the selected timerange "up to 10
traces" which produced counter-intuitive results). Instead, to make use
of InfluxDB in a more consistent mannor switch the Dashboard to a span-based
display rather than a trace-based display. This means users will see individual
operations instead of a summarization of all operations within traces.

For more details about the motivation for this change, see #136 (comment)
The user interface tries to filter based on _trace_ IDs whereas here
we tried to filter based on _span IDs_, this caused the filter to not
show correct information.
@emidoots emidoots force-pushed the influxdb-aggregate-feature branch from 1005858 to 8b57bf4 Compare April 17, 2016 21:52
stddev can be nil when there are <= 1 entries and a standard deviation
cannot be calculated at all.
@emidoots
Copy link
Member Author

@chris-ramon can you review my follow-up commits on this branch?

mean, min, max, stddev, count := v[1], v[2], v[3], v[4], v[5]
results[i] = &AggregatedResult{
RootSpanName: row.Tags["name"],
Average: time.Duration(mustJSONFloat64(mean) * float64(time.Second)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace float64(time.Second) with a variable instead? declared & assigned around line 209, to avoid repetitive equal calculations.

@chris-ramon
Copy link
Contributor

Hi @slimsag! great work on improving the aggregation support for the InfluxDBStore - I've left a few of in-line comments other than that this is really looking good. 👍

…fixed integer

This is more resiliant to change / future-proof.
@emidoots emidoots merged commit 9dd479d into master Apr 20, 2016
@emidoots emidoots deleted the influxdb-aggregate-feature branch April 20, 2016 07:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants