Skip to content
This repository has been archived by the owner on Dec 1, 2018. It is now read-only.

Commit

Permalink
Proposal: Introduce Oldtimer
Browse files Browse the repository at this point in the history
This is a proposal for Oldtimer, the Heapster historical metrics
access component.  Oldtimer was original proposed in the vision
statement, but was not specified in any particular detail previously.
  • Loading branch information
DirectXMan12 committed Apr 11, 2016
1 parent de510e4 commit 2cd3494
Showing 1 changed file with 144 additions and 0 deletions.
144 changes: 144 additions & 0 deletions docs/proposals/old-timer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Heapster Oldtimer

## Overview

Prior to the Heapster refactor, the Heapster model presented aggregations of
metrics over certain time periods (the last hour and day). Post-refactor, the
concern of presenting an interface for historical metrics was to be split into
a separate Heapster component: Oldtimer.

Oldtimer will present common interfaces for retrieving historical metrics over
longer periods of time than the Heapster model, and will allow fetching
aggregations of metrics (e.g. averages, 95 percentile, etc) over different
periods of time. It will do this by querying the sink to which it is storing
metrics.

Note: even though we are retrieving metrics, this document refers to the
metrics storage locations as "sinks" to be consistent with the rest
of Heapster.

## Motivation

There are two major motivations for exposing historical metrics information:

1. Using aggregated historical data to make size-related decisions
(for example, idling requires looking for traffic over a long time period)

2. Providing a common interface to for users to view historical metrics

Before the Heapster refactoring (see the "Heapster Long Term Vision" proposal),
Heapster supported querying metrics aggregated over certain extended time
periods (the last hour and day) via the Heapster model.

However, since the Heapster model is stored in-memory, and not persisted to
disk, this historical data would be "lost" whenever Heapster was restarted.
This made it unreliable for use by system components which need a historical
view.

Since we already persist metrics into a sink, it does not make sense for
Heapster itself to persist long-term metrics to disk itself. Instead, we can
just query the sink directly.

## Design

### API

Oldtimer will present an api somewhat similar to the normal Heapster model.
The urls will take the forms:

`/api/v1/old-timer/{prefix}/metrics/`: Returns a list of all available metrics.

`/api/v1/old-timer{prefix}/metrics/{metric-name}?start=X&end=Y`: Returns a set
of (Timestamp, Value) pairs for the requested {prefix}-level metric, over the
given time range.

`/api/v1/old-timer/{prefix}/metrics/{metric-name}/{aggregation-name}?start=X&end=Y&bucket=B`:
Returns the requested {prefix}-level metric, aggregated with the given
aggregation over the requested time period (potentially split into several
different bucket of duration `B`). `{aggregation}` may be a comma-separated
list of aggregations to retrieve multiple at once.

Where `{prefix}` is either empty (cluster-level), `/namespaces/{namespace}`
(namespace-level), `/namespaces/{namespace}/pods/{pod-name}` (pod-level),
`/namespaces/{namespace}/pod-list/{pod-list}` (multi-pod-level), or
`/namespaces/{namespace}/pods/{pod-name}/containers/{container-name}`
(container-level).

In addition, when `{prefix}` is not empty, there will be a url of the form:
`/api/v1/old-timer/{prefix-without-final-element}` which allows fetching the
list of available nodes/namespaces/pods/containers.

The `start` and `end` parameters are defined the same way as for the model.
The `bucket` (bucket duration) parameter is a number followed by any of the
following suffixes:

- `ms`: milliseconds
- `s`: seconds
- `m`: minutes
- `h`: hours
- `d`: days

### Functionality

When Oldtimer receives a request at one of the given URLs, it will compose a
query to the configured metrics sink, execute that query, and return the
results. The return format for normal requests will be the same as that
returned by the Heapster model.

In the case of aggregations, the normal `MetricsResult` and `MetricsResultList`
are wrapped in order to differentiate between different aggregations. Each
metric point represents one bucket (if no buckets are requested, only one point
is returned). The timestamp in the case of aggregations is the timestamp of
the start of that bucket.

```go
type MetricAggregationResult struct {
Average *MetricResult
Maximum *MetricResult
Minimum *MetricResult
Median *MetricResult
Count *MetricResult
Percentiles map[uint64]MetricResult
}

type MetricListAggregationResult struct {
Average *MetricResultList
Maximum *MetricResultList
Minimum *MetricResultList
Median *MetricResultList
Count *MetricResultList
Percentiles map[uint64]MetricResultList
}
```

### Aggregations

Several different aggregations will be supported. Aggregations should be
performed in the metrics sink. If more aggregations later become supported
across all metrics sinks, the list can be expanded (and the API version
should probably be bumped, since the supported aggregations should be part of
the API).

- Average (arithmetic mean): `/{metric-name}/average`
- Maximum: `/{metric-name}/max`
- Minimum: `/{metric-name}/min`
- Percentile: `/{metric-name}/{number}-perc`
- Median: `/{metric-name}/median`
- Count: `/{metric-name}/count`

## Scaling and Performance Considerations

Since Oldtimer itself does not store any data, it should be fairly easy to
deploy multiple replicas of Oldtimer. The metrics sinks themselves should
already have clustering support, and thus can be scaled as well. Since
Oldtimer queries the metrics sinks themselves, response latency should
depend mainly on how quickly the sinks can respond to queries.

## Open Questions

- Do the choice of percentiles need to be limited? InfluxDB and Hawkular
appear to support arbitrary percentile values in queries, while GCM v3 appears
to support 99, 95, 50, 5, and OpenTSDB appears to support 50, 75, 90, 95, 99,
and 999 (meaning the common values would be 50, 95, and 99).


0 comments on commit 2cd3494

Please sign in to comment.