This repository has been archived by the owner on Dec 1, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is a proposal for Oldtimer, the Heapster historical metrics access component. Oldtimer was original proposed in the vision statement, but was not specified in any particular detail previously.
- Loading branch information
1 parent
de510e4
commit 2cd3494
Showing
1 changed file
with
144 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,144 @@ | ||
# Heapster Oldtimer | ||
|
||
## Overview | ||
|
||
Prior to the Heapster refactor, the Heapster model presented aggregations of | ||
metrics over certain time periods (the last hour and day). Post-refactor, the | ||
concern of presenting an interface for historical metrics was to be split into | ||
a separate Heapster component: Oldtimer. | ||
|
||
Oldtimer will present common interfaces for retrieving historical metrics over | ||
longer periods of time than the Heapster model, and will allow fetching | ||
aggregations of metrics (e.g. averages, 95 percentile, etc) over different | ||
periods of time. It will do this by querying the sink to which it is storing | ||
metrics. | ||
|
||
Note: even though we are retrieving metrics, this document refers to the | ||
metrics storage locations as "sinks" to be consistent with the rest | ||
of Heapster. | ||
|
||
## Motivation | ||
|
||
There are two major motivations for exposing historical metrics information: | ||
|
||
1. Using aggregated historical data to make size-related decisions | ||
(for example, idling requires looking for traffic over a long time period) | ||
|
||
2. Providing a common interface to for users to view historical metrics | ||
|
||
Before the Heapster refactoring (see the "Heapster Long Term Vision" proposal), | ||
Heapster supported querying metrics aggregated over certain extended time | ||
periods (the last hour and day) via the Heapster model. | ||
|
||
However, since the Heapster model is stored in-memory, and not persisted to | ||
disk, this historical data would be "lost" whenever Heapster was restarted. | ||
This made it unreliable for use by system components which need a historical | ||
view. | ||
|
||
Since we already persist metrics into a sink, it does not make sense for | ||
Heapster itself to persist long-term metrics to disk itself. Instead, we can | ||
just query the sink directly. | ||
|
||
## Design | ||
|
||
### API | ||
|
||
Oldtimer will present an api somewhat similar to the normal Heapster model. | ||
The urls will take the forms: | ||
|
||
`/api/v1/old-timer/{prefix}/metrics/`: Returns a list of all available metrics. | ||
|
||
`/api/v1/old-timer{prefix}/metrics/{metric-name}?start=X&end=Y`: Returns a set | ||
of (Timestamp, Value) pairs for the requested {prefix}-level metric, over the | ||
given time range. | ||
|
||
`/api/v1/old-timer/{prefix}/metrics/{metric-name}/{aggregation-name}?start=X&end=Y&bucket=B`: | ||
Returns the requested {prefix}-level metric, aggregated with the given | ||
aggregation over the requested time period (potentially split into several | ||
different bucket of duration `B`). `{aggregation}` may be a comma-separated | ||
list of aggregations to retrieve multiple at once. | ||
|
||
Where `{prefix}` is either empty (cluster-level), `/namespaces/{namespace}` | ||
(namespace-level), `/namespaces/{namespace}/pods/{pod-name}` (pod-level), | ||
`/namespaces/{namespace}/pod-list/{pod-list}` (multi-pod-level), or | ||
`/namespaces/{namespace}/pods/{pod-name}/containers/{container-name}` | ||
(container-level). | ||
|
||
In addition, when `{prefix}` is not empty, there will be a url of the form: | ||
`/api/v1/old-timer/{prefix-without-final-element}` which allows fetching the | ||
list of available nodes/namespaces/pods/containers. | ||
|
||
The `start` and `end` parameters are defined the same way as for the model. | ||
The `bucket` (bucket duration) parameter is a number followed by any of the | ||
following suffixes: | ||
|
||
- `ms`: milliseconds | ||
- `s`: seconds | ||
- `m`: minutes | ||
- `h`: hours | ||
- `d`: days | ||
|
||
### Functionality | ||
|
||
When Oldtimer receives a request at one of the given URLs, it will compose a | ||
query to the configured metrics sink, execute that query, and return the | ||
results. The return format for normal requests will be the same as that | ||
returned by the Heapster model. | ||
|
||
In the case of aggregations, the normal `MetricsResult` and `MetricsResultList` | ||
are wrapped in order to differentiate between different aggregations. Each | ||
metric point represents one bucket (if no buckets are requested, only one point | ||
is returned). The timestamp in the case of aggregations is the timestamp of | ||
the start of that bucket. | ||
|
||
```go | ||
type MetricAggregationResult struct { | ||
Average *MetricResult | ||
Maximum *MetricResult | ||
Minimum *MetricResult | ||
Median *MetricResult | ||
Count *MetricResult | ||
Percentiles map[uint64]MetricResult | ||
} | ||
|
||
type MetricListAggregationResult struct { | ||
Average *MetricResultList | ||
Maximum *MetricResultList | ||
Minimum *MetricResultList | ||
Median *MetricResultList | ||
Count *MetricResultList | ||
Percentiles map[uint64]MetricResultList | ||
} | ||
``` | ||
|
||
### Aggregations | ||
|
||
Several different aggregations will be supported. Aggregations should be | ||
performed in the metrics sink. If more aggregations later become supported | ||
across all metrics sinks, the list can be expanded (and the API version | ||
should probably be bumped, since the supported aggregations should be part of | ||
the API). | ||
|
||
- Average (arithmetic mean): `/{metric-name}/average` | ||
- Maximum: `/{metric-name}/max` | ||
- Minimum: `/{metric-name}/min` | ||
- Percentile: `/{metric-name}/{number}-perc` | ||
- Median: `/{metric-name}/median` | ||
- Count: `/{metric-name}/count` | ||
|
||
## Scaling and Performance Considerations | ||
|
||
Since Oldtimer itself does not store any data, it should be fairly easy to | ||
deploy multiple replicas of Oldtimer. The metrics sinks themselves should | ||
already have clustering support, and thus can be scaled as well. Since | ||
Oldtimer queries the metrics sinks themselves, response latency should | ||
depend mainly on how quickly the sinks can respond to queries. | ||
|
||
## Open Questions | ||
|
||
- Do the choice of percentiles need to be limited? InfluxDB and Hawkular | ||
appear to support arbitrary percentile values in queries, while GCM v3 appears | ||
to support 99, 95, 50, 5, and OpenTSDB appears to support 50, 75, 90, 95, 99, | ||
and 999 (meaning the common values would be 50, 95, and 99). | ||
|
||
|