You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds.
Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations.
The restrictions for this support are as follows:
- The composite aggregation must have EXACTLY one `date_histogram` source
- The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source
- The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg
- If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg.
- The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).
Some key user interaction differences:
- Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary.
- Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is.
- I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
experimental:[Support for composite aggregations inside datafeeds is currently experimental].
17
18
18
19
[discrete]
19
20
[[aggs-limits-dfeeds]]
20
21
== Requirements and limitations
21
22
22
-
There are some limitations to using aggregations in {dfeeds}. Your aggregation
23
-
must include a `date_histogram` aggregation, which in turn must contain a `max`
24
-
aggregation on the time field. This requirement ensures that the aggregated data
25
-
is a time series and the timestamp of each bucket is the time of the last record
26
-
in the bucket.
23
+
There are some limitations to using aggregations in {dfeeds}.
27
24
28
-
IMPORTANT: The name of the aggregation and the name of the field that the agg
29
-
operates on need to match, otherwise the aggregation doesn't work. For example,
30
-
if you use a `max` aggregation on a time field called `responsetime`, the name
25
+
Your aggregation must include a `date_histogram` aggregation or a top level `composite` aggregation,
26
+
which in turn must contain a `max` aggregation on the time field.
27
+
This requirement ensures that the aggregated data is a time series and the timestamp
28
+
of each bucket is the time of the last record in the bucket.
29
+
30
+
IMPORTANT: The name of the aggregation and the name of the field that it
31
+
operates on need to match, otherwise the aggregation doesn't work. For example,
32
+
if you use a `max` aggregation on a time field called `responsetime`, the name
31
33
of the aggregation must be also `responsetime`.
32
34
33
-
You must also consider the interval of the date histogram aggregation carefully.
34
-
The bucket span of your {anomaly-job} must be divisible by the value of the
35
-
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
36
-
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
37
-
interval. {anomaly-jobs-cap} cannot use date histograms with an interval
38
-
measured in months because the length of the month is not fixed. {dfeeds-cap}
39
-
tolerate weeks or smaller units.
35
+
You must consider the interval of the `date_histogram` or `composite`
36
+
aggregation carefully. The bucket span of your {anomaly-job} must be divisible
37
+
by the value of the `calendar_interval` or `fixed_interval` in your aggregation
38
+
(with no remainder). If you specify a `frequency` for your {dfeed},
39
+
it must also be divisible by this interval. {anomaly-jobs-cap} cannot use
40
+
`date_histogram` or `composite` aggregations with an interval measured in months
41
+
because the length of the month is not fixed; they can use weeks or smaller units.
40
42
41
43
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
42
-
<<ml-sum-functions,sum>> analytical functions, set the date histogram
44
+
<<ml-sum-functions,sum>> analytical functions, set the `date_histogram` or `composite`
43
45
aggregation interval to a tenth of the bucket span. This suggestion creates
44
46
finer, more granular time buckets, which are ideal for this type of analysis. If
45
47
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
46
48
functions, set the interval to the same value as the bucket span.
47
49
48
-
If your <<aggs-dfeeds,{dfeed} uses aggregations with nested `terms` aggs>> and
49
-
model plot is not enabled for the {anomaly-job}, neither the **Single Metric
50
-
Viewer** nor the **Anomaly Explorer** can plot and display an anomaly
51
-
chart for the job. In these cases, the charts are not visible and an explanatory
50
+
If your <<aggs-dfeeds,{dfeed} uses aggregations with nested `terms` aggs>> and
51
+
model plot is not enabled for the {anomaly-job}, neither the **Single Metric
52
+
Viewer** nor the **Anomaly Explorer** can plot and display an anomaly
53
+
chart for the job. In these cases, the charts are not visible and an explanatory
52
54
message is shown.
53
55
54
-
When the aggregation interval of the {dfeed} and the bucket span of the job
55
-
don't match, the values of the chart plotted in both the **Single Metric
56
-
Viewer** and the **Anomaly Explorer** differ from the actual values of the job.
57
-
To avoid this behavior, make sure that the aggregation interval in the {dfeed}
58
-
configuration and the bucket span in the {anomaly-job} configuration have the
56
+
When the aggregation interval of the {dfeed} and the bucket span of the job
57
+
don't match, the values of the chart plotted in both the **Single Metric
58
+
Viewer** and the **Anomaly Explorer** differ from the actual values of the job.
59
+
To avoid this behavior, make sure that the aggregation interval in the {dfeed}
60
+
configuration and the bucket span in the {anomaly-job} configuration have the
59
61
same values.
60
62
63
+
Your {dfeed} can contain multiple aggregations, but only the ones with names
64
+
that match values in the job configuration are fed to the job.
61
65
62
66
[discrete]
63
-
[[aggs-include-jobs]]
64
-
== Including aggregations in {anomaly-jobs}
67
+
[[aggs-using-date-histogram]]
68
+
=== Including aggregations in {anomaly-jobs}
65
69
66
70
When you create or update an {anomaly-job}, you can include the names of
67
71
aggregations, for example:
@@ -86,8 +90,8 @@ PUT _ml/anomaly_detectors/farequote
86
90
----------------------------------
87
91
// TEST[skip:setup:farequote_data]
88
92
89
-
<1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
90
-
aggregated fields defined in the `analysis_config` object are analyzed by the
93
+
<1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
94
+
aggregated fields defined in the `analysis_config` object are analyzed by the
91
95
{anomaly-job}.
92
96
93
97
NOTE: When the `summary_count_field_name` property is set to a non-null value,
@@ -134,25 +138,135 @@ PUT _ml/datafeeds/datafeed-farequote
134
138
----------------------------------
135
139
// TEST[skip:setup:farequote_job]
136
140
137
-
<1> The aggregations have names that match the fields that they operate on. The
141
+
<1> The aggregations have names that match the fields that they operate on. The
138
142
`max` aggregation is named `time` and its field also needs to be `time`.
139
-
<2> The `term` aggregation is named `airline` and its field is also named
143
+
<2> The `term` aggregation is named `airline` and its field is also named
140
144
`airline`.
141
-
<3> The `avg` aggregation is named `responsetime` and its field is also named
145
+
<3> The `avg` aggregation is named `responsetime` and its field is also named
142
146
`responsetime`.
143
147
144
-
Your {dfeed} can contain multiple aggregations, but only the ones with names
145
-
that match values in the job configuration are fed to the job.
148
+
TIP: If you are using a `term` aggregation to gather influencer or partition
149
+
field information, consider using a `composite` aggregation. It performs
150
+
better than a `date_histogram` with a nested `term` aggregation and also includes
151
+
all the values of the field instead of the top values per bucket.
152
+
153
+
[discrete]
154
+
[[aggs-using-composite]]
155
+
=== Using composite aggregations in {anomaly-jobs}
156
+
157
+
experimental::[]
158
+
159
+
For `composite` aggregation support, there must be exactly one `date_histogram` value
160
+
source. That value source must not be sorted in descending order. Additional
161
+
`composite` aggregation value sources are allowed, such as `terms`.
162
+
163
+
NOTE: A {dfeed} that uses composite aggregations may not be as performant as datafeeds that use scrolling or
164
+
date histogram aggregations. Composite aggregations are optimized
165
+
for queries that are either `match_all` or `range` filters. Other types of
166
+
queries may cause the `composite` aggregation to be ineffecient.
167
+
168
+
Here is an example that uses a `composite` aggregation instead of a
169
+
`date_histogram`.
170
+
171
+
Assuming the same job configuration as above.
172
+
173
+
[source,console]
174
+
----------------------------------
175
+
PUT _ml/anomaly_detectors/farequote-composite
176
+
{
177
+
"analysis_config": {
178
+
"bucket_span": "60m",
179
+
"detectors": [{
180
+
"function": "mean",
181
+
"field_name": "responsetime",
182
+
"by_field_name": "airline"
183
+
}],
184
+
"summary_count_field_name": "doc_count"
185
+
},
186
+
"data_description": {
187
+
"time_field":"time"
188
+
}
189
+
}
190
+
----------------------------------
191
+
// TEST[skip:setup:farequote_data]
192
+
193
+
This is an example of a datafeed that uses a `composite` aggregation to bucket
194
+
the metrics based on time and terms:
195
+
196
+
[source,console]
197
+
----------------------------------
198
+
PUT _ml/datafeeds/datafeed-farequote-composite
199
+
{
200
+
"job_id": "farequote-composite",
201
+
"indices": [
202
+
"farequote"
203
+
],
204
+
"aggregations": {
205
+
"buckets": {
206
+
"composite": {
207
+
"size": 1000, <1>
208
+
"sources": [
209
+
{
210
+
"time_bucket": { <2>
211
+
"date_histogram": {
212
+
"field": "time",
213
+
"fixed_interval": "360s",
214
+
"time_zone": "UTC"
215
+
}
216
+
}
217
+
},
218
+
{
219
+
"airline": { <3>
220
+
"terms": {
221
+
"field": "airline"
222
+
}
223
+
}
224
+
}
225
+
]
226
+
},
227
+
"aggregations": {
228
+
"time": { <4>
229
+
"max": {
230
+
"field": "time"
231
+
}
232
+
},
233
+
"responsetime": { <5>
234
+
"avg": {
235
+
"field": "responsetime"
236
+
}
237
+
}
238
+
}
239
+
}
240
+
}
241
+
}
242
+
----------------------------------
243
+
// TEST[skip:setup:farequote_job]
146
244
245
+
<1> Provide the `size` to the composite agg to control how many resources
246
+
are used when aggregating the data. A larger `size` means a faster datafeed but
247
+
more cluster resources are used when searching.
248
+
<2> The required `date_histogram` composite aggregation source. Make sure it
249
+
is named differently than your desired time field.
250
+
<3> Instead of using a regular `term` aggregation, adding a composite
251
+
aggregation `term` source with the name `airline` works. Note its name
252
+
is the same as the field.
253
+
<4> The required `max` aggregation whose name is the time field in the
254
+
job analysis config.
255
+
<5> The `avg` aggregation is named `responsetime` and its field is also named
256
+
`responsetime`.
147
257
148
258
[discrete]
149
259
[[aggs-dfeeds]]
150
260
== Nested aggregations in {dfeeds}
151
261
152
-
{dfeeds-cap} support complex nested aggregations. This example uses the
153
-
`derivative` pipeline aggregation to find the first order derivative of the
262
+
{dfeeds-cap} support complex nested aggregations. This example uses the
263
+
`derivative` pipeline aggregation to find the first order derivative of the
154
264
counter `system.network.out.bytes` for each value of the field `beat.name`.
155
265
266
+
NOTE: `derivative` or other pipeline aggregations may not work within `composite`
267
+
aggregations. See
268
+
{ref}/search-aggregations-bucket-composite-aggregation.html#search-aggregations-bucket-composite-aggregation-pipeline-aggregations[composite aggregations and pipeline aggregations].
269
+
156
270
[source,js]
157
271
----------------------------------
158
272
"aggregations": {
@@ -247,8 +361,9 @@ number of unique entries for the `error` field.
247
361
[[aggs-define-dfeeds]]
248
362
== Defining aggregations in {dfeeds}
249
363
250
-
When you define an aggregation in a {dfeed}, it must have the following form:
364
+
When you define an aggregation in a {dfeed}, it must have one of the following forms:
251
365
366
+
When using a `date_histogram` aggregation to bucket by time:
252
367
[source,js]
253
368
----------------------------------
254
369
"aggregations": {
@@ -282,36 +397,75 @@ When you define an aggregation in a {dfeed}, it must have the following form:
282
397
----------------------------------
283
398
// NOTCONSOLE
284
399
285
-
The top level aggregation must be either a
286
-
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
287
-
sub-aggregation that is a `date_histogram` or the top level aggregation is the
288
-
required `date_histogram`. There must be exactly one `date_histogram`
Copy file name to clipboardExpand all lines: server/src/main/java/org/elasticsearch/search/aggregations/bucket/composite/DateHistogramValuesSourceBuilder.java
+8-2Lines changed: 8 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -212,15 +212,21 @@ public DateHistogramValuesSourceBuilder fixedInterval(DateHistogramInterval inte
212
212
* {@code null} then it means that the interval is expressed as a fixed
213
213
* {@link TimeValue} and may be accessed via {@link #getIntervalAsFixed()} ()}. */
0 commit comments