Skip to content

Commit c8415a7

Browse files
authored
[ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds. Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. The restrictions for this support are as follows: - The composite aggregation must have EXACTLY one `date_histogram` source - The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source - The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg - If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg. - The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket). Some key user interaction differences: - Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. - Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is. - I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
1 parent 15f8638 commit c8415a7

File tree

31 files changed

+2070
-298
lines changed

31 files changed

+2070
-298
lines changed

docs/reference/aggregations/bucket/composite-aggregation.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -848,6 +848,7 @@ GET /_search
848848
--------------------------------------------------
849849
// TESTRESPONSE[s/\.\.\.//]
850850

851+
[[search-aggregations-bucket-composite-aggregation-pipeline-aggregations]]
851852
==== Pipeline aggregations
852853

853854
The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases.

docs/reference/ml/anomaly-detection/ml-configuring-aggregations.asciidoc

Lines changed: 210 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -11,57 +11,61 @@ distributes these calculations across your cluster. You can then feed this
1111
aggregated data into the {ml-features} instead of raw results, which
1212
reduces the volume of data that must be considered while detecting anomalies.
1313

14-
TIP: If you use a terms aggregation and the cardinality of a term is high, the
15-
aggregation might not be effective and you might want to just use the default
16-
search and scroll behavior.
14+
TIP: If you use a terms aggregation and the cardinality of a term is high but
15+
still significantly less than your total number of documents, use
16+
{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregations]
17+
experimental:[Support for composite aggregations inside datafeeds is currently experimental].
1718

1819
[discrete]
1920
[[aggs-limits-dfeeds]]
2021
== Requirements and limitations
2122

22-
There are some limitations to using aggregations in {dfeeds}. Your aggregation
23-
must include a `date_histogram` aggregation, which in turn must contain a `max`
24-
aggregation on the time field. This requirement ensures that the aggregated data
25-
is a time series and the timestamp of each bucket is the time of the last record
26-
in the bucket.
23+
There are some limitations to using aggregations in {dfeeds}.
2724

28-
IMPORTANT: The name of the aggregation and the name of the field that the agg
29-
operates on need to match, otherwise the aggregation doesn't work. For example,
30-
if you use a `max` aggregation on a time field called `responsetime`, the name
25+
Your aggregation must include a `date_histogram` aggregation or a top level `composite` aggregation,
26+
which in turn must contain a `max` aggregation on the time field.
27+
This requirement ensures that the aggregated data is a time series and the timestamp
28+
of each bucket is the time of the last record in the bucket.
29+
30+
IMPORTANT: The name of the aggregation and the name of the field that it
31+
operates on need to match, otherwise the aggregation doesn't work. For example,
32+
if you use a `max` aggregation on a time field called `responsetime`, the name
3133
of the aggregation must be also `responsetime`.
3234

33-
You must also consider the interval of the date histogram aggregation carefully.
34-
The bucket span of your {anomaly-job} must be divisible by the value of the
35-
`calendar_interval` or `fixed_interval` in your aggregation (with no remainder).
36-
If you specify a `frequency` for your {dfeed}, it must also be divisible by this
37-
interval. {anomaly-jobs-cap} cannot use date histograms with an interval
38-
measured in months because the length of the month is not fixed. {dfeeds-cap}
39-
tolerate weeks or smaller units.
35+
You must consider the interval of the `date_histogram` or `composite`
36+
aggregation carefully. The bucket span of your {anomaly-job} must be divisible
37+
by the value of the `calendar_interval` or `fixed_interval` in your aggregation
38+
(with no remainder). If you specify a `frequency` for your {dfeed},
39+
it must also be divisible by this interval. {anomaly-jobs-cap} cannot use
40+
`date_histogram` or `composite` aggregations with an interval measured in months
41+
because the length of the month is not fixed; they can use weeks or smaller units.
4042

4143
TIP: As a rule of thumb, if your detectors use <<ml-metric-functions,metric>> or
42-
<<ml-sum-functions,sum>> analytical functions, set the date histogram
44+
<<ml-sum-functions,sum>> analytical functions, set the `date_histogram` or `composite`
4345
aggregation interval to a tenth of the bucket span. This suggestion creates
4446
finer, more granular time buckets, which are ideal for this type of analysis. If
4547
your detectors use <<ml-count-functions,count>> or <<ml-rare-functions,rare>>
4648
functions, set the interval to the same value as the bucket span.
4749

48-
If your <<aggs-dfeeds,{dfeed} uses aggregations with nested `terms` aggs>> and
49-
model plot is not enabled for the {anomaly-job}, neither the **Single Metric
50-
Viewer** nor the **Anomaly Explorer** can plot and display an anomaly
51-
chart for the job. In these cases, the charts are not visible and an explanatory
50+
If your <<aggs-dfeeds,{dfeed} uses aggregations with nested `terms` aggs>> and
51+
model plot is not enabled for the {anomaly-job}, neither the **Single Metric
52+
Viewer** nor the **Anomaly Explorer** can plot and display an anomaly
53+
chart for the job. In these cases, the charts are not visible and an explanatory
5254
message is shown.
5355

54-
When the aggregation interval of the {dfeed} and the bucket span of the job
55-
don't match, the values of the chart plotted in both the **Single Metric
56-
Viewer** and the **Anomaly Explorer** differ from the actual values of the job.
57-
To avoid this behavior, make sure that the aggregation interval in the {dfeed}
58-
configuration and the bucket span in the {anomaly-job} configuration have the
56+
When the aggregation interval of the {dfeed} and the bucket span of the job
57+
don't match, the values of the chart plotted in both the **Single Metric
58+
Viewer** and the **Anomaly Explorer** differ from the actual values of the job.
59+
To avoid this behavior, make sure that the aggregation interval in the {dfeed}
60+
configuration and the bucket span in the {anomaly-job} configuration have the
5961
same values.
6062

63+
Your {dfeed} can contain multiple aggregations, but only the ones with names
64+
that match values in the job configuration are fed to the job.
6165

6266
[discrete]
63-
[[aggs-include-jobs]]
64-
== Including aggregations in {anomaly-jobs}
67+
[[aggs-using-date-histogram]]
68+
=== Including aggregations in {anomaly-jobs}
6569

6670
When you create or update an {anomaly-job}, you can include the names of
6771
aggregations, for example:
@@ -86,8 +90,8 @@ PUT _ml/anomaly_detectors/farequote
8690
----------------------------------
8791
// TEST[skip:setup:farequote_data]
8892

89-
<1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
90-
aggregated fields defined in the `analysis_config` object are analyzed by the
93+
<1> The `airline`, `responsetime`, and `time` fields are aggregations. Only the
94+
aggregated fields defined in the `analysis_config` object are analyzed by the
9195
{anomaly-job}.
9296

9397
NOTE: When the `summary_count_field_name` property is set to a non-null value,
@@ -134,25 +138,135 @@ PUT _ml/datafeeds/datafeed-farequote
134138
----------------------------------
135139
// TEST[skip:setup:farequote_job]
136140

137-
<1> The aggregations have names that match the fields that they operate on. The
141+
<1> The aggregations have names that match the fields that they operate on. The
138142
`max` aggregation is named `time` and its field also needs to be `time`.
139-
<2> The `term` aggregation is named `airline` and its field is also named
143+
<2> The `term` aggregation is named `airline` and its field is also named
140144
`airline`.
141-
<3> The `avg` aggregation is named `responsetime` and its field is also named
145+
<3> The `avg` aggregation is named `responsetime` and its field is also named
142146
`responsetime`.
143147

144-
Your {dfeed} can contain multiple aggregations, but only the ones with names
145-
that match values in the job configuration are fed to the job.
148+
TIP: If you are using a `term` aggregation to gather influencer or partition
149+
field information, consider using a `composite` aggregation. It performs
150+
better than a `date_histogram` with a nested `term` aggregation and also includes
151+
all the values of the field instead of the top values per bucket.
152+
153+
[discrete]
154+
[[aggs-using-composite]]
155+
=== Using composite aggregations in {anomaly-jobs}
156+
157+
experimental::[]
158+
159+
For `composite` aggregation support, there must be exactly one `date_histogram` value
160+
source. That value source must not be sorted in descending order. Additional
161+
`composite` aggregation value sources are allowed, such as `terms`.
162+
163+
NOTE: A {dfeed} that uses composite aggregations may not be as performant as datafeeds that use scrolling or
164+
date histogram aggregations. Composite aggregations are optimized
165+
for queries that are either `match_all` or `range` filters. Other types of
166+
queries may cause the `composite` aggregation to be ineffecient.
167+
168+
Here is an example that uses a `composite` aggregation instead of a
169+
`date_histogram`.
170+
171+
Assuming the same job configuration as above.
172+
173+
[source,console]
174+
----------------------------------
175+
PUT _ml/anomaly_detectors/farequote-composite
176+
{
177+
"analysis_config": {
178+
"bucket_span": "60m",
179+
"detectors": [{
180+
"function": "mean",
181+
"field_name": "responsetime",
182+
"by_field_name": "airline"
183+
}],
184+
"summary_count_field_name": "doc_count"
185+
},
186+
"data_description": {
187+
"time_field":"time"
188+
}
189+
}
190+
----------------------------------
191+
// TEST[skip:setup:farequote_data]
192+
193+
This is an example of a datafeed that uses a `composite` aggregation to bucket
194+
the metrics based on time and terms:
195+
196+
[source,console]
197+
----------------------------------
198+
PUT _ml/datafeeds/datafeed-farequote-composite
199+
{
200+
"job_id": "farequote-composite",
201+
"indices": [
202+
"farequote"
203+
],
204+
"aggregations": {
205+
"buckets": {
206+
"composite": {
207+
"size": 1000, <1>
208+
"sources": [
209+
{
210+
"time_bucket": { <2>
211+
"date_histogram": {
212+
"field": "time",
213+
"fixed_interval": "360s",
214+
"time_zone": "UTC"
215+
}
216+
}
217+
},
218+
{
219+
"airline": { <3>
220+
"terms": {
221+
"field": "airline"
222+
}
223+
}
224+
}
225+
]
226+
},
227+
"aggregations": {
228+
"time": { <4>
229+
"max": {
230+
"field": "time"
231+
}
232+
},
233+
"responsetime": { <5>
234+
"avg": {
235+
"field": "responsetime"
236+
}
237+
}
238+
}
239+
}
240+
}
241+
}
242+
----------------------------------
243+
// TEST[skip:setup:farequote_job]
146244

245+
<1> Provide the `size` to the composite agg to control how many resources
246+
are used when aggregating the data. A larger `size` means a faster datafeed but
247+
more cluster resources are used when searching.
248+
<2> The required `date_histogram` composite aggregation source. Make sure it
249+
is named differently than your desired time field.
250+
<3> Instead of using a regular `term` aggregation, adding a composite
251+
aggregation `term` source with the name `airline` works. Note its name
252+
is the same as the field.
253+
<4> The required `max` aggregation whose name is the time field in the
254+
job analysis config.
255+
<5> The `avg` aggregation is named `responsetime` and its field is also named
256+
`responsetime`.
147257

148258
[discrete]
149259
[[aggs-dfeeds]]
150260
== Nested aggregations in {dfeeds}
151261

152-
{dfeeds-cap} support complex nested aggregations. This example uses the
153-
`derivative` pipeline aggregation to find the first order derivative of the
262+
{dfeeds-cap} support complex nested aggregations. This example uses the
263+
`derivative` pipeline aggregation to find the first order derivative of the
154264
counter `system.network.out.bytes` for each value of the field `beat.name`.
155265

266+
NOTE: `derivative` or other pipeline aggregations may not work within `composite`
267+
aggregations. See
268+
{ref}/search-aggregations-bucket-composite-aggregation.html#search-aggregations-bucket-composite-aggregation-pipeline-aggregations[composite aggregations and pipeline aggregations].
269+
156270
[source,js]
157271
----------------------------------
158272
"aggregations": {
@@ -247,8 +361,9 @@ number of unique entries for the `error` field.
247361
[[aggs-define-dfeeds]]
248362
== Defining aggregations in {dfeeds}
249363

250-
When you define an aggregation in a {dfeed}, it must have the following form:
364+
When you define an aggregation in a {dfeed}, it must have one of the following forms:
251365

366+
When using a `date_histogram` aggregation to bucket by time:
252367
[source,js]
253368
----------------------------------
254369
"aggregations": {
@@ -282,36 +397,75 @@ When you define an aggregation in a {dfeed}, it must have the following form:
282397
----------------------------------
283398
// NOTCONSOLE
284399

285-
The top level aggregation must be either a
286-
{ref}/search-aggregations-bucket.html[bucket aggregation] containing as single
287-
sub-aggregation that is a `date_histogram` or the top level aggregation is the
288-
required `date_histogram`. There must be exactly one `date_histogram`
289-
aggregation. For more information, see
290-
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation].
400+
When using a `composite` aggregation:
401+
402+
[source,js]
403+
----------------------------------
404+
"aggregations": {
405+
"composite_agg": {
406+
"sources": [
407+
{
408+
"date_histogram_agg": {
409+
"field": "time",
410+
...settings...
411+
}
412+
},
413+
...other valid sources...
414+
],
415+
...composite agg settings...,
416+
"aggregations": {
417+
"timestamp": {
418+
"max": {
419+
"field": "time"
420+
}
421+
},
422+
...other aggregations...
423+
[
424+
[,"aggregations" : {
425+
[<sub_aggregation>]+
426+
} ]
427+
}]
428+
}
429+
}
430+
}
431+
----------------------------------
432+
// NOTCONSOLE
433+
434+
The top level aggregation must be exclusively one of the following:
435+
* A {ref}/search-aggregations-bucket.html[bucket aggregation] containing a single
436+
sub-aggregation that is a `date_histogram`
437+
* A top level aggregation that is a `date_histogram`
438+
* A top level aggregation is a `composite` aggregation.
439+
440+
There must be exactly one `date_histogram`, `composite` aggregation. For more information, see
441+
{ref}/search-aggregations-bucket-datehistogram-aggregation.html[Date histogram aggregation] and
442+
{ref}/search-aggregations-bucket-composite-aggregation.html[Composite aggregation].
291443

292444
NOTE: The `time_zone` parameter in the date histogram aggregation must be set to
293445
`UTC`, which is the default value.
294446

295-
Each histogram bucket has a key, which is the bucket start time. This key cannot
296-
be used for aggregations in {dfeeds}, however, because they need to know the
297-
time of the latest record within a bucket. Otherwise, when you restart a
298-
{dfeed}, it continues from the start time of the histogram bucket and possibly
299-
fetches the same data twice. The max aggregation for the time field is therefore
300-
necessary to provide the time of the latest record within a bucket.
447+
Each histogram or composite bucket has a key, which is the bucket start time.
448+
This key cannot be used for aggregations in {dfeeds}, however, because
449+
they need to know the time of the latest record within a bucket.
450+
Otherwise, when you restart a {dfeed}, it continues from the start time of the
451+
histogram or composite bucket and possibly fetches the same data twice.
452+
The max aggregation for the time field is therefore necessary to provide
453+
the time of the latest record within a bucket.
301454

302455
You can optionally specify a terms aggregation, which creates buckets for
303456
different values of a field.
304457

305458
IMPORTANT: If you use a terms aggregation, by default it returns buckets for
306459
the top ten terms. Thus if the cardinality of the term is greater than 10, not
307-
all terms are analyzed.
460+
all terms are analyzed. In this case, consider using `composite` aggregations
461+
experimental:[Support for composite aggregations inside datafeeds is currently experimental].
308462

309463
You can change this behavior by setting the `size` parameter. To
310464
determine the cardinality of your data, you can run searches such as:
311465

312466
[source,js]
313467
--------------------------------------------------
314-
GET .../_search
468+
GET .../_search
315469
{
316470
"aggs": {
317471
"service_cardinality": {
@@ -324,10 +478,11 @@ GET .../_search
324478
--------------------------------------------------
325479
// NOTCONSOLE
326480

481+
327482
By default, {es} limits the maximum number of terms returned to 10000. For high
328483
cardinality fields, the query might not run. It might return errors related to
329484
circuit breaking exceptions that indicate that the data is too large. In such
330-
cases, do not use aggregations in your {dfeed}. For more information, see
485+
cases, use `composite` aggregations in your {dfeed}. For more information, see
331486
{ref}/search-aggregations-bucket-terms-aggregation.html[Terms aggregation].
332487

333488
You can also optionally specify multiple sub-aggregations. The sub-aggregations

server/src/main/java/org/elasticsearch/search/aggregations/bucket/composite/DateHistogramValuesSourceBuilder.java

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,15 +212,21 @@ public DateHistogramValuesSourceBuilder fixedInterval(DateHistogramInterval inte
212212
* {@code null} then it means that the interval is expressed as a fixed
213213
* {@link TimeValue} and may be accessed via {@link #getIntervalAsFixed()} ()}. */
214214
public DateHistogramInterval getIntervalAsCalendar() {
215-
return dateHistogramInterval.getAsCalendarInterval();
215+
if (dateHistogramInterval.getIntervalType().equals(DateIntervalWrapper.IntervalTypeEnum.CALENDAR)) {
216+
return dateHistogramInterval.getAsCalendarInterval();
217+
}
218+
return null;
216219
}
217220

218221
/**
219222
* Get the interval as a {@link TimeValue}, regardless of how it was configured. Returns null if
220223
* the interval cannot be parsed as a fixed time.
221224
*/
222225
public DateHistogramInterval getIntervalAsFixed() {
223-
return dateHistogramInterval.getAsFixedInterval();
226+
if (dateHistogramInterval.getIntervalType().equals(DateIntervalWrapper.IntervalTypeEnum.FIXED)) {
227+
return dateHistogramInterval.getAsFixedInterval();
228+
}
229+
return null;
224230
}
225231

226232
/**

0 commit comments

Comments
 (0)