@@ -17,56 +17,14 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
17
17
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
18
18
<<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.
19
19
20
+ NOTE: The algorithm used for categorization was completely changed in version 8.3.0. As a result this aggregation
21
+ will not work in a mixed version cluster where some nodes are on version 8.3.0 or higher and others are
22
+ on a version older than 8.3.0. Upgrade all nodes in your cluster to the same version if you experience
23
+ an error related to this change.
24
+
20
25
[[bucket-categorize-text-agg-syntax]]
21
26
==== Parameters
22
27
23
- `field`::
24
- (Required, string)
25
- The semi-structured text field to categorize.
26
-
27
- `max_unique_tokens`::
28
- (Optional, integer, default: `50`)
29
- The maximum number of unique tokens at any position up to `max_matched_tokens`.
30
- Must be larger than 1. Smaller values use less memory and create fewer categories.
31
- Larger values will use more memory and create narrower categories.
32
- Max allowed value is `100`.
33
-
34
- `max_matched_tokens`::
35
- (Optional, integer, default: `5`)
36
- The maximum number of token positions to match on before attempting to merge categories.
37
- Larger values will use more memory and create narrower categories.
38
- Max allowed value is `100`.
39
-
40
- Example:
41
- `max_matched_tokens` of 2 would disallow merging of the categories
42
- [`foo` `bar` `baz`]
43
- [`foo` `baz` `bozo`]
44
- As the first 2 tokens are required to match for the category.
45
-
46
- NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
47
- added and all new tokens at that position are matched by the `*` token.
48
-
49
- `similarity_threshold`::
50
- (Optional, integer, default: `50`)
51
- The minimum percentage of tokens that must match for text to be added to the
52
- category bucket.
53
- Must be between 1 and 100. The larger the value the narrower the categories.
54
- Larger values will increase memory usage and create narrower categories.
55
-
56
- `categorization_filters`::
57
- (Optional, array of strings)
58
- This property expects an array of regular expressions. The expressions
59
- are used to filter out matching sequences from the categorization field values.
60
- You can use this functionality to fine tune the categorization by excluding
61
- sequences from consideration when categories are defined. For example, you can
62
- exclude SQL statements that appear in your log files. This
63
- property cannot be used at the same time as `categorization_analyzer`. If you
64
- only want to define simple regular expression filters that are applied prior to
65
- tokenization, setting this property is the easiest method. If you also want to
66
- customize the tokenizer or post-tokenization filtering, use the
67
- `categorization_analyzer` property instead and include the filters as
68
- `pattern_replace` character filters.
69
-
70
28
`categorization_analyzer`::
71
29
(Optional, object or string)
72
30
The categorization analyzer specifies how the text is analyzed and tokenized before
@@ -95,14 +53,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
95
53
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
96
54
=====
97
55
98
- `shard_size`::
56
+ `categorization_filters`::
57
+ (Optional, array of strings)
58
+ This property expects an array of regular expressions. The expressions
59
+ are used to filter out matching sequences from the categorization field values.
60
+ You can use this functionality to fine tune the categorization by excluding
61
+ sequences from consideration when categories are defined. For example, you can
62
+ exclude SQL statements that appear in your log files. This
63
+ property cannot be used at the same time as `categorization_analyzer`. If you
64
+ only want to define simple regular expression filters that are applied prior to
65
+ tokenization, setting this property is the easiest method. If you also want to
66
+ customize the tokenizer or post-tokenization filtering, use the
67
+ `categorization_analyzer` property instead and include the filters as
68
+ `pattern_replace` character filters.
69
+
70
+ `field`::
71
+ (Required, string)
72
+ The semi-structured text field to categorize.
73
+
74
+ `max_matched_tokens`::
99
75
(Optional, integer)
100
- The number of categorization buckets to return from each shard before merging
101
- all the results .
76
+ This parameter does nothing now, but is permitted for compatibility with the original
77
+ pre-8.3.0 implementation .
102
78
103
- `size`::
104
- (Optional, integer, default: `10`)
105
- The number of buckets to return.
79
+ `max_unique_tokens`::
80
+ (Optional, integer)
81
+ This parameter does nothing now, but is permitted for compatibility with the original
82
+ pre-8.3.0 implementation.
106
83
107
84
`min_doc_count`::
108
85
(Optional, integer)
@@ -113,8 +90,23 @@ The minimum number of documents for a bucket to be returned to the results.
113
90
The minimum number of documents for a bucket to be returned from the shard before
114
91
merging.
115
92
116
- ==== Basic use
93
+ `shard_size`::
94
+ (Optional, integer)
95
+ The number of categorization buckets to return from each shard before merging
96
+ all the results.
97
+
98
+ `similarity_threshold`::
99
+ (Optional, integer, default: `70`)
100
+ The minimum percentage of token weight that must match for text to be added to the
101
+ category bucket.
102
+ Must be between 1 and 100. The larger the value the narrower the categories.
103
+ Larger values will increase memory usage and create narrower categories.
117
104
105
+ `size`::
106
+ (Optional, integer, default: `10`)
107
+ The number of buckets to return.
108
+
109
+ ==== Basic use
118
110
119
111
WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
120
112
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
@@ -149,27 +141,30 @@ Response:
149
141
"buckets" : [
150
142
{
151
143
"doc_count" : 3,
152
- "key" : "Node shutting down"
144
+ "key" : "Node shutting down",
145
+ "max_matching_length" : 49
153
146
},
154
147
{
155
148
"doc_count" : 1,
156
- "key" : "Node starting up"
149
+ "key" : "Node starting up",
150
+ "max_matching_length" : 47
157
151
},
158
152
{
159
153
"doc_count" : 1,
160
- "key" : "User foo_325 logging on"
154
+ "key" : "User foo_325 logging on",
155
+ "max_matching_length" : 52
161
156
},
162
157
{
163
158
"doc_count" : 1,
164
- "key" : "User foo_864 logged off"
159
+ "key" : "User foo_864 logged off",
160
+ "max_matching_length" : 52
165
161
}
166
162
]
167
163
}
168
164
}
169
165
}
170
166
--------------------------------------------------
171
167
172
-
173
168
Here is an example using `categorization_filters`
174
169
175
170
[source,console]
@@ -202,19 +197,23 @@ category results
202
197
"buckets" : [
203
198
{
204
199
"doc_count" : 3,
205
- "key" : "Node shutting down"
200
+ "key" : "Node shutting down",
201
+ "max_matching_length" : 49
206
202
},
207
203
{
208
204
"doc_count" : 1,
209
- "key" : "Node starting up"
205
+ "key" : "Node starting up",
206
+ "max_matching_length" : 47
210
207
},
211
208
{
212
209
"doc_count" : 1,
213
- "key" : "User logged off"
210
+ "key" : "User logged off",
211
+ "max_matching_length" : 52
214
212
},
215
213
{
216
214
"doc_count" : 1,
217
- "key" : "User logging on"
215
+ "key" : "User logging on",
216
+ "max_matching_length" : 52
218
217
}
219
218
]
220
219
}
@@ -223,11 +222,15 @@ category results
223
222
--------------------------------------------------
224
223
225
224
Here is an example using `categorization_filters`.
226
- The default analyzer is a whitespace analyzer with a custom token filter
227
- which filters out tokens that start with any number.
225
+ The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
226
+ but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
227
+ also uses the `first_line_with_letters` character filter, so that only the first meaningful line
228
+ of multi-line messages is considered.
228
229
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
229
- custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
230
- tokens are held in memory for the categories.
230
+ custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
231
+ tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
232
+ categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
233
+ happen.)
231
234
232
235
[source,console]
233
236
--------------------------------------------------
@@ -238,8 +241,7 @@ POST log-messages/_search?filter_path=aggregations
238
241
"categorize_text": {
239
242
"field": "message",
240
243
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
241
- "max_matched_tokens": 2, <2>
242
- "similarity_threshold": 30 <3>
244
+ "similarity_threshold": 11 <2>
243
245
}
244
246
}
245
247
}
@@ -248,12 +250,12 @@ POST log-messages/_search?filter_path=aggregations
248
250
// TEST[setup:categorize_text]
249
251
<1> The filters to apply to the analyzed tokens. It filters
250
252
out tokens like `bar_123`.
251
- <2> Require at least 2 tokens before the log categories attempt to merge together
252
- <3> Require 30% of the tokens to match before expanding a log categories
253
- to add a new log entry
253
+ <2> Require 11% of token weight to match before adding a message to an
254
+ existing category rather than creating a new one.
254
255
255
- The resulting categories are now broad, matching the first token
256
- and merging the log groups.
256
+ The resulting categories are now very broad, merging the log groups.
257
+ (A `similarity_threshold` of 11% is generally too low. Settings over
258
+ 50% are usually better.)
257
259
258
260
[source,console-result]
259
261
--------------------------------------------------
@@ -263,11 +265,13 @@ and merging the log groups.
263
265
"buckets" : [
264
266
{
265
267
"doc_count" : 4,
266
- "key" : "Node *"
268
+ "key" : "Node",
269
+ "max_matching_length" : 49
267
270
},
268
271
{
269
272
"doc_count" : 2,
270
- "key" : "User *"
273
+ "key" : "User",
274
+ "max_matching_length" : 52
271
275
}
272
276
]
273
277
}
@@ -326,6 +330,7 @@ POST log-messages/_search?filter_path=aggregations
326
330
{
327
331
"doc_count" : 2,
328
332
"key" : "Node shutting down",
333
+ "max_matching_length" : 49,
329
334
"hit" : {
330
335
"hits" : {
331
336
"total" : {
@@ -352,6 +357,7 @@ POST log-messages/_search?filter_path=aggregations
352
357
{
353
358
"doc_count" : 1,
354
359
"key" : "Node starting up",
360
+ "max_matching_length" : 47,
355
361
"hit" : {
356
362
"hits" : {
357
363
"total" : {
@@ -387,6 +393,7 @@ POST log-messages/_search?filter_path=aggregations
387
393
{
388
394
"doc_count" : 1,
389
395
"key" : "Node shutting down",
396
+ "max_matching_length" : 49,
390
397
"hit" : {
391
398
"hits" : {
392
399
"total" : {
@@ -413,6 +420,7 @@ POST log-messages/_search?filter_path=aggregations
413
420
{
414
421
"doc_count" : 1,
415
422
"key" : "User logged off",
423
+ "max_matching_length" : 52,
416
424
"hit" : {
417
425
"hits" : {
418
426
"total" : {
@@ -439,6 +447,7 @@ POST log-messages/_search?filter_path=aggregations
439
447
{
440
448
"doc_count" : 1,
441
449
"key" : "User logging on",
450
+ "max_matching_length" : 52,
442
451
"hit" : {
443
452
"hits" : {
444
453
"total" : {
0 commit comments