Skip to content

Commit 93bc2e3

Browse files
authored
[ML] Replace the implementation of the categorize_text aggregation (#85872)
This replaces the implementation of the categorize_text aggregation with the new algorithm that was added in #80867. The new algorithm works in the same way as the ML C++ code used for categorization jobs (and now includes the fixes of elastic/ml-cpp#2277). The docs are updated to reflect the workings of the new implementation.
1 parent 79990fa commit 93bc2e3

File tree

45 files changed

+667
-3461
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+667
-3461
lines changed

docs/changelog/85872.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pr: 85872
2+
summary: Replace the implementation of the `categorize_text` aggregation
3+
area: Machine Learning
4+
type: enhancement
5+
issues: []

docs/reference/aggregations/bucket/categorize-text-aggregation.asciidoc

Lines changed: 85 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -17,56 +17,14 @@ NOTE: If you have considerable memory allocated to your JVM but are receiving ci
1717
<<search-aggregations-bucket-diversified-sampler-aggregation,diversified sampler>>, or
1818
<<search-aggregations-random-sampler-aggregation,random sampler>> to explore the created categories.
1919

20+
NOTE: The algorithm used for categorization was completely changed in version 8.3.0. As a result this aggregation
21+
will not work in a mixed version cluster where some nodes are on version 8.3.0 or higher and others are
22+
on a version older than 8.3.0. Upgrade all nodes in your cluster to the same version if you experience
23+
an error related to this change.
24+
2025
[[bucket-categorize-text-agg-syntax]]
2126
==== Parameters
2227

23-
`field`::
24-
(Required, string)
25-
The semi-structured text field to categorize.
26-
27-
`max_unique_tokens`::
28-
(Optional, integer, default: `50`)
29-
The maximum number of unique tokens at any position up to `max_matched_tokens`.
30-
Must be larger than 1. Smaller values use less memory and create fewer categories.
31-
Larger values will use more memory and create narrower categories.
32-
Max allowed value is `100`.
33-
34-
`max_matched_tokens`::
35-
(Optional, integer, default: `5`)
36-
The maximum number of token positions to match on before attempting to merge categories.
37-
Larger values will use more memory and create narrower categories.
38-
Max allowed value is `100`.
39-
40-
Example:
41-
`max_matched_tokens` of 2 would disallow merging of the categories
42-
[`foo` `bar` `baz`]
43-
[`foo` `baz` `bozo`]
44-
As the first 2 tokens are required to match for the category.
45-
46-
NOTE: Once `max_unique_tokens` is reached at a given position, a new `*` token is
47-
added and all new tokens at that position are matched by the `*` token.
48-
49-
`similarity_threshold`::
50-
(Optional, integer, default: `50`)
51-
The minimum percentage of tokens that must match for text to be added to the
52-
category bucket.
53-
Must be between 1 and 100. The larger the value the narrower the categories.
54-
Larger values will increase memory usage and create narrower categories.
55-
56-
`categorization_filters`::
57-
(Optional, array of strings)
58-
This property expects an array of regular expressions. The expressions
59-
are used to filter out matching sequences from the categorization field values.
60-
You can use this functionality to fine tune the categorization by excluding
61-
sequences from consideration when categories are defined. For example, you can
62-
exclude SQL statements that appear in your log files. This
63-
property cannot be used at the same time as `categorization_analyzer`. If you
64-
only want to define simple regular expression filters that are applied prior to
65-
tokenization, setting this property is the easiest method. If you also want to
66-
customize the tokenizer or post-tokenization filtering, use the
67-
`categorization_analyzer` property instead and include the filters as
68-
`pattern_replace` character filters.
69-
7028
`categorization_analyzer`::
7129
(Optional, object or string)
7230
The categorization analyzer specifies how the text is analyzed and tokenized before
@@ -95,14 +53,33 @@ include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=tokenizer]
9553
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=filter]
9654
=====
9755

98-
`shard_size`::
56+
`categorization_filters`::
57+
(Optional, array of strings)
58+
This property expects an array of regular expressions. The expressions
59+
are used to filter out matching sequences from the categorization field values.
60+
You can use this functionality to fine tune the categorization by excluding
61+
sequences from consideration when categories are defined. For example, you can
62+
exclude SQL statements that appear in your log files. This
63+
property cannot be used at the same time as `categorization_analyzer`. If you
64+
only want to define simple regular expression filters that are applied prior to
65+
tokenization, setting this property is the easiest method. If you also want to
66+
customize the tokenizer or post-tokenization filtering, use the
67+
`categorization_analyzer` property instead and include the filters as
68+
`pattern_replace` character filters.
69+
70+
`field`::
71+
(Required, string)
72+
The semi-structured text field to categorize.
73+
74+
`max_matched_tokens`::
9975
(Optional, integer)
100-
The number of categorization buckets to return from each shard before merging
101-
all the results.
76+
This parameter does nothing now, but is permitted for compatibility with the original
77+
pre-8.3.0 implementation.
10278

103-
`size`::
104-
(Optional, integer, default: `10`)
105-
The number of buckets to return.
79+
`max_unique_tokens`::
80+
(Optional, integer)
81+
This parameter does nothing now, but is permitted for compatibility with the original
82+
pre-8.3.0 implementation.
10683

10784
`min_doc_count`::
10885
(Optional, integer)
@@ -113,8 +90,23 @@ The minimum number of documents for a bucket to be returned to the results.
11390
The minimum number of documents for a bucket to be returned from the shard before
11491
merging.
11592

116-
==== Basic use
93+
`shard_size`::
94+
(Optional, integer)
95+
The number of categorization buckets to return from each shard before merging
96+
all the results.
97+
98+
`similarity_threshold`::
99+
(Optional, integer, default: `70`)
100+
The minimum percentage of token weight that must match for text to be added to the
101+
category bucket.
102+
Must be between 1 and 100. The larger the value the narrower the categories.
103+
Larger values will increase memory usage and create narrower categories.
117104

105+
`size`::
106+
(Optional, integer, default: `10`)
107+
The number of buckets to return.
108+
109+
==== Basic use
118110

119111
WARNING: Re-analyzing _large_ result sets will require a lot of time and memory. This aggregation should be
120112
used in conjunction with <<async-search, Async search>>. Additionally, you may consider
@@ -149,27 +141,30 @@ Response:
149141
"buckets" : [
150142
{
151143
"doc_count" : 3,
152-
"key" : "Node shutting down"
144+
"key" : "Node shutting down",
145+
"max_matching_length" : 49
153146
},
154147
{
155148
"doc_count" : 1,
156-
"key" : "Node starting up"
149+
"key" : "Node starting up",
150+
"max_matching_length" : 47
157151
},
158152
{
159153
"doc_count" : 1,
160-
"key" : "User foo_325 logging on"
154+
"key" : "User foo_325 logging on",
155+
"max_matching_length" : 52
161156
},
162157
{
163158
"doc_count" : 1,
164-
"key" : "User foo_864 logged off"
159+
"key" : "User foo_864 logged off",
160+
"max_matching_length" : 52
165161
}
166162
]
167163
}
168164
}
169165
}
170166
--------------------------------------------------
171167

172-
173168
Here is an example using `categorization_filters`
174169

175170
[source,console]
@@ -202,19 +197,23 @@ category results
202197
"buckets" : [
203198
{
204199
"doc_count" : 3,
205-
"key" : "Node shutting down"
200+
"key" : "Node shutting down",
201+
"max_matching_length" : 49
206202
},
207203
{
208204
"doc_count" : 1,
209-
"key" : "Node starting up"
205+
"key" : "Node starting up",
206+
"max_matching_length" : 47
210207
},
211208
{
212209
"doc_count" : 1,
213-
"key" : "User logged off"
210+
"key" : "User logged off",
211+
"max_matching_length" : 52
214212
},
215213
{
216214
"doc_count" : 1,
217-
"key" : "User logging on"
215+
"key" : "User logging on",
216+
"max_matching_length" : 52
218217
}
219218
]
220219
}
@@ -223,11 +222,15 @@ category results
223222
--------------------------------------------------
224223

225224
Here is an example using `categorization_filters`.
226-
The default analyzer is a whitespace analyzer with a custom token filter
227-
which filters out tokens that start with any number.
225+
The default analyzer uses the `ml_standard` tokenizer which is similar to a whitespace tokenizer
226+
but filters out tokens that could be interpreted as hexadecimal numbers. The default analyzer
227+
also uses the `first_line_with_letters` character filter, so that only the first meaningful line
228+
of multi-line messages is considered.
228229
But, it may be that a token is a known highly-variable token (formatted usernames, emails, etc.). In that case, it is good to supply
229-
custom `categorization_filters` to filter out those tokens for better categories. These filters will also reduce memory usage as fewer
230-
tokens are held in memory for the categories.
230+
custom `categorization_filters` to filter out those tokens for better categories. These filters may also reduce memory usage as fewer
231+
tokens are held in memory for the categories. (If there are sufficient examples of different usernames, emails, etc., then
232+
categories will form that naturally discard them as variables, but for small input data where only one example exists this won't
233+
happen.)
231234

232235
[source,console]
233236
--------------------------------------------------
@@ -238,8 +241,7 @@ POST log-messages/_search?filter_path=aggregations
238241
"categorize_text": {
239242
"field": "message",
240243
"categorization_filters": ["\\w+\\_\\d{3}"], <1>
241-
"max_matched_tokens": 2, <2>
242-
"similarity_threshold": 30 <3>
244+
"similarity_threshold": 11 <2>
243245
}
244246
}
245247
}
@@ -248,12 +250,12 @@ POST log-messages/_search?filter_path=aggregations
248250
// TEST[setup:categorize_text]
249251
<1> The filters to apply to the analyzed tokens. It filters
250252
out tokens like `bar_123`.
251-
<2> Require at least 2 tokens before the log categories attempt to merge together
252-
<3> Require 30% of the tokens to match before expanding a log categories
253-
to add a new log entry
253+
<2> Require 11% of token weight to match before adding a message to an
254+
existing category rather than creating a new one.
254255

255-
The resulting categories are now broad, matching the first token
256-
and merging the log groups.
256+
The resulting categories are now very broad, merging the log groups.
257+
(A `similarity_threshold` of 11% is generally too low. Settings over
258+
50% are usually better.)
257259

258260
[source,console-result]
259261
--------------------------------------------------
@@ -263,11 +265,13 @@ and merging the log groups.
263265
"buckets" : [
264266
{
265267
"doc_count" : 4,
266-
"key" : "Node *"
268+
"key" : "Node",
269+
"max_matching_length" : 49
267270
},
268271
{
269272
"doc_count" : 2,
270-
"key" : "User *"
273+
"key" : "User",
274+
"max_matching_length" : 52
271275
}
272276
]
273277
}
@@ -326,6 +330,7 @@ POST log-messages/_search?filter_path=aggregations
326330
{
327331
"doc_count" : 2,
328332
"key" : "Node shutting down",
333+
"max_matching_length" : 49,
329334
"hit" : {
330335
"hits" : {
331336
"total" : {
@@ -352,6 +357,7 @@ POST log-messages/_search?filter_path=aggregations
352357
{
353358
"doc_count" : 1,
354359
"key" : "Node starting up",
360+
"max_matching_length" : 47,
355361
"hit" : {
356362
"hits" : {
357363
"total" : {
@@ -387,6 +393,7 @@ POST log-messages/_search?filter_path=aggregations
387393
{
388394
"doc_count" : 1,
389395
"key" : "Node shutting down",
396+
"max_matching_length" : 49,
390397
"hit" : {
391398
"hits" : {
392399
"total" : {
@@ -413,6 +420,7 @@ POST log-messages/_search?filter_path=aggregations
413420
{
414421
"doc_count" : 1,
415422
"key" : "User logged off",
423+
"max_matching_length" : 52,
416424
"hit" : {
417425
"hits" : {
418426
"total" : {
@@ -439,6 +447,7 @@ POST log-messages/_search?filter_path=aggregations
439447
{
440448
"doc_count" : 1,
441449
"key" : "User logging on",
450+
"max_matching_length" : 52,
442451
"hit" : {
443452
"hits" : {
444453
"total" : {

x-pack/plugin/build.gradle

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -118,16 +118,28 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task ->
118118
"ml/datafeeds_crud/Test update datafeed to point to job already attached to another datafeed",
119119
"behaviour change #44752 - not allowing to update datafeed job_id"
120120
)
121+
task.skipTest(
122+
"ml/trained_model_cat_apis/Test cat trained models",
123+
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
124+
)
125+
task.skipTest(
126+
"ml/categorization_agg/Test categorization agg simple",
127+
"categorize_text was changed in 8.3, but experimental prior to the change"
128+
)
129+
task.skipTest(
130+
"ml/categorization_agg/Test categorization aggregation against unsupported field",
131+
"categorize_text was changed in 8.3, but experimental prior to the change"
132+
)
133+
task.skipTest(
134+
"ml/categorization_agg/Test categorization aggregation with poor settings",
135+
"categorize_text was changed in 8.3, but experimental prior to the change"
136+
)
121137
task.skipTest("rollup/delete_job/Test basic delete_job", "rollup was an experimental feature, also see #41227")
122138
task.skipTest("rollup/delete_job/Test delete job twice", "rollup was an experimental feature, also see #41227")
123139
task.skipTest("rollup/delete_job/Test delete running job", "rollup was an experimental feature, also see #41227")
124140
task.skipTest("rollup/get_jobs/Test basic get_jobs", "rollup was an experimental feature, also see #41227")
125141
task.skipTest("rollup/put_job/Test basic put_job", "rollup was an experimental feature, also see #41227")
126142
task.skipTest("rollup/start_job/Test start job twice", "rollup was an experimental feature, also see #41227")
127-
task.skipTest(
128-
"ml/trained_model_cat_apis/Test cat trained models",
129-
"A type field was added to cat.ml_trained_models #73660, this is a backwards compatible change. Still this is a cat api, and we don't support them with rest api compatibility. (the test would be very hard to transform too)"
130-
)
131143
task.skipTest("indices.freeze/30_usage/Usage stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
132144
task.skipTest("indices.freeze/20_stats/Translog stats on frozen indices", "#70192 -- the freeze index API is removed from 8.0")
133145
task.skipTest("indices.freeze/10_basic/Basic", "#70192 -- the freeze index API is removed from 8.0")
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
import static org.hamcrest.Matchers.not;
2828
import static org.hamcrest.Matchers.notANumber;
2929

30-
public class CategorizationAggregationIT extends BaseMlIntegTestCase {
30+
public class CategorizeTextAggregationIT extends BaseMlIntegTestCase {
3131

3232
private static final String DATA_INDEX = "categorization-agg-data";
3333

@@ -77,17 +77,17 @@ public void testAggregationWithBroadCategories() {
7777
.setSize(0)
7878
.setTrackTotalHits(false)
7979
.addAggregation(
80+
// Overriding the similarity threshold to just 11% (default is 70%) results in the
81+
// "Node started" and "Node stopped" messages being grouped in the same category
8082
new CategorizeTextAggregationBuilder("categorize", "msg").setSimilarityThreshold(11)
81-
.setMaxUniqueTokens(2)
82-
.setMaxMatchedTokens(1)
8383
.subAggregation(AggregationBuilders.max("max").field("time"))
8484
.subAggregation(AggregationBuilders.min("min").field("time"))
8585
)
8686
.get();
8787
InternalCategorizationAggregation agg = response.getAggregations().get("categorize");
8888
assertThat(agg.getBuckets(), hasSize(2));
8989

90-
assertCategorizationBucket(agg.getBuckets().get(0), "Node *", 4);
90+
assertCategorizationBucket(agg.getBuckets().get(0), "Node", 4);
9191
assertCategorizationBucket(agg.getBuckets().get(1), "Failed to shutdown error org.aaaa.bbbb.Cccc line caused by foo exception", 2);
9292
}
9393

x-pack/plugin/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/CategorizeTextDistributedIT.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,8 @@
1616
import org.elasticsearch.cluster.metadata.IndexMetadata;
1717
import org.elasticsearch.cluster.routing.ShardRouting;
1818
import org.elasticsearch.common.settings.Settings;
19-
import org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder;
20-
import org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation;
19+
import org.elasticsearch.xpack.ml.aggs.categorization.CategorizeTextAggregationBuilder;
20+
import org.elasticsearch.xpack.ml.aggs.categorization.InternalCategorizationAggregation;
2121
import org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase;
2222

2323
import java.util.Arrays;

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1431,16 +1431,7 @@ public List<AggregationSpec> getAggregations() {
14311431
CategorizeTextAggregationBuilder::new,
14321432
CategorizeTextAggregationBuilder.PARSER
14331433
).addResultReader(InternalCategorizationAggregation::new)
1434-
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME)),
1435-
// TODO: in the long term only keep one or other of these categorization aggregations
1436-
new AggregationSpec(
1437-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME,
1438-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder::new,
1439-
org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.PARSER
1440-
).addResultReader(org.elasticsearch.xpack.ml.aggs.categorization2.InternalCategorizationAggregation::new)
1441-
.setAggregatorRegistrar(
1442-
s -> s.registerUsage(org.elasticsearch.xpack.ml.aggs.categorization2.CategorizeTextAggregationBuilder.NAME)
1443-
)
1434+
.setAggregatorRegistrar(s -> s.registerUsage(CategorizeTextAggregationBuilder.NAME))
14441435
);
14451436
}
14461437

0 commit comments

Comments
 (0)