diff --git a/docs/en/stack/ml/anomaly-detection/categorization-data.asciidoc b/docs/en/stack/ml/anomaly-detection/categorization-data.asciidoc new file mode 100644 index 000000000..8f18c5fa1 --- /dev/null +++ b/docs/en/stack/ml/anomaly-detection/categorization-data.asciidoc @@ -0,0 +1,22 @@ +[role="xpack"] +[[ml-datatypes-categorization]] +=== Data types and categorization + +Categorization is a {ml} process that considers a tokenization of a field, +clusters similar data together, and classifies them into categories. However, +categorization doesn't work equally well on different data types. It works +best on machine-written messages and application outputs, typically on data that +consists of repeated elements, for example log messages for the purpose of +system troubleshooting. Log categorization groups unstructured log messages into +categories, then you can use {anomaly-detect} to model and identify rare or +unusual counts of log message categories. For more information about the +process, see +{ml-docs}/ml-configuring-categories.html[Categorizing log messages]. + +Categorization is tuned to work best on data like log messages by taking token +order into account, not considering synonyms, and including stop words in its analysis. +Complete sentences in human communication or literary text (for example emails, +wiki pages, prose, or other human generated content) can be extremely diverse in +structure. Since categorization is tuned for machine data it will give poor results on such human generated data. +For example, the categorization job would create so many categories that +couldn't be handled effectively. Categorization is _not_ natural language processing (NLP). diff --git a/docs/en/stack/ml/anomaly-detection/overview.asciidoc b/docs/en/stack/ml/anomaly-detection/overview.asciidoc index ce86b7a35..25a9d37b5 100644 --- a/docs/en/stack/ml/anomaly-detection/overview.asciidoc +++ b/docs/en/stack/ml/anomaly-detection/overview.asciidoc @@ -6,3 +6,4 @@ include::analyzing.asciidoc[] include::forecasting.asciidoc[] +include::categorization-data.asciidoc[] \ No newline at end of file