diff --git a/docs/en/stack/ml/dataframes.asciidoc b/docs/en/stack/ml/dataframes.asciidoc new file mode 100644 index 000000000..837f014fe --- /dev/null +++ b/docs/en/stack/ml/dataframes.asciidoc @@ -0,0 +1,58 @@ +[[ml-dataframes]] +=== {dataframes-cap} + +beta[] + +A _{dataframe}_ is a transformation of a dataset by certain rules. You can think +of it like a spreadsheet or a data table that makes your data ready to be analyzed +and organized. + +A lot of {es} datasets are organized as a stream of events: each event is a individual +document, for example a single item purchase. {dataframe-transforms-cap} enable +you to summarize this data, bringing it into an organized, more analysis friendly +format. For example, you can summarize all the purchases of a single customer (see +the example below). + +The {dataframe} feature enables you to define a _pivot_ which is a set of features +that transform the dataset into a different, more digestible format. Pivoting +results in a summary of your dataset (which is the {dataframe} itself). + +Defining a pivot consist of two main parts. First, you select one or more fields +that your dataset will be grouped by. Principally you can select categorical +fields (terms) for grouping. You can also select numerical fields, in this case, +the field values will be bucketed using an interval you specify. + +The second step is deciding how you want to aggregate the grouped data. When +using aggregations, you practically ask questions about the dataset. There are +different types of aggregations, each with its own purpose and output. To learn +more about the supported aggregations and group-by fields, see +{ref}/data-frame-transform-pivot.html[Pivot resources]. + +As an optional step, it's also possible to add a query to further limit the +scope of the aggregation. + +IMPORTANT: In 7.2, you can build {dataframes} on the top of a static dataset. +When new data comes into the index, you have to perform the transformation again +on the altered data. + +.Example + +Imagine that you run a webshop that sells clothes. Every order creates a +document that contains a unique order ID, the name and the category of the +ordered product, its price, the ordered quantity, the exact date of the order, +and some customer information (name, gender, location, etc). Your dataset +contains all the transactions from last year. + +If you want to check the sales in the different categories in your last fiscal year, +define a {dataframe} that is grouped by the product categories (women's shoes, men's +clothing, etc.) and the order date with the interval of the last year, then set +a sum aggregation on the ordered quantity. The result is a {dataframe} pivot that +shows the number of sold items in every product category in the last year. + +[role="screenshot"] +image::ml/images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"] + +IMPORTANT: Creating a {dataframe} leaves your source index intact. A new index will +be created dedicated to the {dataframe}. + +TIP: Using {dataframes} does not require {dfeeds}. \ No newline at end of file diff --git a/docs/en/stack/ml/images/ml-dataframepivot.jpg b/docs/en/stack/ml/images/ml-dataframepivot.jpg new file mode 100644 index 000000000..c0c7946cf Binary files /dev/null and b/docs/en/stack/ml/images/ml-dataframepivot.jpg differ diff --git a/docs/en/stack/ml/overview.asciidoc b/docs/en/stack/ml/overview.asciidoc index e52377d9d..92f5ba526 100644 --- a/docs/en/stack/ml/overview.asciidoc +++ b/docs/en/stack/ml/overview.asciidoc @@ -11,3 +11,4 @@ include::buckets.asciidoc[] include::calendars.asciidoc[] include::rules.asciidoc[] include::architecture.asciidoc[] +include::dataframes.asciidoc[]