Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions docs/en/stack/ml/dataframes.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
[[ml-dataframes]]
=== {dataframes-cap}

{dataframes-cap} feature is available in 7.2 and later.

A _{dataframe}_ is a transformation of a dataset by certain rules defined during
the creation of the {dataframe}. You can think of it like a spreadsheet or a
data table that makes your data ready to be analyzed and organized.

{es} datasets consist of individual documents that have fields and
values in each field. This architecture makes search easy but on the other hand,
makes it hard to run analyses that require reorganized or summarized fields of
the dataset. {ml-cap} analyses need clean and transformed data and that is the
point where {dataframes} come into play.

To transform the data into a {dataframe}, you need to define a _pivot_. During
pivoting, you create a set of features that transform the dataset into a
different, more digestible format to make calculations on your data. Pivoting
results in a summary of your dataset (which is the {dataframe} itself).

Defining a pivot consist of two main parts. First, you select one or more fields
that your dataset will be grouped by. Principally you can select categorical
fields (terms) for grouping. You can also select numerical fields, in this case,
the field values will be bucketed using an interval you specify. The calculation
will run against every bucket that was created this way.

The second step is selecting one or more aggregations to perform calculation over
the dataset. When using aggregations, you practically ask questions about the
dataset. There are different types of aggregations, each with its own purpose and
output. You can learn more about the supported aggregations and group-by fields
here (!add a link!).

As an optional step, it's also possible to add a query to further limit the
scope of the aggregation.

IMPORTANT: In 7.2, you can build {dataframes} on the top of a static dataset.
When new data comes into the index, you have to perform the transformation again
on the altered data. Using {dataframes} does not require {dfeeds}.
{con-dataframes-cap} will be introduced in a later version.

.Example

Put the case that you run a webshop that sells clothes. Every order creates a
document that contains a unique order ID, the name and the category of the
ordered product, its price, the ordered quantity, the exact date of the order,
and some customer information (name, gender, location, etc). Your dataset
contains the documents of all the transactions from last year.

If you want to check the sales in the different categories in your last fiscal year,
define a {dataframe} that is grouped by the product categories (women's shoes, men's
clothing, etc.) and the order date histogram with the interval of the last year,
then set a sum aggregation on the ordered quantity. The result is a {dataframe}
pivot that shows the number of sold items in every product category in the last
year.

IMPORTANT: Creating a {dataframe} leaves your source index intact. A new index will
be created dedicated to the {dataframe}.
1 change: 1 addition & 0 deletions docs/en/stack/ml/overview.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ include::buckets.asciidoc[]
include::calendars.asciidoc[]
include::rules.asciidoc[]
include::architecture.asciidoc[]
include::dataframes.asciidoc[]