Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/en/stack/ml/df-analytics/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ include::ml-dfanalytics.asciidoc[]
include::ml-dfa-overview.asciidoc[leveloffset=+1]
include::ml-supervised-workflow.asciidoc[leveloffset=+2]
include::ml-dfa-phases.asciidoc[leveloffset=+2]
include::ml-dfa-scale.asciidoc[leveloffset=+2]

include::ml-dfa-concepts.asciidoc[leveloffset=+1]
include::dfa-outlier-detection.asciidoc[leveloffset=+2]
Expand Down
153 changes: 153 additions & 0 deletions docs/en/stack/ml/df-analytics/ml-dfa-scale.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
[role="xpack"]
[[ml-dfa-scale]]
= Working with {dfanalytics} at scale

A {dfanalytics-job} has numerous configuration options. Some of them may have a
significant effect on the time taken to train a model. The training time depends
on various factors, like the statistical characteristics of your data, the
number of provided hyperparameters, the number of features included in the
analysis, the hardware you use, and so on. This guide contains a list of
considerations to help you plan for training {dfanalytics} models at scale and
optimizing training time.

In this guide, you’ll learn how to:

* Understand the impact of configuration options on the time taken to train
models for {dfanalytics-jobs}.


Prerequisites:
This guide assumes you’re already familiar with:

* How to create data frame analytics jobs. If not, refer to <<ml-dfa-overview>>.

* How data frame analytics jobs work. If not, refer to <<ml-dfa-phases>>.

It is important to note that there is a correlation between the training time,
the complexity of the model, the size of the data, and the quality of the
analysis results. Improvements in quality, however, are not linear with the
amount of training data; for very large source data, it might take hours to
train a model for very small gains in quality. When you work at scale with
{dfanalytics}, you need to decide what quality of results is acceptable for your
use case. When you have determined your acceptance criteria, you have a better
picture of the factors you can trade off while still achieving your goal.


The following recommendations are not sequential – the numbers just help to
navigate between the list items; you can take action on one or more of them in
any order.


[discrete]
[[rapid-iteration]]
== 0. Start small and iterate rapidly

Training is an iterative process. Experiment with different settings and
configuration options (including but not limited to hyperparameters and feature
importance), then evaluate the results and decide whether they are good enough
or need further experimentation.

Every iteration takes time, so it is useful to start with a small set of data so
you can iterate rapidly and then build up from here.


[discrete]
[[small-training-percent]]
== 1. Set a small training percent

(This step only applies to {regression} and {classification} jobs.)

The number of documents used for training a model has an effect on the training
time. A higher training percent means a longer training time.

Consider starting with a small percentage of training data so you can complete
iterations more quickly. Once you are happy with your configuration, increase
the training percent. As a rule of thumb, if you have a data set with more than
100,000 data points, start with a training percent of 5 or 10.


[discrete]
[[disable-feature-importance]]
== 2. Disable {feat-imp} calculation

(This step only applies to {regression} and {classification} jobs.)

<<ml-feature-importance>> indicates which fields had the biggest impact on each
prediction that is generated by the analysis. Depending on the size of the data
set, {feat-imp} can take a long time to compute.

For a shorter runtime, consider disabling {feat-imp} for some or all iterations
if you do not require it.


[discrete]
[[optimize-included-fields]]
== 3. Optimize the number of included fields

You can speed up runtime by only analyzing relevant fields.

By default, all the fields that are supported by the analysis type are included
in the analysis. In general, more fields analyzed requires more resources and
longer training times, including the time taken for automatic feature selection.
To reduce training time, consider limiting the scope of the analysis to the
relevant fields that contribute to the prediction. You may do this by either
excluding non-relevant fields or by including relevant ones.

NOTE: {feat-imp-cap} can help you determine the fields that contribute most to
the prediction. However, as calculating {feat-imp} increases training time, this
is a trade-off that can be evaluated during an iterative training process.


[discrete]
[[increase-threads]]
== 4. Increase the maximum number of threads

You can set the maximum number of threads that are used during the analysis. The
default value of `max_num_threads` is 1. Depending on the characteristics of the
data, using more threads may decrease the training time at the cost of increased
CPU usage. Note that trying to use more threads than the number of CPU cores has
no advantage.

Hyperparameter optimization and calculating {feat-imp} gain the most benefit
from the increased number of threads. This can be seen in phases
`coarse_parameter_search`, `fine_tuning_parameters`, and `writing_results`. The
rest of the phases are not affected by the increased number of threads.

To learn more about the individual phases, please refer to <<ml-dfa-phases>>.

NOTE: If your {ml} nodes are running concurrent jobs (either {anomaly-detect} or
{dfanalytics}), then you may want to keep the maximum number of threads set to a
low number – for example the default 1 – to prevent jobs competing for
resources.


[discrete]
[[optimize-source-index]]
== 5. Optimize the size of the source index

Even if the training percent is low, reindexing the source index – which is a
mandatory step in the job creation process – may take a long time. During
reindexing, the documents from the source index or indices are copied to the
destination index, so you have a static copy of the analyzed data.

If your data is large and you do not need to test and train on the whole source
index or indices, then reduce the cost of reindexing by using a subset of your
source data. This can be done by either defining a filter for the source index
in the {dfanalytics-job} configuration, or by manually reindexing a subset of
this data to use as an alternate source index.


[discrete]
[[configure-hyperparameters]]
== 6. Configure hyperparameters

(This step only applies to {regression} and {classification} jobs.)

<<hyperparameters>> is the most complicated mathematical process during model
training and may take a long time.

By default, optimized hyperparameter values are chosen automatically. It is
possible to reduce the time taken at this step by manually configuring
hyperparameters – if you fully understand the purpose of the hyperparameters and
have a sensible value for any or all of them. This reduces the computing load
and therefore decreases training time.