Skip to content

Commit abb005b

Browse files
authored
[DOCS] Adds data frame analytics at scale page to the ML DFA docs (#1394) (#1395)
1 parent 91457e2 commit abb005b

File tree

2 files changed

+155
-0
lines changed

2 files changed

+155
-0
lines changed

docs/en/stack/ml/df-analytics/index.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ include::ml-dfanalytics.asciidoc[]
33
include::ml-dfa-overview.asciidoc[leveloffset=+1]
44
include::ml-supervised-workflow.asciidoc[leveloffset=+2]
55
include::ml-dfa-phases.asciidoc[leveloffset=+2]
6+
include::ml-dfa-scale.asciidoc[leveloffset=+2]
67

78
include::ml-dfa-concepts.asciidoc[leveloffset=+1]
89
include::dfa-outlier-detection.asciidoc[leveloffset=+2]
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
[role="xpack"]
2+
[[ml-dfa-scale]]
3+
= Working with {dfanalytics} at scale
4+
5+
A {dfanalytics-job} has numerous configuration options. Some of them may have a
6+
significant effect on the time taken to train a model. The training time depends
7+
on various factors, like the statistical characteristics of your data, the
8+
number of provided hyperparameters, the number of features included in the
9+
analysis, the hardware you use, and so on. This guide contains a list of
10+
considerations to help you plan for training {dfanalytics} models at scale and
11+
optimizing training time.
12+
13+
In this guide, you’ll learn how to:
14+
15+
* Understand the impact of configuration options on the time taken to train
16+
models for {dfanalytics-jobs}.
17+
18+
19+
**Prerequisites:**
20+
21+
This guide assumes you’re already familiar with:
22+
23+
* How to create data frame analytics jobs. If not, refer to <<ml-dfa-overview>>.
24+
25+
* How data frame analytics jobs work. If not, refer to <<ml-dfa-phases>>.
26+
27+
It is important to note that there is a correlation between the training time,
28+
the complexity of the model, the size of the data, and the quality of the
29+
analysis results. Improvements in quality, however, are not linear with the
30+
amount of training data; for very large source data, it might take hours to
31+
train a model for very small gains in quality. When you work at scale with
32+
{dfanalytics}, you need to decide what quality of results is acceptable for your
33+
use case. When you have determined your acceptance criteria, you have a better
34+
picture of the factors you can trade off while still achieving your goal.
35+
36+
37+
The following recommendations are not sequential – the numbers just help to
38+
navigate between the list items; you can take action on one or more of them in
39+
any order.
40+
41+
42+
[discrete]
43+
[[rapid-iteration]]
44+
== 0. Start small and iterate rapidly
45+
46+
Training is an iterative process. Experiment with different settings and
47+
configuration options (including but not limited to hyperparameters and feature
48+
importance), then evaluate the results and decide whether they are good enough
49+
or need further experimentation.
50+
51+
Every iteration takes time, so it is useful to start with a small set of data so
52+
you can iterate rapidly and then build up from here.
53+
54+
55+
[discrete]
56+
[[small-training-percent]]
57+
== 1. Set a small training percent
58+
59+
(This step only applies to {regression} and {classification} jobs.)
60+
61+
The number of documents used for training a model has an effect on the training
62+
time. A higher training percent means a longer training time.
63+
64+
Consider starting with a small percentage of training data so you can complete
65+
iterations more quickly. Once you are happy with your configuration, increase
66+
the training percent. As a rule of thumb, if you have a data set with more than
67+
100,000 data points, start with a training percent of 5 or 10.
68+
69+
70+
[discrete]
71+
[[disable-feature-importance]]
72+
== 2. Disable {feat-imp} calculation
73+
74+
(This step only applies to {regression} and {classification} jobs.)
75+
76+
<<ml-feature-importance>> indicates which fields had the biggest impact on each
77+
prediction that is generated by the analysis. Depending on the size of the data
78+
set, {feat-imp} can take a long time to compute.
79+
80+
For a shorter runtime, consider disabling {feat-imp} for some or all iterations
81+
if you do not require it.
82+
83+
84+
[discrete]
85+
[[optimize-included-fields]]
86+
== 3. Optimize the number of included fields
87+
88+
You can speed up runtime by only analyzing relevant fields.
89+
90+
By default, all the fields that are supported by the analysis type are included
91+
in the analysis. In general, more fields analyzed requires more resources and
92+
longer training times, including the time taken for automatic feature selection.
93+
To reduce training time, consider limiting the scope of the analysis to the
94+
relevant fields that contribute to the prediction. You may do this by either
95+
excluding non-relevant fields or by including relevant ones.
96+
97+
NOTE: {feat-imp-cap} can help you determine the fields that contribute most to
98+
the prediction. However, as calculating {feat-imp} increases training time, this
99+
is a trade-off that can be evaluated during an iterative training process.
100+
101+
102+
[discrete]
103+
[[increase-threads]]
104+
== 4. Increase the maximum number of threads
105+
106+
You can set the maximum number of threads that are used during the analysis. The
107+
default value of `max_num_threads` is 1. Depending on the characteristics of the
108+
data, using more threads may decrease the training time at the cost of increased
109+
CPU usage. Note that trying to use more threads than the number of CPU cores has
110+
no advantage.
111+
112+
Hyperparameter optimization and calculating {feat-imp} gain the most benefit
113+
from the increased number of threads. This can be seen in phases
114+
`coarse_parameter_search`, `fine_tuning_parameters`, and `writing_results`. The
115+
rest of the phases are not affected by the increased number of threads.
116+
117+
To learn more about the individual phases, refer to <<ml-dfa-phases>>.
118+
119+
NOTE: If your {ml} nodes are running concurrent {anomaly-detect} or
120+
{dfanalytics-jobs}, then you may want to keep the maximum number of threads set
121+
to a low number – for example the default 1 – to prevent jobs competing for
122+
resources.
123+
124+
125+
[discrete]
126+
[[optimize-source-index]]
127+
== 5. Optimize the size of the source index
128+
129+
Even if the training percent is low, reindexing the source index – which is a
130+
mandatory step in the job creation process – may take a long time. During
131+
reindexing, the documents from the source index or indices are copied to the
132+
destination index, so you have a static copy of the analyzed data.
133+
134+
If your data is large and you do not need to test and train on the whole source
135+
index or indices, then reduce the cost of reindexing by using a subset of your
136+
source data. This can be done by either defining a filter for the source index
137+
in the {dfanalytics-job} configuration, or by manually reindexing a subset of
138+
this data to use as an alternate source index.
139+
140+
141+
[discrete]
142+
[[configure-hyperparameters]]
143+
== 6. Configure hyperparameters
144+
145+
(This step only applies to {regression} and {classification} jobs.)
146+
147+
<<hyperparameters>> is the most complicated mathematical process during model
148+
training and may take a long time.
149+
150+
By default, optimized hyperparameter values are chosen automatically. It is
151+
possible to reduce the time taken at this step by manually configuring
152+
hyperparameters – if you fully understand the purpose of the hyperparameters and
153+
have a sensible value for any or all of them. This reduces the computing load
154+
and therefore decreases training time.

0 commit comments

Comments
 (0)