-
Notifications
You must be signed in to change notification settings - Fork 254
[DOCS] Adds data frame analytics at scale page to the ML DFA docs #1394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| [role="xpack"] | ||
| [[ml-dfa-scale]] | ||
| = Working with {dfanalytics} at scale | ||
|
|
||
| A {dfanalytics-job} has numerous configuration options. Some of them may have a | ||
| significant effect on the time taken to train a model. The training time depends | ||
| on various factors, like the statistical characteristics of your data, the | ||
| number of provided hyperparameters, the number of features included in the | ||
| analysis, the hardware you use, and so on. This guide contains a list of | ||
| considerations to help you plan for training {dfanalytics} models at scale and | ||
| optimizing training time. | ||
|
|
||
| In this guide, you’ll learn how to: | ||
|
|
||
| * Understand the impact of configuration options on the time taken to train | ||
| models for {dfanalytics-jobs}. | ||
|
|
||
|
|
||
| Prerequisites: | ||
| This guide assumes you’re already familiar with: | ||
|
|
||
| * How to create data frame analytics jobs. If not, refer to <<ml-dfa-overview>>. | ||
|
|
||
| * How data frame analytics jobs work. If not, refer to <<ml-dfa-phases>>. | ||
|
|
||
| It is important to note that there is a correlation between the training time, | ||
| the complexity of the model, the size of the data, and the quality of the | ||
| analysis results. Improvements in quality, however, are not linear with the | ||
| amount of training data; for very large source data, it might take hours to | ||
| train a model for very small gains in quality. When you work at scale with | ||
| {dfanalytics}, you need to decide what quality of results is acceptable for your | ||
| use case. When you have determined your acceptance criteria, you have a better | ||
| picture of the factors you can trade off while still achieving your goal. | ||
|
|
||
|
|
||
| The following recommendations are not sequential – the numbers just help to | ||
| navigate between the list items; you can take action on one or more of them in | ||
| any order. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[rapid-iteration]] | ||
| == 0. Start small and iterate rapidly | ||
|
|
||
| Training is an iterative process. Experiment with different settings and | ||
| configuration options (including but not limited to hyperparameters and feature | ||
| importance), then evaluate the results and decide whether they are good enough | ||
| or need further experimentation. | ||
|
|
||
| Every iteration takes time, so it is useful to start with a small set of data so | ||
| you can iterate rapidly and then build up from here. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[small-training-percent]] | ||
| == 1. Set a small training percent | ||
|
|
||
| (This step only applies to {regression} and {classification} jobs.) | ||
|
|
||
| The number of documents used for training a model has an effect on the training | ||
| time. A higher training percent means a longer training time. | ||
|
|
||
| Consider starting with a small percentage of training data so you can complete | ||
| iterations more quickly. Once you are happy with your configuration, increase | ||
| the training percent. As a rule of thumb, if you have a data set with more than | ||
| 100,000 data points, start with a training percent of 5 or 10. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[disable-feature-importance]] | ||
| == 2. Disable {feat-imp} calculation | ||
|
|
||
| (This step only applies to {regression} and {classification} jobs.) | ||
|
|
||
| <<ml-feature-importance>> indicates which fields had the biggest impact on each | ||
| prediction that is generated by the analysis. Depending on the size of the data | ||
| set, {feat-imp} can take a long time to compute. | ||
|
|
||
| For a shorter runtime, consider disabling {feat-imp} for some or all iterations | ||
| if you do not require it. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[optimize-included-fields]] | ||
| == 3. Optimize the number of included fields | ||
|
|
||
| You can speed up runtime by only analyzing relevant fields. | ||
|
|
||
| By default, all the fields that are supported by the analysis type are included | ||
| in the analysis. In general, more fields analyzed requires more resources and | ||
| longer training times, including the time taken for automatic feature selection. | ||
| To reduce training time, consider limiting the scope of the analysis to the | ||
| relevant fields that contribute to the prediction. You may do this by either | ||
| excluding non-relevant fields or by including relevant ones. | ||
|
|
||
| NOTE: {feat-imp-cap} can help you determine the fields that contribute most to | ||
| the prediction. However, as calculating {feat-imp} increases training time, this | ||
| is a trade-off that can be evaluated during an iterative training process. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[increase-threads]] | ||
| == 4. Increase the maximum number of threads | ||
|
|
||
| You can set the maximum number of threads that are used during the analysis. The | ||
| default value of `max_num_threads` is 1. Depending on the characteristics of the | ||
| data, using more threads may decrease the training time at the cost of increased | ||
| CPU usage. Note that trying to use more threads than the number of CPU cores has | ||
| no advantage. | ||
|
|
||
| Hyperparameter optimization and calculating {feat-imp} gain the most benefit | ||
| from the increased number of threads. This can be seen in phases | ||
| `coarse_parameter_search`, `fine_tuning_parameters`, and `writing_results`. The | ||
| rest of the phases are not affected by the increased number of threads. | ||
|
|
||
| To learn more about the individual phases, please refer to <<ml-dfa-phases>>. | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| NOTE: If your {ml} nodes are running concurrent jobs (either {anomaly-detect} or | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| {dfanalytics}), then you may want to keep the maximum number of threads set to a | ||
szabosteve marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| low number – for example the default 1 – to prevent jobs competing for | ||
| resources. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[optimize-source-index]] | ||
| == 5. Optimize the size of the source index | ||
|
|
||
| Even if the training percent is low, reindexing the source index – which is a | ||
| mandatory step in the job creation process – may take a long time. During | ||
| reindexing, the documents from the source index or indices are copied to the | ||
| destination index, so you have a static copy of the analyzed data. | ||
|
|
||
| If your data is large and you do not need to test and train on the whole source | ||
| index or indices, then reduce the cost of reindexing by using a subset of your | ||
| source data. This can be done by either defining a filter for the source index | ||
| in the {dfanalytics-job} configuration, or by manually reindexing a subset of | ||
| this data to use as an alternate source index. | ||
|
|
||
|
|
||
| [discrete] | ||
| [[configure-hyperparameters]] | ||
| == 6. Configure hyperparameters | ||
|
|
||
| (This step only applies to {regression} and {classification} jobs.) | ||
|
|
||
| <<hyperparameters>> is the most complicated mathematical process during model | ||
| training and may take a long time. | ||
|
|
||
| By default, optimized hyperparameter values are chosen automatically. It is | ||
| possible to reduce the time taken at this step by manually configuring | ||
| hyperparameters – if you fully understand the purpose of the hyperparameters and | ||
| have a sensible value for any or all of them. This reduces the computing load | ||
| and therefore decreases training time. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.