Skip to content

Commit 55c0e88

Browse files
authored
[DOCS] Synchs and links hyperparameter descriptions (#56137)
1 parent e59b099 commit 55c0e88

File tree

3 files changed

+83
-90
lines changed

3 files changed

+83
-90
lines changed

docs/reference/ml/df-analytics/apis/get-dfanalytics-stats.asciidoc

Lines changed: 17 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -99,32 +99,31 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-alpha]
9999

100100
`class_assignment_objective`::::
101101
(string)
102-
Defines whether class assignment maximizes the accuracy or the minimum recall
103-
metric. Possible values are `maximize_accuracy` and `maximize_minimum_recall`.
102+
include::{docdir}/ml/ml-shared.asciidoc[tag=class-assignment-objective]
104103

105104
`downsample_factor`::::
106105
(double)
107106
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-downsample-factor]
108107

109108
`eta`::::
110109
(double)
111-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-eta]
110+
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
112111

113112
`eta_growth_rate_per_tree`::::
114113
(double)
115114
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-eta-growth]
116115

117116
`feature_bag_fraction`::::
118117
(double)
119-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-feature-bag-fraction]
118+
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
120119

121120
`gamma`::::
122121
(double)
123-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-gamma]
122+
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
124123

125124
`lambda`::::
126125
(double)
127-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-lambda]
126+
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
128127

129128
`max_attempts_to_add_tree`::::
130129
(integer)
@@ -136,7 +135,7 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-max-optimization-rounds]
136135

137136
`max_trees`::::
138137
(integer)
139-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-max-trees]
138+
include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
140139

141140
`num_folds`::::
142141
(integer)
@@ -221,32 +220,29 @@ heuristics.
221220
=======
222221
`compute_feature_influence`::::
223222
(boolean)
224-
If true, feature influence calculation is enabled.
223+
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
225224

226225
`feature_influence_threshold`::::
227226
(double)
228-
The minimum {olscore} that a document needs to have to calculate its feature
229-
influence score.
227+
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
230228

231229
`method`::::
232230
(string)
233-
The method that {oldetection} uses. Possible values are `lof`, `ldof`,
234-
`distance_kth_nn`, `distance_knn`, and `ensemble`.
231+
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
235232

236233
`n_neighbors`::::
237234
(integer)
238-
The value for how many nearest neighbors each method of {oldetection} uses to
239-
calculate its outlier score.
235+
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
240236

241237
`outlier_fraction`::::
242238
(double)
239+
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
243240
The proportion of the data set that is assumed to be outlying prior to
244241
{oldetection}.
245242

246243
`standardization_enabled`::::
247244
(boolean)
248-
If true, then the following operation is performed on the columns before
249-
computing {olscores}: (x_i - mean(x_i)) / sd(x_i).
245+
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
250246
=======
251247
//End parameters
252248
@@ -296,23 +292,23 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-downsample-factor]
296292

297293
`eta`::::
298294
(double)
299-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-eta]
295+
include::{docdir}/ml/ml-shared.asciidoc[tag=eta]
300296

301297
`eta_growth_rate_per_tree`::::
302298
(double)
303299
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-eta-growth]
304300

305301
`feature_bag_fraction`::::
306302
(double)
307-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-feature-bag-fraction]
303+
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-bag-fraction]
308304

309305
`gamma`::::
310306
(double)
311-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-gamma]
307+
include::{docdir}/ml/ml-shared.asciidoc[tag=gamma]
312308

313309
`lambda`::::
314310
(double)
315-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-lambda]
311+
include::{docdir}/ml/ml-shared.asciidoc[tag=lambda]
316312

317313
`max_attempts_to_add_tree`::::
318314
(integer)
@@ -324,7 +320,7 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-max-optimization-rounds]
324320

325321
`max_trees`::::
326322
(integer)
327-
include::{docdir}/ml/ml-shared.asciidoc[tag=dfas-max-trees]
323+
include::{docdir}/ml/ml-shared.asciidoc[tag=max-trees]
328324

329325
`num_folds`::::
330326
(integer)

docs/reference/ml/df-analytics/apis/put-dfanalytics.asciidoc

Lines changed: 9 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -42,24 +42,9 @@ indices and stores the outcome in a destination index.
4242
If the destination index does not exist, it is created automatically when you
4343
start the job. See <<start-dfanalytics>>.
4444

45-
[[ml-hyperparam-optimization]]
4645
If you supply only a subset of the {regression} or {classification} parameters,
47-
_hyperparameter optimization_ occurs. It determines a value for each of the
48-
undefined parameters.
49-
50-
////
51-
The starting point is calculated for data dependent parameters by examining the
52-
loss on the training data. Subject to the size constraint, this operation
53-
provides an upper bound on the improvement in validation loss.
54-
55-
The optimization starts with random search, then
56-
Bayesian optimization is performed that is targeting maximum expected
57-
improvement. If you override any parameters by explicitely setting it, the
58-
optimization calculates the value of the remaining parameters accordingly and
59-
uses the value you provided for the overridden parameter. The number of rounds
60-
are reduced respectively. The validation error is estimated in each round by
61-
using 4-fold cross validation.
62-
////
46+
{ml-docs}/hyperparameters.html[hyperparameter optimization] occurs.
47+
It determines a value for each of the undefined parameters.
6348

6449
[[ml-put-dfanalytics-path-params]]
6550
==== {api-path-parms-title}
@@ -108,11 +93,7 @@ understand the function of these parameters.
10893
=====
10994
`class_assignment_objective`::::
11095
(Optional, string)
111-
Defines the objective to optimize when assigning class labels:
112-
`maximize_accuracy` or `maximize_minimum_recall`. When maximizing accuracy,
113-
class labels are chosen to maximize the number of correct predictions. When
114-
maximizing minimum recall, labels are chosen to maximize the minimum recall
115-
for any class. Defaults to `maximize_minimum_recall`.
96+
include::{docdir}/ml/ml-shared.asciidoc[tag=class-assignment-objective]
11697

11798
`dependent_variable`::::
11899
(Required, string)
@@ -179,41 +160,27 @@ The configuration information necessary to perform
179160
=====
180161
`compute_feature_influence`::::
181162
(Optional, boolean)
182-
If `true`, the feature influence calculation is enabled. Defaults to `true`.
163+
include::{docdir}/ml/ml-shared.asciidoc[tag=compute-feature-influence]
183164

184165
`feature_influence_threshold`::::
185166
(Optional, double)
186-
The minimum {olscore} that a document needs to have in order to calculate its
187-
{fiscore}. Value range: 0-1 (`0.1` by default).
167+
include::{docdir}/ml/ml-shared.asciidoc[tag=feature-influence-threshold]
188168

189169
`method`::::
190170
(Optional, string)
191-
Sets the method that {oldetection} uses. If the method is not set {oldetection}
192-
uses an ensemble of different methods and normalises and combines their
193-
individual {olscores} to obtain the overall {olscore}. We recommend to use the
194-
ensemble method. Available methods are `lof`, `ldof`, `distance_kth_nn`,
195-
`distance_knn`.
171+
include::{docdir}/ml/ml-shared.asciidoc[tag=method]
196172

197173
`n_neighbors`::::
198174
(Optional, integer)
199-
Defines the value for how many nearest neighbors each method of
200-
{oldetection} will use to calculate its {olscore}. When the value is not set,
201-
different values will be used for different ensemble members. This helps
202-
improve diversity in the ensemble. Therefore, only override this if you are
203-
confident that the value you choose is appropriate for the data set.
175+
include::{docdir}/ml/ml-shared.asciidoc[tag=n-neighbors]
204176

205177
`outlier_fraction`::::
206178
(Optional, double)
207-
Sets the proportion of the data set that is assumed to be outlying prior to
208-
{oldetection}. For example, 0.05 means it is assumed that 5% of values are real
209-
outliers and 95% are inliers.
179+
include::{docdir}/ml/ml-shared.asciidoc[tag=outlier-fraction]
210180

211181
`standardization_enabled`::::
212182
(Optional, boolean)
213-
If `true`, then the following operation is performed on the columns before
214-
computing outlier scores: (x_i - mean(x_i)) / sd(x_i). Defaults to `true`. For
215-
more information, see
216-
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[this wiki page about standardization].
183+
include::{docdir}/ml/ml-shared.asciidoc[tag=standardization-enabled]
217184
//End outlier_detection
218185
=====
219186
//Begin regression

docs/reference/ml/ml-shared.asciidoc

Lines changed: 57 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,19 @@ include::{docdir}/ml/ml-shared.asciidoc[tag=time-span]
278278
====
279279
end::chunking-config[]
280280

281+
tag::class-assignment-objective[]
282+
Defines the objective to optimize when assigning class labels:
283+
`maximize_accuracy` or `maximize_minimum_recall`. When maximizing accuracy,
284+
class labels are chosen to maximize the number of correct predictions. When
285+
maximizing minimum recall, labels are chosen to maximize the minimum recall
286+
for any class. Defaults to `maximize_minimum_recall`.
287+
end::class-assignment-objective[]
288+
289+
tag::compute-feature-influence[]
290+
Specifies whether the feature influence calculation is enabled. Defaults to
291+
`true`.
292+
end::compute-feature-influence[]
293+
281294
tag::custom-rules[]
282295
An array of custom rule objects, which enable you to customize the way detectors
283296
operate. For example, a rule may dictate to the detector conditions under which
@@ -479,32 +492,15 @@ tag::dfas-downsample-factor[]
479492
The value of the downsample factor.
480493
end::dfas-downsample-factor[]
481494

482-
tag::dfas-eta[]
483-
The value of the eta hyperparameter.
484-
end::dfas-eta[]
485-
486495
tag::dfas-eta-growth[]
487496
Specifies the rate at which the `eta` increases for each new tree that is added to the
488497
forest. For example, a rate of `1.05` increases `eta` by 5%.
489498
end::dfas-eta-growth[]
490499

491-
tag::dfas-feature-bag-fraction[]
492-
The fraction of features that is used when selecting a random bag for each
493-
candidate split.
494-
end::dfas-feature-bag-fraction[]
495-
496-
tag::dfas-gamma[]
497-
Regularization factor to penalize trees with large numbers of nodes.
498-
end::dfas-gamma[]
499-
500500
tag::dfas-iteration[]
501501
The number of iterations on the analysis.
502502
end::dfas-iteration[]
503503

504-
tag::dfas-lambda[]
505-
Regularization factor to penalize large leaf weights.
506-
end::dfas-lambda[]
507-
508504
tag::dfas-max-attempts[]
509505
If the algorithm fails to determine a non-trivial tree (more than a single
510506
leaf), this parameter determines how many of such consecutive failures are
@@ -519,10 +515,6 @@ The maximum number of steps is determined based on the number of undefined hyper
519515
times the maximum optimization rounds per hyperparameter.
520516
end::dfas-max-optimization-rounds[]
521517

522-
tag::dfas-max-trees[]
523-
The maximum number of trees in the forest.
524-
end::dfas-max-trees[]
525-
526518
tag::dfas-num-folds[]
527519
The maximum number of folds for the cross-validation procedure.
528520
end::dfas-num-folds[]
@@ -584,9 +576,9 @@ end::empty-bucket-count[]
584576

585577
tag::eta[]
586578
Advanced configuration option. The shrinkage applied to the weights. Smaller
587-
values result in larger forests which have a better generalization error. However,
588-
the smaller the value the longer the training will take. For more information,
589-
about shrinkage, see
579+
values result in larger forests which have a better generalization error.
580+
However, the smaller the value the longer the training will take. For more
581+
information, about shrinkage, see
590582
https://en.wikipedia.org/wiki/Gradient_boosting#Shrinkage[this wiki article].
591583
end::eta[]
592584

@@ -605,9 +597,15 @@ end::exclude-interim-results[]
605597

606598
tag::feature-bag-fraction[]
607599
Advanced configuration option. Defines the fraction of features that will be
608-
used when selecting a random bag for each candidate split.
600+
used when selecting a random bag for each candidate split. By default, this
601+
value is calculated during hyperparameter optimization.
609602
end::feature-bag-fraction[]
610603

604+
tag::feature-influence-threshold[]
605+
The minimum {olscore} that a document needs to have in order to calculate its
606+
{fiscore}. Value range: 0-1 (`0.1` by default).
607+
end::feature-influence-threshold[]
608+
611609
tag::filter[]
612610
One or more <<analysis-tokenfilters,token filters>>. In addition to the built-in
613611
token filters, other plugins can provide more token filters. This property is
@@ -653,7 +651,8 @@ Advanced configuration option. Regularization parameter to prevent overfitting
653651
on the training data set. Multiplies a linear penalty associated with the size of
654652
individual trees in the forest. The higher the value the more training will
655653
prefer smaller trees. The smaller this parameter the larger individual trees
656-
will be and the longer training will take.
654+
will be and the longer training will take. By default, this value is calculated
655+
during hyperparameter optimization.
657656
end::gamma[]
658657

659658
tag::groups[]
@@ -785,6 +784,7 @@ more training will attempt to keep leaf weights small. This makes the prediction
785784
function smoother at the expense of potentially not being able to capture
786785
relevant relationships between the features and the {depvar}. The smaller this
787786
parameter the larger individual trees will be and the longer training will take.
787+
By default, this value is calculated during hyperparameter optimization.
788788
end::lambda[]
789789

790790
tag::last-data-time[]
@@ -828,9 +828,18 @@ end::max-empty-searches[]
828828

829829
tag::max-trees[]
830830
Advanced configuration option. Defines the maximum number of trees the forest is
831-
allowed to contain. The maximum value is 2000.
831+
allowed to contain. The maximum value is 2000. By default, this value is
832+
calculated during hyperparameter optimization.
832833
end::max-trees[]
833834

835+
tag::method[]
836+
The method that {oldetection} uses. Available methods are `lof`, `ldof`,
837+
`distance_kth_nn`, `distance_knn`, and `ensemble`. The default value is
838+
`ensemble`, which means that {oldetection} uses an ensemble of different methods
839+
and normalises and combines their individual {olscores} to obtain the overall
840+
{olscore}.
841+
end::method[]
842+
834843
tag::missing-field-count[]
835844
The number of input documents that are missing a field that the {anomaly-job} is
836845
configured to analyze. Input documents with missing fields are still processed
@@ -973,6 +982,14 @@ NOTE: To use the `multivariate_by_fields` property, you must also specify
973982
--
974983
end::multivariate-by-fields[]
975984

985+
tag::n-neighbors[]
986+
Defines the value for how many nearest neighbors each method of {oldetection}
987+
uses to calculate its {olscore}. When the value is not set, different values are
988+
used for different ensemble members. This default behavior helps improve the
989+
diversity in the ensemble; only override it if you are confident that the value
990+
you choose is appropriate for the data set.
991+
end::n-neighbors[]
992+
976993
tag::node-address[]
977994
The network address of the node.
978995
end::node-address[]
@@ -1015,6 +1032,12 @@ order documents are discarded, since jobs require time series data to be in
10151032
ascending chronological order.
10161033
end::out-of-order-timestamp-count[]
10171034

1035+
tag::outlier-fraction[]
1036+
The proportion of the data set that is assumed to be outlying prior to
1037+
{oldetection}. For example, 0.05 means it is assumed that 5% of values are real
1038+
outliers and 95% are inliers.
1039+
end::outlier-fraction[]
1040+
10181041
tag::over-field-name[]
10191042
The field used to split the data. In particular, this property is used for
10201043
analyzing the splits with respect to the history of all splits. It is used for
@@ -1143,6 +1166,13 @@ number of data points. If your data contains many sparse buckets, consider using
11431166
a longer `bucket_span`.
11441167
end::sparse-bucket-count[]
11451168

1169+
tag::standardization-enabled[]
1170+
If `true`, the following operation is performed on the columns before computing
1171+
{olscores}: (x_i - mean(x_i)) / sd(x_i). Defaults to `true`. For more
1172+
information about this concept, see
1173+
https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)[Wikipedia].
1174+
end::standardization-enabled[]
1175+
11461176
tag::state-anomaly-job[]
11471177
The status of the {anomaly-job}, which can be one of the following values:
11481178
+

0 commit comments

Comments
 (0)