You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/mllib-clustering.md
+97-14Lines changed: 97 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows:
443
443
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
444
444
on a statistical model of how text documents are generated.
445
445
446
-
LDA takes in a collection of documents as vectors of word counts.
447
-
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
448
-
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
449
-
450
-
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
451
-
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
446
+
LDA supports different inference algorithms via `setOptimizer` function.
LDA takes in a collection of documents as vectors of word counts and the
456
+
following parameters:
454
457
455
458
*`k`: Number of topics (i.e., cluster centers)
456
-
*`maxIterations`: Limit on the number of iterations of EM used for learning
457
-
*`docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
458
-
*`topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
459
-
*`checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
460
-
461
-
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
462
-
support prediction on new documents, and it does not have a Python API. These will be added in the future.
459
+
*`LDAOptimizer`: Optimizer to use for learning the LDA model, either
460
+
`EMLDAOptimizer` or `OnlineLDAOptimizer`
461
+
*`docConcentration`: Dirichlet parameter for prior over documents'
462
+
distributions over topics. Larger values encourage smoother inferred
463
+
distributions.
464
+
*`topicConcentration`: Dirichlet parameter for prior over topics'
465
+
distributions over terms (words). Larger values encourage smoother
466
+
inferred distributions.
467
+
*`maxIterations`: Limit on the number of iterations.
468
+
*`checkpointInterval`: If using checkpointing (set in the Spark
469
+
configuration), this parameter specifies the frequency with which
470
+
checkpoints will be created. If `maxIterations` is large, using
471
+
checkpointing can help reduce shuffle file sizes on disk and help with
472
+
failure recovery.
473
+
474
+
475
+
All of MLlib's LDA models support:
476
+
477
+
*`describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
478
+
which is a probability distribution over terms (words).
479
+
*`topicsMatrix`: For each non empty document in the
480
+
training set, LDA gives a probability distribution over topics. Note
481
+
that for empty documents, we don't create the topic distributions.
482
+
483
+
*Note*: LDA is still an experimental feature under active development.
484
+
As a result, certain features are only available in one of the two
485
+
optimizers / models generated by the optimizer. The following
486
+
discussion will describe each optimizer/model pair separately.
487
+
488
+
**EMLDAOptimizer and DistributedLDAModel**
489
+
490
+
For the parameters provided to `LDA`:
491
+
492
+
*`docConcentration`: Only symmetric priors are supported, so all values
493
+
in the provided `k`-dimensional vector must be identical. All values
494
+
must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
495
+
(uniform `k` dimensional vector with value $(50 / k) + 1$
496
+
*`topicConcentration`: Only symmetric priors supported. Values must be
497
+
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
498
+
*`maxIterations`: Interpreted as maximum number of EM iterations.
499
+
500
+
`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
501
+
the inferred topics but also the full training corpus and topic
502
+
distributions for each document in the training corpus. A
503
+
`DistributedLDAModel` supports:
504
+
505
+
*`topTopicsPerDocument(k)`: The top `k` topics and their weights for
506
+
each document in the training corpus
507
+
*`topDocumentsPerTopic(k)`: The top `k` documents for each topic and
508
+
the corresponding weight of the topic in the documents.
509
+
*`logPrior`: log probability of the estimated topics and
510
+
document-topic distributions given the hyperparameters
511
+
`docConcentration` and `topicConcentration`
512
+
*`logLikelihood`: log likelihood of the training corpus, given the
513
+
inferred topics and document-topic distributions
514
+
515
+
**OnlineLDAOptimizer and LocalLDAModel**
516
+
517
+
For the parameters provided to `LDA`:
518
+
519
+
*`docConcentration`: Asymmetric priors can be used by passing in a
520
+
vector with values equal to the Dirichlet parameter in each of the `k`
521
+
dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
522
+
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
523
+
*`topicConcentration`: Only symmetric priors supported. Values must be
524
+
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
525
+
*`maxIterations`: Interpreted as maximum number of minibatches to
526
+
submit.
527
+
528
+
In addition, `OnlineLDAOptimizer` accepts the following parameters:
529
+
530
+
*`miniBatchFraction`: Fraction of corpus sampled and used at each
531
+
iteration
532
+
*`optimizeAlpha`: If set to true, performs maximum-likelihood
533
+
estimation of the hyperparameter `alpha` (aka `docConcentration`)
534
+
after each minibatch and returns the optimized `alpha` in the resulting
535
+
`LDAModel`
536
+
*`tau0` and `kappa`: Used for learning-rate decay, which is computed by
537
+
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.
538
+
539
+
`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
540
+
inferred topics. A `LocalLDAModel` supports:
541
+
542
+
*`logLikelihood(documents)`: Calculates a lower bound on the provided
543
+
`documents` given the inferred topics.
544
+
*`logPerplexity(documents)`: Calculates an upper bound on the
545
+
perplexity of the provided `documents` given the inferred topics.
0 commit comments