Skip to content

Commit b8b9f9a

Browse files
author
Feynman Liang
committed
Adds new LDA features to user guide
1 parent b265e28 commit b8b9f9a

File tree

3 files changed

+98
-15
lines changed

3 files changed

+98
-15
lines changed

docs/mllib-clustering.md

Lines changed: 97 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -443,23 +443,106 @@ LDA can be thought of as a clustering algorithm as follows:
443443
* Rather than estimating a clustering using a traditional distance, LDA uses a function based
444444
on a statistical model of how text documents are generated.
445445

446-
LDA takes in a collection of documents as vectors of word counts.
447-
It supports different inference algorithms via `setOptimizer` function. EMLDAOptimizer learns clustering using [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
448-
on the likelihood function and yields comprehensive results, while OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) and is generally memory friendly. After fitting on the documents, LDA provides:
449-
450-
* Topics: Inferred topics, each of which is a probability distribution over terms (words).
451-
* Topic distributions for documents: For each non empty document in the training set, LDA gives a probability distribution over topics. (EM only). Note that for empty documents, we don't create the topic distributions. (EM only)
446+
LDA supports different inference algorithms via `setOptimizer` function.
447+
`EMLDAOptimizer` learns clustering using
448+
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
449+
on the likelihood function and yields comprehensive results, while
450+
`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
451+
variational
452+
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
453+
and is generally memory friendly.
452454

453-
LDA takes the following parameters:
455+
LDA takes in a collection of documents as vectors of word counts and the
456+
following parameters:
454457

455458
* `k`: Number of topics (i.e., cluster centers)
456-
* `maxIterations`: Limit on the number of iterations of EM used for learning
457-
* `docConcentration`: Hyperparameter for prior over documents' distributions over topics. Currently must be > 1, where larger values encourage smoother inferred distributions.
458-
* `topicConcentration`: Hyperparameter for prior over topics' distributions over terms (words). Currently must be > 1, where larger values encourage smoother inferred distributions.
459-
* `checkpointInterval`: If using checkpointing (set in the Spark configuration), this parameter specifies the frequency with which checkpoints will be created. If `maxIterations` is large, using checkpointing can help reduce shuffle file sizes on disk and help with failure recovery.
460-
461-
*Note*: LDA is a new feature with some missing functionality. In particular, it does not yet
462-
support prediction on new documents, and it does not have a Python API. These will be added in the future.
459+
* `LDAOptimizer`: Optimizer to use for learning the LDA model, either
460+
`EMLDAOptimizer` or `OnlineLDAOptimizer`
461+
* `docConcentration`: Dirichlet parameter for prior over documents'
462+
distributions over topics. Larger values encourage smoother inferred
463+
distributions.
464+
* `topicConcentration`: Dirichlet parameter for prior over topics'
465+
distributions over terms (words). Larger values encourage smoother
466+
inferred distributions.
467+
* `maxIterations`: Limit on the number of iterations.
468+
* `checkpointInterval`: If using checkpointing (set in the Spark
469+
configuration), this parameter specifies the frequency with which
470+
checkpoints will be created. If `maxIterations` is large, using
471+
checkpointing can help reduce shuffle file sizes on disk and help with
472+
failure recovery.
473+
474+
475+
All of MLlib's LDA models support:
476+
477+
* `describeTopics(n: Int)`: Prints `n` of the inferred topics, each of
478+
which is a probability distribution over terms (words).
479+
* `topicsMatrix`: For each non empty document in the
480+
training set, LDA gives a probability distribution over topics. Note
481+
that for empty documents, we don't create the topic distributions.
482+
483+
*Note*: LDA is still an experimental feature under active development.
484+
As a result, certain features are only available in one of the two
485+
optimizers / models generated by the optimizer. The following
486+
discussion will describe each optimizer/model pair separately.
487+
488+
**EMLDAOptimizer and DistributedLDAModel**
489+
490+
For the parameters provided to `LDA`:
491+
492+
* `docConcentration`: Only symmetric priors are supported, so all values
493+
in the provided `k`-dimensional vector must be identical. All values
494+
must also be $> 1.0$. Providing `Vector(-1)` results in default behavior
495+
(uniform `k` dimensional vector with value $(50 / k) + 1$
496+
* `topicConcentration`: Only symmetric priors supported. Values must be
497+
$> 1.0$. Providing `-1` results in defaulting to a value of $0.1 + 1$.
498+
* `maxIterations`: Interpreted as maximum number of EM iterations.
499+
500+
`EMLDAOptimizer` produces a `DistributedLDAModel`, which stores not only
501+
the inferred topics but also the full training corpus and topic
502+
distributions for each document in the training corpus. A
503+
`DistributedLDAModel` supports:
504+
505+
* `topTopicsPerDocument(k)`: The top `k` topics and their weights for
506+
each document in the training corpus
507+
* `topDocumentsPerTopic(k)`: The top `k` documents for each topic and
508+
the corresponding weight of the topic in the documents.
509+
* `logPrior`: log probability of the estimated topics and
510+
document-topic distributions given the hyperparameters
511+
`docConcentration` and `topicConcentration`
512+
* `logLikelihood`: log likelihood of the training corpus, given the
513+
inferred topics and document-topic distributions
514+
515+
**OnlineLDAOptimizer and LocalLDAModel**
516+
517+
For the parameters provided to `LDA`:
518+
519+
* `docConcentration`: Asymmetric priors can be used by passing in a
520+
vector with values equal to the Dirichlet parameter in each of the `k`
521+
dimensions. Values should be $>= 0$. Providing `Vector(-1)` results in
522+
default behavior (uniform `k` dimensional vector with value $(1.0 / k)$)
523+
* `topicConcentration`: Only symmetric priors supported. Values must be
524+
$>= 0$. Providing `-1` results in defaulting to a value of $(1.0 / k)$.
525+
* `maxIterations`: Interpreted as maximum number of minibatches to
526+
submit.
527+
528+
In addition, `OnlineLDAOptimizer` accepts the following parameters:
529+
530+
* `miniBatchFraction`: Fraction of corpus sampled and used at each
531+
iteration
532+
* `optimizeAlpha`: If set to true, performs maximum-likelihood
533+
estimation of the hyperparameter `alpha` (aka `docConcentration`)
534+
after each minibatch and returns the optimized `alpha` in the resulting
535+
`LDAModel`
536+
* `tau0` and `kappa`: Used for learning-rate decay, which is computed by
537+
$(\tau_0 + iter)^{-\kappa}$ where $iter$ is the current number of iterations.
538+
539+
`OnlineLDAOptimizer` produces a `LocalLDAModel`, which only stores the
540+
inferred topics. A `LocalLDAModel` supports:
541+
542+
* `logLikelihood(documents)`: Calculates a lower bound on the provided
543+
`documents` given the inferred topics.
544+
* `logPerplexity(documents)`: Calculates an upper bound on the
545+
perplexity of the provided `documents` given the inferred topics.
463546

464547
**Examples**
465548

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,6 @@ object LocalLDAModel extends Loader[LocalLDAModel] {
420420
}
421421
val topicsMat = Matrices.fromBreeze(brzTopics)
422422

423-
// TODO: initialize with docConcentration, topicConcentration, and gammaShape after SPARK-9940
424423
new LocalLDAModel(topicsMat, docConcentration, topicConcentration, gammaShape)
425424
}
426425
}

mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext {
6868
// Train a model
6969
val lda = new LDA()
7070
lda.setK(k)
71+
.setOptimizer(new EMLDAOptimizer)
7172
.setDocConcentration(topicSmoothing)
7273
.setTopicConcentration(termSmoothing)
7374
.setMaxIterations(5)

0 commit comments

Comments
 (0)