Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA ordination with feature loadings computation #598

Merged
merged 35 commits into from
Jul 23, 2024

Conversation

thpralas
Copy link
Collaborator

@thpralas thpralas commented Jul 5, 2024

This PR adds two new methods getLDA and addLDA that perform Latent Dirichlet Allocation.

getLDA returns the ordination matrix with the feature loadings matrix stored as attribute.
addLDA adds this ordination matrix to reducedDims of the TreeSummarizedExperiment.

This PR is related to issue #596.

@antagomir
Copy link
Member

Great! I will be first looking fwd to comments from @TuomasBorman and then let's see if I have anything to add.

@antagomir
Copy link
Member

Conflicts to resolve?

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Even though the code is rather short, structure it by adding tests and comments

R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
R/runLDA.R Outdated Show resolved Hide resolved
@TuomasBorman
Copy link
Contributor

TuomasBorman commented Jul 8, 2024

Couple things to discuss:

  1. LDA has features from both ordination and clustering methods, however, it is more clustering method. That should be taken into account in documentation.
  2. As this is clustering method, should we add clusters to rowData/colData? In theory, LDA can be applied also to rows, which makes reducedDim not suitable for storing the result. (Instead of ordination methods, addCluster() should be used as an example)
  3. And finally, should this be implemented in bluster package (the Bioconductor clustering package)? We could make a PR as we did last year https://github.com/LTLA/bluster/blob/master/R/DmmParam.R

@antagomir
Copy link
Member

Intuitively, reducedDim is a logical place given that this would be usually done for samples. But it could be in principle applied on both rows and columns. The function generates both feature matrix and feature loading matrix, both should be stored.

LDA is a matrix decomposition method, and I would not recommend to link it with clustering so directly.

But it could in principle go to the bluster package. Not sure if they would approve but it could be asked by opening an issue. That might take more time, however.. therefore I would meanwhile proceed with implementing these in mia, and later we can move them elsewhere if necessary? LDA and NMF have clear uses, it would be useful to have them easily accessible.

@TuomasBorman
Copy link
Contributor

TuomasBorman commented Jul 8, 2024

With these explanations, I am fine with mia implementation and reducedDim()

Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 16 lines in your changes missing coverage. Please review.

Please upload report for BASE (devel@dea9d92). Learn more about missing BASE report.

Files Patch % Lines
R/utils.R 31.57% 13 Missing ⚠️
R/addLDA.R 89.65% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             devel     #598   +/-   ##
========================================
  Coverage         ?   68.08%           
========================================
  Files            ?       44           
  Lines            ?     5273           
  Branches         ?        0           
========================================
  Hits             ?     3590           
  Misses           ?     1683           
  Partials         ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things

R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
@antagomir
Copy link
Member

Ok to me to add that.

However, if it feels a lot to add now, I would focus on finalizing the present one (with a single fixed k) first, and then adding that multi-k optimization as a separate PR.

@TuomasBorman
Copy link
Contributor

TuomasBorman commented Jul 11, 2024

That would require something like this

addLDA <- function(eval.metric = "perplexity"){
    if(eval.metric == "coherence" && !require("topicdoc") ){
        stop("'topicmodel' package must be installed in order to calculate coherence.")
    }
    models <- lapply(k, function(x){
        LDA(k = x)
    }
    
    metric <- .calc_metrics(models)
    
    # Get the model with lowest eval.metric (dfault value perplexity).
    
    # Get scores
    # Add loadings, model, and metrics data.frame to attributes

}

.calc_metrics <- function(x){
    # Loop through models
    # Calculate perplexity with topicmodels
    # Calculate coherence (I believe you have to use topicdoc package). Calculate it if that package is available
    # Create df from both metrics
}

R/utils.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
R/addLDA.R Outdated Show resolved Hide resolved
@TuomasBorman
Copy link
Contributor

NMF would go with this same template I think. However, there k (rank) can be numeric vector and the function finds optimal k by itself

@TuomasBorman
Copy link
Contributor

For LDA we could have another PR for multiple k values, if you are not able to implement it now

The idea was this:

data("GlobalPatterns")
tse <- GlobalPatterns
tse <- agglomerateByPrevalence(tse, rank="Phylum")

eval.metric <- "perplexity"
k <- c(2, 3, 4, 5, 6, 7, 8, 9, 10)
k <- 5
assay.type <- "counts"

mat <- assay(tse, assay.type)

df <- as.data.frame(t(mat))




if( eval.metric == "coherence" ){
    .require_package("topicdoc")
}

models <- lapply(k, function(i){
    topicmodels::LDA(df, i)
})

metrics <- .calculate_lda_metrics(models, k)

lda_model <- models[[ which.max(metrics[[eval.metric]]) ]]

posteriors <- topicmodels::posterior(lda_model, df)
scores <- t(as.data.frame(posteriors$topics))
attr(scores, "loadings") <- loadings
attr(scores, "model") <- lda_model
attr(scores, "eval_metric") <- metrics 

.calculate_lda_metrics <- function(models, k){
    metrics <- lapply(models, function(model){
        res <- topicmodels::perplexity(model)
        names(res) <- "perplexity"
        if( require("topicdoc") ){
            coherence <- topicdoc::topic_coherence(model, df)
            names(coherence) <- paste0("coherence_", seq_len(length(coherence)))
            coherence <- c(coherence, coherence_mean = mean(coherence))
            res <- c(res, coherence)
        }
        return(res)
    })
    metrics <- as.data.frame( dplyr::bind_rows(metrics) )
    metrics[["k"]] <- k
    return(metrics)
}

#################################
df <- attr(scores, "eval_metric")

library(ggplot2)

ggplot(df) +
    geom_point(aes(x = k, y = perplexity), color = "red")
ggplot(df) +
    geom_point(aes(x = k, y = mean_coherence), color = "blue")


Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR ready? If the modification on multiple k values do not come to this PR, I think this is good to go

R/utils.R Outdated Show resolved Hide resolved
@TuomasBorman TuomasBorman merged commit 50bd2ea into devel Jul 23, 2024
3 checks passed
@TuomasBorman TuomasBorman deleted the feature_loadings_methods branch July 23, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants