Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbrda() doesn't return species scores and the documentation is unclear if it should #254

Closed
gavinsimpson opened this issue Oct 3, 2017 · 17 comments

Comments

@gavinsimpson
Copy link
Contributor

gavinsimpson commented Oct 3, 2017

Unlike capscale(), dbrda() doesn't return species scores:

library('vegan')
data(varespec, varechem)
ord <- dbrda(varespec ~ N + P + K + Condition(Al), varechem, dist='bray')
> head(scores(ord, display = 'species'))
     dbRDA1 dbRDA2
spe1     NA     NA

It's not immediately clear if dbrda() should return these — the documentation doesn't distinguish between capscale() and dbrda() on this point.

Given the similarity between these two methods, I suspect we can just add species scores using the same code from capscale().

Either way we should either

  • clarify and update the documentation in ?dbrda to be clearer if there are differences between this function and capscale(),

or

  • add species scores to dbrda() models.

This issue was originally raised on StackOverlfow

@jarioksa
Copy link
Contributor

jarioksa commented Oct 3, 2017

In vegan_2.5-0 (master at github) we do document missing species scores in ?cca.object. A more obvious place would be ?dbrda. However, the decision on documentation and implementation of species scores will also depend on the decision on issue #217: should we deprecate capscale with its species scores?

I am not quite sure if species scores should be implemented, because I don't know how to do this correctly (this also applies to capscale). In capscale we pretend that the analysis is like RDA, and we add species scores like we would do in RDA. If we used Euclidean distances in capscale, the scores will be identical to RDA species scores and this is all OK. But what is the relationship between species and ordination if we use non-Euclidean dissimilarities? In principle, both dbrda and capscale handle dissimilarities like they were Euclidean distances and this is quite OK for sampling units. In species scores we need a leap from species to dissimilarities and from there a leap to believing the dissimilarities were Euclidean. My faith failed here when I implemented dbrda, although I was still strong when I implemented capscale a decade earlier.

So this can be done, but should we?

@jarioksa
Copy link
Contributor

jarioksa commented Oct 4, 2017

I think I need to explain the reasons behind my hesitations of species scores in dbrda(). We do have them in capscale(), and we can easily add them to dbrda(), either directly or as a separate function like I outlined in StackOverlfow.

capscale() is written like it would use Euclidean distances, and if it really does so, its results are identical to RDA. Naturally, it does not make sense to use Euclidean distances, because direct use of RDA would be more efficient. However, the dissimilarities are analysed like they were Euclidean (or metric). If the dissimilarities are semimetric, we will have some negative eigenvalues and imaginary components. These are just ignored in capscale() that maps the dissimilarities into Euclidean space before analysis, but they are directly handled in dbrda(). These imaginary components do not worry me: their effect usually appears only in the last axes that are ignored any way, and the first axes and usually all constrained axes are real even in semimetric indices. The species scores are found by mapping the original community data onto solution and this is a bigger leap. It is strictly correct only with Euclidean distances and can be misleading with other distances (even metric ones). In capscale() we do it anyway, but not (yet) in dbrda(). The capscale() logic is that this is absolutely OK if you use Euclidean distances (which you should not use), but if you use something else, you take a conscious risk of erring. However, users may not be so conscious, but they take the existence of species scores as our endorsement.

There are numerous problems with species scores. The way we add them needs metric distances, but not necessarily standard Euclidean. If we have a distance that can be expressed as Euclidean distances of transformed data, then transformed data can be used to find species scores. That is, if our dissimilarity can be expressed dist(transf(x)), then transf(x) can be used to find species scores. Currently capscale() does not do so, but it uses x to find species scores. For instance, the classic Chord Distance can be expressed as dist(decostand(x, "normalize")), and then decostand(x, "normalize") can be used to find correct species scores, but capscale() uses x. (NB., if somebody uses Chord Distance, she probably used some other way of calculation and so we cannot even know which indices can be expressed exactly as Euclidean distances.)

The commonly used (semimetric) community dissimilarities cannot be expressed exactly as Euclidean distances. They differ in two major ways: they are often bound to range 0...1, and they use differences instead of squared differences of Euclidean metrics. However, if we can find a transformation that makes them approximately similar to Euclidean distances, we should use data with that transformation to get species scores. For closed range, we can transform SUs to unit total, and we can use sqrt(x) to reduce the effect of squared differences of Euclidean distances (this does not remove the effect, but it can make it smaller). The problem is that there is no obvious way dbrda() knows how to transform data before finding species scores (apart from exhaustive search with some choices). Users may know how to do this, or try to find a way of doing it, or just ignore the whole issue, but then we should leave this decision to users and provide a function that adds species scores like they want. An acceptable transformation is not a constant property of dissimilarity measure, but also depends on the data set.

As an example, let us have a look at BCI data with Bray-Curtis dissimilarity. If direct species scores of capscale() are OK, then the Bray-Curtis dissimilarity of the analysis (vegdist(BCI)) and implied Euclidean distance (dist(BCI)) of calculating species scores are pretty similar. Obviously, this is not the case and species scores can be misleading:
bciraw
However, transformation sqrt(decostand(BCI, "total")/2) gives Euclidean distances that are linearly related to Bray-Curtis dissimilarities, although not quite equal (red line) and have some scatter:
bcitrans

This may be a more acceptable transformation for getting species scores, but dbrda() cannot know this. Therefore it does not add those scores. capscale() neither knows this, but it does not care, but adds the scores with untransformed data anyway.

We have three obvious alternatives:

  1. Current situation of not adding species scores in dbrda() and phasing out capscale() that did so: we just give up and say we do not know how to add the species scores.
  2. Making dbrda() similar to current capscale: add species scores, ignore the issue.
  3. Not adding species scores in dbrda(), but providing function to add them allowing users to make informed and justified decisions.

@philiphaupt
Copy link

Dear Jari and vegan developers,

I would prefer option three, with some reference to the issue that you so clearly described here in the documentation of the function; And some reference to species scores in the documentation of dbrda. I think it negates the "endorsement" issue you mention, and forces people to make an effort to understand and consider their data.

Another option I was wondering about, which is probably out of grave misunderstanding of multiple facets of your incredible work, is: Are their other ways of calculating "species scores" that may circumvent these problems? e.g. Pearson or Spearman rank correlation of "species" to "axis", as used in Primer PERMANOVA. https://stackoverflow.com/questions/46563475/what-is-the-difference-between-r-vegan-species-scores-and-primer-spearman-rank

@jarioksa
Copy link
Contributor

jarioksa commented Oct 4, 2017

I don't know PRIMER and I can't comment on its choices. However, having correlations of "species" to "axis" is something that I exactly don't want to have near vegan. The species scores we have are not for the axes, but they show the direction in the space to which the species increases most rapidly: it is not along axes, but somewhere in the space. These scores are rotation invariant: if you change the orientation of the axis, all points and all scores move together and are unchanged with respect to each other, although their relationship to axes would change. So how would rank correlation change if you rotate an axis? For comparable species scores, you should find a direction to which the species has the best rank correlation, and I don't know how to do that. We can do it in Euclidean space using Euclidean geometry like we do now, but I have no idea about rank-order metric.

@gavinsimpson
Copy link
Contributor Author

@jarioksa what about envfit()-like species scores/arrows for these distance-based ordination methods? Or is that effectively the same thing (but for two dimensions only) that is currently done in capscale() and the functions you showed in your SO answer? These arrows show the direction and magnitude if we scale by r.

If well documented, the function you showed on SO would be fine to include in vegan.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 4, 2017

Yes, indeed: envfit() implements the same method, but it defaults to using weighted averages scores(), and won't handle partial models in the same way (nor unconstrained scores after constraints). The mathematics are a bit simpler in getting species scores, because we know that the scores used are orthonormal. The proposed function will also amend the ordination result, and permanently add species score there instead of adding ad hoc scores over existing plots.

Several people in SO have used envfit() to add species arrows to ordination, including NMDS were they really should not be used. Technically this is correct.

@gavinsimpson
Copy link
Contributor Author

@jarioksa

Several people in SO have used envfit() to add species arrows to ordination, including NMDS were they really should not be used. Technically this is correct.

(Is that a typo in the last sentence? Should it be "Technically this is incorrect"?)

What is the concern with using envfit() for NMDS (sensu metaMDS())? Is it the assumption of linear change — hence ordisurf() — or something more specific than that?

@jarioksa
Copy link
Contributor

jarioksa commented Oct 5, 2017

No, it is not a typo. This is technically correct:

library(vegan)
data(varespec)
mod <- rda(varespec)
biplot(mod)
plot(envfit(mod ~ Pleuschr + Cladarbu + Cladrang + Cladstel, varespec)) # correct
ordisurf(mod ~ Cladstel, varespec, knots=1, add=TRUE) # linear change

The point with NMDS really is that the arrows imply a linear species response in the ordination space. We have those in PCA/RDA, but not in NMDS.

@jarioksa
Copy link
Contributor

jarioksa commented Oct 5, 2017

RFC: implementation of adding species scores to ordination objects

  1. Species scores are implemented only for methods that normally have them, but where they may be missing because input data were dissimilarities instead of raw community data. This means constrained ordination methods dbrda, capscale (until it is removed) and unconstrained metaMDS. These result objects have a standard list object that contains the species scores, but that object can be empty or NA when species scores were not calculated.

  2. Species scores are not added to ordination objects that do not have a standard location for species scores, even if the methods are similar to those in point 1, and the scores could be technically added. The following methods will not have species scores: wcmdscale, monoMDS, isomap (are there other?).

  3. Adding species scores is only implemented for vegan objects, and we do not provide this method for other packages, even when these are similar to methods in point 1.

  4. Species scores are added using the native method for the ordination objects. For dbrda (and capscale) this is linear model similar to biplot arrows, and for metaMDS this is wascores. There is no option to choose non-native method.

  5. If the ordination object already has species scores, these are over-written. This concerns capscale and metaMDS that can have species scores initially. When these are over-written, we give either a warning or a message (which?). The dbrda object has species scores only if they were added previously, and these will be over-written silently.

  6. The function will be generic, but methods are implemented only for classes "dbrda", "capscale" and "metaMDS". The "default" method will give an error message.

  7. The first implementation of the function in StackOverflow called this function specscores, but I am happy to use a better name if one is suggested.

jarioksa pushed a commit that referenced this issue Oct 10, 2017
Together with previous commits in this branch, this implements the
functions outlined in github issue #254 for adding species scores
to ordination results. Still undocumented and non-exported.
@jarioksa
Copy link
Contributor

I have now function to add (or update) species scores in dbrda, capscale and metaMDS in branch issue-#254. The functions are still undocumented and non-exported, but they can be used after source(). I am not satisfied with the name of the function: specscores(). This is ugly, but more importantly, it rather sounds a name of function to extract scores than to add them. I now consider something like addspscores() which is not much prettier but more revealing on the nature of the function. After making up my mind on naming, I'll document, export and merge these functions. English speakers may comment on the name (hint, @gavinsimpson).

The existence of species scores was the main difference between capscale() and dbrda(), and that made me hesitate deprecating capscale (issue #217). There are still implementation differences and the results do differ between dbrda() and capscale(), but intend to move on with capscale deprecation after merging this function. Comments are welcome, though.

@gavinsimpson
Copy link
Contributor Author

Would it make sense to have:

  1. sppscores(ord, ...) — extractor function wrapping scores(ord, ..., display = 'species'), and
  2. a replacement function version `sppscores<-`(ord, comm) such that we can do sppscores(ord) <- comm

?

Also, for consistency we should probably call this specscores() even though I prefer spp for 'species' over spec.

I'm not a fan of addspscores; why sp and not spp? three words without separators is two too many. Same would go for addspecscores or addsppscores.

@gavinsimpson
Copy link
Contributor Author

I'm not sure about this line

warning("function overwrites old species scores")

  • It should be a message, not a warning as if we ever write a test for this code we'll have to explain to CRAN why we're throwing warnings.

  • It should also be in past tense; "Existing species scores were overwritten"

  • Do we even need a message here? I don't really want R shouting stuff at me just for using the function as it was intended?

Ditto

warning("function overwrites old species scores")

Also, I'm now wondering if we really need this function at all. Can't we just have each of these functions just add species scores regardless? It seems like we're being too protective of users.

Would it not be better to add a sppscores argument with default FALSE to all three functions, such that we don't generate species scores even if comm is supplied, unless the user specifically requests them via sppscores = TRUE.

Perhaps if we did keep the proposed function it would allow a user to add species scores to be added later if they are doing something very non-standard or bespoke. But we should bake this functionality into the functions that do the ordination so a user needs only to activate a switch when fitting the ordination rather than the two-step process.

Thoughts?

@jarioksa
Copy link
Contributor

Why have this (or similar) function? It seems that people want to have it. This only concerns distance-based methods (metaMDS, capscale and now dbrda) which have no native information about species or the rectangular community data in general. One of the most common vegan questions in StackOverflow is about adding species scores (usually as correlation vectors) onto ordination plots. So there is a need. It seems that people are increasingly using metaMDS and capscale using dissimilarities as input. Natively we have species scores in metaMDS and capscale if the function was called with community data and dissimilarities were calculated within the function (actually, metaMDS gained ability to handle dissimilarity input first in version 1.15-4 in 2009). So we have a list item for these scores but they may be missing. I had my doubts about adding species scores as projections in dbrda and that was a contributing reason of not implementing them (another reason was probably laziness). However, they do exist in capscale and all other constrained ordination methods. People ask them, and instead of repeating answers in StackOverflow, we could provide the tool.

Then how to do this?

I think having wascores is rather unproblematic, and currently we use the same transformations as for the data in metaMDS, and also print out information on prior transformation in deriving species scores. There is no comm= argument in metaMDS and the scores are always missing if dissimilarities only were supplied. The projections in constrained ordination are more problematic. In capscale the logic is that the scores are correct if Euclidean distances were used. This means that they are often wrong. We also have a comm= argument which allows calculations of species scores even when dissimilarities were supplied, allowing better or worse user choices than the inbuilt one. In dbrda we have nothing, but could have.

The suggested change was based on the idea of amending a solution with species scores. This would only concern methods with may have species scores, but happens not to have them, and this amendment would permanently change the result object (and current version saves no permanent information on the way this amendment was done: which data, how transformed). Another alternative is to supply functions that make easier to add scores in plots without saving them in the result object. For wascores this could be as simple as this untested change:

--- a/R/wascores.R
+++ b/R/wascores.R
@@ -1,8 +1,10 @@
 `wascores` <-
-    function (x, w, expand = FALSE) 
+    function (x, w, expand = FALSE, display = "sites", ...) 
 {
     if(any(w < 0) || sum(w) == 0)
         stop("weights must be non-negative and not all zero")
+    if (is.list(x))
+        x <- scores(x, display = display, ...)
     x <- as.matrix(x)
     w <- as.matrix(w)
     nc <- ncol(x)

For projections we would need a new function. People use now envfit, but that fits correlation arrows and ignores the effect of conditions terms on input data, and gives different results in partial ordination. Obviously we also would need an option to get the correlation arrows, though. The main point would be that the original ordination result will remain unchanged, but items can be added to graph or saved and analysed.

@jarioksa
Copy link
Contributor

If we have something to solve the scores problem, it should be light. Here some ideas and the reasons why I don't like them:

  • Having just function to return scores for plotting (my suggestion above) may work for wascores, but we run in the trouble with projection scores with scaling and const: The plotting function should know how the original plot was called and work consistently. We could have something similar as plot.envfit, but that will not honour the scaling in the original plot. Basically, if we plot a solution with species scores, and then use the new projection plot to display the same scores, these should be identical. We can have that for wascores, but it is far too tricky for projection scores, and would need new functions with several options.

  • Having a specific function to extract just species scores and its sister to replace species scores is too heavy. Actually, in an email I suggested scores<- function for adding species scores years ago, but you rejected the ideas as "non-Rish". If we have such a function, it should either refuse to over-write existing scores as this can ruin the object, or only work with some score types of some ordination methods. Having a new generic that most often refuses to work is not nice.

  • The situation really concerns only metaMDS and capscale which often have species scores, and dbrda whose kin have species scores and users may expect them. If having species scores is made dependent on sppscores = TRUE argument, we should re-fit the whole analysis to add the scores. This is especially impractical in metaMDS which may be very slow in large data sets, and which may not converge to the same solution in two runs. Species scores are supplementary information in distance-based ordination, and it should be possible to supplement this information after the analysis. Running distance-based methods with dissimilarities should not prevent from having this supplementary information, and the easiest way is to add those scores to the result. The only way to handle species scores consistently is to have them in the same object.

  • There are at least three reasons I did not implement species scores in dbrda: (1) the scores are often misleading with non-Euclidean dissimilarities, (2) I was lazy, (3) the needed code is ugly and I would like to get rid of that even in capscale. The basic code is OK, but dealing with subsets and missing values makes it ugly.

  • All information we need for projection scores of species are ordination of sampling units and eigenvalues. We actually have all relevant information also in vegan::wcmdscale and vegan::isomap and in stats::cmdscale as points and eig. Should we have something for these methods, if we have the supplementary service for dbrda and capscale? The complications are that wcmdscale can have non-equal weights which needs two more if(){} blocks, and stats::cmdscale has no class (it is just a list).

@jarioksa
Copy link
Contributor

jarioksa commented Oct 13, 2017

@gavinsimpson I started to think that a renamed replacement function makes sense for adding species scores. However, this seems to stretch the R concept of replacement a bit. For instance, the right-hand-side must be called value hinting that R thinks replacement is just putting unchanged value somewhere, but we use value to calculate the replacement. The synopsis would read sppscores(object) <- value. This may be confusing. Do you think this OK?

The function works, though, and it is clearer that we add something to ordination result than in the previous version. The replacement functions can only have two arguments: the left one being the object (usually called x in R functions), and the right one being the value. So we cannot modify the calculations, like we had first for metaMDS.

I'll deal with other aspects later (warnings, file names, documentation etc.). I'll familiarize with this approach first.

@jarioksa jarioksa mentioned this issue Oct 16, 2017
@jarioksa
Copy link
Contributor

jarioksa commented Oct 16, 2017

@gavinsimpson @philiphaupt With respect to the discussion here, PR #256 does the following:

  • Function is called -- after all -- sppscores. The Polish group of consonants (csc) in the middle of specscores was too much for me, and automatic command completion needed writing up to that group.
  • There is now a replacement function, but there is no extractor function, or there is a token extractor function that just tells that it is not implemented.
  • We do stretch a bit the concept of replacement function because sppscores(object) <- value does not add the value as such, but a function (projection, weighted averages) of value with the information from object.
  • Even if we had extractor function, the following would fail or give wrong results: sppscores(object) <- sppscores(object), because the right-hand-side result (value) is not directly added to object.
  • Changing sppscores<-() to sppscores() would change the idiom to object <- sppscores(object, value).
  • The method is implemented only for dbrda, capscale and metaMDS. It could be implemented for wcmdscale, monoMDS, isomap and in principle also to cmdscale, but these functions do not have support functions that could handle species scores. Except that ordiplot and friends can handle them.
  • I haven't changed wascores() to accept ordination object, although this seems to work. However, text(wascores(object, comm)) would fail because base::text() does not find the labels from row names, making this shortcut much more long-winding (text(wascores(object, comm), labels=colnames(comm)))
  • There are a couple of documentation fixes that update the information for (potentially) missing species scores in db-RDA. These should be cherry-picked even if Issue #254 #256 is not merged.

jarioksa pushed a commit that referenced this issue Oct 19, 2017
@jarioksa
Copy link
Contributor

Closed with PR #256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants