Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Include 'combinatorial' argument to t-test functions #9860

Closed
Rambatino opened this issue Feb 21, 2019 · 4 comments
Closed

Proposal: Include 'combinatorial' argument to t-test functions #9860

Rambatino opened this issue Feb 21, 2019 · 4 comments
Labels
needs-decision Items that need further discussion before they are merged or closed scipy.stats

Comments

@Rambatino
Copy link

Proposal

Add a new argument - combinatorial - in order for all combinations of the input ndarray to be compared to one another:

stats.ttest_ind(ndarray, combinatorial=True,...)

If the ndarray contains 5 vectors (np.array([array1, array2...array5)]) then the function would calculate 1vs2, 1vs3, 1vs4...2vs3, 2vs4,...4vs5.

Current Limitations

Currently the ttest_ind function requires that you pass in two array like variables, a & b, and compares one vector to the next:

>>> import numpy as np
>>> from scipy import stats
>>> array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> array2 = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1])
>>> stats.ttest_ind(array1, array2)
Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345)

or as an ndarray

>>> stats.ttest_ind(np.array([array1, array2]).T, np.array([array2, array1]).T)
Ttest_indResult(statistic=array([ 4.38178046, -4.38178046]), pvalue=array([0.00046449, 0.00046449]))

Due to the structure of the arguments it means that only 1 comparison can be made: Thus requiring some combination, potentially using the itertools library to compare multiple groups (i.e. 1 vs 2, 2 vs 3 & 1 vs 3):

>>> import itertools as it
>>> array3 = np.array([7, 7, 7, 7, 7, 7, 7, 7, 7])
>>> ndarray = np.array([array1, array2, array3])
>>> [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :]) for x in it.combinations(range(0, ndarray.shape[0]), 2)]
[Ttest_indResult(statistic=4.3817804600413295, pvalue=0.00046449025177634345),
Ttest_indResult(statistic=-2.1908902300206647, pvalue=0.04360954625650814),
Ttest_indResult(statistic=-inf, pvalue=0.0)]

or this way:

>>> stats.ttest_ind(np.array([array1, array2, array1]).T, np.array([array2, array3, array3]).T)
Ttest_indResult(statistic=array([ 4.38178046,        -inf, -2.19089023]), pvalue=array([0.00046449, 0.        , 0.04360955]))

However, it's not immediately obvious how to construct the input variables when doing combinations and suppose the dimensions become very large: It would require very large input ndarrays to be constucted, with large amounts of duplicated data. It therefore seems sensible that the arguments would support a general combinatorial case.

Internally I wrote a series of functions for the different t-tests that do support this input using matrix broadcasting and thus removing the need for python looping external to the function or constucting large input ndarrays:

>>> p_values, score = t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> p_values
array([[1.00000000e+00, 4.64490252e-04, 4.36095463e-02],
       [4.64490252e-04,            nan, 0.00000000e+00],
       [4.36095463e-02, 0.00000000e+00,            nan]])
>>> score
array([[ 0.        ,  4.38178046, -2.19089023],
       [-4.38178046,         nan,        -inf],
       [ 2.19089023,         inf,         nan]])

Internal function:

def t_test_(v, weights, opts):
    base = np.nansum(~np.isnan(v)*1, axis=1)
    mean = np.nanmean(v, axis=1)
    t_nom = mean[:, np.newaxis] - mean
    s = np.nanvar(v, axis=1, ddof=1)
    if opts['equal_variance']:
        sp = ((base - 1) * s + (base[:, np.newaxis] - 1) * s[:, np.newaxis]) / (base + base[:, np.newaxis] - 2)
        score = t_nom / (np.sqrt(sp) * np.sqrt(1 / base + 1 / base[:, np.newaxis]))
        dof = base + base[:, np.newaxis] - 2
    else:
        s_n = s / base
        score = t_nom / np.sqrt(s_n + s_n[:, np.newaxis])
        dof_denom = (s_n**2 / (base - 1)) + (s_n[:, np.newaxis]**2 / (base[:, np.newaxis] - 1))
        dof = (s_n + s_n[:, np.newaxis])**2 / dof_denom
    return stats.t.sf(np.abs(score), dof) * 2, score

Benchmark

>>> array4 = np.array([11, 3, 2, 1, 5, 2, 1, 3, 6])
>>> array5 = np.array([11, 3, 2, 1, 5, 2, 1, 6, 6])
>>> ndarray = np.array([array1, array2, array3, array4, array5])
>>> def test_current():
...     return [ stats.ttest_ind(ndarray[x[0], :], ndarray[x[1], :], nan_policy='omit') for x in it.combinations(range(0, ndarray.shape[0]), 2)]
>>> def test_new():
...     return standard_t_test_(ndarray, None, {'paired': False, 'equal_variance': True})
>>> print(timeit.timeit("test_current()", "from __main__ import test_current", number=100000))
202.31119344299987
>>> print(timeit.timeit("test_new()", "from __main__ import test_new", number=100000))
20.02507360100026

This demonstrates that the new function is around 10 times faster when comparing 5 groups with one another when comparing to the python looping way.

Summary

What are the implications for adding a feature such as this? Statistically running multiple tests would affect the p-value performance as it would permit p-hacking/mining - however I'd argue that that decision rests with the statistician, not the statistical library.

I'd like to know people's thoughts on a feature such as - or similar to - this.

@rgommers rgommers added scipy.stats needs-decision Items that need further discussion before they are merged or closed labels Feb 21, 2019
@rgommers
Copy link
Member

stats.ttest_ind(ndarray, combinatorial=True,...)

that changes the function signature, which breaks backwards compatibility - so that's a no go. the way to do it would be stats.ttest_ind(a, b=None, ..., combinatorial=True), where a can then be a single array or a 2-D array or a sequence of arrays. It's a bit ugly.

Statistically running multiple tests would affect the p-value performance as it would permit p-hacking/mining - however I'd argue that that decision rests with the statistician, not the statistical library.

It would be a bit weird to return a p-value from this function, and then tell the user that they need to modify it (or, not use 0.05/0.95). So either we support it properly, with the return values needed to let the user easily do the right thing without increasing type I errors, or we just point them to f_oneway (ANOVA is the better way to do this I think).

@Rambatino
Copy link
Author

Hi @rgommers. yes, it is ugly because you effectively halve the desired required inputs and accommodate for that by passing a new argument set to true. Furthermore the type of ‘a’ can’t then be a single array if you have combinatorial set to true, which further confuses the arguments to the function.

I couldn’t think of a better way to do it, however, save for defining a new method. I’ll have a look elsewhere in the library and see if there are potentially some indicators of how better to potentially handle this use-case (of mutually antagonistic argument sets)

Wrt ANOVA, yes, that would be the proper way to test if there are any differences between the groups but it doesn’t tell you what those differences are which would require further investigation (an anova would be the first step, potentially).

return values needed to let the user easily do the right thing without increasing type I errors

Do we need to modify p values (or return other modified p values) for the multiple comparisons? If you pass ndarrays as a & b, the function doesn’t aim to alter the p values in that case does it? I wouldn’t assume it would be the job of the function to do this, as we are aware that we need to apply a correction (usually bonferonni) after running these multiple comparisons.

In short, do you see a benefit to incorporating this methodology into scipy (or publishing it elsewhere) or should I just keep it internally?

@chrisb83
Copy link
Member

The combination of the keyword combinatorial and axis will be very confusing in my view. While combinatorial would speed up the computation compared to looping over the different combinations, I am not sure whether adding this functionality is a good choice.

@Rambatino
Copy link
Author

Agreed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-decision Items that need further discussion before they are merged or closed scipy.stats
Projects
None yet
Development

No branches or pull requests

3 participants