More Stats Functions #1732

ds26gte · 2024-04-12T20:31:11Z

Issue brownplt/code.pyret.org#520 filed by @schanzer

We've had a few teachers ask if Pyret supports various stats functions:

Getting these implemented as a Pyret program would be great, but implementing them as part of Pyret's stats library would be much better.

(In keeping with the other stats functions, these should all operate on lists. I'll wrap them to work with tables in the DS teachpack.)

t-test-pooled, t-test-independent, chi-square brownplt#1732

shriram · 2024-04-13T00:55:59Z

Thanks, @ds26gte! Can you add some tests, please?

…lt#1732 - statistics.arr: added exceptions for t-test-{pooled, independent}

schanzer · 2024-04-13T19:03:16Z

@ds26gte awesome to see this progress! I'm still hoping we can add a z-test function as well (see checklist in the issue).

ds26gte · 2024-04-13T19:03:27Z

@team, the z-test seems to require, in addition to the two samples, also the population (rather than the sample) variances. Please add what you think are the right arguments for the z-test and the other functions that I've already added.

schanzer · 2024-04-23T19:13:25Z

@ds26gte waiting to hear back about the desired contract from one of the teachers who requested these functions, which should give me a sense for whether these are close enough to what they need that I couldn't bridge the gap in a teachpack. Will wait to hear back.

schanzer · 2024-04-26T19:43:59Z

@ds26gte I spoke with Nancy Pfenning today, who gave the following descriptions of what the inputs to various functions should be:

z-test: list of numbers, stddev, hypothesized mean
t-test: list of numbers, mean
2-sample t-test: 2 list of numbers (can be different size), "tail-ness" (boolean operator? >,<, ≠?)
paired t-test: 2 list of numbers (error if different size, order matters), "tail-ness" (boolean operator? >,<, ≠?)
pooled t-test: 2 lists of numbers, "tail-ness" (boolean operator? >,<, ≠?)
chi-squared: 2 lists of numbers (assumes pre-summarized data)

I think this is all inline with what you have, with the exception of the z-test. Can you double-check your implementation, and let me know why it has two lists of numbers?

list of known x-inputs, and a list of the corresponding known y-outputs, and returns a predictor function that takes a list of x-inputs and returns its estimated y-output brownplt#1732 - js/trove/multiple-regression.js contains the JS implementation of multiple-regression and all its matrix subroutines - tests/test-statistics.arr: added a basic test (can add more from curriculum examples, when these are added)

representing one input (setting of indep vars to values). The returned predictor fn also takes an N-tuple brownplt#1732

multiple-regression.js: clean-up w/ better row/col indexing names

- check mulreg test on 1 var matches our linreg on same var - add mulreg test for 2 vars statistics.arr: add pointers to docs for formulas used

…ts arg tuple's elts are numbers brownplt#1732

brownplt#1732

schanzer · 2024-05-29T20:29:15Z

@ds26gte Sorry for the delay on this! I was hoping to hear back from the teacher who was requesting them, but they're overwhelmed with end-of-year stuff so I hopped on Zoom with Joy instead. :)

Below are the contract and purpose statements for the various functions that Bootstrap would export:

sample-variance :: Table, Column -> Number

pop-variance :: Table, Column -> Number

t-test-1-sample :: Table, String, Number -> Number

t-test-2-sample :: Table, String, String -> Number  #  this is the same as t-test-independent, so as long as one is implemented we're fine

t-test-paired :: Table, String, String -> Number 

t-test-pooled :: Table, Column1, Column2 -> Number

chi-sqr            :: Table -> p-value # consumes a 2-way Table of observed counts

chi-sqr-gof :: Table, Table -> p-value # consumes a 1-col Table of observed counts, and a 1-col Table of expected countrs

You'll want to replace Table in most of the contracts above with List, but for chi-sqr I'm assuming you want a list of lists? I'll wrap the functions in our library to keep everything in Table-land

ds26gte · 2024-06-05T14:15:06Z

(BTW, our naming needs to move away from contrasting linear against multiple. They are both linear -- it's actually single vs multiple.)

schanzer · 2024-06-05T17:30:02Z

@ds26gte good call. I propose linear-regression and multiple-linear-regression, possibly also with single-linear-regression as an alias for the first.

ds26gte · 2024-06-06T13:34:40Z

Looks like at least the googleable literature also contrasts linear against multiple. To be sure, multiple-regression desribes an n-dimensional plane, which is not, in a geometric sense, linear. On the other hand, even in the single-dimensional case, we can contrast linear against quadratic and other higher powers, which we don't use.

Essentially, our code and curriculum only deal with predictor functions that operate on one or multiple independent variables, but in both cases only take the first power of the independent variable(s). We want names that capture this and also don't mislead.

ds26gte · 2024-06-10T16:21:53Z

OK, apropos the various z-tests and t-tests, I don't think the things we're implementing are tests. Did we just want scores, in which case specifying the "tailness" as an argument makes no sense. The tailness is something you use along with the score in a subsequent (complicated) step for which we currently do not have code. This subsequent step could be automated, but it requires more coding.

The score gives us an abscissa to associate with our sample. The confidence level identifies one or two contiguous areas under the probability density function (normal, t, F, etc). The tailness is additional input that helps us identify this area. We then find the terminus abscissa associated with this area. Finally, we check if our own sample's abscissa is on the correct side of this terminus abscissa. So the test's result is a boolean.

As a coding task, what we need is the ability to find an abscissa given an area.

At a lower level, this means finding the root ("zero") of the difference of the integral of the function (with one integration bound varying) against a known area. This requires me to implement a suitable numerical integration function and a Newton-Raphson interpolation function. Both of which I can do, but it is a big undertaking, so...

Do we want to do this?

Could you check with Nancy or our curriculum goals. (The current texts don't mention anything, but maybe I'm not grepping expertly.)

ds26gte · 2024-06-14T17:21:09Z

Latest changes to z-, t- and chi- functions in commit 207d18b.

Using test in the function names as spec'd. However, please consider changing it to score or value, since these give an x-value for the related probability density function.

Note: if the original spec setter did mean test, i.e., a boolean output is desired, then we need to add libraries for numerical integration, Γ, Newton-Raphson, and various prob density functions, as outlined above. This can be done and if anyone wants to review my prototype in Lua, do lmk. (Γ is an improper integral, but the numerical-integration routine can be adapted for it.)

Important: there is a non-glaring typo on the Investopedia website in its formula for the pooled t-test. So I've checked all the t-test-* functions against a paper textbook.

schanzer · 2024-07-23T15:05:34Z

@ds26gte In Bootstrap:DS, everything is done via tables. The previous domain of linear-regression consumed a list of xs and a list of ys, which worked perfectly with Pyret's column->list machinery.

The current domain of multiple-regression, however, is a list of pairs, which requires a lot of munging in Pyret to convert a list-of-table-columns into a list-of-pairs. This munging obviously has to happen somewhere, but this feels like it should happen in JS-land, not Pyret-land.

Can we bring the domains of LR and MR into alignment, so that both consume a list of values on different axes?

blerner · 2024-07-23T19:42:07Z

This is a 1-line wrapper for you, e.g.

t = table: x, y
  row: 1, 1
  row: 2, 4
  row: 3, 7
end

map2({(x, y): {x; y}}, t.get-column("x"), t.get-column("y"))

schanzer · 2024-07-24T02:55:33Z

But that only works for 2 lists. What about 10? 20?

blerner · 2024-07-24T03:31:58Z

Do you actually have any such scenarios in BS:DS?

schanzer · 2024-07-24T13:33:41Z

If we never needed more than x and y, we'd be happy to stay with linear regression. The whole point of adding multiple regression is allow for such scenarios, right? And if 10 is extreme, how about 5? 4? At some point relying on map<n> will break.

I have a solution that does what I want already, but I have real concerns about doing all this list munging in Pyret instead of JS. For a table with 5k rows, even a 3 column MR will require a pretty huge number of swaps in memory.

blerner · 2024-07-24T14:03:47Z

First of all, no, the signature for multiple regression does not currently use tuples at all:

pyret-lang/src/arr/trove/statistics.arr

Line 227 in 448cfdf

    
           fun multiple-regression(x_s_s :: List<List<Number>>, y_s :: List<Number>) -> (List<Number> -> Number):

It's a list of lists of numbers, where each inner list is an individual sample of the data. You want the transpose of this, if you're trying to extract columns and do it that way.

Second, @ds26gte , the easiest way for you to support this is to implement

fun multiple-regression-tablex_s_s :: Table, y_s :: List<Number>) -> (List<Number> -> Number)

that does the same thing as multiple-regression, but uses MX.table-to-matrix instead of MX.lists-to-matrix. (You should obviously extract a common, internal helper function for both of these, with signature

fun multiple-regression-matrix(m_xss :: Matrix, y_s :: List<Number>) -> (List<Number> -> Number)

that has converted x_s_s to a matrix already, and then does the rest of the math.

Third, @schanzer , you should use this API via table.select-columns to extract a sub-table, rather than repeatedly using table.column to extract lists.

schanzer · 2024-07-24T14:26:59Z

I know MR doesn't use tuples - the issue is having to transpose tens of thousands of cells into the list format MR needs, and having to do it all in Pyret when it feels like this is a task for JS. Having this supported in the stats library as you propose would be fantastic.

@ds26gte is this something you can add? If so, I'll use the proper select-columns API to pass you the right table

schanzer · 2024-07-24T15:02:49Z

@ds26gte nevermind -- Ben and I spoke by phone, and he explained that I'm worrying about the wrong performance hit. If it's going to be slow anywhere, it'll happen in the matrix inversion.

I'm ready to sign off on this as-is, and if we find a real dataset for which this is a problem we can revisit the issue. Thanks for all your work on this!

ds26gte self-assigned this Apr 12, 2024

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 12, 2024

statistics.arr: added variance, variance-sample, t-test-paired,

a52ef2d

t-test-pooled, t-test-independent, chi-square brownplt#1732

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 13, 2024

- test-statistics.arr: added tests for the new stats functions brownp…

2ba6dd1

…lt#1732 - statistics.arr: added exceptions for t-test-{pooled, independent}

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 15, 2024

Added z-test impl & test brownplt#1732

363038c

schanzer mentioned this issue Apr 26, 2024

More Stats Functions brownplt/code.pyret.org#520

Closed

5 tasks

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024

multiple-regression(): first arg is now a list of N-tuples (each N-tuple

7cf32b9

representing one input (setting of indep vars to values). The returned predictor fn also takes an N-tuple brownplt#1732

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024

statistics.arr: Correct type of multiple-regression() brownplt#1732

7501262

multiple-regression.js: clean-up w/ better row/col indexing names

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024

test-statistics.arr: brownplt#1732

d341e33

- check mulreg test on 1 var matches our linreg on same var - add mulreg test for 2 vars statistics.arr: add pointers to docs for formulas used

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024

multiple-regression.js: gen'd predictor function shd check that all i…

6f09230

…ts arg tuple's elts are numbers brownplt#1732

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024

multiple-regression.js: better exception msgs brownplt#1732

cb49a3f

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 19, 2024

redefine linear-regression() as a special case of multiple-regression()

b9ea62d

brownplt#1732

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More Stats Functions #1732

More Stats Functions #1732

ds26gte commented Apr 12, 2024 •

edited by schanzer

Loading

shriram commented Apr 13, 2024

schanzer commented Apr 13, 2024

ds26gte commented Apr 13, 2024

schanzer commented Apr 23, 2024

schanzer commented Apr 26, 2024

schanzer commented May 29, 2024

ds26gte commented Jun 5, 2024

schanzer commented Jun 5, 2024

ds26gte commented Jun 6, 2024

ds26gte commented Jun 10, 2024 •

edited

Loading

ds26gte commented Jun 14, 2024

schanzer commented Jul 23, 2024

blerner commented Jul 23, 2024

schanzer commented Jul 24, 2024

blerner commented Jul 24, 2024

schanzer commented Jul 24, 2024

blerner commented Jul 24, 2024

schanzer commented Jul 24, 2024

schanzer commented Jul 24, 2024

More Stats Functions #1732

More Stats Functions #1732

Comments

ds26gte commented Apr 12, 2024 • edited by schanzer Loading

shriram commented Apr 13, 2024

schanzer commented Apr 13, 2024

ds26gte commented Apr 13, 2024

schanzer commented Apr 23, 2024

schanzer commented Apr 26, 2024

schanzer commented May 29, 2024

ds26gte commented Jun 5, 2024

schanzer commented Jun 5, 2024

ds26gte commented Jun 6, 2024

ds26gte commented Jun 10, 2024 • edited Loading

ds26gte commented Jun 14, 2024

schanzer commented Jul 23, 2024

blerner commented Jul 23, 2024

schanzer commented Jul 24, 2024

blerner commented Jul 24, 2024

schanzer commented Jul 24, 2024

blerner commented Jul 24, 2024

schanzer commented Jul 24, 2024

schanzer commented Jul 24, 2024

ds26gte commented Apr 12, 2024 •

edited by schanzer

Loading

ds26gte commented Jun 10, 2024 •

edited

Loading