Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Stats Functions #1732

Open
6 tasks done
ds26gte opened this issue Apr 12, 2024 · 19 comments
Open
6 tasks done

More Stats Functions #1732

ds26gte opened this issue Apr 12, 2024 · 19 comments
Assignees

Comments

@ds26gte
Copy link
Contributor

ds26gte commented Apr 12, 2024

Issue brownplt/code.pyret.org#520 filed by @schanzer

We've had a few teachers ask if Pyret supports various stats functions:

  • population variance
  • sample variance
  • t-test (multiple kinds)
  • z-test
  • chi-squared test
  • multiple regression

Getting these implemented as a Pyret program would be great, but implementing them as part of Pyret's stats library would be much better.

(In keeping with the other stats functions, these should all operate on lists. I'll wrap them to work with tables in the DS teachpack.)

@ds26gte ds26gte self-assigned this Apr 12, 2024
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 12, 2024
@shriram
Copy link
Member

shriram commented Apr 13, 2024

Thanks, @ds26gte! Can you add some tests, please?

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 13, 2024
…lt#1732

- statistics.arr: added exceptions for t-test-{pooled, independent}
@schanzer
Copy link

@ds26gte awesome to see this progress! I'm still hoping we can add a z-test function as well (see checklist in the issue).

@ds26gte
Copy link
Contributor Author

ds26gte commented Apr 13, 2024

@team, the z-test seems to require, in addition to the two samples, also the population (rather than the sample) variances. Please add what you think are the right arguments for the z-test and the other functions that I've already added.

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue Apr 15, 2024
@schanzer
Copy link

@ds26gte waiting to hear back about the desired contract from one of the teachers who requested these functions, which should give me a sense for whether these are close enough to what they need that I couldn't bridge the gap in a teachpack. Will wait to hear back.

@schanzer
Copy link

@ds26gte I spoke with Nancy Pfenning today, who gave the following descriptions of what the inputs to various functions should be:

z-test: list of numbers, stddev, hypothesized mean
t-test: list of numbers, mean
2-sample t-test: 2 list of numbers (can be different size), "tail-ness" (boolean operator? >,<, ≠?)
paired t-test: 2 list of numbers (error if different size, order matters), "tail-ness" (boolean operator? >,<, ≠?)
pooled t-test: 2 lists of numbers, "tail-ness" (boolean operator? >,<, ≠?)
chi-squared: 2 lists of numbers (assumes pre-summarized data)

I think this is all inline with what you have, with the exception of the z-test. Can you double-check your implementation, and let me know why it has two lists of numbers?

ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 16, 2024
  list of known x-inputs, and a list of the corresponding known
  y-outputs, and returns a predictor function that takes a list of
  x-inputs and returns its estimated y-output brownplt#1732
- js/trove/multiple-regression.js contains the JS implementation of
  multiple-regression and all its matrix subroutines
- tests/test-statistics.arr: added a basic test (can add more from
  curriculum examples, when these are added)
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024
representing one input (setting of indep vars to values). The returned
predictor fn also takes an N-tuple brownplt#1732
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024
multiple-regression.js: clean-up w/ better row/col indexing names
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024
- check mulreg test on 1 var matches our linreg on same var
- add mulreg test for 2 vars
statistics.arr: add pointers to docs for formulas used
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024
ds26gte added a commit to ds26gte/pyret-lang that referenced this issue May 17, 2024
@schanzer
Copy link

@ds26gte Sorry for the delay on this! I was hoping to hear back from the teacher who was requesting them, but they're overwhelmed with end-of-year stuff so I hopped on Zoom with Joy instead. :)

Below are the contract and purpose statements for the various functions that Bootstrap would export:

sample-variance :: Table, Column -> Number

pop-variance :: Table, Column -> Number

t-test-1-sample :: Table, String, Number -> Number

t-test-2-sample :: Table, String, String -> Number  #  this is the same as t-test-independent, so as long as one is implemented we're fine

t-test-paired :: Table, String, String -> Number 

t-test-pooled :: Table, Column1, Column2 -> Number

chi-sqr            :: Table -> p-value # consumes a 2-way Table of observed counts

chi-sqr-gof :: Table, Table -> p-value # consumes a 1-col Table of observed counts, and a 1-col Table of expected countrs

You'll want to replace Table in most of the contracts above with List, but for chi-sqr I'm assuming you want a list of lists? I'll wrap the functions in our library to keep everything in Table-land

@ds26gte
Copy link
Contributor Author

ds26gte commented Jun 5, 2024

(BTW, our naming needs to move away from contrasting linear against multiple. They are both linear -- it's actually single vs multiple.)

@schanzer
Copy link

schanzer commented Jun 5, 2024

@ds26gte good call. I propose linear-regression and multiple-linear-regression, possibly also with single-linear-regression as an alias for the first.

@ds26gte
Copy link
Contributor Author

ds26gte commented Jun 6, 2024

Looks like at least the googleable literature also contrasts linear against multiple. To be sure, multiple-regression desribes an n-dimensional plane, which is not, in a geometric sense, linear. On the other hand, even in the single-dimensional case, we can contrast linear against quadratic and other higher powers, which we don't use.

Essentially, our code and curriculum only deal with predictor functions that operate on one or multiple independent variables, but in both cases only take the first power of the independent variable(s). We want names that capture this and also don't mislead.

@ds26gte
Copy link
Contributor Author

ds26gte commented Jun 10, 2024

OK, apropos the various z-tests and t-tests, I don't think the things we're implementing are tests. Did we just want scores, in which case specifying the "tailness" as an argument makes no sense. The tailness is something you use along with the score in a subsequent (complicated) step for which we currently do not have code. This subsequent step could be automated, but it requires more coding.

The score gives us an abscissa to associate with our sample. The confidence level identifies one or two contiguous areas under the probability density function (normal, t, F, etc). The tailness is additional input that helps us identify this area. We then find the terminus abscissa associated with this area. Finally, we check if our own sample's abscissa is on the correct side of this terminus abscissa. So the test's result is a boolean.

As a coding task, what we need is the ability to find an abscissa given an area.

At a lower level, this means finding the root ("zero") of the difference of the integral of the function (with one integration bound varying) against a known area. This requires me to implement a suitable numerical integration function and a Newton-Raphson interpolation function. Both of which I can do, but it is a big undertaking, so...

Do we want to do this?

Could you check with Nancy or our curriculum goals. (The current texts don't mention anything, but maybe I'm not grepping expertly.)

@ds26gte
Copy link
Contributor Author

ds26gte commented Jun 14, 2024

Latest changes to z-, t- and chi- functions in commit 207d18b.

Using test in the function names as spec'd. However, please consider changing it to score or value, since these give an x-value for the related probability density function.

Note: if the original spec setter did mean test, i.e., a boolean output is desired, then we need to add libraries for numerical integration, Γ, Newton-Raphson, and various prob density functions, as outlined above. This can be done and if anyone wants to review my prototype in Lua, do lmk. (Γ is an improper integral, but the numerical-integration routine can be adapted for it.)

Important: there is a non-glaring typo on the Investopedia website in its formula for the pooled t-test. So I've checked all the t-test-* functions against a paper textbook.

@schanzer
Copy link

@ds26gte In Bootstrap:DS, everything is done via tables. The previous domain of linear-regression consumed a list of xs and a list of ys, which worked perfectly with Pyret's column->list machinery.

The current domain of multiple-regression, however, is a list of pairs, which requires a lot of munging in Pyret to convert a list-of-table-columns into a list-of-pairs. This munging obviously has to happen somewhere, but this feels like it should happen in JS-land, not Pyret-land.

Can we bring the domains of LR and MR into alignment, so that both consume a list of values on different axes?

@blerner
Copy link
Member

blerner commented Jul 23, 2024

This is a 1-line wrapper for you, e.g.

t = table: x, y
  row: 1, 1
  row: 2, 4
  row: 3, 7
end

map2({(x, y): {x; y}}, t.get-column("x"), t.get-column("y"))

@schanzer
Copy link

But that only works for 2 lists. What about 10? 20?

@blerner
Copy link
Member

blerner commented Jul 24, 2024

Do you actually have any such scenarios in BS:DS?

@schanzer
Copy link

If we never needed more than x and y, we'd be happy to stay with linear regression. The whole point of adding multiple regression is allow for such scenarios, right? And if 10 is extreme, how about 5? 4? At some point relying on map<n> will break.

I have a solution that does what I want already, but I have real concerns about doing all this list munging in Pyret instead of JS. For a table with 5k rows, even a 3 column MR will require a pretty huge number of swaps in memory.

@blerner
Copy link
Member

blerner commented Jul 24, 2024

First of all, no, the signature for multiple regression does not currently use tuples at all:

fun multiple-regression(x_s_s :: List<List<Number>>, y_s :: List<Number>) -> (List<Number> -> Number):

It's a list of lists of numbers, where each inner list is an individual sample of the data. You want the transpose of this, if you're trying to extract columns and do it that way.

Second, @ds26gte , the easiest way for you to support this is to implement

fun multiple-regression-tablex_s_s :: Table, y_s :: List<Number>) -> (List<Number> -> Number)

that does the same thing as multiple-regression, but uses MX.table-to-matrix instead of MX.lists-to-matrix. (You should obviously extract a common, internal helper function for both of these, with signature

fun multiple-regression-matrix(m_xss :: Matrix, y_s :: List<Number>) -> (List<Number> -> Number)

that has converted x_s_s to a matrix already, and then does the rest of the math.

Third, @schanzer , you should use this API via table.select-columns to extract a sub-table, rather than repeatedly using table.column to extract lists.

@schanzer
Copy link

I know MR doesn't use tuples - the issue is having to transpose tens of thousands of cells into the list format MR needs, and having to do it all in Pyret when it feels like this is a task for JS. Having this supported in the stats library as you propose would be fantastic.

@ds26gte is this something you can add? If so, I'll use the proper select-columns API to pass you the right table

@schanzer
Copy link

@ds26gte nevermind -- Ben and I spoke by phone, and he explained that I'm worrying about the wrong performance hit. If it's going to be slow anywhere, it'll happen in the matrix inversion.

I'm ready to sign off on this as-is, and if we find a real dataset for which this is a problem we can revisit the issue. Thanks for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants