Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regr_slope() aggregate function #7135

Merged
merged 1 commit into from
Aug 1, 2023
Merged

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

NA

Rationale for this change

I'm planning to add PostgreSQL linear regression aggregate functions. Their implementations are similar, so I'd like to implement regr_slope(y, x) first, and the remaining functions should be easy to extend.

What changes are included in this PR?

Adding regr_slope(y, x) aggregate function, mainly implementing Accumulator for RegrSlopeAccumulator
The linear regression is calculated using COVAR_POP(x,y) / VAR_POP(x)
One-pass and parallel algorithm for covar/var reference: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm

Are these changes tested?

All end-to-end tests produced the same result compared to Postgres

Are there any user-facing changes?

No

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 30, 2023
@alamb
Copy link
Contributor

alamb commented Jul 30, 2023

Thank you @2010YOUY01 -- I'll check this out carefully over the next few days.

One thing I wonder about is how large the datafusion library is getting. I was thinking maybe it would be useful to start splitting some of the more advanced / special purpose functionality into their own crates, as partly described in #7110. Curious what you think about this

@alamb
Copy link
Contributor

alamb commented Jul 31, 2023

I am sorry I ran out of time today to review this. It is on my list for tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @2010YOUY01 -- I found this PR easy to read and well tested and documented. I think it is a very nice contribution

Looks like it is implemented in several other databases:

https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Aggregate/REGR_SLOPE.htm

https://docs.snowflake.com/en/sql-reference/functions/regr_slope

@@ -384,6 +385,22 @@ var_samp(expression)
- **expression**: Expression to operate on.
Can be a constant, column, or function, and any combination of arithmetic operators.

### `regr_slope`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Aug 1, 2023

I don't really have any suggestions or comments for this PR so merging it in

@alamb alamb merged commit a9561a0 into apache:main Aug 1, 2023
21 checks passed
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants