Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] A collection of issues for extending the Aggregation function #12254

Open
1 of 7 tasks
Weijun-H opened this issue Aug 30, 2024 · 8 comments
Open
1 of 7 tasks

[Epic] A collection of issues for extending the Aggregation function #12254

Weijun-H opened this issue Aug 30, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@Weijun-H
Copy link
Member

Weijun-H commented Aug 30, 2024

Is your feature request related to a problem or challenge?

DataFusion now supports several aggregation functions, but it still lacks some common ones that are essential for a broader range of data processing tasks. To make DataFusion more versatile and capable of handling diverse workloads, it should include additional aggregation functions commonly used in data analysis, such as mode and max_by.

Describe the solution you'd like

Describe alternatives you've considered

No response

Additional context

No response

@Weijun-H Weijun-H added the enhancement New feature or request label Aug 30, 2024
@Weijun-H Weijun-H changed the title [Epic] Extend the Aggregation function [Epic] A collection of issues for extending the Aggregation function Aug 30, 2024
@alamb
Copy link
Contributor

alamb commented Sep 5, 2024

I wonder if we should consider where to draw the line on what aggregate functions to include in the core (i.e. should we include all these new functions?)

Now that all aggregate functions use the same API, we could potentially keep more specialized functions such as listed here outside the ore -- either in its own crate or even own repo -- and then have other code integrate it in -- e.g. #11979

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

I started a discussion about if we should be adding all these functions directly in the core here: #12357

@Weijun-H
Copy link
Member Author

Weijun-H commented Sep 6, 2024

I wonder if we should consider where to draw the line on what aggregate functions to include in the core (i.e. should we include all these new functions?)

Now that all aggregate functions use the same API, we could potentially keep more specialized functions such as listed here outside the ore -- either in its own crate or even own repo -- and then have other code integrate it in -- e.g. #11979

I like this idea! 🚀

@alamb
Copy link
Contributor

alamb commented Sep 16, 2024

@Weijun-H and @dmitrybugakov and @dharanad -- what do you think about creating a datafusion-functions-duckdb repo in datafusion-contrib similar to https://github.com/datafusion-contrib/datafusion-functions-json for JSON from @samuelcolvin and co.

It would be a pretty neat way to help build out the function library in DataFUsion and would show off its extensibility

I could then try an integrate it into dft that @matthewmturner and I have been working on: https://github.com/datafusion-contrib/datafusion-dft which would make it easer to use

Originally from: #12476 (comment)

@austin362667
Copy link
Contributor

Thank you @alamb for proposing this initiative. I like this idea. What about others' thought?
It clearly draws a line between the core and the extensions. And we can still leverage those functions as extension in dft.

@dmitrybugakov
Copy link
Contributor

@alamb
I’m generally in favor of moving additional features from the core to separate subprojects. However, could we consider using a more general name than datafusion-functions-duckdb? What are your thoughts on using something like datafusion-functions-sql or another broader name?

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

@alamb I’m generally in favor of moving additional features from the core to separate subprojects. However, could we consider using a more general name than datafusion-functions-duckdb? What are your thoughts on using something like datafusion-functions-sql or another broader name?

I do not have a strong preference -- I think it likely depends on the usecase:

  • trying to migrate spark workloads
  • trying to migrate duckdb workloads
  • triynt to make the most useful sql system you can

However, let's not let get too carried away with details at the moment.

I created https://github.com/datafusion-contrib/datafusion-functions-extra and added @dmitrybugakov and @austin362667 as admins. If anyone else wants to help let me know and we can add you too.

@dmitrybugakov or @austin362667 would you be willing to setup the basic skeleton of the repo?

Perhaps you could follow the model of https://github.com/datafusion-contrib/datafusion-functions-json for readme and registration function

And then try to put the mode function there: #12385

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

Update: I also added @Lordworms (a longtime DataFusion contributor) per #12284 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants