-
Notifications
You must be signed in to change notification settings - Fork 294
feat: basic generator udf #5036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces generator UDF functionality to Daft via the @daft.func.gen
decorator, enabling users to create functions that yield multiple values per input row (1-to-N transformations). The implementation adds a new GeneratorUdf
class in daft/udf/gen.py
that wraps Python generator functions and converts them into Daft expressions. Currently, this is implemented as a temporary workaround using existing row-wise UDF infrastructure combined with explode operations rather than native generator support.
The changes include several architectural improvements: refactoring the existing RowWiseUdf
class to use shared utility functions from the new daft/udf/_internal.py
file, removing the deprecated Expression._row_wise_udf
method, and integrating with Rust-based UDF execution. The new generator UDFs support automatic type inference from Iterator
and Generator
type hints from both typing
and collections.abc
modules.
The feature maintains API consistency with the existing @daft.func
decorator pattern by adding .gen
as a static method, making it intuitive for users familiar with Daft's UDF system. This addresses use cases like tokenization, data expansion, and sequence generation where single input rows need to produce multiple output rows.
Confidence score: 3/5
- This PR introduces significant new functionality with some implementation concerns that need attention
- Score lowered due to syntax errors in tests, empty placeholder file, documentation inconsistencies, and potential edge case handling issues
- Pay close attention to
tests/udf/test_generator_udf.py
for syntax errors anddaft/udf/generator.py
which appears to be an empty placeholder file
7 files reviewed, 4 comments
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5036 +/- ##
==========================================
+ Coverage 76.04% 76.13% +0.08%
==========================================
Files 945 947 +2
Lines 129743 129791 +48
==========================================
+ Hits 98669 98814 +145
+ Misses 31074 30977 -97
🚀 New features to boost your workflow:
|
daft/udf/__init__.py
Outdated
def gen(fn: Callable[P, Iterator[T]], *, return_dtype: DataTypeLike | None = None) -> GeneratorUdf[P, T]: ... | ||
|
||
@staticmethod | ||
def gen( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we need to hide this from the user similar to what we do with async udfs.
The user doesnt need to specify anything in the decorator
@daft.func
async def foo():
pass
We just infer what kind of function it is behind the scenes. And we can do the same thing for generators:
@daft.func
def generator_func() -> Generator[str, ...]:
yield 'hello'
yield 'world'
We should be able to follow a very similar pattern to what we do to distinguish between async and serial functions
we can use inspect.isgeneratorfunction
to determine if it's a generator
import inspect
print(inspect.isgeneratorfunction(generator_func))
If we follow this same pattern we did for async, it would greatly simplify the implementation, as well as the public api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one caveat that you'll need to keep in mind is that the explode function is declared later in the dependency chain, and is not directly available on Series
. So you need to go through the FUNCTION_REGISTRY
which is a bit unruly:
let result_series = FUNCTION_REGISTRY
.read()
.unwrap()
.get("explode")
.unwrap()
.get_function(FunctionArgs::empty(), &Schema::empty())
.unwrap()
.call(FunctionArgs::new_unnamed(vec![result_series]))?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we need to hide this from the user similar to what we do with async udfs.
Agreed. I think I was overthinking it when I implemented it this way. There were several considerations I had but I think they are all solveable:
- With the explicit decorator, users can also implement them via a normal function that returns an iterator. But there is no need to give them that flexibility, they can always do
yield from
. - I was hoping that the explicit
daft.func.gen
could be more discoverable. But I don't think it would be any easier to discover than if we specified in the docstring todaft.func
that you can use a generator. - I didn't want to overcrowd the doc strings for
daft.func
. My solution to this is to slim it down more. I don't think it's a great user experience to stuff all the info into these doc strings anyway, the better alternative is to beef up our user guide and provide more color there when users need more detail, and just use the doc strings as quick reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one caveat that you'll need to keep in mind is that the explode function is declared later in the dependency chain, and is not directly available on Series. So you need to go through the FUNCTION_REGISTRY which is a bit unruly:
With my current implementation of calling .explode()
on the Python side, would I still run into this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my current implementation of calling .explode() on the Python side, would I still run into this issue?
Your current approach works functionally, but it spreads the UDF conditional logic across multiple layers, making the code harder to reason about. I'd prefer consolidating this logic inside RowWisePyFn::call
in python_udf.rs
.
Something like this:
let is_async: bool = Python::with_gil(|py| {
py.import(pyo3::intern!(py, "asyncio"))?
.getattr(pyo3::intern!(py, "iscoroutinefunction"))?
.call1((self.inner.as_ref(),))?
.extract()
})?;
let is_generator: bool = Python::with_gil(|py| {
py.import(pyo3::intern!(py, "inspect"))?
.getattr(pyo3::intern!(py, "isgeneratorfunction"))?
.call1((self.inner.as_ref(),))?
.extract()
})?;
enum FunctionType {
Async,
Generator,
Scalar
}
let function_type = match (is_async, is_generator) {
(true, true) => err!("async generators not yet supported"),
(true, false) => FunctionType::Async,
(false, true) => FunctionType::Generator,
(false, false) => FunctionType::Scalar,
};
match function_type {
FunctionType::Async => self.call_async(args, num_rows),
FunctionType::Generator => self.call_generator(args, num_rows),
FunctionType::Scalar => self.call_serial(args, num_rows),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For async this could make sense but for generator UDFs I actually want to represent them as a different variant in the Rust side. Similar to how we have Expr::ScalarFn
that has a Python
variant, I'd like to have Expr::GeneratorFn
that can be implemented with either Python or Rust (e.g. explode).
I'm implementing it this way right now so that it's simple (the core of the implementation just lives in generator.py) and easy to rip out once we fully implement it, since I don't introduce anything new in Rust at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@universalmind303 what do you think about merging this in for now and then making the above changes in the future?
## Changes Made Adds the ability to create a generator UDF via `@daft.func.gen` decorator. Currently implemented internally as a list-type row-wise UDF + explode. ## Related Issues <!-- Link to related GitHub issues, e.g., "Closes Eventual-Inc#123" --> ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)
Changes Made
Adds the ability to create a generator UDF via
@daft.func.gen
decorator.Currently implemented internally as a list-type row-wise UDF + explode.
Related Issues
Checklist
docs/mkdocs.yml
navigation