Skip to content

Conversation

kevinzwang
Copy link
Member

Changes Made

Adds the ability to create a generator UDF via @daft.func.gen decorator.

Currently implemented internally as a list-type row-wise UDF + explode.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces generator UDF functionality to Daft via the @daft.func.gen decorator, enabling users to create functions that yield multiple values per input row (1-to-N transformations). The implementation adds a new GeneratorUdf class in daft/udf/gen.py that wraps Python generator functions and converts them into Daft expressions. Currently, this is implemented as a temporary workaround using existing row-wise UDF infrastructure combined with explode operations rather than native generator support.

The changes include several architectural improvements: refactoring the existing RowWiseUdf class to use shared utility functions from the new daft/udf/_internal.py file, removing the deprecated Expression._row_wise_udf method, and integrating with Rust-based UDF execution. The new generator UDFs support automatic type inference from Iterator and Generator type hints from both typing and collections.abc modules.

The feature maintains API consistency with the existing @daft.func decorator pattern by adding .gen as a static method, making it intuitive for users familiar with Daft's UDF system. This addresses use cases like tokenization, data expansion, and sequence generation where single input rows need to produce multiple output rows.

Confidence score: 3/5

  • This PR introduces significant new functionality with some implementation concerns that need attention
  • Score lowered due to syntax errors in tests, empty placeholder file, documentation inconsistencies, and potential edge case handling issues
  • Pay close attention to tests/udf/test_generator_udf.py for syntax errors and daft/udf/generator.py which appears to be an empty placeholder file

7 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

Copy link

codecov bot commented Aug 23, 2025

Codecov Report

❌ Patch coverage is 93.75000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.13%. Comparing base (d2e9c04) to head (342913f).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
daft/udf/generator.py 88.88% 4 Missing ⚠️
daft/udf/_internal.py 94.44% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5036      +/-   ##
==========================================
+ Coverage   76.04%   76.13%   +0.08%     
==========================================
  Files         945      947       +2     
  Lines      129743   129791      +48     
==========================================
+ Hits        98669    98814     +145     
+ Misses      31074    30977      -97     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 95.79% <ø> (-0.02%) ⬇️
daft/udf/__init__.py 96.00% <100.00%> (+1.88%) ⬆️
daft/udf/row_wise.py 86.53% <100.00%> (+1.05%) ⬆️
daft/udf/_internal.py 94.44% <94.44%> (ø)
daft/udf/generator.py 88.88% <88.88%> (ø)

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

def gen(fn: Callable[P, Iterator[T]], *, return_dtype: DataTypeLike | None = None) -> GeneratorUdf[P, T]: ...

@staticmethod
def gen(
Copy link
Contributor

@universalmind303 universalmind303 Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to hide this from the user similar to what we do with async udfs.

The user doesnt need to specify anything in the decorator

@daft.func
async def foo():
  pass

We just infer what kind of function it is behind the scenes. And we can do the same thing for generators:

@daft.func
def generator_func() -> Generator[str, ...]:
  yield 'hello'
  yield 'world'

We should be able to follow a very similar pattern to what we do to distinguish between async and serial functions

https://github.com/Eventual-Inc/Daft/blob/main/src/daft-dsl/src/python_udf.rs#L131

we can use inspect.isgeneratorfunction to determine if it's a generator

import inspect
print(inspect.isgeneratorfunction(generator_func))

If we follow this same pattern we did for async, it would greatly simplify the implementation, as well as the public api.

Copy link
Contributor

@universalmind303 universalmind303 Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one caveat that you'll need to keep in mind is that the explode function is declared later in the dependency chain, and is not directly available on Series. So you need to go through the FUNCTION_REGISTRY which is a bit unruly:

            let result_series = FUNCTION_REGISTRY
                .read()
                .unwrap()
                .get("explode")
                .unwrap()
                .get_function(FunctionArgs::empty(), &Schema::empty())
                .unwrap()
                .call(FunctionArgs::new_unnamed(vec![result_series]))?;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we need to hide this from the user similar to what we do with async udfs.

Agreed. I think I was overthinking it when I implemented it this way. There were several considerations I had but I think they are all solveable:

  • With the explicit decorator, users can also implement them via a normal function that returns an iterator. But there is no need to give them that flexibility, they can always do yield from.
  • I was hoping that the explicit daft.func.gen could be more discoverable. But I don't think it would be any easier to discover than if we specified in the docstring to daft.func that you can use a generator.
  • I didn't want to overcrowd the doc strings for daft.func. My solution to this is to slim it down more. I don't think it's a great user experience to stuff all the info into these doc strings anyway, the better alternative is to beef up our user guide and provide more color there when users need more detail, and just use the doc strings as quick reference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one caveat that you'll need to keep in mind is that the explode function is declared later in the dependency chain, and is not directly available on Series. So you need to go through the FUNCTION_REGISTRY which is a bit unruly:

With my current implementation of calling .explode() on the Python side, would I still run into this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my current implementation of calling .explode() on the Python side, would I still run into this issue?

Your current approach works functionally, but it spreads the UDF conditional logic across multiple layers, making the code harder to reason about. I'd prefer consolidating this logic inside RowWisePyFn::call in python_udf.rs.

Something like this:

let is_async: bool = Python::with_gil(|py| {
    py.import(pyo3::intern!(py, "asyncio"))?
        .getattr(pyo3::intern!(py, "iscoroutinefunction"))?
        .call1((self.inner.as_ref(),))?
        .extract()
})?;

let is_generator: bool = Python::with_gil(|py| {
    py.import(pyo3::intern!(py, "inspect"))?
        .getattr(pyo3::intern!(py, "isgeneratorfunction"))?
        .call1((self.inner.as_ref(),))?
        .extract()
})?;

enum FunctionType {
    Async,
    Generator,
    Scalar
}

let function_type = match (is_async, is_generator) {
    (true, true) => err!("async generators not yet supported"),
    (true, false) => FunctionType::Async,
    (false, true) => FunctionType::Generator,
    (false, false) => FunctionType::Scalar,
};


match function_type {
    FunctionType::Async => self.call_async(args, num_rows),
    FunctionType::Generator => self.call_generator(args, num_rows),
    FunctionType::Scalar => self.call_serial(args, num_rows),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For async this could make sense but for generator UDFs I actually want to represent them as a different variant in the Rust side. Similar to how we have Expr::ScalarFn that has a Python variant, I'd like to have Expr::GeneratorFn that can be implemented with either Python or Rust (e.g. explode).

I'm implementing it this way right now so that it's simple (the core of the implementation just lives in generator.py) and easy to rip out once we fully implement it, since I don't introduce anything new in Rust at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@universalmind303 what do you think about merging this in for now and then making the above changes in the future?

@kevinzwang kevinzwang enabled auto-merge (squash) August 25, 2025 23:32
@kevinzwang kevinzwang merged commit dce283a into main Aug 26, 2025
54 checks passed
@kevinzwang kevinzwang deleted the kevin/generator-udf branch August 26, 2025 00:21
venkateshdb pushed a commit to venkateshdb/Daft that referenced this pull request Sep 6, 2025
## Changes Made

Adds the ability to create a generator UDF via `@daft.func.gen`
decorator.

Currently implemented internally as a list-type row-wise UDF + explode.

## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes Eventual-Inc#123" -->

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants