Skip to content

Conversation

universalmind303
Copy link
Contributor

Changes Made

adds basic support & tests for async @daft.func

@daft.func
async def my_udf(text)->str:
    return text.upper()

df = daft.from_pydict({
    "text":["hello", "world"]
})

print(df.select(my_udf(df["text"])).collect())

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Aug 5, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces async function support for Daft's @daft.func decorator, allowing users to write asynchronous user-defined functions (UDFs) that can perform I/O-bound operations efficiently. The implementation adds three main components:

  1. Detection of async functions: The Rust layer now uses asyncio.iscoroutinefunction() to identify when a decorated function is async and routes it to a specialized handler.

  2. Concurrent async execution: A new Python function call_async_batch_with_evaluated_exprs() executes all async UDF calls concurrently using asyncio.gather(), providing significant performance benefits for I/O-bound operations compared to sequential execution.

  3. Modified execution path: Async functions follow a different code path than synchronous functions, collecting all evaluated arguments upfront and processing them together rather than using the chunked parallel processing approach used for sync functions.

The feature integrates cleanly with Daft's existing UDF system - users can simply add async to their function definition and await async operations inside, while the framework handles the complexity of concurrent execution. This is particularly valuable for UDFs that need to make HTTP requests, query databases, or perform other async I/O operations. A comprehensive test validates that async UDFs produce identical results to their synchronous counterparts while benefiting from concurrent execution.

Confidence score: 3/5

  • This PR introduces complex async functionality that could have subtle runtime issues
  • Score reflects concerns about event loop management, error handling, and argument processing edge cases
  • Pay close attention to daft/udf/row_wise.py and src/daft-dsl/src/python_udf.rs for potential async-related issues

3 files reviewed, 2 comments

Edit Code Review Bot Settings | Greptile

Comment on lines +90 to +95
try:
# try to use existing event loop
event_loop = asyncio.get_running_loop()
outputs = asyncio.run_coroutine_threadsafe(run_tasks(), event_loop).result()
except RuntimeError:
outputs = asyncio.run(run_tasks())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seemed like a pretty reasonable initial approach to me. I think as an optimization, we could create a global event loop that we reuse instead of potentially creating a new loop every time an async udf is called

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I was also thinking of using the Tokio event loop and making eval_expression_list async to avoid blocking entirely.

Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 70.76923% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.26%. Comparing base (7d76af5) to head (78b9f4d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
daft/udf/row_wise.py 13.63% 19 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4908      +/-   ##
==========================================
+ Coverage   77.84%   79.26%   +1.41%     
==========================================
  Files         906      906              
  Lines      127107   126048    -1059     
==========================================
+ Hits        98952    99913     +961     
+ Misses      28155    26135    -2020     
Files with missing lines Coverage Δ
src/daft-dsl/src/python_udf.rs 86.98% <100.00%> (+4.57%) ⬆️
daft/udf/row_wise.py 60.00% <13.63%> (-22.93%) ⬇️

... and 44 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mkdocs.yml Outdated
Copy link
Contributor Author

@universalmind303 universalmind303 Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precommit autoformatter updated this.

Copy link
Contributor

@srilman srilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @universalmind303!

@@ -125,14 +125,14 @@ impl FunctionEvaluator for LegacyPythonUDF {
}
}

fn evaluate(&self, inputs: &[Series], _: &FunctionExpr) -> DaftResult<Series> {
fn evaluate(&self, _inputs: &[Series], _: &FunctionExpr) -> DaftResult<Series> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Change _inputs back to inputs since its used

Comment on lines +90 to +95
try:
# try to use existing event loop
event_loop = asyncio.get_running_loop()
outputs = asyncio.run_coroutine_threadsafe(run_tasks(), event_loop).result()
except RuntimeError:
outputs = asyncio.run(run_tasks())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, I was also thinking of using the Tokio event loop and making eval_expression_list async to avoid blocking entirely.

@universalmind303 universalmind303 enabled auto-merge (squash) August 7, 2025 16:58
Copy link

codspeed-hq bot commented Aug 7, 2025

CodSpeed Performance Report

Merging #4908 will degrade performances by 99.29%

Comparing cory/async-udf (78b9f4d) with main (7d76af5)

Summary

❌ 1 regressions
✅ 23 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_show[1 Small File] 12.5 ms 1,751.3 ms -99.29%

@universalmind303 universalmind303 merged commit f84d188 into main Aug 7, 2025
47 of 48 checks passed
@universalmind303 universalmind303 deleted the cory/async-udf branch August 7, 2025 17:34
@jaychia
Copy link
Contributor

jaychia commented Aug 7, 2025

Plz add this into docs! This seems super cool and we should 100% be shouting it out where possible.

@kevinzwang to advise where best to advertise in the new docs

kevinzwang pushed a commit that referenced this pull request Aug 7, 2025
## Changes Made

adds basic support & tests for async `@daft.func`


```py
@daft.func
async def my_udf(text)->str:
    return text.upper()

df = daft.from_pydict({
    "text":["hello", "world"]
})

print(df.select(my_udf(df["text"])).collect())
```


## Related Issues

<!-- Link to related GitHub issues, e.g., "Closes #123" -->

## Checklist

- [ ] Documented in API Docs (if applicable)
- [ ] Documented in User Guide (if applicable)
- [ ] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [ ] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants