Implement general purpose async functions by alamb · Pull Request #1 · goldmedal/datafusion-llm-function

alamb · 2025-01-09T12:24:05Z

Related to Async User Defined Functions (UDF) apache/datafusion#6518

This PR implements what I think is a general purpose framework for implementing async user defined functions.

The high level design is to handle async functions with a special new execution plan

I will comment more inline about the design.

When run with cargo run this program shows:

++
++
+-------+
| count |
+-------+
| 3     |
+-------+
+-------+
| count |
+-------+
| 3     |
+-------+
+-------+
| count |
+-------+
| 3     |
+-------+
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                        |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: llm_bool(Utf8("If all of them are Aisa countries: {}, {}, {}"), t1.c1, t1.c2, t1.c3)                                            |
|               |   TableScan: t1 projection=[c1, c2, c3]                                                                                                     |
| physical_plan | ProjectionExec: expr=[__async_fn_0@3 as llm_bool(Utf8("If all of them are Aisa countries: {}, {}, {}"),t1.c1,t1.c2,t1.c3)]                  |
|               |   AsyncFuncExec: async_expr=[async_expr(name=__async_fn_0, expr=llm_bool(If all of them are Aisa countries: {}, {}, {}, c1@0, c2@1, c3@2))] |
|               |     MemoryExec: partitions=1, partition_sizes=[3]                                                                                           |
|               |                                                                                                                                             |
+---------------+---------------------------------------------------------------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
| llm_bool(Utf8("If all of them are Aisa countries: {}, {}, {}"),t1.c1,t1.c2,t1.c3) |
+-----------------------------------------------------------------------------------+
| false                                                                             |
| true                                                                              |
| false                                                                             |
| false                                                                             |
| false                                                                             |
| true                                                                              |
| true                                                                              |
| false                                                                             |
| false                                                                             |
+-----------------------------------------------------------------------------------+

alamb · 2025-01-11T13:26:22Z

src/llm/functions.rs

-/// A scalar UDF that will be bypassed when planning logical plan.
-/// This is used to register the remote function to the context. The function should not be
-/// invoked by DataFusion. It's only used to generate the logical plan and unparsed them to SQL.
+/// A scalar UDF that can invoke using async methods


Here is the the new API. At a high level it is meant to mimic ScalarUDFImpl except that it has a async invoke function

alamb · 2025-01-11T13:27:01Z

src/llm/physical_optimizer.rs

+pub struct AsyncFuncRule {}
+
+impl PhysicalOptimizerRule for AsyncFuncRule {
+    /// Insert a AsyncFunctionNode node in front of this projection if there are any async functions in it


Here is the high level design: add a new node before a ProjectionExec that does the actual async calls

alamb · 2025-01-11T13:27:25Z

src/main.rs

-        .with_optimizer_rules(vec![])
-        .with_query_planner(Arc::new(LLMQueryPlanner {}))
-        .with_physical_optimizer_rules(vec![])
+        .with_physical_optimizer_rule(Arc::new(AsyncFuncRule {}))


Here is how to use the new code: add the new optimizer rule:

alamb · 2025-01-11T13:27:57Z

src/main.rs

    Ok(())
 }
+
+/// This is a simple example of a UDF that takes a string, invokes a (remote) LLM function


Now, define a struct that implements AsyncScalarUDFImpl

alamb · 2025-01-11T13:28:28Z

src/main.rs

+        Ok(DataType::Boolean)
+    }
+
+    async fn invoke_async(&self, args: &RecordBatch) -> Result<ArrayRef> {


Here is the function that is invoked (it is async) and should be able to do any network and other calls

alamb · 2025-01-11T13:33:44Z

More discussion here:

Async User Defined Functions (UDF) apache/datafusion#6518 (comment)

edmondop · 2025-01-11T17:58:13Z

src/llm/functions.rs

+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType>;
+
+    /// Invoke the function asynchronously with the async arguments
+    async fn invoke_async(&self, args: &RecordBatch) -> Result<ArrayRef>;


I wonder whether this should return a Stream of ArrayRef, so that internally you can batch the calls to an external system with the right batch size ? In case of LLM there might be also a problem with the context, I suppose...

That is also an excellent question -- the current situation is that Datafusion handles the batching (aka target_size) -- so normally will pass 8k rows or whatever to the`

I think we could potentially make the API something like:

fn invoke_async_stream(&self, input: SendableRecordBatchStream) -> Result<SendableRecordBatchStream>;

but I think that might be tricker to code / get right

In terms of LLM context, this particualr PR only adds async scalar functions. I think we could likely do something similar to with window and aggregate functions, which might more naturally map to context 🤔

edmondop · 2025-01-11T18:00:24Z

src/llm/exec.rs

+            let schema_captured = schema_captured.clone();
+
+            async move {
+                let batch = batch?;


minor, would moving this invocation of the ? operator save a task in case of an error?

yes, you are right -- that would be an improvement 👍

goldmedal

Thanks, @alamb!
That's a reasonable way to implement an async function in DataFusion.
It's beneficial! I'll follow this approach and move on to the next step.

Sorry for the late reply—I just got back to work. 🙇

alamb · 2025-02-04T16:54:31Z

Sorry for the late reply—I just got back to work. 🙇

Welcome back!

This comment was marked as outdated.

Sign in to view

Implement general purpose async function evaluation

90bd4f6

alamb force-pushed the alamb/make_calling_futures_easier branch from 5cb6e4e to 90bd4f6 Compare January 11, 2025 13:21

Consistent name

2a42a2b

alamb commented Jan 11, 2025

View reviewed changes

alamb changed the title ~~WIP: Rework projection handling for async functions~~ Rework projection handling for async functions Jan 11, 2025

alamb mentioned this pull request Jan 11, 2025

Async User Defined Functions (UDF) apache/datafusion#6518

Closed

alamb changed the title ~~Rework projection handling for async functions~~ Implement general purpose async functions Jan 11, 2025

edmondop reviewed Jan 11, 2025

View reviewed changes

goldmedal approved these changes Feb 3, 2025

View reviewed changes

goldmedal merged commit 3440f25 into goldmedal:master Feb 3, 2025

alamb deleted the alamb/make_calling_futures_easier branch February 4, 2025 16:53

goldmedal mentioned this pull request Feb 23, 2025

Introduce Async User Defined Functions apache/datafusion#14837

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement general purpose async functions#1

Implement general purpose async functions#1
goldmedal merged 2 commits intogoldmedal:masterfrom
alamb:alamb/make_calling_futures_easier

alamb commented Jan 9, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

alamb Jan 11, 2025

Uh oh!

alamb Jan 11, 2025

Uh oh!

alamb Jan 11, 2025

Uh oh!

alamb Jan 11, 2025

Uh oh!

alamb Jan 11, 2025

Uh oh!

alamb commented Jan 11, 2025

Uh oh!

edmondop Jan 11, 2025

Uh oh!

alamb Jan 12, 2025

Uh oh!

edmondop Jan 11, 2025

Uh oh!

alamb Jan 12, 2025

Uh oh!

goldmedal left a comment

Uh oh!

alamb commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alamb commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jan 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goldmedal left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Jan 9, 2025 •

edited

Loading