Skip to content

Conversation

@firestarman
Copy link
Collaborator

@firestarman firestarman commented Nov 30, 2023

fix #9938

There are more and more shims for Python UDF code, leading to a lot of duplicate code. And some shim class names are confusable, e.g. GpuArrowPythonRunnerShims, GpuPythonArrowShims. All of these makes maintainance not an easy job now, espcially if any change in cuDF API.

So this PR is trying to

  • extract cuDF relevant code from shims and move it to common classes (mainly GpuArrowReader and GpuArrowWriter) to avoid duplicate it in all the shims.
  • keep the class names (GpuArrowPythonOuptut, GpuArrowPythonRunner) unchanged in shims to avoid confusion.
  • replace the ShimBasePythonRunner with GpuBasePythonRunner, and reduce the shims by not taking care of the reader class inside it. Instead creating different versions of GpuArrowPythonOuptut to support different shims.
  • Other small refactor

@firestarman firestarman changed the title Some refactor for the Python UDF code Some refactor for the Python UDF code[databricks] Nov 30, 2023
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@firestarman firestarman merged commit 22b11e8 into NVIDIA:branch-24.02 Dec 13, 2023
@firestarman firestarman deleted the py-udf-refactor branch December 13, 2023 02:51
razajafri pushed a commit to razajafri/spark-rapids that referenced this pull request Jan 25, 2024
razajafri added a commit that referenced this pull request Jan 26, 2024
* Download Maven from apache.org archives (#10225)

Fixes #10224 

Replace broken install using apt by downloading Maven from apache.org.

Signed-off-by: Gera Shegalov <[email protected]>

* Fix a hang for Pandas UDFs on DB 13.3[databricks] (#9833)

fix #9493
fix #9844

The python runner uses two separate threads to write and read data with Python processes, 
however on DB13.3, it becomes single-threaded, which means reading and writing run on the same thread.
Now the first reading is always ahead of the first writing. But the original BatchQueue will wait
on the first reading until the first writing is done. Then it will wait forever.

Change made:

- Update the BatchQueue to support asking for a batch instead of waiting unitl one is inserted into the queue. 
   This can eliminate the order requirement of reading and writing.
- Introduce a new class named BatchProducer to work with the new BatchQueue to support rows number
   peek on demand for the reading.
- Apply this new BatchQueue to relevant plans.
- Update the Python runners to support writing one batch one time for the singled-threaded model.
- Found an issue about PythonUDAF and RunningWindoFunctionExec, it may be a bug specific to DB 13.3,
   and add a test (test_window_aggregate_udf_on_cpu) for it.
- Other small refactors
---------

Signed-off-by: Firestarman <[email protected]>

* Fix a potential data corruption for Pandas UDF (#9942)

This PR moves the BatchQueue into the DataProducer to share the same lock as the output iterator
returned by asIterator,  and make the batch movement from the input iterator to the batch queue be
an atomic operation to eliminate the race when appending the batches to the queue.

* Do some refactor for the Python UDF code to try to reduce duplicate code. (#9902)

Signed-off-by: Firestarman <[email protected]>

* Fixed 330db Shims to Adopt the PythonRunner Changes [databricks] (#10232)

This PR removes the old 330db shims in favor of the new Shims, similar to the one in 341db. 

**Tests:**
Ran udf_test.py on Databricks 11.3 and they all passed. 

fixes #10228 

---------

Signed-off-by: raza jafri <[email protected]>

---------

Signed-off-by: Gera Shegalov <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: raza jafri <[email protected]>
Co-authored-by: Gera Shegalov <[email protected]>
Co-authored-by: Liangcai Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better to do some refactor for the Python UDF code

3 participants