[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

daw3rd · 2025-01-09T16:15:47Z

Search before asking

I searched the issues and found no similar issues.

Component

Library/core, Transforms/Other

Feature

@touma-I has started an effort to simplify how transforms are run inside a launcher with pdf2parquet, here for python and here for ray. This is a good start and in particular

removes the user manipulation of sys.argv
removes the complexity of AST-based CLI args.

but I'd like to see the following:

Support for all parameters supported by the launchers (many are not supported now)
A super-class that does most of the work/implementation, with sub-classes only providing names of transform and launcher classes.
Other transforms begin to use this new framework.

Some discussion on class naming. The current first step uses Pdf2Parquet as the the name. This could be Pdf2ParquetLauncher instead since it is really launching the transform in a runtime. However, after discussion with @touma-I we agreed to stick with the current naming convention.

Note also that the python and ray classes have the same name. Should we instead follow other conventions and name the ray one Pdf2ParquetRay and also pdf2ParquetSpark?

And, a nice-to-have would be a 3 liner of one of these new classes in the top-level readme.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad · 2025-01-09T17:36:16Z

If we agree that using a Google Colab notebook is the easiest first thing for the user to try, we can make a couple of small modifications to this notebook, https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/pdf2parquet-python.ipynb (to handle non-local input/output) and use it in place of the current more complex https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb that was written before Maroun's simplifications. This is in addition to having a 3-liner in the README suggested here.

touma-I · 2025-01-09T17:51:16Z

@daw3rd Thank you for taking on this much needed work.

Is it possible to cover the default values for the framework? i.e. In the case of ray, the current code requires that the caller specifies mem and num_cpus. Is there a way to add default values that are required by the framework when they are not specified by the calling code ?

I don't think the re-naming Pdf2ParquetRay and PDf2parquetSpark is needed as it is already scoped by the submodule name ray.* or spark.* . That said: I don't feel too strongly about enforce this for future transform. This naming convention makes it easier for us to maintain/review the code but I feel there will be developers who want to call their class something else and we will want to accommodate and deviate from the convention if the request is reasonable as long as the API is properly documented in an example notebook or readme.md. For example. if tomorrow I come up with a transform that I can foo_bar_aplha_delta but I want to call my api fbad(...).transform(), I think I should be able to allow that. no ?

shahrokhDaijavad · 2025-01-09T20:04:09Z

@daw3rd Your question about using non-local data in Google Colab, without cloning the repo:
We only have to worry about the input file (you can create local output directory on Colab)

When in Google Colab, the easiest way to get the input file, as you have suggested, is to use wget in the notebook and get it from the repo itself, so for the pdf2parquet example:
!mkdir -p 'input'
!wget -O 'input/redp5110-ch1.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf'

daw3rd added the enhancement New feature or request label Jan 9, 2025

daw3rd self-assigned this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

daw3rd commented Jan 9, 2025

shahrokhDaijavad commented Jan 9, 2025

touma-I commented Jan 9, 2025 •

edited

Loading

shahrokhDaijavad commented Jan 9, 2025

[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

Comments

daw3rd commented Jan 9, 2025

Search before asking

Component

Feature

Are you willing to submit a PR?

shahrokhDaijavad commented Jan 9, 2025

touma-I commented Jan 9, 2025 • edited Loading

shahrokhDaijavad commented Jan 9, 2025

touma-I commented Jan 9, 2025 •

edited

Loading