Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Grow core library and transforms to enable easy launching of a transform in a runtime from a .py file #931

Open
2 tasks done
daw3rd opened this issue Jan 9, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@daw3rd
Copy link
Member

daw3rd commented Jan 9, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/core, Transforms/Other

Feature

@touma-I has started an effort to simplify how transforms are run inside a launcher with pdf2parquet, here for python and here for ray. This is a good start and in particular

  1. removes the user manipulation of sys.argv
  2. removes the complexity of AST-based CLI args.

but I'd like to see the following:

  1. Support for all parameters supported by the launchers (many are not supported now)
  2. A super-class that does most of the work/implementation, with sub-classes only providing names of transform and launcher classes.
  3. Other transforms begin to use this new framework.

Some discussion on class naming. The current first step uses Pdf2Parquet as the the name. This could be Pdf2ParquetLauncher instead since it is really launching the transform in a runtime. However, after discussion with @touma-I we agreed to stick with the current naming convention.

Note also that the python and ray classes have the same name. Should we instead follow other conventions and name the ray one Pdf2ParquetRay and also pdf2ParquetSpark?

And, a nice-to-have would be a 3 liner of one of these new classes in the top-level readme.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@daw3rd daw3rd added the enhancement New feature or request label Jan 9, 2025
@daw3rd daw3rd self-assigned this Jan 9, 2025
@shahrokhDaijavad
Copy link
Member

If we agree that using a Google Colab notebook is the easiest first thing for the user to try, we can make a couple of small modifications to this notebook, https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/pdf2parquet-python.ipynb (to handle non-local input/output) and use it in place of the current more complex https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/Run_your_first_transform_colab.ipynb that was written before Maroun's simplifications. This is in addition to having a 3-liner in the README suggested here.

@touma-I
Copy link
Collaborator

touma-I commented Jan 9, 2025

@daw3rd Thank you for taking on this much needed work.

Is it possible to cover the default values for the framework? i.e. In the case of ray, the current code requires that the caller specifies mem and num_cpus. Is there a way to add default values that are required by the framework when they are not specified by the calling code ?

I don't think the re-naming Pdf2ParquetRay and PDf2parquetSpark is needed as it is already scoped by the submodule name ray.* or spark.* . That said: I don't feel too strongly about enforce this for future transform. This naming convention makes it easier for us to maintain/review the code but I feel there will be developers who want to call their class something else and we will want to accommodate and deviate from the convention if the request is reasonable as long as the API is properly documented in an example notebook or readme.md. For example. if tomorrow I come up with a transform that I can foo_bar_aplha_delta but I want to call my api fbad(...).transform(), I think I should be able to allow that. no ?

@shahrokhDaijavad
Copy link
Member

@daw3rd Your question about using non-local data in Google Colab, without cloning the repo:
We only have to worry about the input file (you can create local output directory on Colab)

When in Google Colab, the easiest way to get the input file, as you have suggested, is to use wget in the notebook and get it from the repo itself, so for the pdf2parquet example:
!mkdir -p 'input'
!wget -O 'input/redp5110-ch1.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants