You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Library/core, Transforms/Other
Feature
@touma-I has started an effort to simplify how transforms are run inside a launcher with pdf2parquet, here for python and here for ray. This is a good start and in particular
removes the user manipulation of sys.argv
removes the complexity of AST-based CLI args.
but I'd like to see the following:
Support for all parameters supported by the launchers (many are not supported now)
A super-class that does most of the work/implementation, with sub-classes only providing names of transform and launcher classes.
Other transforms begin to use this new framework.
Some discussion on class naming. The current first step uses Pdf2Parquet as the the name. This could be Pdf2ParquetLauncher instead since it is really launching the transform in a runtime. However, after discussion with @touma-I we agreed to stick with the current naming convention.
Note also that the python and ray classes have the same name. Should we instead follow other conventions and name the ray one Pdf2ParquetRay and also pdf2ParquetSpark?
And, a nice-to-have would be a 3 liner of one of these new classes in the top-level readme.
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
@daw3rd Thank you for taking on this much needed work.
Is it possible to cover the default values for the framework? i.e. In the case of ray, the current code requires that the caller specifies mem and num_cpus. Is there a way to add default values that are required by the framework when they are not specified by the calling code ?
I don't think the re-naming Pdf2ParquetRay and PDf2parquetSpark is needed as it is already scoped by the submodule name ray.* or spark.* . That said: I don't feel too strongly about enforce this for future transform. This naming convention makes it easier for us to maintain/review the code but I feel there will be developers who want to call their class something else and we will want to accommodate and deviate from the convention if the request is reasonable as long as the API is properly documented in an example notebook or readme.md. For example. if tomorrow I come up with a transform that I can foo_bar_aplha_delta but I want to call my api fbad(...).transform(), I think I should be able to allow that. no ?
@daw3rd Your question about using non-local data in Google Colab, without cloning the repo:
We only have to worry about the input file (you can create local output directory on Colab)
When in Google Colab, the easiest way to get the input file, as you have suggested, is to use wget in the notebook and get it from the repo itself, so for the pdf2parquet example: !mkdir -p 'input' !wget -O 'input/redp5110-ch1.pdf' 'https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/language/pdf2parquet/test-data/input/redp5110-ch1.pdf'
Search before asking
Component
Library/core, Transforms/Other
Feature
@touma-I has started an effort to simplify how transforms are run inside a launcher with pdf2parquet, here for python and here for ray. This is a good start and in particular
but I'd like to see the following:
Some discussion on class naming. The current first step uses Pdf2Parquet as the the name. This could be Pdf2ParquetLauncher instead since it is really launching the transform in a runtime. However, after discussion with @touma-I we agreed to stick with the current naming convention.
Note also that the python and ray classes have the same name. Should we instead follow other conventions and name the ray one Pdf2ParquetRay and also pdf2ParquetSpark?
And, a nice-to-have would be a 3 liner of one of these new classes in the top-level readme.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: