Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Restructure transforms as their own modules #774

Open
2 tasks done
touma-I opened this issue Nov 5, 2024 · 1 comment
Open
2 tasks done

[Feature] Restructure transforms as their own modules #774

touma-I opened this issue Nov 5, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request simplify-DPK

Comments

@touma-I
Copy link
Collaborator

touma-I commented Nov 5, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Restructer the repo toreduce burden/constraints on contributors:

  1. Each transform will have its own module name dpk_ (ie. dpk_pdf2Parquet, dpk_doc_quality, etc) . Since all of our contributors are currently Python code, the module name is also the folder name where the contributors add their code.
  2. The Python Folders, the Ray folder and the Spark folder will no longer be needed. We will be giving the contributor freedom on deciding whether ray and spark are submodules (i.e. dpk_/ray) or if they want to deliver it as a separate module (dpk__ray)
  3. The Src subfolder is no longer required. This is an area that is also left to the contributor as we each have our own style
  4. Pyproject.toml will no longer be required. Developers can opt to have one using setuptools, poetry, hatch, etc but is no longer a requirement. We still need a requirements.txt file that lists all the dependencies for the code that is being contributed
  5. Reduce (if not eliminate) the number of Makefiles for each transform. Right now we are down from 3 to 1. In anycase, contributor will not be required to maintain the Makefile and we simplified it quite to reduce complexity. (Contributor can still create their own Makefile and use make within their folder anyway they want to automate simle tasks)
  6. Both Dockerfiles for ray and Python will be removed. We will use a standard docker template for those transforms that need to be integrated with the KFP. I also see this area evolving over time as more and more of our users are running their pipelines and transforms inside of self contained docker container (even without KFP). So stay tuned for some evolution in this area but for all intended purposes, contributors of the transforms will not have to deal with Dockerfiles unless they want to for their own needs.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@touma-I
Copy link
Collaborator Author

touma-I commented Nov 18, 2024

For a short time, we will need to support both layout (old and new). During that time, the CI/CD will get a bit more complicated but expect we will do a pass at cleaning things up once all the transforms have been moved to the new layout. Also, the packaging will get a bit trickier if we decide to rollout new transform as their own modules while the other transforms are still using a flat structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request simplify-DPK
Projects
None yet
Development

No branches or pull requests

1 participant