Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pdf2parquet transform #416

Merged
merged 35 commits into from
Jul 25, 2024
Merged

Add pdf2parquet transform #416

merged 35 commits into from
Jul 25, 2024

Conversation

dolfim-ibm
Copy link
Member

Why are these changes needed?

This PR adds a transform for converting PDF files to Markdown

.make.defaults Outdated Show resolved Hide resolved
.make.defaults Outdated Show resolved Hide resolved
transforms/universal/pdf2md/README.md Outdated Show resolved Hide resolved
transforms/universal/pdf2md/README.md Outdated Show resolved Hide resolved
transforms/universal/pdf2md/kfp_ray/noop_multiple_wf.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/kfp_ray/noop_multiple_wf.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/test/test_pdf2md.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/ray/README.md Outdated Show resolved Hide resolved
transforms/universal/pdf2md/ray/README.md Outdated Show resolved Hide resolved
transforms/universal/pdf2md/kfp_ray/README.md Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/README.md Outdated Show resolved Hide resolved

shortname = "pdf2md"
cli_prefix = f"{shortname}_"
pdf2md_modelsdir_key = f"{shortname}_modelsdir"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved

| Parameter | Default | Description |
|------------|----------|--------------|
| `--pdf2md_modelsdir` | `./artifacts` | The location where the models are located or downloaded to |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are input parameters, so they are fine

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, we are you using markdown and not raw text in the content column?


| Parameter | Default | Description |
|------------|----------|--------------|
| `--pdf2md_modelsdir` | `./artifacts` | The location where the models are located or downloaded to |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky no, we separately have config parameters to the transform and CLI parameters. this is well-established pattern in other transforms. My comment stands.

transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/src/pdf2md_transform.py Outdated Show resolved Hide resolved
@daw3rd daw3rd changed the title Add Docling pdf2md Add pdf2parquet transform Jul 18, 2024

| Parameter | Default | Description |
|------------|----------|--------------|
| `--pdf2md_modelsdir` | `./artifacts` | The location where the models are located or downloaded to |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should add the non-CLI config params that do not include pdf2md_ prefix to the README, then the CLI --pdf2md* params can just refer to that documentation. The point here is that there is configuration of the transform that is independent of the CLI. the prefix is used in the CLI to help better distinguish these within the very large set of params that are presented in the cloud/kfp gui to run at scale.

transforms/universal/pdf2md/python/Dockerfile Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/Makefile Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/.dockerignore Outdated Show resolved Hide resolved
transforms/universal/pdf2md/python/.gitignore Outdated Show resolved Hide resolved
transforms/universal/pdf2parquet/python/Dockerfile Outdated Show resolved Hide resolved
transforms/universal/pdf2parquet/ray/Dockerfile Outdated Show resolved Hide resolved
@dolfim-ibm dolfim-ibm force-pushed the docling-pdf2md branch 2 times, most recently from 7b0ca63 to 8c4d0bf Compare July 24, 2024 19:06
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Signed-off-by: Michele Dolfi <[email protected]>
Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@daw3rd daw3rd merged commit 8360e18 into IBM:dev Jul 25, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants