Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pdf2parquet transform #416

Merged
merged 35 commits into from
Jul 25, 2024
Merged

Add pdf2parquet transform #416

merged 35 commits into from
Jul 25, 2024

Commits on Jul 25, 2024

  1. copy noop as skeleton

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    a0d485e View commit details
    Browse the repository at this point in the history
  2. add docling for converting pdf documents to md

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    f6bed21 View commit details
    Browse the repository at this point in the history
  3. add DOCKER_PLATFORM variable

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    81ee45e View commit details
    Browse the repository at this point in the history
  4. fix Makefile, improve dockerfile and address feedback

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    f92010d View commit details
    Browse the repository at this point in the history
  5. add ray, simplify CLI params parsing, use Pdf2Md as name

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    58aa5d8 View commit details
    Browse the repository at this point in the history
  6. fix dockerfile and noop

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    9031fdf View commit details
    Browse the repository at this point in the history
  7. add /src in root dir and avoid double COPY

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    d5dcc08 View commit details
    Browse the repository at this point in the history
  8. add transform_class in TransformConfiguration

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    e938f88 View commit details
    Browse the repository at this point in the history
  9. transform_class arg for ray, simplify cli args, remove download model…

    …s (done automatically), cleanup prints
    
    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    bca0aaa View commit details
    Browse the repository at this point in the history
  10. add kfp

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    b378ad1 View commit details
    Browse the repository at this point in the history
  11. fixed for tests

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    32a0f40 View commit details
    Browse the repository at this point in the history
  12. fix test_pdf2md.py input/output

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    134c48c View commit details
    Browse the repository at this point in the history
  13. fix reference data

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    6a5a466 View commit details
    Browse the repository at this point in the history
  14. ignore verbose contents since we check its content hash

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    526484a View commit details
    Browse the repository at this point in the history
  15. rename to pdf2parquet

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    a6a0950 View commit details
    Browse the repository at this point in the history
  16. update expected results to match container versions

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    2a386e9 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    4d65403 View commit details
    Browse the repository at this point in the history
  18. add run_locally

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    52df94f View commit details
    Browse the repository at this point in the history
  19. typo and expand README

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    9efae7d View commit details
    Browse the repository at this point in the history
  20. raise final failures

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    4531cc3 View commit details
    Browse the repository at this point in the history
  21. add free up space for test-universal

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    7e1624a View commit details
    Browse the repository at this point in the history
  22. keep jvm

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    ec2b469 View commit details
    Browse the repository at this point in the history
  23. remove temporary build in CI

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    172760d View commit details
    Browse the repository at this point in the history
  24. add option for json output

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    6d27e73 View commit details
    Browse the repository at this point in the history
  25. add docs

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    7c5efd7 View commit details
    Browse the repository at this point in the history
  26. py 3.10 compatibility

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    0e12d6e View commit details
    Browse the repository at this point in the history
  27. fix test results

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    e31318b View commit details
    Browse the repository at this point in the history
  28. restore proper test

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    b07f46e View commit details
    Browse the repository at this point in the history
  29. remove blacklisting of contents

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    916eb1b View commit details
    Browse the repository at this point in the history
  30. apply changes as in IBM#440

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    8c57e1a View commit details
    Browse the repository at this point in the history
  31. move pdf2parquet to language

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    c07d01e View commit details
    Browse the repository at this point in the history
  32. add ocr extra

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    340a0cc View commit details
    Browse the repository at this point in the history
  33. add more tests for do_ocr and pdf2parquet_do_table_structure

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    2fdbf83 View commit details
    Browse the repository at this point in the history
  34. add pdf2parquet to the README

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    4d9a985 View commit details
    Browse the repository at this point in the history
  35. extend timeout for the language images

    Signed-off-by: Michele Dolfi <[email protected]>
    dolfim-ibm committed Jul 25, 2024
    Configuration menu
    Copy the full SHA
    8e06993 View commit details
    Browse the repository at this point in the history