Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/add unstructured docker #824

Merged
merged 3 commits into from
Aug 1, 2024
Merged

Conversation

emrgnt-cmplxty
Copy link
Contributor

@emrgnt-cmplxty emrgnt-cmplxty commented Aug 1, 2024

🚀 This description was created by Ellipsis for commit 5e73e8e

Summary:

Add Dockerfile for 'unstructured' parsing provider, update dependencies, enhance 'UnstructuredParsingProvider' class, and update tests.

Key points:

  • Added Dockerfile.unstructured to build a Docker image with 'unstructured' parsing provider.
  • Updated pyproject.toml to include poppler-utils dependency.
  • Modified r2r.toml to set parsing provider to 'unstructured'.
  • Enhanced r2r/providers/parsing/unstructured_parsing.py to yield Extraction objects and log parsing details.
  • Updated tests/test_config.py to reflect changes in configuration.

Generated with ❤️ by ellipsis.dev

@emrgnt-cmplxty emrgnt-cmplxty marked this pull request as ready for review August 1, 2024 23:51
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 01cba3e in 56 seconds

More details
  • Looked at 164 lines of code in 4 files
  • Skipped 1 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_uqtEeugTORq19AQZ


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \
poppler-utils libmagic1 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The poppler-utils package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.


# Install system dependencies (including those needed for Unstructured and OpenCV)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.

gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \
poppler-utils libmagic1 \
libgl1-mesa-glx libglib2.0-0 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The libgl1-mesa-glx and libglib2.0-0 packages are installed twice, once in the builder stage and once in the final image. Consider removing them from the builder stage to optimize the build process.

# Install system dependencies (including those needed for Unstructured and OpenCV)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tesseract-ocr package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 5e73e8e in 38 seconds

More details
  • Looked at 45 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_D8JeqLeLKitd5lXf


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

if isinstance(data, bytes):
data = BytesIO(data)

# TODO - Include check on excluded parsers here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment suggests adding a check for excluded parsers, which is not implemented. This is crucial for ensuring that the parsing process respects the configuration settings to exclude certain parsers.

Suggested change
# TODO - Include check on excluded parsers here.
# Implement the check for excluded parsers as indicated by the TODO.

@emrgnt-cmplxty emrgnt-cmplxty merged commit ccfa8fe into dev Aug 1, 2024
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant