-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/add unstructured docker #824
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 01cba3e in 56 seconds
More details
- Looked at
164
lines of code in4
files - Skipped
1
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_uqtEeugTORq19AQZ
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \ | ||
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \ | ||
poppler-utils libmagic1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The poppler-utils
package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.
|
||
# Install system dependencies (including those needed for Unstructured and OpenCV) | ||
RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The curl
package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \ | ||
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \ | ||
poppler-utils libmagic1 \ | ||
libgl1-mesa-glx libglib2.0-0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The libgl1-mesa-glx
and libglib2.0-0
packages are installed twice, once in the builder stage and once in the final image. Consider removing them from the builder stage to optimize the build process.
# Install system dependencies (including those needed for Unstructured and OpenCV) | ||
RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
gcc g++ musl-dev curl libffi-dev gfortran libopenblas-dev \ | ||
tesseract-ocr libtesseract-dev libleptonica-dev pkg-config \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tesseract-ocr
package is installed twice, once in the builder stage and once in the final image. Consider removing it from the builder stage to optimize the build process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 5e73e8e in 38 seconds
More details
- Looked at
45
lines of code in2
files - Skipped
0
files when reviewing. - Skipped posting
0
drafted comments based on config settings.
Workflow ID: wflow_D8JeqLeLKitd5lXf
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
if isinstance(data, bytes): | ||
data = BytesIO(data) | ||
|
||
# TODO - Include check on excluded parsers here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment suggests adding a check for excluded parsers, which is not implemented. This is crucial for ensuring that the parsing process respects the configuration settings to exclude certain parsers.
# TODO - Include check on excluded parsers here. | |
# Implement the check for excluded parsers as indicated by the TODO. |
Summary:
Add Dockerfile for 'unstructured' parsing provider, update dependencies, enhance 'UnstructuredParsingProvider' class, and update tests.
Key points:
Dockerfile.unstructured
to build a Docker image with 'unstructured' parsing provider.pyproject.toml
to includepoppler-utils
dependency.r2r.toml
to set parsing provider to 'unstructured'.r2r/providers/parsing/unstructured_parsing.py
to yieldExtraction
objects and log parsing details.tests/test_config.py
to reflect changes in configuration.Generated with ❤️ by ellipsis.dev