-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Dependency update for unstructured component #3552
Conversation
Pull Request Validation ReportThis comment is automatically generated by Conventional PR Whitelist Report
Result Pull request does not satisfy any enabled whitelist criteria. Pull request will be validated. Validation Report
Result Pull request satisfies all enabled pull request rules. Last Modified at 26 Aug 24 16:46 UTC |
This pull request is automatically being deployed by Amplify Hosting (learn more). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e7541f0
to
95b24b2
Compare
…try.scripts] section for better organization and readability
caf6300
to
a8f6e1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how the new script nltk_setup
is intended to be used ?
the unstructured component, and specifically the pdf parsing tools, use nltk behind the scenes. But they assume that these three packages, punkt etc, are already available on the host - if they aren’t, the component will error and ask you to download them first. so the basic idea is just that every time we deploy a release that we are sure those nltk packages are on the machine and accessible for the nltk library - does that make sense? Maybe there’s another way to go about it, I.,e, the dockerfile orchestrates it? But that’s the idea |
@erichare could we make it part of the The current format is not that helpful in fix/troubleshoot the problem |
@nicoloboschi when you get a chance, could you look at the latest commit? I'm not sure whether i put it exactly in the most logical place, but the utility function for downloading the resources i put in setup.py, then the run script imports that function and calls it at startup. If the packages are already available, it'll skip downloading them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* FIX: Dependency update for unstructured component * [autofix.ci] apply automated fixes * Add nltk download script * chore(pyproject.toml): move nltk_setup script definition to [tool.poetry.scripts] section for better organization and readability * [autofix.ci] apply automated fixes * Make nltk resource downloading part of run * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes * Use logger for nltk messages * Move the download resources function * [autofix.ci] apply automated fixes * Move nltk resource download * [autofix.ci] apply automated fixes * Update main.py * Update poetry lock * Update main.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Gabriel Luiz Freitas Almeida <[email protected]>
* FIX: Dependency update for unstructured component * [autofix.ci] apply automated fixes * Add nltk download script * chore(pyproject.toml): move nltk_setup script definition to [tool.poetry.scripts] section for better organization and readability * [autofix.ci] apply automated fixes * Make nltk resource downloading part of run * [autofix.ci] apply automated fixes * [autofix.ci] apply automated fixes * Use logger for nltk messages * Move the download resources function * [autofix.ci] apply automated fixes * Move nltk resource download * [autofix.ci] apply automated fixes * Update main.py * Update poetry lock * Update main.py --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Gabriel Luiz Freitas Almeida <[email protected]>
This PR makes two changes to the dependencies of the project:
pdf
extra package for unstructured, for processing PDFshuggingface-hub
which is required for Unstructured.