Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Dependency update for unstructured component #3552

Merged
merged 19 commits into from
Aug 30, 2024

Conversation

erichare
Copy link
Collaborator

This PR makes two changes to the dependencies of the project:

  1. It adds the pdf extra package for unstructured, for processing PDFs
  2. It bumps the minimum version for huggingface-hub which is required for Unstructured.

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. python Pull requests that update Python code labels Aug 26, 2024
Copy link
Contributor

Pull Request Validation Report

This comment is automatically generated by Conventional PR

Whitelist Report

Whitelist Active Result
Pull request is a draft and should be ignored
Pull request is made by a whitelisted user and should be ignored
Pull request is submitted by a bot and should be ignored
Pull request is submitted by administrators and should be ignored

Result

Pull request does not satisfy any enabled whitelist criteria. Pull request will be validated.

Validation Report

Validation Active Result
All commits in this pull request has valid messages
Pull request does not introduce too many changes
Pull request has a valid title
Pull request has mentioned issues
Pull request has valid branch name
Pull request should have a non-empty body

Result

Pull request satisfies all enabled pull request rules.

Last Modified at 26 Aug 24 16:46 UTC

Copy link

This pull request is automatically being deployed by Amplify Hosting (learn more).

Access this pull request here: https://pr-3552.dmtpw4p5recq1.amplifyapp.com

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels Aug 26, 2024
@ogabrielluiz ogabrielluiz changed the title FIX: Dependency update for unstructured component fix: Dependency update for unstructured component Aug 26, 2024
Copy link
Contributor

@ogabrielluiz ogabrielluiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 26, 2024
@github-actions github-actions bot added the bug Something isn't working label Aug 26, 2024
Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how the new script nltk_setup is intended to be used ?

@dosubot dosubot bot removed the lgtm This PR has been approved by a maintainer label Aug 27, 2024
@erichare
Copy link
Collaborator Author

Could you explain how the new script nltk_setup is intended to be used ?

Hi @nicoloboschi

the unstructured component, and specifically the pdf parsing tools, use nltk behind the scenes. But they assume that these three packages, punkt etc, are already available on the host - if they aren’t, the component will error and ask you to download them first.

so the basic idea is just that every time we deploy a release that we are sure those nltk packages are on the machine and accessible for the nltk library - does that make sense? Maybe there’s another way to go about it, I.,e, the dockerfile orchestrates it? But that’s the idea

@nicoloboschi
Copy link
Contributor

@erichare could we make it part of the langflow run script?
and at the bootstrap it verifies/download the needed files

The current format is not that helpful in fix/troubleshoot the problem

@erichare
Copy link
Collaborator Author

feat: allow drive loader to read a folder recursively #3572

@nicoloboschi when you get a chance, could you look at the latest commit? I'm not sure whether i put it exactly in the most logical place, but the utility function for downloading the resources i put in setup.py, then the run script imports that function and calls it at startup. If the packages are already available, it'll skip downloading them.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 30, 2024
Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ogabrielluiz ogabrielluiz merged commit 1ed5ce8 into langflow-ai:main Aug 30, 2024
23 of 28 checks passed
zzzming pushed a commit to datastax/ragstack-ai-langflow that referenced this pull request Aug 30, 2024
* FIX: Dependency update for unstructured component

* [autofix.ci] apply automated fixes

* Add nltk download script

* chore(pyproject.toml): move nltk_setup script definition to [tool.poetry.scripts] section for better organization and readability

* [autofix.ci] apply automated fixes

* Make nltk resource downloading part of run

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

* Use logger for nltk messages

* Move the download resources function

* [autofix.ci] apply automated fixes

* Move nltk resource download

* [autofix.ci] apply automated fixes

* Update main.py

* Update poetry lock

* Update main.py

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Gabriel Luiz Freitas Almeida <[email protected]>
carlosrcoelho pushed a commit that referenced this pull request Sep 2, 2024
* FIX: Dependency update for unstructured component

* [autofix.ci] apply automated fixes

* Add nltk download script

* chore(pyproject.toml): move nltk_setup script definition to [tool.poetry.scripts] section for better organization and readability

* [autofix.ci] apply automated fixes

* Make nltk resource downloading part of run

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes

* Use logger for nltk messages

* Move the download resources function

* [autofix.ci] apply automated fixes

* Move nltk resource download

* [autofix.ci] apply automated fixes

* Update main.py

* Update poetry lock

* Update main.py

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Gabriel Luiz Freitas Almeida <[email protected]>
@erichare erichare deleted the bugfix-unstructured branch September 3, 2024 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer python Pull requests that update Python code size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants