Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update unstructured document loaders #2213

Merged
merged 13 commits into from
Apr 21, 2024

Conversation

QuinnGT
Copy link
Contributor

@QuinnGT QuinnGT commented Apr 19, 2024

Description:

  • Updated unstructured file loader with latest API changes
  • Update unstructured folder loader with latest API changes
  • Updated S3File loader with latest unstructured API changes
  • Updated S3File loader to use new region lookup method
  • Update S3File loader to use updated credential function
  • Added support for ocrLanguages but should be updated to be just languages once langchain unstructured gets an update

Note:
Unstructured folder loader currently doesn't work with different file types like .png and .heic due to problems with their API. This is in progress for future releases.

Also, I would like to add S3Folder loader but I'll do that in a separate PR since langchainjs doesn't support it yet and I'll have to go direct to the sdk.

Issues this addresses:

@HenryHengZJ
Copy link
Contributor

Thanks @QuinnGT this is so good!

There's a few other parameters we can add in future:

  • multiPageSections
  • combineUnderNChars
  • newAfterNChars
  • maxCharacters

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Apr 19, 2024

Thank you! I completely agree. Looking forward to the langchain upgrade so we can bring those additional params in.

@HenryHengZJ
Copy link
Contributor

Thank you! I completely agree. Looking forward to the langchain upgrade so we can bring those additional params in.

merged!

Copy link
Contributor

@HenryHengZJ HenryHengZJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! are you still planning to add more changes or is it good to merge @QuinnGT ?

@QuinnGT
Copy link
Contributor Author

QuinnGT commented Apr 20, 2024 via email

…alues, and support for null on multiOptions.
@QuinnGT
Copy link
Contributor Author

QuinnGT commented Apr 21, 2024

Hey @HenryHengZJ take a look at the latest changes. Just tested on multiple docs, fixed some errors, and got it stable.

@HenryHengZJ
Copy link
Contributor

Hey @HenryHengZJ take a look at the latest changes. Just tested on multiple docs, fixed some errors, and got it stable.

awesome thank you so much!

@HenryHengZJ HenryHengZJ merged commit 4c2ba10 into FlowiseAI:main Apr 21, 2024
2 checks passed
@QuinnGT QuinnGT deleted the Feature/update-unstructured branch April 21, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants