Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add url kwarg to partititon #470

Merged
merged 6 commits into from
Apr 12, 2023
Merged

Conversation

MthwRobinson
Copy link
Contributor

Summary

Closes #464 and addresses user requests in langchain#979 Adds a url kwarg to partition to allow users to process remote documents directly. If the user passes in content_type, partition will process the document using content_type as the MIME type. Otherwise, it will use the Content-Type header to determine the MIME type.

Testing

  from unstructured.partition.auto import partition

  # A markdown file
  url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md"
  elements = partition(url=url, content_type="text/markdown")

  # An HTML file
  url = "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-april-11-2023"
  elements = partition(url=url)

  # A PDF file
  url = "https://www.understandingwar.org/sites/default/files/Russian%20Offensive%20Campaign%20Assessment%2C%20April%2011%2C%202023.pdf"
  elements = partition(url=url, strategy="fast")

@MthwRobinson MthwRobinson requested a review from qued April 12, 2023 14:56
@MthwRobinson MthwRobinson changed the title efeat: add url kwarg to partititon feat: add url kwarg to partititon Apr 12, 2023
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@MthwRobinson MthwRobinson enabled auto-merge (squash) April 12, 2023 18:01
@MthwRobinson MthwRobinson merged commit e2e473d into main Apr 12, 2023
@MthwRobinson MthwRobinson deleted the feat/other-content-types branch April 12, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow support for partitioning plain text and markdown from URLs
2 participants