Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds UnstructuredURLLoader for loading data from urls #979

Merged
merged 3 commits into from
Feb 10, 2023

Conversation

MthwRobinson
Copy link
Contributor

Summary

Adds a UnstructuredURLLoader that supports loading data from a list of URLs.

Testing

from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
raw_documents = loader.load()

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks awesome! thanks

@hwchase17 hwchase17 merged commit 07a407d into langchain-ai:master Feb 10, 2023
dongreenberg pushed a commit to dongreenberg/langchain that referenced this pull request Feb 17, 2023
…ain-ai#979)

### Summary

Adds a `UnstructuredURLLoader` that supports loading data from a list of
URLs.


### Testing

```python
from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
raw_documents = loader.load()
```
@blob42 blob42 mentioned this pull request Feb 21, 2023
zachschillaci27 pushed a commit to zachschillaci27/langchain that referenced this pull request Mar 8, 2023
…ain-ai#979)

### Summary

Adds a `UnstructuredURLLoader` that supports loading data from a list of
URLs.


### Testing

```python
from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
raw_documents = loader.load()
```
@Boscop
Copy link

Boscop commented Apr 9, 2023

@MthwRobinson Can you please adjust UnstructuredURLLoader to allow loading text/plain files from URLs? (Ideally also other mime types like text/markdown.)
Currently I'm getting Error fetching or processing https://raw.githubusercontent.com/[...], exeption: Expected content type text/html. Got text/plain; charset=utf-8.

@MthwRobinson
Copy link
Contributor Author

@Boscop - Thanks for flagging, we added an issue in the unstructured library to support other MIME types and will pick it up as soon as we can.

@IqraShahid-dev
Copy link

@MthwRobinson Got error for pdf while fetching from s3 bucket,

raise ValueError(f"Expected content type text/html. Got {content_type}.")
ValueError: Expected content type text/html. Got application/pdf.

@MthwRobinson
Copy link
Contributor Author

#2793 will allow for processing non HTML resources with the URL loader. @Boscop - you can pass the content_type kwarg into the loader to force it to use a specific MIME type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants