feat: adds `UnstructuredURLLoader` for loading data from urls #979

MthwRobinson · 2023-02-10T17:05:14Z

Summary

Adds a UnstructuredURLLoader that supports loading data from a list of URLs.

Testing

from langchain.document_loaders import UnstructuredURLLoader

urls = [
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023",
    "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023"
]
loader = UnstructuredURLLoader(urls=urls)
raw_documents = loader.load()

hwchase17

looks awesome! thanks

…ain-ai#979) ### Summary Adds a `UnstructuredURLLoader` that supports loading data from a list of URLs. ### Testing ```python from langchain.document_loaders import UnstructuredURLLoader urls = [ "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023", "https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023" ] loader = UnstructuredURLLoader(urls=urls) raw_documents = loader.load() ```

Boscop · 2023-04-09T02:21:19Z

@MthwRobinson Can you please adjust UnstructuredURLLoader to allow loading text/plain files from URLs? (Ideally also other mime types like text/markdown.)
Currently I'm getting Error fetching or processing https://raw.githubusercontent.com/[...], exeption: Expected content type text/html. Got text/plain; charset=utf-8.

MthwRobinson · 2023-04-10T16:02:16Z

@Boscop - Thanks for flagging, we added an issue in the unstructured library to support other MIME types and will pick it up as soon as we can.

IqraShahid-dev · 2023-04-12T10:43:13Z

@MthwRobinson Got error for pdf while fetching from s3 bucket,

raise ValueError(f"Expected content type text/html. Got {content_type}.")
ValueError: Expected content type text/html. Got application/pdf.

MthwRobinson · 2023-04-12T20:19:39Z

#2793 will allow for processing non HTML resources with the URL loader. @Boscop - you can pass the content_type kwarg into the loader to force it to use a specific MIME type.

MthwRobinson added 3 commits February 10, 2023 11:29

added url loader

4505397

added example notebook for urls

3429118

export UnstructuredURLLoader

f25d794

hwchase17 approved these changes Feb 10, 2023

View reviewed changes

hwchase17 merged commit 07a407d into langchain-ai:master Feb 10, 2023

blob42 mentioned this pull request Feb 21, 2023

fix searx blob42/langchain#1

Closed

MthwRobinson mentioned this pull request Apr 10, 2023

Allow support for partitioning plain text and markdown from URLs Unstructured-IO/unstructured#464

Closed

MthwRobinson mentioned this pull request Apr 12, 2023

feat: add url kwarg to partititon Unstructured-IO/unstructured#470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds `UnstructuredURLLoader` for loading data from urls #979

feat: adds `UnstructuredURLLoader` for loading data from urls #979

MthwRobinson commented Feb 10, 2023

hwchase17 left a comment

Boscop commented Apr 9, 2023

MthwRobinson commented Apr 10, 2023

IqraShahid-dev commented Apr 12, 2023

MthwRobinson commented Apr 12, 2023

feat: adds UnstructuredURLLoader for loading data from urls #979

feat: adds UnstructuredURLLoader for loading data from urls #979

Conversation

MthwRobinson commented Feb 10, 2023

Summary

Testing

hwchase17 left a comment

Choose a reason for hiding this comment

Boscop commented Apr 9, 2023

MthwRobinson commented Apr 10, 2023

IqraShahid-dev commented Apr 12, 2023

MthwRobinson commented Apr 12, 2023

feat: adds `UnstructuredURLLoader` for loading data from urls #979

feat: adds `UnstructuredURLLoader` for loading data from urls #979