Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Spider Web Scraper & Crawler #2439

Merged
merged 14 commits into from
Aug 8, 2024

Conversation

WilliamEspegren
Copy link
Contributor

Add Spider, the fastest open source scraper & crawler that returns LLM-ready data.

Twitter: @WilliamEspegren

except Exception as e:
raise Exception(f"Error: {str(e)}")

records = []
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a linting error:
image

But I would argue that it is the right return type because the window where you can see what the tool returned is so big and allows for multiple entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change this:

        ) -> Data:

To this:

) -> list[Data]:

@WilliamEspegren
Copy link
Contributor Author

Screencast showing how it works

Screencast.from.2024-06-29.14-50-38.webm

@WilliamEspegren WilliamEspegren marked this pull request as ready for review June 29, 2024 20:54
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request javascript Pull requests that update Javascript code python Pull requests that update Python code labels Jun 29, 2024
@ogabrielluiz
Copy link
Contributor

Wow. This is nice, @WilliamEspegren

Thank you! I'll approve it ASAP.

@ogabrielluiz
Copy link
Contributor

Have you seen our new way of building components?

@WilliamEspegren
Copy link
Contributor Author

Have you seen our new way of building components?

No, I assume this is not the way since you are mentioning it?

@ogabrielluiz
Copy link
Contributor

Take a look at the OpenAIModelComponent:

class OpenAIModelComponent(LCModelComponent):

@WilliamEspegren
Copy link
Contributor Author

@ogabrielluiz I actually did that first: bbd7136 but I got errors

@ogabrielluiz ogabrielluiz changed the title Add Spider Web Scraper & Crawler feat: add Spider Web Scraper & Crawler Jul 1, 2024
@WilliamEspegren
Copy link
Contributor Author

@ogabrielluiz

This:

from typing import Optional
from spider.spider import Spider
from langflow.custom import CustomComponent
from langflow.schema import Data
from langflow.base.langchain_utilities.spider_constants import MODES
from langflow.inputs import (
    SecretStrInput,
    StrInput,
    DropdownInput,
    IntInput,
    BoolInput,
    DictInput,
)

class SpiderTool(CustomComponent):
    display_name: str = "Spider Web Crawler & Scraper"
    description: str = "Spider API for web crawling and scraping."
    output_types: list[str] = ["Document"]
    documentation: str = "https://spider.cloud/docs/api"

    inputs = [
        SecretStrInput(
            name="spider_api_key",
            display_name="Spider API Key",
            required=True,
            password=True,
            info="The Spider API Key, get it from https://spider.cloud",
        ),
        StrInput(
            name="url",
            display_name="URL",
            required=True,
            info="The URL to scrape or crawl",
        ),
        DropdownInput(
            name="mode",
            display_name="Mode",
            required=True,
            options=MODES,
            value=MODES[0],
            info="The mode of operation: scrape or crawl",
        ),
        IntInput(
            name="limit",
            display_name="Limit",
            info="The maximum amount of pages allowed to crawl per website. Set to 0 to crawl all pages.",
            advanced=True,
        ),
        IntInput(
            name="depth",
            display_name="Depth",
            info="The crawl limit for maximum depth. If 0, no limit will be applied.",
            advanced=True,
        ),
        StrInput(
            name="blacklist",
            display_name="Blacklist",
            info="Blacklist paths that you do not want to crawl. Use Regex patterns.",
            advanced=True,
        ),
        StrInput(
            name="whitelist",
            display_name="Whitelist",
            info="Whitelist paths that you want to crawl, ignoring all other routes. Use Regex patterns.",
            advanced=True,
        ),
        BoolInput(
            name="use_readability",
            display_name="Use Readability",
            info="Use readability to pre-process the content for reading.",
            advanced=True,
        ),
        IntInput(
            name="request_timeout",
            display_name="Request Timeout",
            info="Timeout for the request in seconds.",
            advanced=True,
        ),
        BoolInput(
            name="metadata",
            display_name="Metadata",
            info="Include metadata in the response.",
            advanced=True,
        ),
        DictInput(
            name="params",
            display_name="Additional Parameters",
            info="Additional parameters to pass to the API. If provided, other inputs will be ignored.",
        ),
    ]

    def build(
        self,
        spider_api_key: str,
        url: str,
        mode: str,
        limit: Optional[int] = 0,
        depth: Optional[int] = 0,
        blacklist: Optional[str] = None,
        whitelist: Optional[str] = None,
        use_readability: Optional[bool] = False,
        request_timeout: Optional[int] = 30,
        metadata: Optional[bool] = False,
        params: Optional[Data] = None,
    ) -> Data:
        if params:
            parameters = params.__dict__['data']
        else:
            parameters = {
                "limit": limit,
                "depth": depth,
                "blacklist": blacklist,
                "whitelist": whitelist,
                "use_readability": use_readability,
                "request_timeout": request_timeout,
                "metadata": metadata,
                "return_format": "markdown",
            }

        app = Spider(api_key=spider_api_key)
        try:
            if mode == "scrape":
                parameters["limit"] = 1
                result = app.scrape_url(url, parameters)
            elif mode == "crawl":
                result = app.crawl_url(url, parameters)
            else:
                raise ValueError(f"Invalid mode: {mode}. Must be 'scrape' or 'crawl'.")
        except Exception as e:
            raise Exception(f"Error: {str(e)}")

        records = []

        for record in result:
            records.append(Data(data={"content": record["content"], "url": record["url"]}))
        return records

When trying to use the component, I get the following error:

"Error building Component Spider Web Crawler & Scraper: Base type component not found."

@WilliamEspegren
Copy link
Contributor Author

@ogabrielluiz any idea why the code in the comment above fails?

@ogabrielluiz
Copy link
Contributor

@ogabrielluiz

This:

from typing import Optional
from spider.spider import Spider
from langflow.custom import CustomComponent
from langflow.schema import Data
from langflow.base.langchain_utilities.spider_constants import MODES
from langflow.inputs import (
    SecretStrInput,
    StrInput,
    DropdownInput,
    IntInput,
    BoolInput,
    DictInput,
)

class SpiderTool(CustomComponent):
    display_name: str = "Spider Web Crawler & Scraper"
    description: str = "Spider API for web crawling and scraping."
    output_types: list[str] = ["Document"]
    documentation: str = "https://spider.cloud/docs/api"

    inputs = [
        SecretStrInput(
            name="spider_api_key",
            display_name="Spider API Key",
            required=True,
            password=True,
            info="The Spider API Key, get it from https://spider.cloud",
        ),
        StrInput(
            name="url",
            display_name="URL",
            required=True,
            info="The URL to scrape or crawl",
        ),
        DropdownInput(
            name="mode",
            display_name="Mode",
            required=True,
            options=MODES,
            value=MODES[0],
            info="The mode of operation: scrape or crawl",
        ),
        IntInput(
            name="limit",
            display_name="Limit",
            info="The maximum amount of pages allowed to crawl per website. Set to 0 to crawl all pages.",
            advanced=True,
        ),
        IntInput(
            name="depth",
            display_name="Depth",
            info="The crawl limit for maximum depth. If 0, no limit will be applied.",
            advanced=True,
        ),
        StrInput(
            name="blacklist",
            display_name="Blacklist",
            info="Blacklist paths that you do not want to crawl. Use Regex patterns.",
            advanced=True,
        ),
        StrInput(
            name="whitelist",
            display_name="Whitelist",
            info="Whitelist paths that you want to crawl, ignoring all other routes. Use Regex patterns.",
            advanced=True,
        ),
        BoolInput(
            name="use_readability",
            display_name="Use Readability",
            info="Use readability to pre-process the content for reading.",
            advanced=True,
        ),
        IntInput(
            name="request_timeout",
            display_name="Request Timeout",
            info="Timeout for the request in seconds.",
            advanced=True,
        ),
        BoolInput(
            name="metadata",
            display_name="Metadata",
            info="Include metadata in the response.",
            advanced=True,
        ),
        DictInput(
            name="params",
            display_name="Additional Parameters",
            info="Additional parameters to pass to the API. If provided, other inputs will be ignored.",
        ),
    ]

    def build(
        self,
        spider_api_key: str,
        url: str,
        mode: str,
        limit: Optional[int] = 0,
        depth: Optional[int] = 0,
        blacklist: Optional[str] = None,
        whitelist: Optional[str] = None,
        use_readability: Optional[bool] = False,
        request_timeout: Optional[int] = 30,
        metadata: Optional[bool] = False,
        params: Optional[Data] = None,
    ) -> Data:
        if params:
            parameters = params.__dict__['data']
        else:
            parameters = {
                "limit": limit,
                "depth": depth,
                "blacklist": blacklist,
                "whitelist": whitelist,
                "use_readability": use_readability,
                "request_timeout": request_timeout,
                "metadata": metadata,
                "return_format": "markdown",
            }

        app = Spider(api_key=spider_api_key)
        try:
            if mode == "scrape":
                parameters["limit"] = 1
                result = app.scrape_url(url, parameters)
            elif mode == "crawl":
                result = app.crawl_url(url, parameters)
            else:
                raise ValueError(f"Invalid mode: {mode}. Must be 'scrape' or 'crawl'.")
        except Exception as e:
            raise Exception(f"Error: {str(e)}")

        records = []

        for record in result:
            records.append(Data(data={"content": record["content"], "url": record["url"]}))
        return records

When trying to use the component, I get the following error:

"Error building Component Spider Web Crawler & Scraper: Base type component not found."

You should inherit from Component. The import should be from langflow.custom import Component

@WilliamEspegren
Copy link
Contributor Author

Thank you! The component now builds. The problem now is that the component has no output. I have looked at and tried to replicate how the OpenAI does it build(), but nothing has worked :(

@WilliamEspegren
Copy link
Contributor Author

@ogabrielluiz just bringing this to your attention :)

@ogabrielluiz
Copy link
Contributor

@ogabrielluiz just bringing this to your attention :)

Hey @WilliamEspegren

You have to set the outputs as well. Check here for an example:

@WilliamEspegren
Copy link
Contributor Author

Screencast.from.2024-07-09.20-40-55.webm
    outputs = [
            Output(display_name="Markdown", name="content", method="build"),
            Output(display_name="URL", name="url", method="build"),
    ]

When I have the "outputs" above, the component doesn't even show up. When I comment out the "outputs" the component shows up, but there is no outputs on the node. @ogabrielluiz

Copy link
Contributor

@ogabrielluiz ogabrielluiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

except Exception as e:
raise Exception(f"Error: {str(e)}")

records = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change this:

        ) -> Data:

To this:

) -> list[Data]:

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 11, 2024
@WilliamEspegren
Copy link
Contributor Author

Should I solve the merge conflicts?

@WilliamEspegren
Copy link
Contributor Author

@ogabrielluiz just pinging for attention

@ogabrielluiz ogabrielluiz enabled auto-merge (squash) July 22, 2024 14:59
@ogabrielluiz ogabrielluiz merged commit 7a36cc9 into langflow-ai:main Aug 8, 2024
46 of 50 checks passed
@WilliamEspegren WilliamEspegren deleted the spider-integration branch August 9, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request javascript Pull requests that update Javascript code lgtm This PR has been approved by a maintainer python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants