feat: add Spider Web Scraper & Crawler #2439

WilliamEspegren · 2024-06-29T12:54:45Z

Add Spider, the fastest open source scraper & crawler that returns LLM-ready data.

WilliamEspegren · 2024-06-29T20:48:04Z

src/backend/base/langflow/components/langchain_utilities/SpiderTool.py

+        except Exception as e:
+            raise Exception(f"Error: {str(e)}")
+
+        records = []


This returns a linting error:

But I would argue that it is the right return type because the window where you can see what the tool returned is so big and allows for multiple entries.

You should change this:

) -> Data:

To this:

) -> list[Data]:

WilliamEspegren · 2024-06-29T20:49:54Z

Screencast showing how it works

Screencast.from.2024-06-29.14-50-38.webm

ogabrielluiz · 2024-06-30T14:34:38Z

Wow. This is nice, @WilliamEspegren

Thank you! I'll approve it ASAP.

ogabrielluiz · 2024-07-01T18:45:00Z

Have you seen our new way of building components?

WilliamEspegren · 2024-07-01T18:46:26Z

Have you seen our new way of building components?

No, I assume this is not the way since you are mentioning it?

ogabrielluiz · 2024-07-01T18:53:31Z

Take a look at the OpenAIModelComponent:

langflow/src/backend/base/langflow/components/models/OpenAIModel.py

Line 23 in f06657d

class OpenAIModelComponent(LCModelComponent):

WilliamEspegren · 2024-07-01T18:55:30Z

@ogabrielluiz I actually did that first: bbd7136 but I got errors

WilliamEspegren · 2024-07-02T09:07:02Z

@ogabrielluiz

This:

from typing import Optional
from spider.spider import Spider
from langflow.custom import CustomComponent
from langflow.schema import Data
from langflow.base.langchain_utilities.spider_constants import MODES
from langflow.inputs import (
    SecretStrInput,
    StrInput,
    DropdownInput,
    IntInput,
    BoolInput,
    DictInput,
)

class SpiderTool(CustomComponent):
    display_name: str = "Spider Web Crawler & Scraper"
    description: str = "Spider API for web crawling and scraping."
    output_types: list[str] = ["Document"]
    documentation: str = "https://spider.cloud/docs/api"

    inputs = [
        SecretStrInput(
            name="spider_api_key",
            display_name="Spider API Key",
            required=True,
            password=True,
            info="The Spider API Key, get it from https://spider.cloud",
        ),
        StrInput(
            name="url",
            display_name="URL",
            required=True,
            info="The URL to scrape or crawl",
        ),
        DropdownInput(
            name="mode",
            display_name="Mode",
            required=True,
            options=MODES,
            value=MODES[0],
            info="The mode of operation: scrape or crawl",
        ),
        IntInput(
            name="limit",
            display_name="Limit",
            info="The maximum amount of pages allowed to crawl per website. Set to 0 to crawl all pages.",
            advanced=True,
        ),
        IntInput(
            name="depth",
            display_name="Depth",
            info="The crawl limit for maximum depth. If 0, no limit will be applied.",
            advanced=True,
        ),
        StrInput(
            name="blacklist",
            display_name="Blacklist",
            info="Blacklist paths that you do not want to crawl. Use Regex patterns.",
            advanced=True,
        ),
        StrInput(
            name="whitelist",
            display_name="Whitelist",
            info="Whitelist paths that you want to crawl, ignoring all other routes. Use Regex patterns.",
            advanced=True,
        ),
        BoolInput(
            name="use_readability",
            display_name="Use Readability",
            info="Use readability to pre-process the content for reading.",
            advanced=True,
        ),
        IntInput(
            name="request_timeout",
            display_name="Request Timeout",
            info="Timeout for the request in seconds.",
            advanced=True,
        ),
        BoolInput(
            name="metadata",
            display_name="Metadata",
            info="Include metadata in the response.",
            advanced=True,
        ),
        DictInput(
            name="params",
            display_name="Additional Parameters",
            info="Additional parameters to pass to the API. If provided, other inputs will be ignored.",
        ),
    ]

    def build(
        self,
        spider_api_key: str,
        url: str,
        mode: str,
        limit: Optional[int] = 0,
        depth: Optional[int] = 0,
        blacklist: Optional[str] = None,
        whitelist: Optional[str] = None,
        use_readability: Optional[bool] = False,
        request_timeout: Optional[int] = 30,
        metadata: Optional[bool] = False,
        params: Optional[Data] = None,
    ) -> Data:
        if params:
            parameters = params.__dict__['data']
        else:
            parameters = {
                "limit": limit,
                "depth": depth,
                "blacklist": blacklist,
                "whitelist": whitelist,
                "use_readability": use_readability,
                "request_timeout": request_timeout,
                "metadata": metadata,
                "return_format": "markdown",
            }

        app = Spider(api_key=spider_api_key)
        try:
            if mode == "scrape":
                parameters["limit"] = 1
                result = app.scrape_url(url, parameters)
            elif mode == "crawl":
                result = app.crawl_url(url, parameters)
            else:
                raise ValueError(f"Invalid mode: {mode}. Must be 'scrape' or 'crawl'.")
        except Exception as e:
            raise Exception(f"Error: {str(e)}")

        records = []

        for record in result:
            records.append(Data(data={"content": record["content"], "url": record["url"]}))
        return records

When trying to use the component, I get the following error:

"Error building Component Spider Web Crawler & Scraper: Base type component not found."

WilliamEspegren · 2024-07-04T10:38:23Z

@ogabrielluiz any idea why the code in the comment above fails?

ogabrielluiz · 2024-07-06T17:08:50Z

@ogabrielluiz

This:

from typing import Optional
from spider.spider import Spider
from langflow.custom import CustomComponent
from langflow.schema import Data
from langflow.base.langchain_utilities.spider_constants import MODES
from langflow.inputs import (
    SecretStrInput,
    StrInput,
    DropdownInput,
    IntInput,
    BoolInput,
    DictInput,
)

class SpiderTool(CustomComponent):
    display_name: str = "Spider Web Crawler & Scraper"
    description: str = "Spider API for web crawling and scraping."
    output_types: list[str] = ["Document"]
    documentation: str = "https://spider.cloud/docs/api"

    inputs = [
        SecretStrInput(
            name="spider_api_key",
            display_name="Spider API Key",
            required=True,
            password=True,
            info="The Spider API Key, get it from https://spider.cloud",
        ),
        StrInput(
            name="url",
            display_name="URL",
            required=True,
            info="The URL to scrape or crawl",
        ),
        DropdownInput(
            name="mode",
            display_name="Mode",
            required=True,
            options=MODES,
            value=MODES[0],
            info="The mode of operation: scrape or crawl",
        ),
        IntInput(
            name="limit",
            display_name="Limit",
            info="The maximum amount of pages allowed to crawl per website. Set to 0 to crawl all pages.",
            advanced=True,
        ),
        IntInput(
            name="depth",
            display_name="Depth",
            info="The crawl limit for maximum depth. If 0, no limit will be applied.",
            advanced=True,
        ),
        StrInput(
            name="blacklist",
            display_name="Blacklist",
            info="Blacklist paths that you do not want to crawl. Use Regex patterns.",
            advanced=True,
        ),
        StrInput(
            name="whitelist",
            display_name="Whitelist",
            info="Whitelist paths that you want to crawl, ignoring all other routes. Use Regex patterns.",
            advanced=True,
        ),
        BoolInput(
            name="use_readability",
            display_name="Use Readability",
            info="Use readability to pre-process the content for reading.",
            advanced=True,
        ),
        IntInput(
            name="request_timeout",
            display_name="Request Timeout",
            info="Timeout for the request in seconds.",
            advanced=True,
        ),
        BoolInput(
            name="metadata",
            display_name="Metadata",
            info="Include metadata in the response.",
            advanced=True,
        ),
        DictInput(
            name="params",
            display_name="Additional Parameters",
            info="Additional parameters to pass to the API. If provided, other inputs will be ignored.",
        ),
    ]

    def build(
        self,
        spider_api_key: str,
        url: str,
        mode: str,
        limit: Optional[int] = 0,
        depth: Optional[int] = 0,
        blacklist: Optional[str] = None,
        whitelist: Optional[str] = None,
        use_readability: Optional[bool] = False,
        request_timeout: Optional[int] = 30,
        metadata: Optional[bool] = False,
        params: Optional[Data] = None,
    ) -> Data:
        if params:
            parameters = params.__dict__['data']
        else:
            parameters = {
                "limit": limit,
                "depth": depth,
                "blacklist": blacklist,
                "whitelist": whitelist,
                "use_readability": use_readability,
                "request_timeout": request_timeout,
                "metadata": metadata,
                "return_format": "markdown",
            }

        app = Spider(api_key=spider_api_key)
        try:
            if mode == "scrape":
                parameters["limit"] = 1
                result = app.scrape_url(url, parameters)
            elif mode == "crawl":
                result = app.crawl_url(url, parameters)
            else:
                raise ValueError(f"Invalid mode: {mode}. Must be 'scrape' or 'crawl'.")
        except Exception as e:
            raise Exception(f"Error: {str(e)}")

        records = []

        for record in result:
            records.append(Data(data={"content": record["content"], "url": record["url"]}))
        return records

When trying to use the component, I get the following error:

"Error building Component Spider Web Crawler & Scraper: Base type component not found."

You should inherit from Component. The import should be from langflow.custom import Component

WilliamEspegren · 2024-07-06T21:19:13Z

Thank you! The component now builds. The problem now is that the component has no output. I have looked at and tried to replicate how the OpenAI does it build(), but nothing has worked :(

WilliamEspegren · 2024-07-09T18:08:58Z

@ogabrielluiz just bringing this to your attention :)

ogabrielluiz · 2024-07-09T18:11:27Z

@ogabrielluiz just bringing this to your attention :)

Hey @WilliamEspegren

You have to set the outputs as well. Check here for an example:

langflow/src/backend/base/langflow/components/helpers/Memory.py

Line 60 in a6f128c

outputs = [

WilliamEspegren · 2024-07-09T18:45:20Z

Screencast.from.2024-07-09.20-40-55.webm

    outputs = [
            Output(display_name="Markdown", name="content", method="build"),
            Output(display_name="URL", name="url", method="build"),
    ]

When I have the "outputs" above, the component doesn't even show up. When I comment out the "outputs" the component shows up, but there is no outputs on the node. @ogabrielluiz

ogabrielluiz

LGTM

ogabrielluiz · 2024-07-05T16:18:47Z

src/backend/base/langflow/components/langchain_utilities/SpiderTool.py

+        except Exception as e:
+            raise Exception(f"Error: {str(e)}")
+
+        records = []


You should change this:

) -> Data:

To this:

) -> list[Data]:

WilliamEspegren · 2024-07-20T10:45:56Z

Should I solve the merge conflicts?

WilliamEspegren · 2024-07-21T09:23:27Z

@ogabrielluiz just pinging for attention

…Markdown content

WilliamEspegren commented Jun 29, 2024

View reviewed changes

WilliamEspegren marked this pull request as ready for review June 29, 2024 20:54

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request javascript Pull requests that update Javascript code python Pull requests that update Python code labels Jun 29, 2024

ogabrielluiz changed the title ~~Add Spider Web Scraper & Crawler~~ feat: add Spider Web Scraper & Crawler Jul 1, 2024

ogabrielluiz force-pushed the spider-integration branch from 17eec88 to 5e1e73d Compare July 11, 2024 20:36

ogabrielluiz approved these changes Jul 11, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 11, 2024

ogabrielluiz enabled auto-merge (squash) July 22, 2024 14:59

WilliamEspegren added 4 commits August 8, 2024 09:08

spider files

0e04adf

rebuild required

3c88384

add spider-client here

ed62039

Feat: Spider Web Crawler & Scraper

f9c9a60

WilliamEspegren and others added 10 commits August 8, 2024 09:08

Feat: spider integration

c40e2bd

new input not working

8728bdc

[autofix.ci] apply automated fixes

c7e1dd2

[autofix.ci] apply automated fixes (attempt 2/3)

ff9376d

fix: add outputs and configure build method

ea5fb03

style: run ruff

ea69f1d

Refactor SpiderTool to use 'crawl' instead of 'build' for generating …

d6b090b

…Markdown content

chore: add type ignore

e183b95

chore: new lock

b1f6a17

chore: Update mem0ai dependency to version 0.0.5

9eb3e8a

ogabrielluiz force-pushed the spider-integration branch from 0a421d8 to 9eb3e8a Compare August 8, 2024 12:11

ogabrielluiz disabled auto-merge August 8, 2024 12:52

ogabrielluiz merged commit 7a36cc9 into langflow-ai:main Aug 8, 2024
46 of 50 checks passed

WilliamEspegren deleted the spider-integration branch August 9, 2024 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Spider Web Scraper & Crawler #2439

feat: add Spider Web Scraper & Crawler #2439

WilliamEspegren commented Jun 29, 2024

WilliamEspegren Jun 29, 2024

ogabrielluiz Jul 5, 2024

WilliamEspegren commented Jun 29, 2024

ogabrielluiz commented Jun 30, 2024

ogabrielluiz commented Jul 1, 2024

WilliamEspegren commented Jul 1, 2024

ogabrielluiz commented Jul 1, 2024

WilliamEspegren commented Jul 1, 2024

WilliamEspegren commented Jul 2, 2024

WilliamEspegren commented Jul 4, 2024

ogabrielluiz commented Jul 6, 2024

WilliamEspegren commented Jul 6, 2024

WilliamEspegren commented Jul 9, 2024

ogabrielluiz commented Jul 9, 2024

WilliamEspegren commented Jul 9, 2024

ogabrielluiz left a comment

ogabrielluiz Jul 5, 2024

WilliamEspegren commented Jul 20, 2024

WilliamEspegren commented Jul 21, 2024

feat: add Spider Web Scraper & Crawler #2439

feat: add Spider Web Scraper & Crawler #2439

Conversation

WilliamEspegren commented Jun 29, 2024

WilliamEspegren Jun 29, 2024

Choose a reason for hiding this comment

ogabrielluiz Jul 5, 2024

Choose a reason for hiding this comment

WilliamEspegren commented Jun 29, 2024

ogabrielluiz commented Jun 30, 2024

ogabrielluiz commented Jul 1, 2024

WilliamEspegren commented Jul 1, 2024

ogabrielluiz commented Jul 1, 2024

WilliamEspegren commented Jul 1, 2024

WilliamEspegren commented Jul 2, 2024

WilliamEspegren commented Jul 4, 2024

ogabrielluiz commented Jul 6, 2024

WilliamEspegren commented Jul 6, 2024

WilliamEspegren commented Jul 9, 2024

ogabrielluiz commented Jul 9, 2024

WilliamEspegren commented Jul 9, 2024

ogabrielluiz left a comment

Choose a reason for hiding this comment

ogabrielluiz Jul 5, 2024

Choose a reason for hiding this comment

WilliamEspegren commented Jul 20, 2024

WilliamEspegren commented Jul 21, 2024