Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FetchNodeLevelK is not filtering out mailto: links. #861

Closed
Kilowhisky opened this issue Jan 2, 2025 · 3 comments
Closed

FetchNodeLevelK is not filtering out mailto: links. #861

Kilowhisky opened this issue Jan 2, 2025 · 3 comments

Comments

@Kilowhisky
Copy link

Kilowhisky commented Jan 2, 2025

I'm trying to use the new DepthSearchGraph and it appears to be misfiring by trying to follow email links in hrefs. It should filter out non web links.

  • mailto:
  • tel:

See other schemes that should probably be filtered out: https://www.w3.org/wiki/UriSchemes

https://github.com/ScrapeGraphAI/Scrapegraph-ai/blob/96064f20ee8a849a2548f293419cf9028386c47b/scrapegraphai/nodes/fetch_node_level_k.py#L155C16-L158

EDIT:

It also goes boom navigating javascript links.

- navigating to "javascript:void(0)", waiting until "domcontentloaded"
@VinciGit00
Copy link
Collaborator

which model are you using?

@Kilowhisky
Copy link
Author

4o-mini.
Not sure if it needs embedding model or not with gpt.

PeriniM added a commit that referenced this issue Jan 6, 2025
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.34.0-beta.16](v1.34.0-beta.15...v1.34.0-beta.16) (2025-01-06)

### Bug Fixes

* add back poethepoet for pylint ([a82af04](a82af04))
* better playwright installation handling ([f6009d1](f6009d1))
* disallow mailto: ([#861](#861)) ([8d9c909](8d9c909))
* removed requirements files ([25861b0](25861b0))
* selenium import in ChromiumLoader ([e374e05](e374e05))

### chore

* chromium browser asnc handling ([5be7c49](5be7c49))
* made some libs optional ([5cdf055](5cdf055))
* pandas package is now optional ([54c69a2](54c69a2))
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.34.2-beta.1](v1.34.1...v1.34.2-beta.1) (2025-01-06)

### Bug Fixes

* add back poethepoet for pylint ([a82af04](a82af04))
* better playwright installation handling ([f6009d1](f6009d1))
* disallow mailto: ([#861](#861)) ([8d9c909](8d9c909))
* removed requirements files ([25861b0](25861b0))
* search graph ([d4b2679](d4b2679))
* selenium import in ChromiumLoader ([e374e05](e374e05))

### chore

* chromium browser asnc handling ([5be7c49](5be7c49))
* made some libs optional ([5cdf055](5cdf055))
* pandas package is now optional ([54c69a2](54c69a2))

### CI

* **release:** 1.34.0-beta.15 [skip ci] ([bc7ae85](bc7ae85))
* **release:** 1.34.0-beta.16 [skip ci] ([a0efb09](a0efb09)), closes [#861](#861)
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.34.2](v1.34.1...v1.34.2) (2025-01-06)

### Bug Fixes

* add back poethepoet for pylint ([a82af04](a82af04))
* better playwright installation handling ([f6009d1](f6009d1))
* disallow mailto: ([#861](#861)) ([8d9c909](8d9c909))
* removed requirements files ([25861b0](25861b0))
* search graph ([d4b2679](d4b2679))
* selenium import in ChromiumLoader ([e374e05](e374e05))

### chore

* chromium browser asnc handling ([5be7c49](5be7c49))
* made some libs optional ([5cdf055](5cdf055))
* pandas package is now optional ([54c69a2](54c69a2))

### CI

* **release:** 1.34.0-beta.15 [skip ci] ([bc7ae85](bc7ae85))
* **release:** 1.34.0-beta.16 [skip ci] ([a0efb09](a0efb09)), closes [#861](#861)
@PeriniM
Copy link
Collaborator

PeriniM commented Jan 6, 2025

Hey @Kilowhisky, it is fixed in the new release!

@PeriniM PeriniM closed this as completed Jan 6, 2025
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.34.3-beta.1](v1.34.2...v1.34.3-beta.1) (2025-01-06)

### Bug Fixes

* browserbase integration ([752a885](752a885))

### CI

* **release:** 1.34.2-beta.1 [skip ci] ([f383e72](f383e72)), closes [#861](#861) [#861](#861)
* **release:** 1.34.2-beta.2 [skip ci] ([93fd9d2](93fd9d2))
github-actions bot pushed a commit that referenced this issue Jan 6, 2025
## [1.35.0](v1.34.2...v1.35.0) (2025-01-06)

### Features

* ⏰added graph timeout and fixed model_tokens param ([#810](#810) [#856](#856) [#853](#853)) ([01a331a](01a331a))
* ⛏️ enhanced contribution and precommit added ([fcbfe78](fcbfe78))
* add codequality workflow ([4380afb](4380afb))
* add timeout and retry_limit in loader_kwargs ([#865](#865) [#831](#831)) ([21147c4](21147c4))
* serper api search ([1c0141f](1c0141f))

### Bug Fixes

* browserbase integration ([752a885](752a885))
* local html handling ([2a15581](2a15581))

### CI

* **release:** 1.34.2-beta.1 [skip ci] ([f383e72](f383e72)), closes [#861](#861) [#861](#861)
* **release:** 1.34.2-beta.2 [skip ci] ([93fd9d2](93fd9d2))
* **release:** 1.34.3-beta.1 [skip ci] ([013a196](013a196)), closes [#861](#861) [#861](#861)
* **release:** 1.35.0-beta.1 [skip ci] ([c5630ce](c5630ce)), closes [#865](#865) [#831](#831)
* **release:** 1.35.0-beta.2 [skip ci] ([f21c586](f21c586))
* **release:** 1.35.0-beta.3 [skip ci] ([cb54d5b](cb54d5b))
* **release:** 1.35.0-beta.4 [skip ci] ([6e375f5](6e375f5)), closes [#810](#810) [#856](#856) [#853](#853)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants