Scraper uc #496

aravindkarnam · 2025-01-20T11:22:45Z

No description provided.

…ape method in bfs_scraper_strategy

Staging

…Simplified URL validation and normalisation - respecting Robots.txt

Merging latest changes from main branch

…ial and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.

Merging 0.3.6

… from core as it's implemented as a filter

… with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed.

…lded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.

…updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.

Scraper

…tion

…d progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.

…figuration - Added CrawlStats for comprehensive crawl monitoring - Implemented proper resource cleanup with shutdown mechanism - Enhanced URL processing with better validation and politeness controls - Added configuration options (max_concurrent, timeout, external_links) - Improved error handling with retry logic - Added domain-specific queues for better performance - Created comprehensive documentation Note: URL normalization needs review - potential duplicate processing with core crawler for internal links. Currently commented out pending further investigation of edge cases.

Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added

Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.

pulling the main branch into scraper-uc

…ad of timeout to support python versions < 3.11

…is passed as None, renamed the scrapper folder to scraper.

Pulling in 0.3.74

2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary

…ain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL

…dy being checked in process_url

… created in the correct event loop - Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues. - Ensures proper task scheduling in environments with multiple event loops.

… as it's needed to skip filters for start_url

Pulling version 0.4.22 from main into scraper

aravindkarnam and others added 30 commits September 9, 2024 13:13

Created scaffolding for Scraper as per the plan. Implemented the ascr…

44ce12c

…ape method in bfs_scraper_strategy

Merge pull request #2 from aravindkarnam/staging

78f26ac

Staging

Parallel processing with retry on failure with exponential backoff - …

7f3e2e4

…Simplified URL validation and normalisation - respecting Robots.txt

Merge pull request #3 from aravindkarnam/main

65e013d

Merging latest changes from main branch

Fixed some bugs in robots.txt processing

d743ada

Merge pull request #5 from aravindkarnam/main

159bd87

Merging 0.3.6

updated some comments and removed content type checking functionality…

8a7d29c

… from core as it's implemented as a filter

Exposed min_crawl_delay for BFSScraperStrategy

04d8b47

removed unused imports

de28b59

Removed stubs for remove_from_future_crawls since the visited set is …

8105fd1

…updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.

Merge pull request unclecode#172 from aravindkarnam/scraper

0f0f605

Scraper

Update .gitignore to include new directories for issues and documenta…

06b21dc

…tion

Refactored AsyncWebScraper to include comprehensive error handling an…

be472c6

…d progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.

Revieweing the BFS strategy.

3d1c9a8

Merge pull request #7 from aravindkarnam/main

60670b2

pulling the main branch into scraper-uc

Fixed a few bugs, import errors and changed to asyncio wait_for inste…

c179703

…ad of timeout to support python versions < 3.11

Fixed a bug in _process_links, handled condition for when url_scorer …

f8e85b1

…is passed as None, renamed the scrapper folder to scraper.

Merge pull request #8 from aravindkarnam/main

3d52b55

Pulling in 0.3.74

fix: Exempting the start_url from can_process_url

2226ef5

chore: 1. Expose process_external_links as a param

b13fd71

2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary

fix: moved depth as a param to can_process_url and applying filter ch…

ee3001b

…ain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL

Remove the can_process_url check from _process_links since it's alrea…

a98d51a

…dy being checked in process_url

<Future pending> issue fix was incorrect. Reverting

155c756

aravindkarnam added 5 commits November 26, 2024 17:05

fixed the final scraper_quickstart.py example

9530ded

fixed the final scraper_quickstart.py example

ff731e4

updated definition of can_process_url to include dept as an argument,…

2f5e059

… as it's needed to skip filters for start_url

Merge pull request #9 from aravindkarnam/main

7c0fa26

Pulling version 0.4.22 from main into scraper

fix: Added browser config and crawler run config from 0.4.22

7a5f83b

aravindkarnam merged commit a677c2b into unclecode:feature/scraper Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper uc #496

Scraper uc #496

aravindkarnam commented Jan 20, 2025

Scraper uc #496

Scraper uc #496

Conversation

aravindkarnam commented Jan 20, 2025