Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper uc #496

Merged
merged 35 commits into from
Jan 20, 2025
Merged

Conversation

aravindkarnam
Copy link
Collaborator

No description provided.

aravindkarnam and others added 30 commits September 9, 2024 13:13
…Simplified URL validation and normalisation - respecting Robots.txt
Merging latest changes from main branch
…ial and concurrent processing

2. Introduced a dictionary for depth tracking across various tasks
3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.
… with the final scraper result as another option

2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper
3. Added some error handling for cases where robots.txt cannot be fetched or parsed.
…lded just as they are ready, rather than in batches

2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that  duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated.
3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.
…updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.
…d progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.
…figuration

- Added CrawlStats for comprehensive crawl monitoring
- Implemented proper resource cleanup with shutdown mechanism
- Enhanced URL processing with better validation and politeness controls
- Added configuration options (max_concurrent, timeout, external_links)
- Improved error handling with retry logic
- Added domain-specific queues for better performance
- Created comprehensive documentation

Note: URL normalization needs review - potential duplicate processing
with core crawler for internal links. Currently commented out pending
further investigation of edge cases.
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.

- Quick Start is created and added
Implement comprehensive URL filtering and scoring capabilities:

Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking

Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies

Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations

Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
pulling the main branch into scraper-uc
…ad of timeout to support python versions < 3.11
…is passed as None, renamed the scrapper folder to scraper.
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
…ain only when depth is not zero. This way

filter chain is skipped but other validations are in place even for start URL
… created in the correct event loop

- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
@aravindkarnam aravindkarnam merged commit a677c2b into unclecode:feature/scraper Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants