Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: URL support: Capable of web crawling and the corresponding content extraction. #315

Closed
1 task done
Umpire2018 opened this issue Apr 11, 2024 · 0 comments
Labels

Comments

@Umpire2018
Copy link

Is there an existing issue for the same feature request?

  • I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

This feature should be capable of navigating through specified URLs to collect and parse data, allowing for the extraction of specific content based on user-defined criteria. Ideally, it would support a variety of content types, including text, images, and tables, and allow for easy manipulation and storage of the extracted data.

Describe implementation you've considered

Reference: QAnything

QAnything Architecture

  1. Task Management

    • Deploy a task manager to handle the distribution of crawling jobs.
    • Ensure tasks are evenly distributed across available resources to prevent bottlenecks.
    • Use a robust queue system to prioritize tasks, manage retries, and monitor the crawling process.
  2. Content Extraction with Playwright-Python and OCR

    • Employ Playwright for Python to automate and control browser environments for scraping dynamic web pages that rely on JavaScript.
    • Integrate OCR technology to recognize and extract text from images and other irregular content types that cannot be easily selected.
  3. Page Classification

    • Analyze the structure of the data stored and classify pages accordingly.
    • Use machine learning or heuristic methods to categorize pages for targeted data extraction.

Documentation, adoption, use case

No response

Additional information

BCEmbedding: Bilingual and Crosslingual Embedding for RAG

@Umpire2018 Umpire2018 mentioned this issue Apr 11, 2024
17 tasks
@Umpire2018 Umpire2018 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants