Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: What is the best way to use website crawler in a workspace? #605

Open
azaylamba opened this issue Nov 15, 2024 · 4 comments
Open
Labels

Comments

@azaylamba
Copy link
Contributor

The website crawler feature is a great feature and can be used to ingest webpages in the workspace. Just wondering, what is the best way to update the workspace when some of the webpages now have updated content after initial crawling is done.
I believe we would need to crawl the website again, would that result in duplicate documents in the workspace and vector database?
How to avoid the duplication and update the workspace with updated webpages?
Should we create a new workspace and crawl the website again? This doesn't seem scalable when the website content is being updated frequently.

What should be the best approach in this situation?

@charles-marion
Copy link
Collaborator

Bedrock knowledge Base supports Crawling websites and has an API to sync the data
https://docs.aws.amazon.com/bedrock/latest/userguide/kb-data-source-sync-ingest.html

You might be able to set upEvent Bridge to periodically call the Bedrock API StartIngestionJob.

An alternative is to periodically remove the website and add it back from the workspace. The integration test has an example where it adds a RSS Feed and remove it (document)
https://github.com/aws-samples/aws-genai-llm-chatbot/blob/main/integtests/chatbot-api/aurora_workspace_test.py#L62

@azaylamba
Copy link
Contributor Author

@charles-marion Thanks for the links, I will have a look.

@azaylamba
Copy link
Contributor Author

@charles-marion Currently I am using OpenSearch vector storage and primarily uploading PDF documents in the workspace using file upload option. I am thinking to use website crawler so that I don't have to manually upload the documents as the documents are also being uploaded as web pages on the website.
Setting up a bedrock knowledge base would require whole new setup around workspace and vector storage. So, I am exploring if website crawling can be used with existing workspace and the OpenSearch vector stoarge without having to do additional setup.
I think the second approach you suggested where we need to periodically remove and add the website to the workspace can be explored. I am thinking about the complexity and the downtime of workspace during this time. Probably we would need to make sure that documents are deleted from everywhere including S3, OpenSearch, DynamoDB etc. which increases the complexity in setting up the periodic removal and addition.

Copy link

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants