-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean html extractor #237
clean html extractor #237
Conversation
Deploying with Cloudflare Pages
|
currently, retry for HTTP call is missing here as we have in airflow/include/tasks/extract/utils/html_helpers.py. I was thing to do it in separate PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few minor suggestion, but I think we're almost good to merge it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajastro Can you run this again with new weaviate provider by deleting the weaviate index and reingesting, please?
The code in main branch uses latest weaviate provider. You might want to rebase and re-run the DAG after deleting the weaviate index. First you need to run bulk ingest then you need to run other DAGs.
203a61d
to
8c65471
Compare
@pankajastro tested this only observation was that source count has increased, however, as discussed with @sunank200 this was related to PR, we should be ok to close this |
b518821
to
7780e21
Compare
closes: #164 Currently, We have some duplicate code in the HTML extractor, this PR aims to remove the duplicate code and reuse it from html_utils.
closes: #164
Currently, We have some duplicate code in the HTML extractor, this PR aims to remove the duplicate code and reuse it from html_utils.