-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement subdomain focus feature in data-prep-connector #725
Conversation
Signed-off-by: Hiroya Matsubara <[email protected]>
Signed-off-by: Hiroya Matsubara <[email protected]>
@Qiragg Please confirm that you can see this PR and comment on it. You can also tag me once your you approve it. I am also soliciting input from the broader community on this one. I know we did it before in the first part of the year and I want to make sure we capture lessons learned from the previous implementation (what worked and what did not work). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hmtbr would be great if you can review my comments and let me know your thoughts on how this should work.
@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl |
I am not knowledgable enough to crawling requirements to make a comment. |
In bluecrawl, we do provide the ability to do subdomain_focus automatically based on the input seed url but we cannot focus on multiple subdomains per job. For crawling, it would make sense to be able to launch a single job that focuses on multiple subdomains which is what this feature would provide. This is a functionality that was missing in DPK-connector and is much needed for launching certain targeted crawls. There are a couple of points to discuss here:
Ideally, we only want to crawl The PR looks good to me for now. I think if we get feedback regarding a different design choice that the user wants, we can think about it at that point. |
@Qiragg Thanks for the analysis. Good stuff: Can you click on "Files Changed" then click on the green button "review changes" then check "approve" then click submit to indicate your approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update pyproject.toml to reflect 0.2.2.dev1. (This will be the work-in-progress tag until it is released to pypi)
Signed-off-by: Hiroya Matsubara <[email protected]>
Currently there is no support for path focus with subdomain. It will produce pages from domain
It will produce pages from domain |
@hmtbr @Qiragg @shahrokhDaijavad is this a bug in the logic then that need to be fixed ? If the application specifies www.ibm.com/docs with path focus it should not receive anything from research.ibm.com/docs . No? |
@touma-I Thanks for your approval. The cases don't produce intuitive results actually. I will address them in another issue and PR soon. This PR itself won't introduce any regression, so I merge this now to unblock our work. |
Thanks @hmtbr Agree. Now that this has matured to this point, it is appropriate to consider usability by outsiders who are not intimately familiar with our internal processes. I do think that if path focus is done right, we won't need subdomain focus and It will be appropriate to do a second iteration. thanks again. |
Why are these changes needed?
If the user provides https://research.example.com/ as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.
This PR implements the subdomain focus feature.
Related issue number (if any).
#724