implement subdomain focus feature in data-prep-connector #725

hmtbr · 2024-10-18T06:48:28Z

Why are these changes needed?

If the user provides https://research.example.com/ as a seed url for the data-prep-connector, there is a requirement that the user wants to automatically apply subdomain focus so we do not crawl other subdomains than research for the domain example.com.

This PR implements the subdomain focus feature.

Related issue number (if any).

#724

Signed-off-by: Hiroya Matsubara <[email protected]>

matouma · 2024-10-18T11:12:25Z

@Qiragg Please confirm that you can see this PR and comment on it. You can also tag me once your you approve it. I am also soliciting input from the broader community on this one. I know we did it before in the first part of the year and I want to make sure we capture lessons learned from the previous implementation (what worked and what did not work).

matouma

@hmtbr would be great if you can review my comments and let me know your thoughts on how this should work.

data-connector-lib/src/dpk_connector/core/crawler.py

data-connector-lib/src/dpk_connector/core/spiders/sitemap.py

touma-I · 2024-10-18T11:29:11Z

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

yuanchi2807 · 2024-10-18T17:30:22Z

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

I am not knowledgable enough to crawling requirements to make a comment.

Qiragg · 2024-10-19T02:26:51Z

@Qiragg @yuanchi2807 Please review and comment as you see fit. @qqirag it may help if you can elaborate further on the rated issue based on previous experience with similar functionality in bluecrawl

In bluecrawl, we do provide the ability to do subdomain_focus automatically based on the input seed url but we cannot focus on multiple subdomains per job. For crawling, it would make sense to be able to launch a single job that focuses on multiple subdomains which is what this feature would provide. This is a functionality that was missing in DPK-connector and is much needed for launching certain targeted crawls.

There are a couple of points to discuss here:

It is more intuitive to set the default subdomain_focus to be true but we will not do that here based we do not want to depart from the default crawler behavior in the earlier version.
We expect the user to not provide ibm.com and research.ibm.com as seed urls at the same time if they want subdomain_focus to be applied to on research. It is not clear if we should reject jobs which are improperly configured in such a way or allow the user to rectify their mistake. While this is obvious in this case; in rare cases, the user may want to apply path_focus and subdomain_focus at the same time by providing: research.ibm.com/help/ and ibm.com/support/ as seeds. In such a case, my understanding is that they will end up crawling both ibm.com/help/ and ibm.com/support/.

Ideally, we only want to crawl research.ibm.com/help/ and ibm.com/support/ for such a user-defined case.

The PR looks good to me for now. I think if we get feedback regarding a different design choice that the user wants, we can think about it at that point.

touma-I · 2024-10-20T23:10:15Z

@Qiragg Thanks for the analysis. Good stuff: Can you click on "Files Changed" then click on the green button "review changes" then check "approve" then click submit to indicate your approval.
@hmtbr:
1- what happens if we use the seed url : http://www.research.ibm.com/ and set path focus to true (i.e. path focus is for the Root (i.e. / ), will that produce the same result as sub-domain focus ?
2- Also, what if the user specifies the "unexpected" in the case that @Qiragg identified above (ie. 2 seeds URLs ibm.com and research.ibm.com and subdomain focus set to true? Will the code default to the least restrictive or the more restrictive case ?

touma-I

Need to update pyproject.toml to reflect 0.2.2.dev1. (This will be the work-in-progress tag until it is released to pypi)

Signed-off-by: Hiroya Matsubara <[email protected]>

hmtbr · 2024-10-21T00:26:59Z

1- what happens if we use the seed url : http://www.research.ibm.com/ and set path focus to true (i.e. path focus is for the Root (i.e. / ), will that produce the same result as sub-domain focus ?

Currently there is no support for path focus with subdomain. It will produce pages from domain *.ibm.com.

2- Also, what if the user specifies the "unexpected" in the case that @Qiragg identified above (ie. 2 seeds URLs ibm.com and research.ibm.com and subdomain focus set to true? Will the code default to the least restrictive or the more restrictive case ?

It will produce pages from domain *.ibm.com.

touma-I · 2024-10-21T19:16:21Z

1- what happens if we use the seed url : http://www.research.ibm.com/ and set path focus to true (i.e. path focus is for the Root (i.e. / ), will that produce the same result as sub-domain focus ?

Currently there is no support for path focus with subdomain. It will produce pages from domain *.ibm.com.

@hmtbr @Qiragg @shahrokhDaijavad is this a bug in the logic then that need to be fixed ? If the application specifies www.ibm.com/docs with path focus it should not receive anything from research.ibm.com/docs . No?

hmtbr · 2024-10-22T03:29:00Z

@touma-I Thanks for your approval. The cases don't produce intuitive results actually. I will address them in another issue and PR soon. This PR itself won't introduce any regression, so I merge this now to unblock our work.

touma-I · 2024-10-22T11:50:36Z

@touma-I Thanks for your approval. The cases don't produce intuitive results actually. I will address them in another issue and PR soon. This PR itself won't introduce any regression, so I merge this now to unblock our work.

Thanks @hmtbr Agree. Now that this has matured to this point, it is appropriate to consider usability by outsiders who are not intimately familiar with our internal processes. I do think that if path focus is done right, we won't need subdomain focus and It will be appropriate to do a second iteration. thanks again.

implement subdomain focus feature in data-prep-connector

477da0f

Signed-off-by: Hiroya Matsubara <[email protected]>

hmtbr marked this pull request as ready for review October 18, 2024 06:50

refactoring

1722e36

Signed-off-by: Hiroya Matsubara <[email protected]>

hmtbr requested a review from touma-I October 18, 2024 08:09

matouma reviewed Oct 18, 2024

View reviewed changes

data-connector-lib/src/dpk_connector/core/crawler.py Show resolved Hide resolved

data-connector-lib/src/dpk_connector/core/spiders/sitemap.py Show resolved Hide resolved

touma-I requested changes Oct 20, 2024

View reviewed changes

bump version

1dfe7ea

Signed-off-by: Hiroya Matsubara <[email protected]>

touma-I requested a review from shivdeep-singh-ibm October 21, 2024 12:23

Qiragg approved these changes Oct 21, 2024

View reviewed changes

touma-I approved these changes Oct 21, 2024

View reviewed changes

hmtbr merged commit b297156 into dev Oct 22, 2024
4 checks passed

hmtbr deleted the subdomain-focus branch October 22, 2024 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement subdomain focus feature in data-prep-connector #725

implement subdomain focus feature in data-prep-connector #725

hmtbr commented Oct 18, 2024

matouma commented Oct 18, 2024

matouma left a comment

touma-I commented Oct 18, 2024

yuanchi2807 commented Oct 18, 2024

Qiragg commented Oct 19, 2024 •

edited

Loading

touma-I commented Oct 20, 2024

touma-I left a comment

hmtbr commented Oct 21, 2024

touma-I commented Oct 21, 2024 •

edited

Loading

hmtbr commented Oct 22, 2024

touma-I commented Oct 22, 2024

implement subdomain focus feature in data-prep-connector #725

implement subdomain focus feature in data-prep-connector #725

Conversation

hmtbr commented Oct 18, 2024

Why are these changes needed?

Related issue number (if any).

matouma commented Oct 18, 2024

matouma left a comment

Choose a reason for hiding this comment

touma-I commented Oct 18, 2024

yuanchi2807 commented Oct 18, 2024

Qiragg commented Oct 19, 2024 • edited Loading

touma-I commented Oct 20, 2024

touma-I left a comment

Choose a reason for hiding this comment

hmtbr commented Oct 21, 2024

touma-I commented Oct 21, 2024 • edited Loading

hmtbr commented Oct 22, 2024

touma-I commented Oct 22, 2024

Qiragg commented Oct 19, 2024 •

edited

Loading

touma-I commented Oct 21, 2024 •

edited

Loading