Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search feature #370

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

dariuschira
Copy link

Basic search feature for shub!

There are a million ways to expand this using scrapinghub's api for filtering jobs and requests but I think the core of it (and what is most useful in my case) is to find a project key starting from an url so I thought I'd start with that.

Let me know what you guys think :)

@dariuschira dariuschira mentioned this pull request Sep 16, 2019
@click.option(
'--start_date',
default='6 months ago',
help='date to start searching from, defaults to 6 months ago'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered 6 months to be a reasonable default. My main concern is that people will use this without setting spider and start date and it will be slow for them and a lot of requests for scrapinghub cloud but making any of this mandatory might be too restrictive. What do you think?

Copy link
Contributor

@vshlapakov vshlapakov Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's too much: we have 120 days of data retention for the professional plan, so it doesn't make sense to have it larger than 4 months anyway. And from the usage perspective, how often do you search for a job that you ran a few months ago? I would set the option's default even lower, say a week or two.

@@ -39,6 +39,7 @@
'six>=1.7.0',
'tqdm',
'toml',
'dateparser'
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be avoided by using a specific format and parsing that, but ofc, this offers more flexibility. Should I remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dateparser is a nice and very convenient library, I don't mind to add it, we could use it in the other commands later if needed 👍

@vshlapakov
Copy link
Contributor

@mirceachira The feature is really nice and in a great demand, I believe. The core changes also look good in general, good work! However, it requires some additional changes and performance tuning on server side before adding the feature to the official client, so I have to put the PR on hold temporarily while we're discussing what could be done, I'll post an update here or ping your privately when we're ready to proceed. Hopefully, it won't take a long time 🤞

In the meantime, could you please address the Flake8 complaints to fix the build and add some basic tests for the new command? Thanks in advance!

Copy link
Member

@starrify starrify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mirceachira, thanks a lot for the help! Please find a few of my random thoughts below.

1. Naming and ambiguity

1.a. The term "search" is rather broad while the functionality is very specific (searching for job IDs by filtering URLs in job requests).
1.b. The start_date / end_date options could surely accept values like "2 hours ago" since it's parsed by dateparser. And using a higher-than-date granularity (e.g. "from 14:00 UTC to 16:00 UTC last Monday") may be a common use case. Thus "time" may be a better choice than "date" here.
1.c. I was still unclear (or precisely, not too sure) of the functionality even after having read all the help texts, and even I've performed quite a few times such operations before. I thought it might be job requests when reading "fetch job IDs based on URLs" but was far from being sure. Try this: "URLs of a job" vs "URLs of a job's requests".

2. Considering more basic functionalities instead (?)

There has been some basic (a.k.a. potentially commonly needed) functionalities missing from shub like fetching a job's metadata / stats (see also #277). Here're two examples related to this PR:

2.a. To return job IDs (and some other basic and quick info?) based on some or no condition (job start time, tags, etc.).
2.b. To support filters for existing requests/items/logs queries.

The newly added feature in this PR would look too specific comparing to these two above.

Once these two missing features above are ready, one may simply do something like this to achieve the same (or even better) functionality as in this PR:

$ shub list-jobs --start-time="1 hour ago" --only-job-ids | parallel 'if [[ $(shub requests {} --max-results=1 --filter '\''["url","contains",["foobar"]]) ]]'\''; then echo {}; fi'

And the good part here is one may actually perform any supported query (not only case-sensitive substring test of the URL field) on any supported resource (not only job requests).

@starrify
Copy link
Member

starrify commented Oct 4, 2019

Implemented job resource filtering in #372

@rafaelcapucho rafaelcapucho force-pushed the master branch 2 times, most recently from a290a34 to b0e4614 Compare January 15, 2021 02:28
@apalala
Copy link
Contributor

apalala commented Sep 8, 2021

fixes #366

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants