-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search feature #370
base: master
Are you sure you want to change the base?
Search feature #370
Conversation
@click.option( | ||
'--start_date', | ||
default='6 months ago', | ||
help='date to start searching from, defaults to 6 months ago' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered 6 months to be a reasonable default. My main concern is that people will use this without setting spider and start date and it will be slow for them and a lot of requests for scrapinghub cloud but making any of this mandatory might be too restrictive. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's too much: we have 120 days of data retention for the professional plan, so it doesn't make sense to have it larger than 4 months anyway. And from the usage perspective, how often do you search for a job that you ran a few months ago? I would set the option's default even lower, say a week or two.
@@ -39,6 +39,7 @@ | |||
'six>=1.7.0', | |||
'tqdm', | |||
'toml', | |||
'dateparser' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be avoided by using a specific format and parsing that, but ofc, this offers more flexibility. Should I remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dateparser is a nice and very convenient library, I don't mind to add it, we could use it in the other commands later if needed 👍
@mirceachira The feature is really nice and in a great demand, I believe. The core changes also look good in general, good work! However, it requires some additional changes and performance tuning on server side before adding the feature to the official client, so I have to put the PR on hold temporarily while we're discussing what could be done, I'll post an update here or ping your privately when we're ready to proceed. Hopefully, it won't take a long time 🤞 In the meantime, could you please address the Flake8 complaints to fix the build and add some basic tests for the new command? Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mirceachira, thanks a lot for the help! Please find a few of my random thoughts below.
1. Naming and ambiguity
1.a. The term "search" is rather broad while the functionality is very specific (searching for job IDs by filtering URLs in job requests).
1.b. The start_date
/ end_date
options could surely accept values like "2 hours ago" since it's parsed by dateparser
. And using a higher-than-date granularity (e.g. "from 14:00 UTC to 16:00 UTC last Monday") may be a common use case. Thus "time" may be a better choice than "date" here.
1.c. I was still unclear (or precisely, not too sure) of the functionality even after having read all the help texts, and even I've performed quite a few times such operations before. I thought it might be job requests when reading "fetch job IDs based on URLs" but was far from being sure. Try this: "URLs of a job" vs "URLs of a job's requests".
2. Considering more basic functionalities instead (?)
There has been some basic (a.k.a. potentially commonly needed) functionalities missing from shub
like fetching a job's metadata / stats (see also #277). Here're two examples related to this PR:
2.a. To return job IDs (and some other basic and quick info?) based on some or no condition (job start time, tags, etc.).
2.b. To support filters for existing requests/items/logs queries.
The newly added feature in this PR would look too specific comparing to these two above.
Once these two missing features above are ready, one may simply do something like this to achieve the same (or even better) functionality as in this PR:
$ shub list-jobs --start-time="1 hour ago" --only-job-ids | parallel 'if [[ $(shub requests {} --max-results=1 --filter '\''["url","contains",["foobar"]]) ]]'\''; then echo {}; fi'
And the good part here is one may actually perform any supported query (not only case-sensitive substring test of the URL field) on any supported resource (not only job requests).
Implemented job resource filtering in #372 |
a290a34
to
b0e4614
Compare
fixes #366 |
Basic search feature for shub!
There are a million ways to expand this using scrapinghub's api for filtering jobs and requests but I think the core of it (and what is most useful in my case) is to find a project key starting from an url so I thought I'd start with that.
Let me know what you guys think :)