Search feature #370

dariuschira · 2019-09-16T13:07:17Z

Basic search feature for shub!

There are a million ways to expand this using scrapinghub's api for filtering jobs and requests but I think the core of it (and what is most useful in my case) is to find a project key starting from an url so I thought I'd start with that.

Let me know what you guys think :)

dariuschira · 2019-09-16T13:13:21Z

shub/search.py

+@click.option(
+    '--start_date',
+    default='6 months ago',
+    help='date to start searching from, defaults to 6 months ago'


I considered 6 months to be a reasonable default. My main concern is that people will use this without setting spider and start date and it will be slow for them and a lot of requests for scrapinghub cloud but making any of this mandatory might be too restrictive. What do you think?

I'd say it's too much: we have 120 days of data retention for the professional plan, so it doesn't make sense to have it larger than 4 months anyway. And from the usage perspective, how often do you search for a job that you ran a few months ago? I would set the option's default even lower, say a week or two.

dariuschira · 2019-09-16T15:54:55Z

setup.py

@@ -39,6 +39,7 @@
        'six>=1.7.0',
        'tqdm',
        'toml',
+        'dateparser'


This could be avoided by using a specific format and parsing that, but ofc, this offers more flexibility. Should I remove this?

dateparser is a nice and very convenient library, I don't mind to add it, we could use it in the other commands later if needed 👍

vshlapakov · 2019-09-27T15:49:34Z

@mirceachira The feature is really nice and in a great demand, I believe. The core changes also look good in general, good work! However, it requires some additional changes and performance tuning on server side before adding the feature to the official client, so I have to put the PR on hold temporarily while we're discussing what could be done, I'll post an update here or ping your privately when we're ready to proceed. Hopefully, it won't take a long time 🤞

In the meantime, could you please address the Flake8 complaints to fix the build and add some basic tests for the new command? Thanks in advance!

starrify

Hi @mirceachira, thanks a lot for the help! Please find a few of my random thoughts below.

1. Naming and ambiguity

1.a. The term "search" is rather broad while the functionality is very specific (searching for job IDs by filtering URLs in job requests).
1.b. The start_date / end_date options could surely accept values like "2 hours ago" since it's parsed by dateparser. And using a higher-than-date granularity (e.g. "from 14:00 UTC to 16:00 UTC last Monday") may be a common use case. Thus "time" may be a better choice than "date" here.
1.c. I was still unclear (or precisely, not too sure) of the functionality even after having read all the help texts, and even I've performed quite a few times such operations before. I thought it might be job requests when reading "fetch job IDs based on URLs" but was far from being sure. Try this: "URLs of a job" vs "URLs of a job's requests".

2. Considering more basic functionalities instead (?)

There has been some basic (a.k.a. potentially commonly needed) functionalities missing from shub like fetching a job's metadata / stats (see also #277). Here're two examples related to this PR:

2.a. To return job IDs (and some other basic and quick info?) based on some or no condition (job start time, tags, etc.).
2.b. To support filters for existing requests/items/logs queries.

The newly added feature in this PR would look too specific comparing to these two above.

Once these two missing features above are ready, one may simply do something like this to achieve the same (or even better) functionality as in this PR:

$ shub list-jobs --start-time="1 hour ago" --only-job-ids | parallel 'if [[ $(shub requests {} --max-results=1 --filter '\''["url","contains",["foobar"]]) ]]'\''; then echo {}; fi'

And the good part here is one may actually perform any supported query (not only case-sensitive substring test of the URL field) on any supported resource (not only job requests).

starrify · 2019-10-04T16:25:44Z

Implemented job resource filtering in #372

apalala · 2021-09-08T12:21:20Z

fixes #366

dariuschira added 2 commits September 16, 2019 16:03

Added search feature, search project key by part of url

3d55bff

Added dateparser dependency

bcc772a

dariuschira mentioned this pull request Sep 16, 2019

Search feature #366

Open

dariuschira commented Sep 16, 2019

View reviewed changes

starrify reviewed Oct 3, 2019

View reviewed changes

rafaelcapucho force-pushed the master branch 2 times, most recently from a290a34 to b0e4614 Compare January 15, 2021 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search feature #370

Search feature #370

dariuschira commented Sep 16, 2019

dariuschira Sep 16, 2019

vshlapakov Sep 27, 2019 •

edited

Loading

dariuschira Sep 16, 2019

vshlapakov Sep 27, 2019

vshlapakov commented Sep 27, 2019

starrify left a comment

starrify commented Oct 4, 2019

apalala commented Sep 8, 2021

Search feature #370

Are you sure you want to change the base?

Search feature #370

Conversation

dariuschira commented Sep 16, 2019

dariuschira Sep 16, 2019

Choose a reason for hiding this comment

vshlapakov Sep 27, 2019 • edited Loading

Choose a reason for hiding this comment

dariuschira Sep 16, 2019

Choose a reason for hiding this comment

vshlapakov Sep 27, 2019

Choose a reason for hiding this comment

vshlapakov commented Sep 27, 2019

starrify left a comment

Choose a reason for hiding this comment

starrify commented Oct 4, 2019

apalala commented Sep 8, 2021

vshlapakov Sep 27, 2019 •

edited

Loading