Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[For discussion] Add args from #392

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 25 additions & 3 deletions shub/schedule.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@
Similarly, job-specific settings can be supplied through the -s option:

shub schedule myspider -s SETTING=VALUE -s LOG_LEVEL=DEBUG

Also, the spider can be run with all arguments are taken from another job with -f option:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extend the logic to optionally inherit job-level settings as well (with a different option ofc)? Could be done separately, here I'm more worried about consistent API, say we reserve -f/--args-from for arguments, and -s/--settings-from for settings. Wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes totally sense, thank you :). For settings we already have "-s", we can use "-g/--settings-from" for example.
Apart from the settings, there are also parameters that we can take from the "parent" job
--environment (-n, --env-from)
--priority
--units
--tag
Not sure about the priority, units, tag, but what do you think about the environment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh the short flag -g here -g/--settings-from looks a bit misleading 🤔 Also inheriting different parameters from different jobs (say get tags from job A, get settings from job B etc) doesn't look useful in real life, you'd usually inherit parameters from a single job.

My suggestion here would be

  • add a single key-value param specifiying a job which we should use as a base one, like -i/--inherit-from
  • add boolean flags for different parameters, like --inherit-args, --inherit-settings, --inherit-environment, --inherit-tags
  • inheriting priority/units doesn't seem to be useful, that's just one number, but I might be wrong there
  • using --inherit-from without specifiying what exactly to inherit should fail with a warning

Your thoughts?

Copy link
Author

@PyExplorer PyExplorer Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the argument's names are a bit long (--inherit-environment) but clear and no problem to use this I hope. And the approach with --inherit argument looks good and quite agile.

About using --inherit-from without specifiying what exactly to inherit should fail with a warning - what do you think if instead of raising a warning it will inherit all parameters (args, settings, environment, tags, priority, units) at once? In most cases, we need exactly this behavior. Or just add a new argument like to --inherit-all (in addition to --inherit-args, etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of raising a warning it will inherit all parameters

Looks good to me, let's go this way 👍🏻


shub schedule myspider -f 123/21/134

But arguments with -a will replace these arguments

"""

SHORT_HELP = "Schedule a spider to run on Scrapy Cloud"
Expand All @@ -54,15 +61,17 @@
help='Amount of Scrapy Cloud units (-u number)')
@click.option('-t', '--tag',
help='Job tags (-t tag)', multiple=True)
def cli(spider, argument, set, environment, priority, units, tag):
@click.option('-f', '--args_from',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency I'd suggest to use a hyphen, it'll be converted to args_from automatically

Suggested change
@click.option('-f', '--args_from',
@click.option('-f', '--args-from',

help='project/spider/job for copying arguments (-f 123/321/456)')
def cli(spider, argument, set, environment, priority, units, tag, args_from):
try:
target, spider = spider.rsplit('/', 1)
except ValueError:
target = 'default'
targetconf = get_target_conf(target)
job_key = schedule_spider(targetconf.project_id, targetconf.endpoint,
targetconf.apikey, spider, argument, set,
priority, units, tag, environment)
priority, units, tag, environment, args_from)
watch_url = urljoin(
targetconf.endpoint,
'../p/{}/{}/{}'.format(*job_key.split('/')),
Expand All @@ -78,11 +87,13 @@ def cli(spider, argument, set, environment, priority, units, tag):


def schedule_spider(project, endpoint, apikey, spider, arguments=(), settings=(),
priority=DEFAULT_PRIORITY, units=None, tag=(), environment=()):
priority=DEFAULT_PRIORITY, units=None, tag=(), environment=(),
args_from=None):
client = ScrapinghubClient(apikey, dash_endpoint=endpoint)
try:
project = client.get_project(project)
args = dict(x.split('=', 1) for x in arguments)
args = add_args_from_job(client, args, args_from)
cmd_args = args.pop('cmd_args', None)
meta = args.pop('meta', None)
job = project.jobs.run(
Expand All @@ -99,3 +110,14 @@ def schedule_spider(project, endpoint, apikey, spider, arguments=(), settings=()
return job.key
except ScrapinghubAPIError as e:
raise RemoteErrorException(str(e))

def add_args_from_job(client, base_args, args_from):
if not args_from:
return base_args
job_args = get_args_from_parent_job(client, args_from).copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use copy.deepcopy here instead?

job_args.update(base_args)
return job_args

def get_args_from_parent_job(client, args_from):
job = client.get_job(args_from)
return job.metadata.get("spider_args") or {}
43 changes: 39 additions & 4 deletions tests/test_schedule.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,20 @@ def test_schedules_job_if_input_is_ok(self, mock_schedule):
# Default
self.runner.invoke(schedule.cli, ['spider'])
mock_schedule.assert_called_with(
proj, endpoint, apikey, 'spider', (), (), 2, None, (), ())
proj, endpoint, apikey, 'spider', (), (), 2, None, (), (), None)
# Other project
self.runner.invoke(schedule.cli, ['123/spider'])
mock_schedule.assert_called_with(
123, endpoint, apikey, 'spider', (), (), 2, None, (), ())
123, endpoint, apikey, 'spider', (), (), 2, None, (), (), None)
# Other endpoint
proj, endpoint, apikey = self.conf.get_target('vagrant')
self.runner.invoke(schedule.cli, ['vagrant/spider'])
mock_schedule.assert_called_with(
proj, endpoint, apikey, 'spider', (), (), 2, None, (), ())
proj, endpoint, apikey, 'spider', (), (), 2, None, (), (), None)
# Other project at other endpoint
self.runner.invoke(schedule.cli, ['vagrant/456/spider'])
mock_schedule.assert_called_with(
456, endpoint, apikey, 'spider', (), (), 2, None, (), ())
456, endpoint, apikey, 'spider', (), (), 2, None, (), (), None)

@mock.patch('shub.schedule.ScrapinghubClient', autospec=True)
def test_schedule_invalid_spider(self, mock_client):
Expand Down Expand Up @@ -73,6 +73,41 @@ def test_forwards_args_and_settings(self, mock_client):
job_settings,
)

@mock.patch('shub.schedule.ScrapinghubClient', autospec=True)
def test_forwards_args_from(self, mock_client):
mock_proj = mock_client.return_value.get_project.return_value
with mock.patch('shub.schedule.get_args_from_parent_job') as mock_get_afpj:
mock_get_afpj.return_value = {"job_arg": "test"}

self.runner.invoke(
schedule.cli,
"testspider -a ARG=val1 --args_from 123/321/44".split(' '),
)
job_args = mock_proj.jobs.run.call_args[1]['job_args']
self.assertDictContainsSubset(
{'ARG': 'val1', "job_arg": "test"},
job_args,
)

@mock.patch('shub.schedule.ScrapinghubClient', autospec=True)
def test_forwards_overriding_args_and_args_from(self, mock_client):
# override args from another job by args set in cmd
mock_proj = mock_client.return_value.get_project.return_value
with mock.patch('shub.schedule.get_args_from_parent_job') as mock_get_afpj:
mock_get_afpj.return_value = {"job_arg": "test", 'ARG': 'val_test_job'}

self.runner.invoke(
schedule.cli,
"testspider -a ARG=val1 --args_from 123/321/44 "
"--argument ARGWITHEQUAL=val2=val2".split(' '),
)
job_args = mock_proj.jobs.run.call_args[1]['job_args']
self.assertDictContainsSubset(
{'ARG': 'val_test_job', 'ARGWITHEQUAL': 'val2=val2', "job_arg": "test"},
job_args,
)


@mock.patch('shub.schedule.ScrapinghubClient', autospec=True)
def test_forwards_tags(self, mock_client):
mock_proj = mock_client.return_value.get_project.return_value
Expand Down