Improve kedro run as a package #1423

antonymilne · 2022-04-07T10:55:08Z

⚠️ This is a huge description because I spent ages figuring this out and I thought we should have it all spelt out somewhere. Haven't done tests or docs yet - just opening this to get initial feedback.

The actual change to code is very small, but a big improvement I think: you can now do

from spaceflights.__main__ import main
outputs = main(pipeline="ds")
print(outputs)

and it will Just Work, even in IPython/Juptyer. All the command line ways of invoking main work exactly as before, i.e. python -m spaceflights --pipeline=ds etc. We now have consistency between main and CLI ways of launching kedro.

Description

A user can run their kedro project in several ways. Note that the run command executed can be defined on the framework side or overridden in turn by a plugin or a project cli.py (done by _find_run_command).

Kedro CLI: kedro run. This is the only route that doesn't go through the project __main__.py; instead it goes through kedro.framework.main, which builds the CLI tree and does something like _find_run_command
CLI but not kedro CLI: python -m spaceflights. This hits the project __main__.py and will call main
CLI but not kedro CLI: python src/spaceflights. This is just a more unusual way of doing 2
Inside your own python script: from spaceflights.__main__ import main; main(), then run the script using python
Inside IPython/Jupyter with no kedro ipython extension: same as 4 but you run it in the notebook
Inside IPython/Jupyter with kedro ipython extension: same as 5 but you can also do session.run (which is what we have advertised as the way to do a kedro run in the past)

All the above must be run from within the project root or kedro won't be able to find your conf. Options 2 onwards needs you to have pip installed your project or to have src in your PYTHONPATH. Note that having the project pip installed could mean first doing kedro package and then pip install the resulting .whl file or it could mean just pip install ./src from your project root; it doesn't make a difference.

Current problems

When using 1, 2, 3 and 4 you specify arguments using the CLI syntax --pipeline ds. With main() you do main(["--pipeline", "ds"]), i.e. the CLI syntax shoehorned into a Python function call. When using session.run you do a pure Python function call as session.run(pipeline_name="ds") - but note the argument name is different!!
Methods 5 and 6 run ok but generate a horrible big mysterious ipython exception after completing
Method 4 isn't actually useful if you want to run a kedro project with anything else since (big problem) no code after main() will actually execute...
... and (smaller problem) main() doesn't return anything like session.run does, so you couldn't use any kedro outputs downstream anyway
Method 6 using main doesn't work unless you call session.close first because you'll get some error about there already being an active session, same as %run_viz line magic requires session.close first kedro-viz#811

This PR fixes 1, 2, 3, 4. We should have separate ticket(s) to fix 5 and also the following:

move the horrible code out of the project __main__.py to somewhere framework side, because no one wants to see that in their project... Possible locations might be kedro.framework.cli.utils or kedro.framework.project
figure out if we can get a user to do a nicer looking import than from spaceflights.__main__ import main, which doesn't really look like it should be done
fix a bug where commands defined in the project cli.py aren't given their full path when you do kedro but instead Project specific commands from cli. I think this is a bug anyway and it didn't used to be like this...
consider changing the behaviour when the project cli.py exists but is empty/all commented out. I found the current behaviour a bit annoying when playing around with this
consider whether we should recommend main over session.run in IPython or even stop exposing session altogether. Mooted briefly in Technical design decision record for KedroSession #1335 but needs more discussion. main is a higher-level function than session.run and respects the hierarchy of framework < plugins < project cli.py. The arguments to main also match the usual kedro run arguments, which those in session.run do not

Development notes

There's basically two things that have changed:

add return to the kedro run command. session.run, and hence main, returns the free outputs of the pipeline after the kedro run. This fixes problem 4
change how main calls the run command. This fixes problems 1, 2, 3

Adventures with click

I opened an issue on the click repo (which didn't go down well... 😅) describing the original source of problems 1, 2, 3. In short, when you run a click command it always exits afterwards with an Exception, even if execution was successful. This means it's impossible to run any code after that main in a Python script and also massively confuses ipython, which blows up.

For the record, some of the things that don't quite fix this are:

run(standalone_mode=True) like suggested on the click repo. This solves problem 2 and 3 but not 1
run.callback. This nearly gets you there but doesn't fill in the default arguments for you so you need to explicitly give all the arguments in the function call, which is horrible
run.main is equivalent to what we were doing before (run.__call__) so doesn't help
run.invoke is really close to doing it, but unlike run.forward doesn't fills in default arguments and won't notice CLI arguments when called with python -m spaceflights --pipeline=ds

Overall, the code here is I think the unique thing that does exactly what we want both on the CLI and in scripts/ipython 🎉

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes

Signed-off-by: Antony Milne <[email protected]>

antonymilne · 2022-04-07T14:39:32Z

P.S. technically this is as breaking change since the way you call main with arguments has changed. But given that the main(["--pipeline", "ds"]) syntax is so new and not well-known I think we should just go for this in 0.18.1. It would be really nice to have this working for databricks, etc.

noklam · 2022-04-08T16:43:38Z

This is a nice summary, also reminds me I have some challenges to make kedro pipeline works with other python code. I think this would help integrate kedro with any downstream tasks.

#795. I am not sure if the discussion still holds or things can be done more easily now.

I think this doc is also useful to lay out how people can run kedro pipeline.

Method 4 isn't actually useful if you want to run a kedro project with anything else since (big problem) no code after main() will actually execute...

Why is this happening?

from spaceflights.main import main

It did look a bit weird, but I think it is because we want to execute python -m spaceflights thus we need to put it in a __main__.py file? Although I haven't see any project actually use this before.

consider changing the behaviour when the project cli.py exists but is empty/all commented out. I found the current behaviour a bit annoying when playing around with this

I am not super familiar with click and how cli.py and plugins get pick up by kedro, so can't comment on this now.

consider whether we should recommend main over session.run in IPython or even stop exposing session altogether. Mooted briefly in #1335 but needs more discussion. main is a higher-level function than session.run and respects the hierarchy of framework < plugins < project cli.py. The arguments to main also match the usual kedro run arguments, which those in session.run do not

I do like the idea of a higher-level entry point,

Misc:
This is not specific to kedro run, but I notice when I do kedro run I would get a dependencies error (Pip's error, and it's buried in the log that are not obvious at all), complaining that it doesn't match with the plugins, even I am not running the plugin at all.

Signed-off-by: Antony Milne <[email protected]>

noklam · 2022-06-09T09:48:38Z

Just an idea that pops up, do we only expect kedro run as a self-contained package? It is possible a kedro pipeline behaves just like a normal function where it takes an argument instead of catalogue only?

noklam · 2022-08-26T21:02:08Z

@AntonyMilneQB

In #1807 this would work for packaged kedro project. What's the benefit of calling the CLI instead of just using KedroSession, which is the preferable way of calling kedro pipeline programmatically?

One thing that I can think of is when you have a custom cli.py, but I don't think that's something we promote and isn't a common use case.

from kedro.framework.session import KedroSession
from kedro.framework.project import configure_project
package_name = <your_package_name>
configure_project(package_name)

with KedroSession.create(package_name) as session:
    session.run()

antonymilne · 2022-09-08T13:37:00Z

@noklam sorry for the incredibly slow response - finally got around to this issue on my big list of things to respond to... 😅

There's a few problems with doing session.run like that:

python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from command line without needing to write a Python script to do it
in the context of a Python script, your code works fine but doing configure_project and a context manager for a KedroSession feels overly complex, fiddly and too low-level. There should be a more user-friendly way that wraps all that into a single function call. It should be more obvious how to run a kedro pipeline programmatically. The code you wrote is fine for our "under the hood" implementation but I feel like there should be more accessible higher-level function to do it
as you say, it doesn't respect the cli.py run command > plugin run command command > kedro run command hierarchy. In the future I hope this will become increasingly irrelevant. As you've noted before, the cli.py route is not well used (more of a historical leftover I think) so maybe it should be removed. The use of a plugin to override run will hopefully become less common if we do something like Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041 also. But for now (0.18.x), I think maintaining this hierarchy is worth doing

noklam · 2022-09-08T14:07:34Z

Thanks @AntonyMilneQB, please see my comments.

kedro/kedro/framework/cli/project.py

Lines 344 to 376 in 3c1d6a7

    
           def run( 
        
               tag, 
        
               env, 
        
               runner, 
        
               is_async, 
        
               node_names, 
        
               to_nodes, 
        
               from_nodes, 
        
               from_inputs, 
        
               to_outputs, 
        
               load_version, 
        
               pipeline, 
        
               config, 
        
               params, 
        
           ): 
        
               """Run the pipeline.""" 
        
               runner = load_obj(runner or "SequentialRunner", "kedro.runner") 
        
               tag = _get_values_as_tuple(tag) if tag else tag 
        
               node_names = _get_values_as_tuple(node_names) if node_names else node_names 
        
               with KedroSession.create(env=env, extra_params=params) as session: 
        
                   session.run( 
        
                       tags=tag, 
        
                       runner=runner(is_async=is_async), 
        
                       node_names=node_names, 
        
                       from_nodes=from_nodes, 
        
                       to_nodes=to_nodes, 
        
                       from_inputs=from_inputs, 
        
                       to_outputs=to_outputs, 
        
                       load_versions=load_version, 
        
                       pipeline_name=pipeline, 
        
                   )

python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from the command line without needing to write a Python script to do it

I understand this need, but I also doubt if there are any people doing this for a real project, executing a module entirely with python -m isn't a common thing that I do. With a kedro project it's likely you have more than one pipeline, and will need different parameters, so it feels quite limited. If package_name is a useful argument in KedroSession (See point 2), then it can be just group in kedro run --package_name=xxx and all other arguments, which is much more consistent and powerful. It's not a good reason to keep yet one more CLI entrypoint when kedro run can easily do the same thing.

in the context of a Python script, your code works fine but doing configure_project and a context manager for a KedroSession feels overly complex, fiddly and too low-level. There should be a more user-friendly way that wraps all that into a single function call. It should be more obvious how to run a kedro pipeline programmatically. The code you wrote is fine for our "under the hood" implementation but I feel like there should be more accessible higher-level function to do it

I agree that configure_project is too low-level as a user-facing function. I think the problem here is what KedroSession.create(package_name=xxx) does, it actually did nothing. The only reason why it works in a kedro project is we did this configure_project steps behind the scene, either through cli or reload_kedro, what you pass into KedroSession doesn't matter at all.

I am not sure if KedroSession and context managers are too low level though, given this is the standard way to work with the notebook (Databricks) or any Python API. Currently, I view session.run as the closest Python API to kedro run.
If the package_name argument is doing its job, then it should be the same for both project or package mode. This is also quite consistent with the current way. kedro run is just using KedroSession.run and the only job of the cli is parsing the argument.

Why do we need a context manager? Is that a good reason to keep them at all? Note that both extra_params and env are in the context manager because we are saving them to session_store, but it doesn't add anything at the moment. In the past, it has some responsibility of activating/deactivating sessions, but that was removed for a while already. In reality, "closing" a session doesn't do anything. So in theory there can be a kedro_run function which basically just do "Create a Session and run a session".

as you say, it doesn't respect the cli.py run command > plugin run command command > kedro run command hierarchy. In the future, I hope this will become increasingly irrelevant. As you've noted before, the cli.py route is not well used (more of a historical leftover I think) so maybe it should be removed. The use of a plugin to override run will hopefully become less common if we do something like Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041 also. But for now (0.18.x), I think maintaining this hierarchy is worth doing

I think this is the only compelling reason to me for keeping backward compatibility. But as you said, it's not something we want to keep and this is also a very uncommon usage.

Ideally, there should be only 1 CLI and 1 way of doing it in Python API, either python scripts or IPython. And I think this isn't hard to achieve as we tidy up some of the KedroSession leftover.

antonymilne · 2022-09-09T17:08:27Z

python -m spaceflights should be possible if you just want to run a packaged pipeline by itself. You should be able to trigger (part of) a pipeline by itself from the command line without needing to write a Python script to do it

I understand this need, but I also doubt if there are any people doing this for a real project, executing a module entirely with python -m isn't a common thing that I do. With a kedro project it's likely you have more than one pipeline, and will need different parameters, so it feels quite limited. If package_name is a useful argument in KedroSession (See point 2), then it can be just group in kedro run --package_name=xxx and all other arguments, which is much more consistent and powerful. It's not a good reason to keep yet one more CLI entrypoint when kedro run can easily do the same thing.

This might not be done much currently but it is a useful feature and something we should promote more (especially when it's fixed 😀).

one way which I didn't put in the original list is that you can just run spaceflights if you've first done kedro package and then pip installed the .whl file. This is essentially the same as doing to python -m spaceflights but not quite as safe because you can sometimes get confused Python environments (in the same way that python -m pip is recommended over pip)
perhaps surprisingly, and completely undocumented I think, this method allows arguments to be passed exactly as in kedro run, e.g. spaceflights --pipeline=data_science
kedro package ➡️ share .whl file ➡️ pip install .whl file ➡️ run python -m spaceflights is the simplest way to reliably share and deploy a project. Think of my kedro project as a Python package here that I'm trying to distribute. e.g. I have a kedro project I want to send to you so you can run it. I could zip up the code, send it to you, ask you to pip install the requirements and then execute kedro run. But the way to share Python packages is to distribute them (e.g. as a whl file). This way you get all the advantages of distributing something as an actual package, e.g. package versioning, automatic dependency installation when you pip install my spaceflights.whl. Philosophically it's also significant that you don't need to care that it's using kedro underneath. It's a black box to you - just a command you can run like black or git or whatever
the above point actually won't work as smoothly as that in reality because of the issue we have with not packaging conf, so I need to send that to you separately also... But you get the idea. In principle python -m spaceflights/just spaceflights is the correct way to execute a packaged project

I agree that configure_project is too low-level as a user-facing function. I think the problem here is what KedroSession.create(package_name=xxx) does, it actually did nothing. The only reason why it works in a kedro project is we did this configure_project steps behind the scene, either through cli or reload_kedro, what you pass into KedroSession doesn't matter at all.

I am not sure if KedroSession and context managers are too low level though, given this is the standard way to work with the notebook (Databricks) or any Python API. Currently, I view session.run as the closest Python API to kedro run. If the package_name argument is doing its job, then it should be the same for both project or package mode. This is also quite consistent with the current way. kedro run is just using KedroSession.run and the only job of the cli is parsing the argument.

Why do we need a context manager? Is that a good reason to keep them at all? Note that both extra_params and env are in the context manager because we are saving them to session_store, but it doesn't add anything at the moment. In the past, it has some responsibility of activating/deactivating sessions, but that was removed for a while already. In reality, "closing" a session doesn't do anything. So in theory there can be a kedro_run function which basically just do "Create a Session and run a session".

Very fair points all round here I think. I'm not sure why we have this configure_project step - I'd guess it was introduced when we started doing settings.py and lazy-loading of pipelines etc. If I remember correctly, the original method way to handle KedroSession did not have the context manager, and I'm not sure why it was changed to that. I'd guess that, like you suggest, it's because in the past we used to handle other cases which we don't care about any more (e.g. multiple runs per session, activating/deactivating sessions). Possibly this could be simplified quite a lot now, very nice idea! 👍 Of course it would be quite a breaking change, but might help clean things up more. We'd need to look back and understand what the original motivation for configure_project etc. was, since I'm not sure. Probably @idanov has a better understanding of all this.

noklam · 2022-09-27T13:25:55Z

...lates/project/{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/__main__.py

-    run = _find_run_command(package_name)
-    run(*args, **kwargs)
+    if kwargs:
+        run(**kwargs)


I guess this is unfinished? as I can see in the other __main__.py the result of session.run is return but here it doesn't return the output

merelcht · 2022-09-28T15:02:40Z

We discussed this in Technical Design and talked about the following:

The package_name argument provided in KedroSession.create() doesn't do anything, which is confusing to users. This argument should be removed, but since that is a breaking change it will happen in 0.19
The workaround to run a run within a session is valid, but it's important to note that bootstrap_project is only meant for configuring a project that hasn't been packaged. When working with a packaged project configure_project should be used instead.
The problem with the main() method is that it currently returns an exit code, so downstream processes can't do anything with it.

Some steps to take to improve the situation and explain it better to users:

noklam · 2022-09-28T15:19:14Z

The problem with the main() method is that it currently returns an exit code, so downstream processes can't do anything with it.

Clarify this one a bit. I think these are 2 problems

main should return the session.run() - so downstream processing is possible
click generate a sys.exit() by default, and since main wrap around the CLI so it creates some issue on Databricks or even just IPython - this link explains more.

merelcht · 2023-02-22T13:07:26Z

This task isn't done, and is definitely something we want to work on. @AntonyMilneQB will never forget about this, and will make sure it gets discussed properly 😄 This PR will be closed because it has diverged a lot from the main branch and isn't mergeable.

antonymilne added 3 commits April 7, 2022 11:04

Update template main

dce0a1d

Signed-off-by: Antony Milne <[email protected]>

Return session.run in run command

4f54f67

Signed-off-by: Antony Milne <[email protected]>

Add comments

1ead446

Signed-off-by: Antony Milne <[email protected]>

antonymilne requested a review from noklam April 7, 2022 14:39

antonymilne mentioned this pull request Apr 13, 2022

Design auto-registration of pipelines #1284

Closed

idanov added this to the 0.18+ milestone Apr 20, 2022

antonymilne added 3 commits April 21, 2022 10:36

Try out on databricks

4eee9dc

Signed-off-by: Antony Milne <[email protected]>

Add type hint

71e08de

Signed-off-by: Antony Milne <[email protected]>

Databricks debug

b89e446

Signed-off-by: Antony Milne <[email protected]>

merelcht removed this from the 0.18+ milestone Apr 25, 2022

Tidy

dd1273f

Signed-off-by: Antony Milne <[email protected]>

antonymilne mentioned this pull request Jul 21, 2022

Investigate KedroSession.create: ModuleNotFoundError: No module named 'None' #1583

Closed

noklam mentioned this pull request Aug 23, 2022

Should we change the output of session.run? #1802

Closed

1 task

antonymilne mentioned this pull request Aug 24, 2022

Packaged kedro pipeline does not work well on Databricks #1807

Closed

antonymilne added this to the Improve kedro run entrypoints milestone Sep 8, 2022

This was referenced Sep 23, 2022

Extract insights from configuration user interviews #1847

Closed

KedroSession not working on packaged kedro project #1869

Closed

noklam reviewed Sep 27, 2022

View reviewed changes

noklam mentioned this pull request Feb 1, 2023

Allow injecting data into a KedroSession run #2169

Open

This was referenced Feb 13, 2023

Add plural flags --nodes, --tags, and --load-versions to kedro run #2301

Merged

Investigate further inconsistencies between the CLI and Kedro API #2247

Open

merelcht closed this Feb 22, 2023

antonymilne mentioned this pull request Mar 2, 2023

Consider removal of project cli.py #2384

Open

merelcht deleted the feature/improve-package-main branch May 4, 2023 13:17

This was referenced Jun 14, 2023

package Kedro project should return session.run #2681

Closed

Consider using run(standalone_mode=True) to fix packaged Kedro project getting a sys.exit #2682

Closed

idanov mentioned this pull request Oct 18, 2023

Make default project entrypoint interactive (Databricks) friendly #3191

Closed

7 tasks

noklam mentioned this pull request Oct 27, 2023

Improve Kedro run as a Package (2023) #3237

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve kedro run as a package #1423

Improve kedro run as a package #1423

antonymilne commented Apr 7, 2022 •

edited

Loading

antonymilne commented Apr 7, 2022

noklam commented Apr 8, 2022

noklam commented Jun 9, 2022

noklam commented Aug 26, 2022 •

edited

Loading

antonymilne commented Sep 8, 2022 •

edited

Loading

noklam commented Sep 8, 2022 •

edited

Loading

antonymilne commented Sep 9, 2022

noklam Sep 27, 2022

merelcht commented Sep 28, 2022 •

edited

Loading

noklam commented Sep 28, 2022

merelcht commented Feb 22, 2023

Improve kedro run as a package #1423

Improve kedro run as a package #1423

Conversation

antonymilne commented Apr 7, 2022 • edited Loading

Description

Current problems

Development notes

Adventures with click

Checklist

antonymilne commented Apr 7, 2022

noklam commented Apr 8, 2022

noklam commented Jun 9, 2022

noklam commented Aug 26, 2022 • edited Loading

antonymilne commented Sep 8, 2022 • edited Loading

noklam commented Sep 8, 2022 • edited Loading

antonymilne commented Sep 9, 2022

noklam Sep 27, 2022

Choose a reason for hiding this comment

merelcht commented Sep 28, 2022 • edited Loading

noklam commented Sep 28, 2022

merelcht commented Feb 22, 2023

antonymilne commented Apr 7, 2022 •

edited

Loading

noklam commented Aug 26, 2022 •

edited

Loading

antonymilne commented Sep 8, 2022 •

edited

Loading

noklam commented Sep 8, 2022 •

edited

Loading

merelcht commented Sep 28, 2022 •

edited

Loading