Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Revise cookiecutter json in starters #205

Merged
merged 7 commits into from
Jan 24, 2024

Conversation

DimedS
Copy link
Contributor

@DimedS DimedS commented Jan 5, 2024

Motivation and Context

Removing the default values of the tools and example_pipeline variables from cookiecutter.json in the four starters is currently not feasible. Variables utilised in Cookiecutter must be declared in cookiecutter.json, even if they are consistently set during the cookiecutter() function call (which overrides values from the file). Omitting these declarations results in Cookiecutter raising an UndefinedVariableInTemplate error.

However, for clarity, it might be beneficial to leave the values blank or indicate their status with a note, such as:

tools: "protected internal variable - do not modify"
example_pipeline: "protected internal variable - do not modify"

This approach informs users of the special nature of these variables, highlighting that they are internally managed and should not be altered.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Assigned myself to the PR
  • Added tests to cover my changes

@DimedS DimedS linked an issue Jan 5, 2024 that may be closed by this pull request
@DimedS DimedS changed the title Revise cookiecutter json in starters fix: Revise cookiecutter json in starters Jan 5, 2024
Signed-off-by: Dmitry Sorokin <[email protected]>
@DimedS DimedS force-pushed the 3483-revise-cookiecutterjson-starters branch from 5a60867 to bb8de1e Compare January 16, 2024 16:37
@DimedS DimedS marked this pull request as ready for review January 17, 2024 09:59
@AhdraMeraliQB
Copy link
Contributor

AhdraMeraliQB commented Jan 17, 2024

Summarising our discussion offline for posterity:

In Kedro's implementation, cookiecutter (and hence pyproject.toml) will always receive values that overwrite the defaults specified in cookiecutter.json. However, this knowledge is not available to cookiecutter, and so not supplying a default value results in an error.

I agree that indicating this is important for the understandability of the code, though I don't think empty strings are informative but perhaps something like:

tools: "default value required by cookiecutter - overwritten by Kedro implementation"
example_pipeline: "default value required by cookiecutter - overwritten by Kedro implementation"

Which informs why this default is there and it's lack of effect on the rest of the project generation process.

CC: @merelcht

@DimedS
Copy link
Contributor Author

DimedS commented Jan 18, 2024

It appears that in previous instances, a recursive link was utilized to indicate that the value would be derived from the codebase when executing the cookiecutter() function, like:
"kedro_version": "{{ cookiecutter.kedro_version }}",
so we can keep the same or discuss any more straight meaning text

@SajidAlamQB
Copy link
Contributor

SajidAlamQB commented Jan 18, 2024

Just so I understand are you saying that even if we are getting "kedro_version": "{{ cookiecutter.kedro_version }}" from the codebase it doesn't matter as we just replace it when we come to create the project?

If thats the case I have no issue replacing it with some descriptive placeholder for future contributors.

Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR I'd prefer informative text (option 4) instead of using a reference like "kedro_version": "{{ cookiecutter.kedro_version }}" (option 3), on the simple premise that we didn't immediately realise this value had no effect on the actual generated project because it looked like it served a purpose.

Problem Context:

Note

These cookiecutter.json files will only be used when going through kedro new --starter=<starter name>; when using kedro new --example=yes the cookiecutter.json file in the Kedro framework template project is used instead.

The function of cookiecutter.json (as understood) is to serve as a place to specify the default values of different variables used during project generation (provides the defaults for cookiecutter prompts and populates placeholders in files like pyproject.toml ). Currently we specify six attributes in the cookiecutter.json files: project_name, repo_name, python_package, version, tools, and example_pipeline.

This only applies to spaceflights-pandas, spaceflights-pandas-viz, spaceflights-pyspark, and spaceflights-pyspark-viz because they are accessed through both kedro new --starter=<starter name> and kedro new --example=yes

When creating a project from these starter templates, the only information users must provide is a project name (either through the command flag --name/-n, a config file, or cookiecutter prompts) and as such project_name, repo_name, and python_package need to exist in cookiecutter.json (values of the latter two are dependent on project_name).

Furthermore, it is not simply the case that version, tools, and example_pipeline are optional to provide, but instead, the user is unable to provide these at all.

If users can't provide this information, why do we need to have defaults for them in cookiecutter.json?

Short answer: because of kedro new --example=yes.

As mentioned above, the defaults in cookiecutter.json are used in two places:

  1. Cookiecutter prompts for user input (optional to provide default values)
  2. Cookiecutter placeholders in the template project's files (must provide default values)

For the other templates, neither of these apply because we know this template can only be accessed through kedro new --starter=<starter name>, and so we know that no tools or example pipeline has been selected. As a result, we don't create any placeholders to capture tools and example_pipeline information (version still has a placeholder).

However, for these specific starters, when we create the project template we can't know how it will be used, and so we don't know if the user can provide information about their tools and example pipeline selection.

We want to store this information in pyproject.toml when it's provided but we don't know when that is (and two identical templates with different pyproject.toml files is definitely overkill), so we use cookiecutter placeholders to capture the information when it's provided.

Version is only slightly different; we have the placeholder to capture the information but we know in all cases the user cannot provide this information. Instead this is passed to cookiecutter through Kedro to fill the placeholder.

However, cookiecutter can only see placeholders that it doesn't have values for - without default values what is it expected to populate the placeholders with? It asks us this by throwing an error, and so we include default values in cookiecutter.json

If we need the default values to populate the placeholders, why are we changing them?

Short answer: because Kedro provides its own default values.

cookiecutter.json is not the only way to provide information to cookiecutter, the default values can be overwritten by providing extra_config to cookiecutter when calling the project generation process. What we know, that cookiecutter doesn't, is that Kedro will always provide values for version, tools, and example_pipeline, even when unspecified by the user. This means that the values for these attributes in cookiecutter.json will always be overwritten. This allows us to put whatever values we like in cookiecutter.json as they will never be used (with the current implementation).

What should we put in cookiecutter.json?

There are several options:

  1. The same default values as for the Kedro template cookiecutter.json (current implementation)

Pro: Consistent with Kedro template project
Con: Uninformative, lead to this investigation. Would this comment be discoverable enough to ensure this won't be raised again at a future point? I personally don't believe so.

  1. Empty strings

Pro: Suggests placeholders are populated with some other values (if I remember correctly .toml files complain about empty strings; if projects are created without error some other value must be provided instead)
Con: Better but still not very informative

  1. "kedro_version": "{{ cookiecutter.kedro_version }}", "tools": "{{ cookiecutter.tools }}"...

Pro: Suggests placeholders are populated with values directly passed to cookiecutter - if not passed through this would create infinite recursion
Con: Not informative without knowledge of cookiecutter

  1. Use as a comment to explain the purpose of the attributes, e.g. "default value required by cookiecutter - overwritten by Kedro implementation"

Pro: Explicitly informative, even without prior knowledge of how Kedro and cookiecutter work
Con: Likely unconventional

@DimedS
Copy link
Contributor Author

DimedS commented Jan 19, 2024

Thank you, @AhdraMeraliQB, for the comprehensive description of the issue and the potential solutions.
I agree with both @AhdraMeraliQB and @SajidAlamQB and have modified the variable names in accordance with @AhdraMeraliQB 's proposal. Let's wait until next week; we might receive additional opinions by then.

Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Would love to see some more opinions on this choice

@merelcht
Copy link
Member

I personally find the comment a bit messy and would prefer to just go for the "kedro_version": "{{ cookiecutter.kedro_version }}" style for all these fields. IMO we should document the behaviour for ourselves rather than put explanations in user facing files.

It would be good to hear what the rest of the team thinks @ankatiyar @noklam @lrcouto

In any case, whatever decision is made this should be applied across all starters for consistency.

@noklam
Copy link
Contributor

noklam commented Jan 22, 2024

I tend to agree @merelcht on this one. It's not something that user can change. It is still a good note that we should keep, maybe somewhere in the codebase or the develop docs if applicable.

Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB merged commit 866324d into main Jan 24, 2024
19 checks passed
@AhdraMeraliQB AhdraMeraliQB deleted the 3483-revise-cookiecutterjson-starters branch January 24, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revise cookiecutter.json in starters
5 participants