Extension Mechanism Implementation #1833

costrouc · 2023-06-13T21:55:54Z

Closes nebari-dev/governance#35
Closes #865
Closes #1046

Introduction

This is an initial POC of the extension mechanism. Goals are to provide a public api of a pluggy interface for managing stages and subcommands. With this work the goal is that a majority of what is Nebari becomes a collection of extensions.\

Nebari can finally be oblivious to the stages that are being run and have a much smaller core.

Emphasis on pydantic schema for nebari-config.yaml

The following is now a valid nebari-config.yaml. Significant work has been put into nebari.schema which will provide much better error messages than before. The hope is that nebari.schema is THE way to know valid configuration.

project_name: mycluster
domain: example.nebari.dev
provider: aws

How will developers use the extension mechanism?

There are several ways developers with extend Nebari.

Via a pip installed package which has the entryhook which will auto register the plugins within the module

[project.entry-points.nebari]
mynebariextension= "mynebari.pluggy.plugin"

Via a command line option to load a plugin module. Multiple --import-plugin statements can be used

nebari --import-plugin=mynebari.pluggy.plugin --import-plugin=./path/to/plugin.py deploy ...

Currently we have two plugin which are mentioned below and worth reviewing nebari.hookspecs.NebariStage and nebari.hookspecs.Subcommand. Subcommand is failrly straightforward and you can see several examples in _nebari.subcommands. The stages are a bit more complicated and many examples are in _nebari.stages.*. The important thing is now nebari is now fully embracing the extensions and we use the extensions internally ourselves.

You can easily replace a stage by specifying a stage with the same name but higher priority. E.g.

from nebari.hookspecs import NebariStage, hookimpl

class MyNewInfrastructureStage(NebariStage):
      name = "02-infrastructure"
      priority = 21

@hookimpl
def nebari_stage() -> List[NebariStage]:
    return [MyNewInfrastructureStage]

Currently there is a class NeabriStage and NebariTerraformStage but it would be easy to add additional stages. The goal is to make extending the stages as easy as possible. There has already been conversation about a NeabriTerragruntStage.

Key interfaces to Review

nebabi.schema is now our public view and source of truth for the nebari configuration
nebari.hookspecs.NebariStage this is base class for all Stages
nebari.hookspecs.Subcommand which exposes a hook on adding arbtrary Typer subcommands we follow a similar pattern to datasette and conda here.
_nebari.stages.base.NebariTerraformStage is a subclass on NebariStage which implements convenience features to allow terraform stages to be more concise. A majority of the stages use this class.

Render, Deploy, and Destroy logic is significantly simpler. For example here is the deploy logic now.

    stage_outputs = {}
    with contextlib.ExitStack() as stack:
        for stage in get_available_stages():
            s = stage(output_directory=pathlib.Path("."), config=config)
            stack.enter_context(s.deploy(stage_outputs))

            if not disable_checks:
                s.check(stage_outputs)

Progress

Progress towards prior functionality

Feature ideas

costrouc · 2023-06-14T13:37:34Z

From discussion with @iameskild we need significant control over how and when certain stages are chosen. For example here are some use cases that need to be considered:

How does a user replace a given stage(s)? (What happens when 3+ plugin hooks all say they want to replace the stage)
How at runtime do we allow certain stages to be excluded from the run?

A proposal (from me) is to use priority/name.

pytest uses tags/markers (maybe a good idea?)
dependeny dag? Not sure how this work be implemented

@iameskild is suggesting a plugin hook to be added which will modify the ordering of the hooks

There should be a pre/post for check, maybe nice to have to render/deploy/destroy.

def pre_check(stage_name: str):
      .... # run this before stage

There should be a way to have a schema for that given stage as input and output.

Schemas everywhere. @iameskild mentioned that the NebariTerraformStage should have validation on internal methods e.g. input_vars.

Some areas next steps:

schema fixing that up moving towards public module
subcommands extension https://github.com/simonw/datasette/blob/main/datasette/publish/cloudrun.py
wrapper implementation for nebari

Subcommand build in to read arbirarry python module nebari --import-module asd.qwer.mystage render should be possible

We can give back to the jupyterhub /zero to jupyterhub community to have an easy z2jh install. There should be a way to explicitly override the given stages. Shows importance of --include-stages and --exclude-stages.

Devtools to help us solve what is going on before and after stages. How could we do this

def post_check(name='infrastructure'):
     breakpoint()

iameskild · 2023-06-14T20:40:37Z

@costrouc here is a basic POC of what an outer priority manager plugin object might do. If you change the priority level of either MyStage1 relative to YourStage1, the order will switch.

From here we can implement more complex logic that handles replacing plugins for the same stage, resolves conflicts if the priority level is the same, etc.

costrouc · 2023-06-15T03:36:02Z

Thanks @iameskild checkout https://github.com/nebari-dev/nebari/blob/extension-mechanism/src/_nebari/stages/base.py#L89-L106 for the basic implementation in here. I don't have plans to make any additions to this function so I definitely think we should add some of the capabilities that you mentioned with being able to filter stages etc.

I've done the initial work (still somewhat broken) of moving all commands to a nebari_subcommand hookspec.

Also the stages seem to sorta work. I've been able to successfully run the render and validate command.

Also spent a lot of time working on the schema (still needs a lot more work) and removing all config['...'].get(...) and replacing with config.value.key since with schema validation we have guarantees on the schema.

costrouc · 2023-06-15T04:53:01Z

Further work was able to get about midway through a local deployment (to kubernetes ingress).

iameskild · 2023-06-15T20:17:14Z

As I've been working on the schema (the main schema and the ones specific to the InputVars), it got me thinking that it might make more logical sense to make stages (plugins) that are specific to cloud providers as well.

At the moment, the infrastructure stage is one stage that passes different input_vars based on the cloud provider. When I see this kind of conditional, it makes me think that these need to be broken out into separate stages, one per cloud provider.

Another example is the cluster_autoscaler which is specific to AWS but is being created within a stage that also creates Traefik resources, nvidia installers and container registries, it might make more sense to have each of these be a separate plugin.

Although more work upfront, I see the benefits in the future when others want to (natively) support new cloud providers. Another possible downside to this approach is that the number of current plugins will likely grow quite quickly which might reduce the speed with which we can deploy the whole cluster (given the overhead of each terraform init, terraform plan required for each plugin). That said, I see the benefits being that each stage (plugin) only performs one specific job and can be more easily swapped out should we need or want to in the future.

The tricky part of this entire extension-mechanism work seems to be how we handle prioritizing and swapping out plugins.

cc @costrouc

costrouc · 2023-06-15T23:01:16Z

At the moment, the infrastructure stage is one stage that passes different input_vars based on the cloud provider. When I see this kind of conditional, it makes me think that these need to be broken out into separate stages, one per cloud provider.

Another example is the cluster_autoscaler which is specific to AWS but is being created within a stage that also creates Traefik resources, nvidia installers and container registries, it might make more sense to have each of these be a separate plugin.

@iameskild I think these are great ideas. This work will make "stages" much cheaper to create. I think it absolutely makes sense. The questions I have here is about stages and how they get selected. In the case you are suggesting here we would have certain stages that only apply when a certain config value is set. How do we want to expose that?

iameskild · 2023-06-20T10:41:24Z

More thoughts on the schema.

If and when we break out the Kubernetes Services (ie. Argo-Workflows, conda-store, etc.) and cloud providers into their own plugins, the main schema will need to be modified (namely removing their sections from the main schema). We might want to make it so the main schema is extendible based on the plugins that are installed. This would require a dynamic schema; pydantic has just the tool for this, create_model.

This gist gives a taste for how this might be accomplished. In the long run this would reduce the size and scope of the main schema to only those components that are used throughout the deployment (name, domain, etc.) and everything else is the responsibility of the plugin to managed.
The InputVars schema we've been including would ultimately become a new section in the config itself.

cc @costrouc

costrouc · 2023-06-20T12:56:59Z

More thoughts on the schema.

If and when we break out the Kubernetes Services (ie. Argo-Workflows, conda-store, etc.) and cloud providers into their own plugins, the main schema will need to be modified (namely removing their sections from the main schema). We might want to make it so the main schema is extendible based on the plugins that are installed. This would require a dynamic schema; pydantic has just the tool for this, create_model.

Thank you for thinking about this! YES this is something that needs to be addressed. I think it is important from the start that we get this part right. I would really appreciate you doing some more research about that.

Thoughts that I had each stage would have a "reserved" part of the schema. But this looks and feels too verbose.

stages:
   mystage: { } # arbitrary dict for given stage name. Stage could pass it's own schema

Or each stage could "reserve" several keys at the root level.

mystage: {}
othervalue: 1

And mystage extension would claim mystage and othervalue as keys which that stage manages. At runtime we do a check that no stages collide on keys that they manage. This approach would match more of what we have curently. So sort of a merge at the root level.

I'm struggling on thinking how the stages could collaborate on what the schema would look like. Last example is the best that I could think of. Totally agree that create_model would be a good way to construct this "global" schema.

dharhas · 2023-06-20T14:21:03Z

How does this work with conda packages?

dharhas · 2023-06-20T14:22:35Z

And mystage extension would claim mystage and othervalue as keys which that stage manages. At runtime we do a check that no stages collide on keys that they manage. This approach would match more of what we have curently. So sort of a merge at the root level.

wouldn't it make more sense to namespace the keys by the stage name so they can never clash.

costrouc · 2023-06-21T12:40:22Z

How does this work with conda packages?

@dharhas there shouldn't be any issue. This is using python entryhooks

wouldn't it make more sense to namespace the keys by the stage name so they can never clash.

I'm split on this.

Pros:

Yes this would guarantee no clashes.
much simpler to understand (explicit) the stage <-> schema mapping

Cons:

our current schema doesn't do this so we'd have to make a special exception for our current stages
more verbose

To me backwards compatibility is the main concern for me.

dharhas · 2023-06-21T12:44:52Z

Backwards compatibility isn't a huge issue for me as long as we have a clean messaging and upgrade path. Our install base is small enough to be manageable and we have relationships with folks using it.

costrouc · 2023-06-21T14:19:57Z

@dharhas I'm fine with us making this move. I think this schema per stage will need to happen in another PR then. I'd like to make this PR focus on not changing anything and locking down the current schema. We can create a PR which makes this move and I think it will be extremely beneficial.

As of now I have this PR working on all clouds. So now I'd say this PR needs cleanup and checking the features are preserved.

pmeier · 2023-07-10T21:57:33Z

I've been thinking quite a bit about this after the initial presentation in one of our syncs. My biggest concern was that the way the extension mechanism is too flexible. While this is usually a good thing for power users, on the flip side it easily gets detrimental for regular ones. Since nebari is about making stuff easy, I would err on the side of the regular users.

I would propose something that hopefully also covers all use cases, but is a little simpler:

Each stage as we currently have it, has a fixed input and output scheme. It can be replaced, by at most one extension that abides by the input and output contracts. If multiple extensions try to replace the same stage, we refuse to deploy and error out.
Between each stage as well as before the first and after the last, we have some hooks. They will be called with the output of the previous stage (the raw nebari config for the very first) and can (maybe?) modify the values, but not the structure. In contrast to the stage replacing hooks, we can have an unlimited amount of these, i.e. multiple extensions can hook into the same thing.

As it stands, this will then limit nebari to k8s. However, we don't need the full flexibility proposed by this PR to remove this limit again: since the number of deployment targets is fairly low (right now I can think of k8s, HPC, and local), we could just build sets of fixed stages for all the targets that we want to support and within them go with the scheme I proposed above.

costrouc · 2023-08-03T01:38:53Z

This took a long time to properly rebase since I had to incorporate #1832 and #1773. The only errors that exist are the ones that we see in the develop branch currently https://github.com/nebari-dev/nebari/actions/runs/5742626558/job/15565133390

iameskild · 2023-08-09T18:22:48Z

The final (hopefully) rebase has been completed.

The Digital Ocean provider tests are failing and this is expected given that they recently deprecated kubernetes version 1.24.
The kubernetes tests is failing and when I deployed Nebari to AWS, everything worked except for the DNS auto-provision.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Chris Ostrouchov <[email protected]> Co-authored-by: eskild <[email protected]> Co-authored-by: Scott Blair <[email protected]>

iameskild

Based on our discussions with the team, we agreed to merge this PR and work through bugs and enhancements as they come up. This issue tracks the current list of enhancements: #1894

costrouc · 2023-08-11T12:54:31Z

@aktech I'm going to merge this PR but it will require some work to get it back to running the test_deployment code you wrote.

costrouc requested a review from iameskild June 13, 2023 23:33

pavithraes added type: enhancement 💅🏼 New feature or request status: in progress 🏗 This task is currently being worked on impact: high 🟥 This issue affects most of the nebari users or is a critical issue labels Jun 14, 2023

costrouc mentioned this pull request Jun 21, 2023

make domain optional #1803

Closed

10 tasks

costrouc force-pushed the extension-mechanism branch from 2d12f8d to 566c14b Compare June 23, 2023 15:03

costrouc changed the title ~~[WIP] Extension Mechanism Implementation~~ Extension Mechanism Implementation Jun 23, 2023

costrouc added status: in review 👀 This PR is currently being reviewed by the team and removed status: in progress 🏗 This task is currently being worked on labels Jun 23, 2023

costrouc force-pushed the extension-mechanism branch from 61ae876 to 493a9b9 Compare June 28, 2023 02:47

sblair-metrostar mentioned this pull request Jun 30, 2023

sample scaffold subcommand, wip nebari-dev/nebari-plugin-examples#1

Merged

18 tasks

costrouc mentioned this pull request Jul 18, 2023

META - Extension System Testing #1862

Open

costrouc force-pushed the extension-mechanism branch from fc6d7cb to 45dbb40 Compare July 19, 2023 15:14

costrouc force-pushed the extension-mechanism branch 2 times, most recently from c0758a2 to 9c78594 Compare August 2, 2023 21:24

costrouc and others added 17 commits August 8, 2023 15:12

Fixing import

e08b1f4

[pre-commit.ci] Apply automatic pre-commit fixes

27cd898

Ignore certain pathlib warings

b7eebd2

Missing import

1f9338e

[pre-commit.ci] Apply automatic pre-commit fixes

2bdb9d7

Missing import

c045be4

Changes to account for #1832 and #1868

142182c

[pre-commit.ci] Apply automatic pre-commit fixes

371483b

Missing pathlib import

f36c41f

[pre-commit.ci] Apply automatic pre-commit fixes

beac07a

Ensure pathlib everywhere

cc25393

Ensure using pathlib type

05fe1d8

[pre-commit.ci] Apply automatic pre-commit fixes

ec78e2c

Bump the highest supported kubernetes version

e27c83e

Clean up tests

107d882

Minor clean, path updates

3e1380d

Remove linger checks.py

79b9066

iameskild force-pushed the extension-mechanism branch from eddde88 to 79b9066 Compare August 9, 2023 01:22

iameskild added 2 commits August 9, 2023 08:54

Fix tests

7167b51

fix jhub_ssh test

5577d37

iameskild approved these changes Aug 10, 2023

View reviewed changes

costrouc merged commit 82ebab5 into develop Aug 11, 2023
21 of 25 checks passed

costrouc deleted the extension-mechanism branch August 11, 2023 12:55

pmeier mentioned this pull request Aug 18, 2023

[BUG] - local deploy cannot guaranteed be done without having access to a domain #1707

Closed

iameskild added this to the Release 2023.8.1 milestone Aug 22, 2023

iameskild mentioned this pull request Aug 25, 2023

Break up stage-07 (kubernetes services) into their own stage plugins #1943

Open

dcmcand mentioned this pull request Feb 8, 2024

[ENH] - Add option for output path to nebari init #1712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension Mechanism Implementation #1833

Extension Mechanism Implementation #1833

costrouc commented Jun 13, 2023 •

edited

Loading

costrouc commented Jun 14, 2023 •

edited

Loading

iameskild commented Jun 14, 2023

costrouc commented Jun 15, 2023

costrouc commented Jun 15, 2023

iameskild commented Jun 15, 2023 •

edited

Loading

costrouc commented Jun 15, 2023

iameskild commented Jun 20, 2023

costrouc commented Jun 20, 2023 •

edited

Loading

dharhas commented Jun 20, 2023

dharhas commented Jun 20, 2023

costrouc commented Jun 21, 2023

dharhas commented Jun 21, 2023

costrouc commented Jun 21, 2023

pmeier commented Jul 10, 2023

costrouc commented Aug 3, 2023

iameskild commented Aug 9, 2023

iameskild left a comment

costrouc commented Aug 11, 2023

Extension Mechanism Implementation #1833

Extension Mechanism Implementation #1833

Conversation

costrouc commented Jun 13, 2023 • edited Loading

Introduction

Emphasis on pydantic schema for nebari-config.yaml

How will developers use the extension mechanism?

Key interfaces to Review

Progress

Progress towards prior functionality

Feature ideas

costrouc commented Jun 14, 2023 • edited Loading

iameskild commented Jun 14, 2023

costrouc commented Jun 15, 2023

costrouc commented Jun 15, 2023

iameskild commented Jun 15, 2023 • edited Loading

costrouc commented Jun 15, 2023

iameskild commented Jun 20, 2023

costrouc commented Jun 20, 2023 • edited Loading

dharhas commented Jun 20, 2023

dharhas commented Jun 20, 2023

costrouc commented Jun 21, 2023

dharhas commented Jun 21, 2023

costrouc commented Jun 21, 2023

pmeier commented Jul 10, 2023

costrouc commented Aug 3, 2023

iameskild commented Aug 9, 2023

iameskild left a comment

Choose a reason for hiding this comment

costrouc commented Aug 11, 2023

costrouc commented Jun 13, 2023 •

edited

Loading

costrouc commented Jun 14, 2023 •

edited

Loading

iameskild commented Jun 15, 2023 •

edited

Loading

costrouc commented Jun 20, 2023 •

edited

Loading