Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension Mechanism Implementation #1833

Merged
merged 147 commits into from
Aug 11, 2023
Merged

Extension Mechanism Implementation #1833

merged 147 commits into from
Aug 11, 2023

Conversation

costrouc
Copy link
Member

@costrouc costrouc commented Jun 13, 2023

Closes nebari-dev/governance#35
Closes #865
Closes #1046

Introduction

This is an initial POC of the extension mechanism. Goals are to provide a public api of a pluggy interface for managing stages and subcommands. With this work the goal is that a majority of what is Nebari becomes a collection of extensions.\

Nebari can finally be oblivious to the stages that are being run and have a much smaller core.

Emphasis on pydantic schema for nebari-config.yaml

The following is now a valid nebari-config.yaml. Significant work has been put into nebari.schema which will provide much better error messages than before. The hope is that nebari.schema is THE way to know valid configuration.

project_name: mycluster
domain: example.nebari.dev
provider: aws

How will developers use the extension mechanism?

There are several ways developers with extend Nebari.

Via a pip installed package which has the entryhook which will auto register the plugins within the module

[project.entry-points.nebari]
mynebariextension= "mynebari.pluggy.plugin"

Via a command line option to load a plugin module. Multiple --import-plugin statements can be used

nebari --import-plugin=mynebari.pluggy.plugin --import-plugin=./path/to/plugin.py deploy ...

Currently we have two plugin which are mentioned below and worth reviewing nebari.hookspecs.NebariStage and nebari.hookspecs.Subcommand. Subcommand is failrly straightforward and you can see several examples in _nebari.subcommands. The stages are a bit more complicated and many examples are in _nebari.stages.*. The important thing is now nebari is now fully embracing the extensions and we use the extensions internally ourselves.

You can easily replace a stage by specifying a stage with the same name but higher priority. E.g.

from nebari.hookspecs import NebariStage, hookimpl

class MyNewInfrastructureStage(NebariStage):
      name = "02-infrastructure"
      priority = 21

@hookimpl
def nebari_stage() -> List[NebariStage]:
    return [MyNewInfrastructureStage]

Currently there is a class NeabriStage and NebariTerraformStage but it would be easy to add additional stages. The goal is to make extending the stages as easy as possible. There has already been conversation about a NeabriTerragruntStage.

Key interfaces to Review

  • nebabi.schema is now our public view and source of truth for the nebari configuration
  • nebari.hookspecs.NebariStage this is base class for all Stages
  • nebari.hookspecs.Subcommand which exposes a hook on adding arbtrary Typer subcommands we follow a similar pattern to datasette and conda here.
  • _nebari.stages.base.NebariTerraformStage is a subclass on NebariStage which implements convenience features to allow terraform stages to be more concise. A majority of the stages use this class.

Render, Deploy, and Destroy logic is significantly simpler. For example here is the deploy logic now.

    stage_outputs = {}
    with contextlib.ExitStack() as stack:
        for stage in get_available_stages():
            s = stage(output_directory=pathlib.Path("."), config=config)
            stack.enter_context(s.deploy(stage_outputs))

            if not disable_checks:
                s.check(stage_outputs)

Progress

Progress towards prior functionality

  • cli command init
  • cli command guided init
  • cli command dev
  • cli command validate
  • cli command keycloak
  • cli command support
  • cli command upgrade
  • working local provider render, deploy, destroy
  • working existing provider render, deploy, destroy
  • working gcp provider render, deploy, destroy
  • working azure provider render, deploy, destroy
  • working aws provider render, deploy, destroy
  • working do provider render, deploy, destroy

Feature ideas

  • All stages refactored into nebari.hookspecs.NebariStage
  • All commands refactored into nebari.hookspecs.Subcommand
  • Stages can be ordered via priority
  • Stages can be replaced via name (currently if a stage has higher priority and same name it will replace the lower priority stages)
  • schema enforcement throughout and heavy use of defaults
  • input vars in NebariTerraformStages should be validated via pydantic
  • schema descriptions for all fields
  • plugins can be loaded via entrypoints
  • plugin modules can be loaded via cli option --import-plugin _nebari.stages.bootstrap
  • easy cli way of filtering stages (e.g. include stages, exclude stages etc.)
  • pre_* and post_* plugins to allow for arbitrary code before and after render, deploy, destroy
  • stages that only apply for certain configuration conditions e.g. provider = aws
  • model works such that z2jh, nebari, and tljh could all be deployed with nebari
  • disable prompt and kubernetes ingress/load balancer flag in stage

@costrouc costrouc requested a review from iameskild June 13, 2023 23:33
@costrouc
Copy link
Member Author

costrouc commented Jun 14, 2023

From discussion with @iameskild we need significant control over how and when certain stages are chosen. For example here are some use cases that need to be considered:

  • How does a user replace a given stage(s)? (What happens when 3+ plugin hooks all say they want to replace the stage)
  • How at runtime do we allow certain stages to be excluded from the run?

A proposal (from me) is to use priority/name.

  • pytest uses tags/markers (maybe a good idea?)
  • dependeny dag? Not sure how this work be implemented

@iameskild is suggesting a plugin hook to be added which will modify the ordering of the hooks

There should be a pre/post for check, maybe nice to have to render/deploy/destroy.

def pre_check(stage_name: str):
      .... # run this before stage

There should be a way to have a schema for that given stage as input and output.

Schemas everywhere. @iameskild mentioned that the NebariTerraformStage should have validation on internal methods e.g. input_vars.

Some areas next steps:

Subcommand build in to read arbirarry python module nebari --import-module asd.qwer.mystage render should be possible

We can give back to the jupyterhub /zero to jupyterhub community to have an easy z2jh install. There should be a way to explicitly override the given stages. Shows importance of --include-stages and --exclude-stages.

Devtools to help us solve what is going on before and after stages. How could we do this

def post_check(name='infrastructure'):
     breakpoint()

@pavithraes pavithraes added type: enhancement 💅🏼 New feature or request status: in progress 🏗 This task is currently being worked on impact: high 🟥 This issue affects most of the nebari users or is a critical issue labels Jun 14, 2023
@iameskild
Copy link
Member

@costrouc here is a basic POC of what an outer priority manager plugin object might do. If you change the priority level of either MyStage1 relative to YourStage1, the order will switch.

From here we can implement more complex logic that handles replacing plugins for the same stage, resolves conflicts if the priority level is the same, etc.

@costrouc
Copy link
Member Author

Thanks @iameskild checkout https://github.com/nebari-dev/nebari/blob/extension-mechanism/src/_nebari/stages/base.py#L89-L106 for the basic implementation in here. I don't have plans to make any additions to this function so I definitely think we should add some of the capabilities that you mentioned with being able to filter stages etc.

I've done the initial work (still somewhat broken) of moving all commands to a nebari_subcommand hookspec.

Also the stages seem to sorta work. I've been able to successfully run the render and validate command.

Also spent a lot of time working on the schema (still needs a lot more work) and removing all config['...'].get(...) and replacing with config.value.key since with schema validation we have guarantees on the schema.

@costrouc
Copy link
Member Author

Further work was able to get about midway through a local deployment (to kubernetes ingress).

@iameskild
Copy link
Member

iameskild commented Jun 15, 2023

As I've been working on the schema (the main schema and the ones specific to the InputVars), it got me thinking that it might make more logical sense to make stages (plugins) that are specific to cloud providers as well.

At the moment, the infrastructure stage is one stage that passes different input_vars based on the cloud provider. When I see this kind of conditional, it makes me think that these need to be broken out into separate stages, one per cloud provider.

Another example is the cluster_autoscaler which is specific to AWS but is being created within a stage that also creates Traefik resources, nvidia installers and container registries, it might make more sense to have each of these be a separate plugin.

Although more work upfront, I see the benefits in the future when others want to (natively) support new cloud providers. Another possible downside to this approach is that the number of current plugins will likely grow quite quickly which might reduce the speed with which we can deploy the whole cluster (given the overhead of each terraform init, terraform plan required for each plugin). That said, I see the benefits being that each stage (plugin) only performs one specific job and can be more easily swapped out should we need or want to in the future.

The tricky part of this entire extension-mechanism work seems to be how we handle prioritizing and swapping out plugins.

cc @costrouc

@costrouc
Copy link
Member Author

At the moment, the infrastructure stage is one stage that passes different input_vars based on the cloud provider. When I see this kind of conditional, it makes me think that these need to be broken out into separate stages, one per cloud provider.

Another example is the cluster_autoscaler which is specific to AWS but is being created within a stage that also creates Traefik resources, nvidia installers and container registries, it might make more sense to have each of these be a separate plugin.

@iameskild I think these are great ideas. This work will make "stages" much cheaper to create. I think it absolutely makes sense. The questions I have here is about stages and how they get selected. In the case you are suggesting here we would have certain stages that only apply when a certain config value is set. How do we want to expose that?

@iameskild
Copy link
Member

More thoughts on the schema.

If and when we break out the Kubernetes Services (ie. Argo-Workflows, conda-store, etc.) and cloud providers into their own plugins, the main schema will need to be modified (namely removing their sections from the main schema). We might want to make it so the main schema is extendible based on the plugins that are installed. This would require a dynamic schema; pydantic has just the tool for this, create_model.

This gist gives a taste for how this might be accomplished. In the long run this would reduce the size and scope of the main schema to only those components that are used throughout the deployment (name, domain, etc.) and everything else is the responsibility of the plugin to managed.
The InputVars schema we've been including would ultimately become a new section in the config itself.

cc @costrouc

@costrouc
Copy link
Member Author

costrouc commented Jun 20, 2023

More thoughts on the schema.

If and when we break out the Kubernetes Services (ie. Argo-Workflows, conda-store, etc.) and cloud providers into their own plugins, the main schema will need to be modified (namely removing their sections from the main schema). We might want to make it so the main schema is extendible based on the plugins that are installed. This would require a dynamic schema; pydantic has just the tool for this, create_model.

Thank you for thinking about this! YES this is something that needs to be addressed. I think it is important from the start that we get this part right. I would really appreciate you doing some more research about that.

Thoughts that I had each stage would have a "reserved" part of the schema. But this looks and feels too verbose.

stages:
   mystage: { } # arbitrary dict for given stage name. Stage could pass it's own schema

Or each stage could "reserve" several keys at the root level.

mystage: {}
othervalue: 1

And mystage extension would claim mystage and othervalue as keys which that stage manages. At runtime we do a check that no stages collide on keys that they manage. This approach would match more of what we have curently. So sort of a merge at the root level.

I'm struggling on thinking how the stages could collaborate on what the schema would look like. Last example is the best that I could think of. Totally agree that create_model would be a good way to construct this "global" schema.

@dharhas
Copy link
Member

dharhas commented Jun 20, 2023

How does this work with conda packages?

@dharhas
Copy link
Member

dharhas commented Jun 20, 2023

And mystage extension would claim mystage and othervalue as keys which that stage manages. At runtime we do a check that no stages collide on keys that they manage. This approach would match more of what we have curently. So sort of a merge at the root level.

wouldn't it make more sense to namespace the keys by the stage name so they can never clash.

@costrouc
Copy link
Member Author

How does this work with conda packages?

@dharhas there shouldn't be any issue. This is using python entryhooks

wouldn't it make more sense to namespace the keys by the stage name so they can never clash.

I'm split on this.

Pros:

  • Yes this would guarantee no clashes.
  • much simpler to understand (explicit) the stage <-> schema mapping

Cons:

  • our current schema doesn't do this so we'd have to make a special exception for our current stages
  • more verbose

To me backwards compatibility is the main concern for me.

@dharhas
Copy link
Member

dharhas commented Jun 21, 2023

Backwards compatibility isn't a huge issue for me as long as we have a clean messaging and upgrade path. Our install base is small enough to be manageable and we have relationships with folks using it.

@costrouc
Copy link
Member Author

@dharhas I'm fine with us making this move. I think this schema per stage will need to happen in another PR then. I'd like to make this PR focus on not changing anything and locking down the current schema. We can create a PR which makes this move and I think it will be extremely beneficial.

As of now I have this PR working on all clouds. So now I'd say this PR needs cleanup and checking the features are preserved.

@costrouc costrouc mentioned this pull request Jun 21, 2023
10 tasks
@costrouc costrouc changed the title [WIP] Extension Mechanism Implementation Extension Mechanism Implementation Jun 23, 2023
@costrouc costrouc added status: in review 👀 This PR is currently being reviewed by the team and removed status: in progress 🏗 This task is currently being worked on labels Jun 23, 2023
@pmeier
Copy link
Member

pmeier commented Jul 10, 2023

I've been thinking quite a bit about this after the initial presentation in one of our syncs. My biggest concern was that the way the extension mechanism is too flexible. While this is usually a good thing for power users, on the flip side it easily gets detrimental for regular ones. Since nebari is about making stuff easy, I would err on the side of the regular users.

I would propose something that hopefully also covers all use cases, but is a little simpler:

  • Each stage as we currently have it, has a fixed input and output scheme. It can be replaced, by at most one extension that abides by the input and output contracts. If multiple extensions try to replace the same stage, we refuse to deploy and error out.
  • Between each stage as well as before the first and after the last, we have some hooks. They will be called with the output of the previous stage (the raw nebari config for the very first) and can (maybe?) modify the values, but not the structure. In contrast to the stage replacing hooks, we can have an unlimited amount of these, i.e. multiple extensions can hook into the same thing.

As it stands, this will then limit nebari to k8s. However, we don't need the full flexibility proposed by this PR to remove this limit again: since the number of deployment targets is fairly low (right now I can think of k8s, HPC, and local), we could just build sets of fixed stages for all the targets that we want to support and within them go with the scheme I proposed above.

@costrouc
Copy link
Member Author

costrouc commented Aug 3, 2023

This took a long time to properly rebase since I had to incorporate #1832 and #1773. The only errors that exist are the ones that we see in the develop branch currently https://github.com/nebari-dev/nebari/actions/runs/5742626558/job/15565133390

@iameskild
Copy link
Member

The final (hopefully) rebase has been completed.

The Digital Ocean provider tests are failing and this is expected given that they recently deprecated kubernetes version 1.24.
The kubernetes tests is failing and when I deployed Nebari to AWS, everything worked except for the DNS auto-provision.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Chris Ostrouchov <[email protected]>
Co-authored-by: eskild <[email protected]>
Co-authored-by: Scott Blair <[email protected]>
Copy link
Member

@iameskild iameskild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions with the team, we agreed to merge this PR and work through bugs and enhancements as they come up. This issue tracks the current list of enhancements: #1894

@costrouc
Copy link
Member Author

@aktech I'm going to merge this PR but it will require some work to get it back to running the test_deployment code you wrote.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: high 🟥 This issue affects most of the nebari users or is a critical issue status: in review 👀 This PR is currently being reviewed by the team type: enhancement 💅🏼 New feature or request
Projects
6 participants