Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI] add support for cluster management #13835

Merged
merged 51 commits into from
Aug 2, 2022

Conversation

nicolai86
Copy link
Contributor

@nicolai86 nicolai86 commented Jul 25, 2022

What does this PR do?

This PR adds support to manage Bring-Your-Own-Credentials lightning compute clusters.
The initial implementation exposes cluster creation, cluster listing and cluster deletion.

The feature is currently only available as closed preview and requires allow listing by a Lightning AI employees.
I've added two end-to-end tests and a couple of unit tests to ensure the CLI works as expected.

The general flow is aligned with how the lighting CLI works today:

Create clusters

This command creates a BYOC compute cluster. The provider arguments are specific to the only cloud provider we support today, AWS.

lightning create cluster <cluster-name> <cloud-provider-parameters>

list clusters

This command does not take any arguments. It lists clusters available to the current user

lightning list clusters

Delete cluster

The command deletes a BYOC compute cluster.

lightning delete cluster <cluster-name>

Does your PR introduce any breaking changes? If yes, please list them.

No breaking changes

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

nicolai86 added 11 commits July 25, 2022 09:46
I can run this with

```
python -m pytest tests/tests_clusters/test_cluster_lifecycle.py
```

test fails ofc because both the assertions don't work yet
supported commands:
- lightning clusters create
- lightning clusters list
- lightning clusters delete
verified e2e via

```
python -m pytest tests/tests_clusters/test_cluster_lifecycle.py::test_cluster_list
```
I can't validate this just yet because I don't have the feature flag enabled in prod.
@github-actions github-actions bot added the app (removed) Generic label for Lightning App package label Jul 25, 2022
src/lightning_app/cli/cmd_clusters.py Outdated Show resolved Hide resolved
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if the failing cloud tests are related... @manskx?

manskx
manskx previously approved these changes Jul 29, 2022
@manskx manskx dismissed their stale review July 29, 2022 11:41

by mistake

@manskx
Copy link
Contributor

manskx commented Jul 29, 2022

Curious if the failing cloud tests are related... @manskx?

Seems the tests are failing because it couldn't access the secrets. @Borda

Copy link
Contributor

@manskx manskx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !!

the e2e tests are not running because the PR is from a fork so the secrets are not shared. TODO

instead of providing a list of default we'll populate this on the backend
@Felonious-Spellfire
Copy link
Contributor

LGTM!

@mergify mergify bot added the ready PRs ready to be merged label Jul 29, 2022
@awaelchli awaelchli added this to the app:0.6 milestone Aug 2, 2022
@tchaton tchaton merged commit 2919dcf into Lightning-AI:master Aug 2, 2022
os.environ.get("LIGHTNING_BYOC_EXTERNAL_ID") is None,
reason="missing LIGHTNING_BYOC_EXTERNAL_ID environment variable",
)
def test_cluster_lifecycle() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduced this new tests_clusters directory. But which testing job runs it?

Copy link
Contributor Author

@nicolai86 nicolai86 Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't hooked up to our CI today because during manual verification cluster creation took more than one hour. There's an internal ticket tracking hooking this up to a CI - GRID-9821.

I think we can add them to our CI when we add path filtering, so they only run when the cluster CLI is changed to not slow down PR checks. Or alternatively, mark it as optional and just keep it running.
See the ticket as there are other concerns around this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these tests are maintained outside the app folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli there is - they don't execute any apps and don't depend on any apps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest removing this file for now until they actually get hooked into CI. Not only they can silently break, they also give a false sense of security to unaware contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me talk to @Borda next week about this; as this is part of our cloud offering we might be able to shift them to the non-public part of our codebase.

@nicolai86 nicolai86 deleted the nicolai86/cluster-management branch August 11, 2022 19:53
@Borda Borda mentioned this pull request Aug 19, 2022
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app (removed) Generic label for Lightning App package priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants