-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CLI] add support for cluster management #13835
[CLI] add support for cluster management #13835
Conversation
I can run this with ``` python -m pytest tests/tests_clusters/test_cluster_lifecycle.py ``` test fails ofc because both the assertions don't work yet
supported commands: - lightning clusters create - lightning clusters list - lightning clusters delete
verified e2e via ``` python -m pytest tests/tests_clusters/test_cluster_lifecycle.py::test_cluster_list ```
I can't validate this just yet because I don't have the feature flag enabled in prod.
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
Co-authored-by: Laverne Henderson <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if the failing cloud tests are related... @manskx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice !!
the e2e tests are not running because the PR is from a fork so the secrets are not shared. TODO
instead of providing a list of default we'll populate this on the backend
LGTM! |
os.environ.get("LIGHTNING_BYOC_EXTERNAL_ID") is None, | ||
reason="missing LIGHTNING_BYOC_EXTERNAL_ID environment variable", | ||
) | ||
def test_cluster_lifecycle() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduced this new tests_clusters directory. But which testing job runs it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't hooked up to our CI today because during manual verification cluster creation took more than one hour. There's an internal ticket tracking hooking this up to a CI - GRID-9821.
I think we can add them to our CI when we add path filtering, so they only run when the cluster CLI is changed to not slow down PR checks. Or alternatively, mark it as optional and just keep it running.
See the ticket as there are other concerns around this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why these tests are maintained outside the app folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli there is - they don't execute any apps and don't depend on any apps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing this file for now until they actually get hooked into CI. Not only they can silently break, they also give a false sense of security to unaware contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me talk to @Borda next week about this; as this is part of our cloud offering we might be able to shift them to the non-public part of our codebase.
What does this PR do?
This PR adds support to manage Bring-Your-Own-Credentials lightning compute clusters.
The initial implementation exposes cluster creation, cluster listing and cluster deletion.
The feature is currently only available as closed preview and requires allow listing by a Lightning AI employees.
I've added two end-to-end tests and a couple of unit tests to ensure the CLI works as expected.
The general flow is aligned with how the
lighting
CLI works today:Create clusters
This command creates a BYOC compute cluster. The provider arguments are specific to the only cloud provider we support today, AWS.
list clusters
This command does not take any arguments. It lists clusters available to the current user
Delete cluster
The command deletes a BYOC compute cluster.
Does your PR introduce any breaking changes? If yes, please list them.
No breaking changes
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃