Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster cleanup #246

Merged
merged 4 commits into from
Jun 5, 2024
Merged

Cluster cleanup #246

merged 4 commits into from
Jun 5, 2024

Conversation

approxit
Copy link
Contributor

@approxit approxit commented Jun 4, 2024

What I've done:

  • Made that empty Cluster instead of being kept in the memory, it could be auto-removed after a specific time or be overridden by a new cluster before that timeout. This helps with consecutive ray down and ray up.
  • Reworked on_stop events in Cluster, ClusterNode, and `ClusterNodeSidecar. They spawn separate asyncio tasks by default.
  • Moved monitor check failed logic from particular sidecar to its parent to better separation of concerns.
  • Fixed default head node priority_subnet_tag not being correctly handled.
  • Moved site packages installed from the system level to be installed in virtualenv that is copied over to golem volume. It can take around 2 minutes for some providers, but it is necessary to have pip working while we wait for more functionality in golem volumes.
  • Bumped total_budget from 1 to 5.
  • Removed some leftovers from workflows

Notable remarks:

  • Consecutive ray up with the same name but with different configs results in ray-specific errors, we decided that currently, we will not implement any additional warnings/guards about that.

@approxit approxit changed the base branch from main to release/0.11 June 4, 2024 09:54
golem-cluster.yaml Show resolved Hide resolved
async def _on_cluster_stop(self, cluster: Cluster) -> None:
cluster_name = str(cluster)

logger.info(f"Cluster `%s` stopped, it will be removed after 30 seconds", cluster_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the defined timeout constant instead of hardcoding the text

Copy link
Contributor

@shadeofblue shadeofblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. please apply my comment, though

@approxit approxit merged commit dc80425 into release/0.11 Jun 5, 2024
8 checks passed
@approxit approxit deleted the approxit/cluster-cleanup branch June 5, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants