Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoscaling context #42284

Closed
wants to merge 13 commits into from

Conversation

GeneDer
Copy link
Contributor

@GeneDer GeneDer commented Jan 9, 2024

Why are these changes needed?

Create AutoscalingContext to enclose the context where autoscaling policy is running on.

  • also renamed all occurrence of curr_target_num_replicas to current_target_num_replicas and should_autoscale to is_autoscaling_policy_enabled

Related issue number

Third PR for #41135

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@GeneDer GeneDer self-assigned this Jan 9, 2024
@GeneDer GeneDer marked this pull request as ready for review January 10, 2024 18:38
@GeneDer GeneDer requested a review from a team January 10, 2024 18:38
@GeneDer
Copy link
Contributor Author

GeneDer commented Jan 10, 2024

@edoakes @alexeykudinkin I included those naming changing suggestions from pervious PR here. Let me know if you prefer to split those out for easier review. But the main focus of this PR should just be adding the AutoscalingContext class

Comment on lines 40 to 41
app_name: Optional[str] = None,
deployment_name: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what're these used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are used when we implement the latency based policy to get the metrics. Users can use those as ids for getting other custom metrics as well.

"""The context for an autoscaling policy call."""

# The AutoscalingConfig which the deployment started with.
config: AutoscalingConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be a dictionary right? Other policies will have their own custom config fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now in the design doesn't allow custom configs on AutoscalingConfig to pass in any config fields. This config is just passing the existing AutoscalingConfig. We can cast it to dictionary, but I'm not sure how useful it may be.

I think it shouldn't be too bad to add any custom configs tho, just need to do some kinda of setattr on the AutoscalingConfig object and make sure the protobuf serdes continue to work. Can be a follow up if you think this is needed.

Comment on lines +110 to +111
# State of the policy to be used during the call
policy_state: Dict[str, Any] = field(default_factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(more general concern, not about this PR specifically)

Note that this field will be lost if the controller crashes (e.g., the head node goes down).

We should either (1) fix this by checkpointing anytime the policy state changes or (2) clearly document this "gotcha"/limitation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will add a comment here and make sure to note it in the custom autoscaling doc. Can be a follow up feature if users require this to be checkpointed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we persist state in the GCS?

We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that we can't, just feel it's a separable component and doing it as a follow up seems to be easier. Will update this PR to include it.

Comment on lines 112 to 113
# The timestamp of last scaled time. Will be None If the deployment have not scaled.
last_scale_time: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this shouldn't just be tracked as part of policy_state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

policy_state was more of a after thought, will move last_scale_time into policy_state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think policy-state should be comprised of 2 pieces:

  • Free form portion (KV store) for policy to persist arbitrary state
  • Well-defined meta information (last updated time, bounded history of previous decisions, etc)

Comment on lines +110 to +111
# State of the policy to be used during the call
policy_state: Dict[str, Any] = field(default_factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we persist state in the GCS?

We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire

Comment on lines 112 to 113
# The timestamp of last scaled time. Will be None If the deployment have not scaled.
last_scale_time: Optional[float] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think policy-state should be comprised of 2 pieces:

  • Free form portion (KV store) for policy to persist arbitrary state
  • Well-defined meta information (last updated time, bounded history of previous decisions, etc)

@GeneDer GeneDer added ray-2.10 P0 Issues that should be fixed in short order labels Jan 25, 2024
@GeneDer GeneDer added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Feb 6, 2024
@GeneDer
Copy link
Contributor Author

GeneDer commented Feb 20, 2024

Close for now until we figured out a redesign

@GeneDer GeneDer closed this Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks ray-2.10
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants