New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add autoscaling context #42284

Closed

GeneDer wants to merge 13 commits into ray-project:master from GeneDer:add-autoscaling-context

Contributor

GeneDer commented Jan 9, 2024

Why are these changes needed?

Create AutoscalingContext to enclose the context where autoscaling policy is running on.

also renamed all occurrence of curr_target_num_replicas to current_target_num_replicas and should_autoscale to is_autoscaling_policy_enabled

Related issue number

Third PR for #41135

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

GeneDer added 2 commits

January 9, 2024 15:50


          [Serve] add AutoscalingContext

b67eb1e

Signed-off-by: Gene Su <[email protected]>


          rename current_target_num_replicas and is_autoscaling_policy_enabled

f952d0c

Signed-off-by: Gene Su <[email protected]>

GeneDer self-assigned this

GeneDer added 4 commits

January 9, 2024 16:07


          fix linting

588dc7d

Signed-off-by: Gene Su <[email protected]>


          add defailt type for policy state

689f7aa

Signed-off-by: Gene Su <[email protected]>


          Merge branch 'master' into add-autoscaling-context


          update comments

e4f60ee

Signed-off-by: Gene Su <[email protected]>

GeneDer marked this pull request as ready for review

January 10, 2024 18:38

GeneDer requested a review from a team

January 10, 2024 18:38

Contributor Author

GeneDer commented Jan 10, 2024

@edoakes @alexeykudinkin I included those naming changing suggestions from pervious PR here. Let me know if you prefer to split those out for easier review. But the main focus of this PR should just be adding the AutoscalingContext class


          Merge branch 'master' into add-autoscaling-context

edoakes reviewed

View reviewed changes

python/ray/serve/_private/autoscaling_policy.py Outdated

Comment on lines 40 to 41

		app_name: Optional[str] = None,
		deployment_name: Optional[str] = None,

Contributor

edoakes Jan 10, 2024

what're these used for?

Contributor Author

GeneDer Jan 10, 2024

Those are used when we implement the latency based policy to get the metrics. Users can use those as ids for getting other custom metrics as well.

python/ray/serve/_private/autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/autoscaling_policy.py

+                  """The context for an autoscaling policy call."""
+                  # The AutoscalingConfig which the deployment started with.
+                  config: AutoscalingConfig

Contributor

edoakes Jan 10, 2024

I think this should just be a dictionary right? Other policies will have their own custom config fields

Contributor Author

GeneDer Jan 10, 2024

Right now in the design doesn't allow custom configs on AutoscalingConfig to pass in any config fields. This config is just passing the existing AutoscalingConfig. We can cast it to dictionary, but I'm not sure how useful it may be.

I think it shouldn't be too bad to add any custom configs tho, just need to do some kinda of setattr on the AutoscalingConfig object and make sure the protobuf serdes continue to work. Can be a follow up if you think this is needed.

python/ray/serve/autoscaling_policy.py

Comment on lines +110 to +111

		# State of the policy to be used during the call
		policy_state: Dict[str, Any] = field(default_factory=dict)

Contributor

edoakes Jan 10, 2024

(more general concern, not about this PR specifically)

Note that this field will be lost if the controller crashes (e.g., the head node goes down).

We should either (1) fix this by checkpointing anytime the policy state changes or (2) clearly document this "gotcha"/limitation

Contributor Author

GeneDer Jan 10, 2024

Makes sense, I will add a comment here and make sure to note it in the custom autoscaling doc. Can be a follow up feature if users require this to be checkpointed.

Contributor

alexeykudinkin Jan 11, 2024

Why can't we persist state in the GCS?

We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire

Contributor Author

GeneDer Jan 11, 2024

Not that we can't, just feel it's a separable component and doing it as a follow up seems to be easier. Will update this PR to include it.

python/ray/serve/autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/autoscaling_policy.py Outdated

Comment on lines 112 to 113

		# The timestamp of last scaled time. Will be None If the deployment have not scaled.
		last_scale_time: Optional[float] = None

Contributor

edoakes Jan 10, 2024

any reason this shouldn't just be tracked as part of policy_state?

Contributor Author

GeneDer Jan 10, 2024

policy_state was more of a after thought, will move last_scale_time into policy_state

Contributor

alexeykudinkin Jan 11, 2024

I think policy-state should be comprised of 2 pieces:

Free form portion (KV store) for policy to persist arbitrary state
Well-defined meta information (last updated time, bounded history of previous decisions, etc)

alexeykudinkin reviewed

View reviewed changes

python/ray/serve/_private/autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/autoscaling_policy.py Outdated Show resolved Hide resolved

python/ray/serve/autoscaling_policy.py

Comment on lines +110 to +111

		# State of the policy to be used during the call
		policy_state: Dict[str, Any] = field(default_factory=dict)

Contributor

alexeykudinkin Jan 11, 2024

Why can't we persist state in the GCS?

We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire

python/ray/serve/autoscaling_policy.py Outdated

Comment on lines 112 to 113

		# The timestamp of last scaled time. Will be None If the deployment have not scaled.
		last_scale_time: Optional[float] = None

Contributor

alexeykudinkin Jan 11, 2024

I think policy-state should be comprised of 2 pieces:

Free form portion (KV store) for policy to persist arbitrary state
Well-defined meta information (last updated time, bounded history of previous decisions, etc)

python/ray/serve/autoscaling_policy.py Outdated Show resolved Hide resolved

alexeykudinkin mentioned this pull request

[Serve] Refactor ReplicaQueueLengthAutoscalingPolicy into AutoscalingPolicyManager and policy function #42242

Merged

8 tasks

GeneDer added 6 commits

January 21, 2024 18:25


          Merge branch 'master' into add-autoscaling-context

a6a1a62


          address some comments

e686d8c

Signed-off-by: Gene Su <[email protected]>


          fix deployment state manager calling autoscaling policy manager

49c6cac

Signed-off-by: Gene Su <[email protected]>


          checkpoint policy states

4e266a9

Signed-off-by: Gene Su <[email protected]>


          Merge branch 'master' into add-autoscaling-context

f7ddb58


          fix tests

7a8a2cd

Signed-off-by: Gene Su <[email protected]>

GeneDer added ray-2.10 P0 labels

GeneDer added P1 and removed P0 labels

Contributor Author

GeneDer commented Feb 20, 2024

Close for now until we figured out a redesign

GeneDer closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels