-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autoscaling context #42284
Add autoscaling context #42284
Conversation
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
@edoakes @alexeykudinkin I included those naming changing suggestions from pervious PR here. Let me know if you prefer to split those out for easier review. But the main focus of this PR should just be adding the |
app_name: Optional[str] = None, | ||
deployment_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what're these used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are used when we implement the latency based policy to get the metrics. Users can use those as ids for getting other custom metrics as well.
"""The context for an autoscaling policy call.""" | ||
|
||
# The AutoscalingConfig which the deployment started with. | ||
config: AutoscalingConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be a dictionary right? Other policies will have their own custom config fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now in the design doesn't allow custom configs on AutoscalingConfig
to pass in any config fields. This config
is just passing the existing AutoscalingConfig
. We can cast it to dictionary, but I'm not sure how useful it may be.
I think it shouldn't be too bad to add any custom configs tho, just need to do some kinda of setattr
on the AutoscalingConfig
object and make sure the protobuf serdes continue to work. Can be a follow up if you think this is needed.
# State of the policy to be used during the call | ||
policy_state: Dict[str, Any] = field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(more general concern, not about this PR specifically)
Note that this field will be lost if the controller crashes (e.g., the head node goes down).
We should either (1) fix this by checkpointing anytime the policy state changes or (2) clearly document this "gotcha"/limitation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will add a comment here and make sure to note it in the custom autoscaling doc. Can be a follow up feature if users require this to be checkpointed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we persist state in the GCS?
We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that we can't, just feel it's a separable component and doing it as a follow up seems to be easier. Will update this PR to include it.
# The timestamp of last scaled time. Will be None If the deployment have not scaled. | ||
last_scale_time: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this shouldn't just be tracked as part of policy_state
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
policy_state
was more of a after thought, will move last_scale_time
into policy_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think policy-state should be comprised of 2 pieces:
- Free form portion (KV store) for policy to persist arbitrary state
- Well-defined meta information (last updated time, bounded history of previous decisions, etc)
# State of the policy to be used during the call | ||
policy_state: Dict[str, Any] = field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we persist state in the GCS?
We should totally persist the state of the policy (eventually) so that head node restart doesn't reset its state potentially sending it haywire
# The timestamp of last scaled time. Will be None If the deployment have not scaled. | ||
last_scale_time: Optional[float] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think policy-state should be comprised of 2 pieces:
- Free form portion (KV store) for policy to persist arbitrary state
- Well-defined meta information (last updated time, bounded history of previous decisions, etc)
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Signed-off-by: Gene Su <[email protected]>
Close for now until we figured out a redesign |
Why are these changes needed?
Create
AutoscalingContext
to enclose the context where autoscaling policy is running on.curr_target_num_replicas
tocurrent_target_num_replicas
andshould_autoscale
tois_autoscaling_policy_enabled
Related issue number
Third PR for #41135
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.