-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support autoscaling metrics when deploying models #1197
feat: support autoscaling metrics when deploying models #1197
Conversation
@sararob I see that you have added the do-not-merge label to all open PRs in the repo. Let me know if there's anything specific that you want addressed in this PR. My team needs this feature to be added to the python sdk to use vertex AI. |
@sasha-gitg Could you take a look ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution! Could you add unit tests in tests/unit/aiplatform/test_endpoints.py
for these additions? You can add the following tests:
- test_deploy_with_autoscaling_target_cpu_utilization
- test_deploy_with_autoscaling_target_accelerator_duty_cycle
- test_deploy_with_autoscaling_target_accelerator_duty_cycle_and_no_accelerator_type_or_count_raises (to ensure this raises a
ValueError
)
Let me know if you need any help adding these.
google/cloud/aiplatform/models.py
Outdated
@@ -980,6 +1011,12 @@ def _deploy_call( | |||
"Both `accelerator_type` and `accelerator_count` should be specified or None." | |||
) | |||
|
|||
if not accelerator_type or not accelerator_count and autoscaling_target_accelerator_duty_cycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this check is working as expected (I tried deploying a model with only autoscaling_target_cpu_utilization
set and it raised this error). You can change it to something like:
if autoscaling_target_accelerator_duty_cycle is not None and (not accelerator_type or not accelerator_count):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 9501eb0
@@ -840,6 +853,13 @@ def _deploy( | |||
be immediately returned and synced when the Future has completed. | |||
deploy_request_timeout (float): | |||
Optional. The timeout for the deploy request in seconds. | |||
autoscaling_target_cpu_utilization (int): | |||
Target CPU Utilization to use for Autoscaling Replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add "Optional" to the beginning of this docstring? Same for autoscaling_target_accelerator_duty_cycle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 7a601c7
@sararob Thank you for the review. I have added in tests. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests. Left a few comments, once these are addressed and you merge the latest from main
into your branch, this should be ready to merge after presubmit tests pass.
deploy_request_timeout=None, | ||
autoscaling_target_accelerator_duty_cycle=70 | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the following after the deploy
call so the test passes with sync=False
?
if not sync:
test_endpoint.wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the other tests which raise errors with the deploy operation wait for the operation to complete except for one.
def test_deploy_raise_error_traffic_80
Can you confirm if you need me to the wait call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this test needs the wait call. The ones that don't need it are testing invalid parameter values (negative numbers and > 100 for traffic_percentage
so the LRO will never start in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 8a28d7a
) | ||
test_endpoint.deploy( | ||
model=test_model, | ||
machine_type=_TEST_MACHINE_TYPE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the machine_type
, service_account
, and deploy_request_timeout
parameters from this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in fbceb92
) | ||
|
||
expected_autoscaling_metric_spec = gca_machine_resources.AutoscalingMetricSpec( | ||
metric_name="aiplatform.googleapis.com/prediction/online/cpu/utilization", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you save this metric_name
string to variable at the top of the file, something like _TEST_METRIC_NAME_CPU_UTILIZATION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e14d1ae
) | ||
|
||
expected_autoscaling_metric_spec = gca_machine_resources.AutoscalingMetricSpec( | ||
metric_name="aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above re: metric_name
string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in e14d1ae
Thank you for reviewing the changes. I don't have enough bandwidth this week. I will send in patches early next week, (however feel free to take over the PR if required.) |
@sararob I have addressed review comments. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After adding the wait()
call in test_deploy_with_autoscaling_target_accelerator_duty_cycle_and_no_accelerator_type_or_count_raises
, this should be good to go.
@sararob Thank you for review and clarifications. Updated the PR, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit tests passed, you just need to run the linter. You can do this in the root of the repo with:
pip3 install nox && nox -s lint
And then push a commit with the changes. Or if you give me push access to your branch I can run it and merge the PR. Thanks!
@sararob Thank you for the review.I have fixed the linting error, lint check seems to pass locally.
Feel free to take over the PR if there are any other issues. |
) * feat: support autoscaling metrics when deploying models * feat: support model deploy to endpoint with autoscaling metrics * fix autoscaling_target_accelerator_duty_cycle check * fix docstring: specify that autoscaling_params are optional * bug fix: add autoscaling_target_cpu_utilization to custom_resource_spec * add tests * add _TEST_METRIC_NAME_CPU_UTILIZATION and _TEST_METRIC_NAME_GPU_UTILIZATION * remove not required arguments in tests * fix tests: wait for LRO to complete even if not sync * fix lint: run black Co-authored-by: Sara Robinson <[email protected]>
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #1198