Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support autoscaling metrics when deploying models #1197

Merged
merged 14 commits into from
May 23, 2022
Merged

feat: support autoscaling metrics when deploying models #1197

merged 14 commits into from
May 23, 2022

Conversation

munagekar
Copy link
Contributor

@munagekar munagekar commented May 5, 2022

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1198

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: vertex-ai Issues related to the googleapis/python-aiplatform API. labels May 5, 2022
@munagekar munagekar changed the title support autoscaling metrics when deploying models feat: support autoscaling metrics when deploying models May 5, 2022
@munagekar munagekar marked this pull request as ready for review May 5, 2022 03:38
@sararob sararob added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 5, 2022
@munagekar
Copy link
Contributor Author

munagekar commented May 5, 2022

@sararob I see that you have added the do-not-merge label to all open PRs in the repo. Let me know if there's anything specific that you want addressed in this PR.

My team needs this feature to be added to the python sdk to use vertex AI.

@munagekar
Copy link
Contributor Author

@sasha-gitg Could you take a look ?

@sararob sararob removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label May 10, 2022
@sasha-gitg sasha-gitg requested a review from sararob May 10, 2022 18:42
Copy link
Contributor

@sararob sararob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Could you add unit tests in tests/unit/aiplatform/test_endpoints.py for these additions? You can add the following tests:

  • test_deploy_with_autoscaling_target_cpu_utilization
  • test_deploy_with_autoscaling_target_accelerator_duty_cycle
  • test_deploy_with_autoscaling_target_accelerator_duty_cycle_and_no_accelerator_type_or_count_raises (to ensure this raises a ValueError)

Let me know if you need any help adding these.

@@ -980,6 +1011,12 @@ def _deploy_call(
"Both `accelerator_type` and `accelerator_count` should be specified or None."
)

if not accelerator_type or not accelerator_count and autoscaling_target_accelerator_duty_cycle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is working as expected (I tried deploying a model with only autoscaling_target_cpu_utilization set and it raised this error). You can change it to something like:

if autoscaling_target_accelerator_duty_cycle is not None and (not accelerator_type or not accelerator_count):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 9501eb0

@@ -840,6 +853,13 @@ def _deploy(
be immediately returned and synced when the Future has completed.
deploy_request_timeout (float):
Optional. The timeout for the deploy request in seconds.
autoscaling_target_cpu_utilization (int):
Target CPU Utilization to use for Autoscaling Replicas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add "Optional" to the beginning of this docstring? Same for autoscaling_target_accelerator_duty_cycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7a601c7

@munagekar munagekar requested a review from sararob May 11, 2022 09:56
@munagekar
Copy link
Contributor Author

munagekar commented May 11, 2022

Could you add unit tests in tests/unit/aiplatform/test_endpoints.py for these additions?

@sararob Thank you for the review. I have added in tests. PTAL

Copy link
Contributor

@sararob sararob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests. Left a few comments, once these are addressed and you merge the latest from main into your branch, this should be ready to merge after presubmit tests pass.

deploy_request_timeout=None,
autoscaling_target_accelerator_duty_cycle=70
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the following after the deploy call so the test passes with sync=False?

if not sync:
    test_endpoint.wait()

Copy link
Contributor Author

@munagekar munagekar May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the other tests which raise errors with the deploy operation wait for the operation to complete except for one.

def test_deploy_raise_error_traffic_80

Can you confirm if you need me to the wait call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this test needs the wait call. The ones that don't need it are testing invalid parameter values (negative numbers and > 100 for traffic_percentage so the LRO will never start in those cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8a28d7a

)
test_endpoint.deploy(
model=test_model,
machine_type=_TEST_MACHINE_TYPE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the machine_type, service_account, and deploy_request_timeout parameters from this call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fbceb92

)

expected_autoscaling_metric_spec = gca_machine_resources.AutoscalingMetricSpec(
metric_name="aiplatform.googleapis.com/prediction/online/cpu/utilization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you save this metric_name string to variable at the top of the file, something like _TEST_METRIC_NAME_CPU_UTILIZATION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e14d1ae

)

expected_autoscaling_metric_spec = gca_machine_resources.AutoscalingMetricSpec(
metric_name="aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re: metric_name string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e14d1ae

@munagekar
Copy link
Contributor Author

munagekar commented May 13, 2022

Thank you for reviewing the changes. I don't have enough bandwidth this week. I will send in patches early next week, (however feel free to take over the PR if required.)

@munagekar munagekar requested a review from sararob May 17, 2022 10:23
@munagekar
Copy link
Contributor Author

munagekar commented May 17, 2022

@sararob I have addressed review comments. PTAL.

Copy link
Contributor

@sararob sararob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adding the wait() call in test_deploy_with_autoscaling_target_accelerator_duty_cycle_and_no_accelerator_type_or_count_raises, this should be good to go.

@munagekar munagekar requested a review from sararob May 17, 2022 23:39
@munagekar
Copy link
Contributor Author

@sararob Thank you for review and clarifications. Updated the PR, PTAL.

@sararob sararob added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 19, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 19, 2022
Copy link
Contributor

@sararob sararob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests passed, you just need to run the linter. You can do this in the root of the repo with:

pip3 install nox && nox -s lint

And then push a commit with the changes. Or if you give me push access to your branch I can run it and merge the PR. Thanks!

@munagekar
Copy link
Contributor Author

@sararob Thank you for the review.I have fixed the linting error, lint check seems to pass locally.

  • I have already allowed edit by maintainers of googleapis/python-aiplatform for this PR.
  • Additionally I have sent a collaborator request to you for munagekar/python-aiplatform.

Feel free to take over the PR if there are any other issues.

@munagekar munagekar requested a review from sararob May 20, 2022 13:48
@sararob sararob added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 20, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 20, 2022
@sararob sararob added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 23, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label May 23, 2022
@sararob sararob merged commit 095717c into googleapis:main May 23, 2022
@munagekar munagekar deleted the feat/autoscaling-endpoint branch May 24, 2022 03:55
SinaChavoshi pushed a commit to SinaChavoshi/python-aiplatform that referenced this pull request Jun 23, 2022
)

* feat: support autoscaling metrics when deploying models

* feat: support model deploy to endpoint with autoscaling metrics

* fix autoscaling_target_accelerator_duty_cycle check

* fix docstring: specify that autoscaling_params are optional

* bug fix: add autoscaling_target_cpu_utilization to custom_resource_spec

* add tests

* add _TEST_METRIC_NAME_CPU_UTILIZATION and _TEST_METRIC_NAME_GPU_UTILIZATION

* remove not required arguments in tests

* fix tests: wait for LRO to complete even if not sync

* fix lint: run black

Co-authored-by: Sara Robinson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: vertex-ai Issues related to the googleapis/python-aiplatform API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Setting Autoscaling Metrics when deploying Model to an Endpoint with python api
3 participants