-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added default AutoMLTabularTrainingJob column transformations #357
feat: Added default AutoMLTabularTrainingJob column transformations #357
Conversation
chore: merge main into dev
40f354e
to
435a98d
Compare
435a98d
to
3300faa
Compare
@@ -86,9 +86,9 @@ def __init__( | |||
raise ValueError("One of gcs_source or bq_source must be set.") | |||
|
|||
if gcs_source: | |||
dataset_metadata = {"input_config": {"gcs_source": {"uri": gcs_source}}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were wrong before
get_dataset_mock.return_value = gca_dataset.Dataset( | ||
display_name=_TEST_DISPLAY_NAME, | ||
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR, | ||
metadata=_TEST_METADATA_TABULAR_GCS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mock with GCS source
get_dataset_mock.return_value = gca_dataset.Dataset( | ||
display_name=_TEST_DISPLAY_NAME, | ||
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR, | ||
metadata=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mock with no metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Tested column_names and it works nicely.
Left a few requested changes. Thanks!
header_line = line[:first_new_line_index] | ||
|
||
# Split to make it an iterable | ||
header_line = header_line.split("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be safer to only include the first line header_line.split("\n")[:1]
to avoid possible parsing errors down stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -2918,10 +2917,19 @@ def _run( | |||
|
|||
training_task_definition = schema.training_job.definition.automl_tabular | |||
|
|||
if self._column_transformations is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please log here we are defaulting to auto for all columns as column_transformations was not provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INFO:google.cloud.aiplatform.training_jobs:No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
INFO:google.cloud.aiplatform.training_jobs:The column transformation of type 'auto' was set for the following columns': ['station_number', 'wban_number', 'year', 'month', 'day', 'num_mean_temp_samples', 'mean_dew_point', 'num_mean_dew_point_samples', 'mean_sealevel_pressure', 'num_mean_sealevel_pressure_samples', 'mean_station_pressure', 'num_mean_station_pressure_samples', 'mean_visibility', 'num_mean_visibility_samples', 'mean_wind_speed', 'num_mean_wind_speed_samples', 'max_sustained_wind_speed', 'max_gust_wind_speed', 'max_temperature', 'max_temperature_explicit', 'min_temperature', 'min_temperature_explicit', 'total_precipitation', 'snow_depth', 'fog', 'rain', 'snow', 'hail', 'thunder', 'tornado'].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sasha-gitg Does this look okay or is it too verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be nice to show the names so the user can verify the columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
8937abc
to
9e66508
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Added a few comments. Thanks Ivan!
line = "" | ||
|
||
try: | ||
logging.disable(logging.CRITICAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preference to be more precise by filtering the module logs we're trying to suppress. Like so:
python-aiplatform/google/cloud/aiplatform/initializer.py
Lines 172 to 176 in 857f63d
logger = logging.getLogger("google.auth._default") | |
logging_warning_filter = utils.LoggingWarningFilter() | |
logger.addFilter(logging_warning_filter) | |
credentials, _ = google.auth.default() | |
logger.removeFilter(logging_warning_filter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
from google.cloud.aiplatform import datasets | ||
from google.cloud.aiplatform.datasets import _datasources | ||
from google.cloud.aiplatform import initializer | ||
from google.cloud.aiplatform import schema | ||
from google.cloud.aiplatform import utils | ||
|
||
from typing import List | ||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import should be up top with stdlib imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0.
* Bump google-cloud-storage min version to 1.32.0 This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0. * Bumped version in constraints-3.6.txt
Testing code
fixes: https://b.corp.google.com/issues/181042526