Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added default AutoMLTabularTrainingJob column transformations #357

Merged

Conversation

ivanmkc
Copy link
Contributor

@ivanmkc ivanmkc commented Apr 29, 2021

  • Functionality to default to all columns for GCS sources
  • Functionality to default to all columns for BQ sources
  • Units tests
  • Check if target column needs to be omitted
  • Test with live BQ datasources
  • Test with live GCS datasources
  • Test with end-to-end CUJ

Testing code

# Train
from google.cloud.aiplatform import training_jobs

ds = aiplatform.TabularDataset.create(display_name="sample-gsod", bq_source="bq://bigquery-public-data.samples.gsod")

job = training_jobs.AutoMLTabularTrainingJob(
    display_name="tabular_regression_job_test",
    optimization_objective="minimize-rmse",
    optimization_prediction_type="regression",
    column_transformations=None,
    optimization_objective_recall_value=None,
    optimization_objective_precision_value=None,
)

model_from_job = job.run(
    dataset=ds,
    target_column="mean_temp",
    model_display_name="tabular_model",
    sync=True,
)

fixes: https://b.corp.google.com/issues/181042526

@ivanmkc ivanmkc requested a review from a team as a code owner April 29, 2021 22:15
@product-auto-label product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Apr 29, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Apr 29, 2021
@ivanmkc ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch 2 times, most recently from 40f354e to 435a98d Compare April 29, 2021 22:33
@ivanmkc ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch from 435a98d to 3300faa Compare April 29, 2021 22:35
@ivanmkc ivanmkc changed the title Added default AutoMLTabularTrainingJob default transformations fix: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021
@ivanmkc ivanmkc changed the title fix: Added default AutoMLTabularTrainingJob default transformations feat: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021
@@ -86,9 +86,9 @@ def __init__(
raise ValueError("One of gcs_source or bq_source must be set.")

if gcs_source:
dataset_metadata = {"input_config": {"gcs_source": {"uri": gcs_source}}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were wrong before

get_dataset_mock.return_value = gca_dataset.Dataset(
display_name=_TEST_DISPLAY_NAME,
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
metadata=_TEST_METADATA_TABULAR_GCS,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock with GCS source

get_dataset_mock.return_value = gca_dataset.Dataset(
display_name=_TEST_DISPLAY_NAME,
metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
metadata=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock with no metadata

@ivanmkc ivanmkc changed the title feat: Added default AutoMLTabularTrainingJob default transformations [WIP] feat: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021
@ivanmkc ivanmkc changed the title [WIP] feat: Added default AutoMLTabularTrainingJob default transformations feat: Added default AutoMLTabularTrainingJob default transformations Apr 30, 2021
@ivanmkc ivanmkc changed the title feat: Added default AutoMLTabularTrainingJob default transformations feat: Added default AutoMLTabularTrainingJob column transformations Apr 30, 2021
Copy link
Member

@sasha-gitg sasha-gitg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Tested column_names and it works nicely.

Left a few requested changes. Thanks!

header_line = line[:first_new_line_index]

# Split to make it an iterable
header_line = header_line.split("\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be safer to only include the first line header_line.split("\n")[:1] to avoid possible parsing errors down stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

google/cloud/aiplatform/datasets/tabular_dataset.py Outdated Show resolved Hide resolved
@@ -2918,10 +2917,19 @@ def _run(

training_task_definition = schema.training_job.definition.automl_tabular

if self._column_transformations is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please log here we are defaulting to auto for all columns as column_transformations was not provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@ivanmkc ivanmkc May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INFO:google.cloud.aiplatform.training_jobs:No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
INFO:google.cloud.aiplatform.training_jobs:The column transformation of type 'auto' was set for the following columns': ['station_number', 'wban_number', 'year', 'month', 'day', 'num_mean_temp_samples', 'mean_dew_point', 'num_mean_dew_point_samples', 'mean_sealevel_pressure', 'num_mean_sealevel_pressure_samples', 'mean_station_pressure', 'num_mean_station_pressure_samples', 'mean_visibility', 'num_mean_visibility_samples', 'mean_wind_speed', 'num_mean_wind_speed_samples', 'max_sustained_wind_speed', 'max_gust_wind_speed', 'max_temperature', 'max_temperature_explicit', 'min_temperature', 'min_temperature_explicit', 'total_precipitation', 'snow_depth', 'fog', 'rain', 'snow', 'hail', 'thunder', 'tornado'].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sasha-gitg Does this look okay or is it too verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be nice to show the names so the user can verify the columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ivanmkc ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch from 8937abc to 9e66508 Compare May 4, 2021 19:50
@ivanmkc ivanmkc requested a review from a team as a code owner May 5, 2021 00:55
Copy link
Member

@sasha-gitg sasha-gitg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Added a few comments. Thanks Ivan!

google/cloud/aiplatform/datasets/tabular_dataset.py Outdated Show resolved Hide resolved
line = ""

try:
logging.disable(logging.CRITICAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference to be more precise by filtering the module logs we're trying to suppress. Like so:

logger = logging.getLogger("google.auth._default")
logging_warning_filter = utils.LoggingWarningFilter()
logger.addFilter(logging_warning_filter)
credentials, _ = google.auth.default()
logger.removeFilter(logging_warning_filter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

from google.cloud.aiplatform import datasets
from google.cloud.aiplatform.datasets import _datasources
from google.cloud.aiplatform import initializer
from google.cloud.aiplatform import schema
from google.cloud.aiplatform import utils

from typing import List
import logging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import should be up top with stdlib imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ivanmkc ivanmkc merged commit 4fce8c4 into googleapis:master May 5, 2021
ivanmkc added a commit that referenced this pull request May 5, 2021
This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes.

This method was introduced in 1.32.0.
dizcology pushed a commit that referenced this pull request May 6, 2021
* Bump google-cloud-storage min version to 1.32.0

This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes.

This method was introduced in 1.32.0.

* Bumped version in constraints-3.6.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants