feat: Added default AutoMLTabularTrainingJob column transformations #357

ivanmkc · 2021-04-29T22:15:38Z

Functionality to default to all columns for GCS sources
Functionality to default to all columns for BQ sources
Units tests
Check if target column needs to be omitted
Test with live BQ datasources
Test with live GCS datasources
Test with end-to-end CUJ

Testing code

# Train
from google.cloud.aiplatform import training_jobs

ds = aiplatform.TabularDataset.create(display_name="sample-gsod", bq_source="bq://bigquery-public-data.samples.gsod")

job = training_jobs.AutoMLTabularTrainingJob(
    display_name="tabular_regression_job_test",
    optimization_objective="minimize-rmse",
    optimization_prediction_type="regression",
    column_transformations=None,
    optimization_objective_recall_value=None,
    optimization_objective_precision_value=None,
)

model_from_job = job.run(
    dataset=ds,
    target_column="mean_temp",
    model_display_name="tabular_model",
    sync=True,
)

fixes: https://b.corp.google.com/issues/181042526

chore: merge main into dev

ivanmkc · 2021-04-29T22:38:38Z

google/cloud/aiplatform/datasets/_datasources.py

@@ -86,9 +86,9 @@ def __init__(
            raise ValueError("One of gcs_source or bq_source must be set.")

        if gcs_source:
-            dataset_metadata = {"input_config": {"gcs_source": {"uri": gcs_source}}}


These were wrong before

ivanmkc · 2021-04-29T22:39:04Z

tests/unit/aiplatform/test_datasets.py

+        get_dataset_mock.return_value = gca_dataset.Dataset(
+            display_name=_TEST_DISPLAY_NAME,
+            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
+            metadata=_TEST_METADATA_TABULAR_GCS,


Mock with GCS source

ivanmkc · 2021-04-29T22:39:14Z

tests/unit/aiplatform/test_datasets.py

+        get_dataset_mock.return_value = gca_dataset.Dataset(
+            display_name=_TEST_DISPLAY_NAME,
+            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_TABULAR,
+            metadata=None,


Mock with no metadata

google/cloud/aiplatform/training_jobs.py

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg

Looks great. Tested column_names and it works nicely.

Left a few requested changes. Thanks!

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg · 2021-05-03T15:57:19Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            header_line = line[:first_new_line_index]
+
+            # Split to make it an iterable
+            header_line = header_line.split("\n")


It may be safer to only include the first line header_line.split("\n")[:1] to avoid possible parsing errors down stream.

Sure, will do.

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg · 2021-05-03T16:33:08Z

google/cloud/aiplatform/training_jobs.py

@@ -2918,10 +2917,19 @@ def _run(

        training_task_definition = schema.training_job.definition.automl_tabular

+        if self._column_transformations is None:


Please log here we are defaulting to auto for all columns as column_transformations was not provided.

Makes sense, will add.

INFO:google.cloud.aiplatform.training_jobs:No column transformations provided, so now retrieving columns from dataset in order to set default column transformations.
INFO:google.cloud.aiplatform.training_jobs:The column transformation of type 'auto' was set for the following columns': ['station_number', 'wban_number', 'year', 'month', 'day', 'num_mean_temp_samples', 'mean_dew_point', 'num_mean_dew_point_samples', 'mean_sealevel_pressure', 'num_mean_sealevel_pressure_samples', 'mean_station_pressure', 'num_mean_station_pressure_samples', 'mean_visibility', 'num_mean_visibility_samples', 'mean_wind_speed', 'num_mean_wind_speed_samples', 'max_sustained_wind_speed', 'max_gust_wind_speed', 'max_temperature', 'max_temperature_explicit', 'min_temperature', 'min_temperature_explicit', 'total_precipitation', 'snow_depth', 'fog', 'rain', 'snow', 'hail', 'thunder', 'tornado'].

@sasha-gitg Does this look okay or is it too verbose?

I thought it would be nice to show the names so the user can verify the columns.

… column names

sasha-gitg

Looks great. Added a few comments. Thanks Ivan!

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg · 2021-05-05T17:38:04Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+        line = ""
+
+        try:
+            logging.disable(logging.CRITICAL)


Preference to be more precise by filtering the module logs we're trying to suppress. Like so:

python-aiplatform/google/cloud/aiplatform/initializer.py

Lines 172 to 176 in 857f63d

logger = logging.getLogger("google.auth._default")

logging_warning_filter = utils.LoggingWarningFilter()

logger.addFilter(logging_warning_filter)

credentials, _ = google.auth.default()

logger.removeFilter(logging_warning_filter)

sasha-gitg · 2021-05-05T17:38:27Z

google/cloud/aiplatform/datasets/tabular_dataset.py

 from google.cloud.aiplatform import datasets
 from google.cloud.aiplatform.datasets import _datasources
 from google.cloud.aiplatform import initializer
 from google.cloud.aiplatform import schema
 from google.cloud.aiplatform import utils

+from typing import List
+import logging


Import should be up top with stdlib imports.

This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0.

* Bump google-cloud-storage min version to 1.32.0 This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0. * Bumped version in constraints-3.6.txt

sasha-gitg and others added 8 commits April 19, 2021 18:08

Merge pull request googleapis#339 from sasha-gitg/dev

c2caaa6

chore: merge main into dev

Added default column_transformation code

29bcc70

Added docstrings

5ce67e2

Added tests and moved code to tabular_dataset

4b96837

Switched to using BigQuery.Table instead of custom SQL query

b68e58c

Fixed bigquery unit test

6a0ac30

Added GCS test

af0b990

Fixed issues with incorrect input config parameter

ea5ef12

ivanmkc requested a review from a team as a code owner April 29, 2021 22:15

product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Apr 29, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Apr 29, 2021

ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch 2 times, most recently from 40f354e to 435a98d Compare April 29, 2021 22:33

Added test for AutoMLTabularTrainingJob for no transformations

3300faa

ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch from 435a98d to 3300faa Compare April 29, 2021 22:35

ivanmkc changed the title ~~Added default AutoMLTabularTrainingJob default transformations~~ fix: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021

ivanmkc changed the title ~~fix: Added default AutoMLTabularTrainingJob default transformations~~ feat: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021

ivanmkc commented Apr 29, 2021

View reviewed changes

Added comment

783d2ea

ivanmkc changed the title ~~feat: Added default AutoMLTabularTrainingJob default transformations~~ [WIP] feat: Added default AutoMLTabularTrainingJob default transformations Apr 29, 2021

ivanmkc commented Apr 30, 2021

View reviewed changes

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved

ivanmkc commented Apr 30, 2021

View reviewed changes

google/cloud/aiplatform/datasets/tabular_dataset.py Show resolved Hide resolved

ivanmkc commented Apr 30, 2021

View reviewed changes

google/cloud/aiplatform/datasets/tabular_dataset.py Show resolved Hide resolved

Fixed test

ae5dfa1

ivanmkc commented Apr 30, 2021

View reviewed changes

google/cloud/aiplatform/datasets/tabular_dataset.py Show resolved Hide resolved

ivanmkc changed the title ~~[WIP] feat: Added default AutoMLTabularTrainingJob default transformations~~ feat: Added default AutoMLTabularTrainingJob default transformations Apr 30, 2021

ivanmkc changed the title ~~feat: Added default AutoMLTabularTrainingJob default transformations~~ feat: Added default AutoMLTabularTrainingJob column transformations Apr 30, 2021

Ran linter

332e3e2

sasha-gitg requested changes May 3, 2021

View reviewed changes

Switched from classmethod to staticmethod where applicable and logged…

9e66508

… column names

ivanmkc force-pushed the imkc--automl-tabular-default-transformations branch from 8937abc to 9e66508 Compare May 4, 2021 19:50

ivanmkc added 3 commits May 4, 2021 17:56

Added extra dataset tests

17e9f37

Added logging suppression

c4f9d6a

Fixed lint errors

819cda8

ivanmkc requested a review from a team as a code owner May 5, 2021 00:55

sasha-gitg approved these changes May 5, 2021

View reviewed changes

Switched logging filter method

c2ece02

ivanmkc merged commit 4fce8c4 into googleapis:master May 5, 2021

ivanmkc added a commit that referenced this pull request May 5, 2021

Bump google-cloud-storage min version to 1.32.0

9ae2c6e

This PR (#357) introduced a mock that uses google.cloud.storage.blob.Blob.download_as_bytes. This method was introduced in 1.32.0.

ivanmkc mentioned this pull request May 5, 2021

fix: Bump google-cloud-storage min version to 1.32.0 #371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added default AutoMLTabularTrainingJob column transformations #357

feat: Added default AutoMLTabularTrainingJob column transformations #357

ivanmkc commented Apr 29, 2021 •

edited

Loading

ivanmkc Apr 29, 2021

ivanmkc Apr 29, 2021

ivanmkc Apr 29, 2021

sasha-gitg left a comment

sasha-gitg May 3, 2021

ivanmkc May 3, 2021

ivanmkc May 4, 2021

sasha-gitg May 3, 2021

ivanmkc May 3, 2021

ivanmkc May 4, 2021

ivanmkc May 4, 2021 •

edited

Loading

ivanmkc May 4, 2021

ivanmkc May 4, 2021

sasha-gitg May 5, 2021

sasha-gitg left a comment

sasha-gitg May 5, 2021

ivanmkc May 5, 2021

sasha-gitg May 5, 2021

ivanmkc May 5, 2021

		@@ -2918,10 +2917,19 @@ def _run(

		training_task_definition = schema.training_job.definition.automl_tabular

		if self._column_transformations is None:

	logger = logging.getLogger("google.auth._default")
	logging_warning_filter = utils.LoggingWarningFilter()
	logger.addFilter(logging_warning_filter)
	credentials, _ = google.auth.default()
	logger.removeFilter(logging_warning_filter)

feat: Added default AutoMLTabularTrainingJob column transformations #357

feat: Added default AutoMLTabularTrainingJob column transformations #357

Conversation

ivanmkc commented Apr 29, 2021 • edited Loading

Testing code

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha-gitg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc May 4, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sasha-gitg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc commented Apr 29, 2021 •

edited

Loading

ivanmkc May 4, 2021 •

edited

Loading