feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

ivanmkc · 2021-08-20T05:52:56Z

Added support for column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init.
Added support for training_fraction_split, validation_fraction_split, test_fraction_split, predefined_split_column_name, timestamp_split_column_name to AutoMLForecastingTrainingJob.run.
Refactored to allow for code reuse.
Added unit tests

Looking at Forecasting vs Tabular, they are many similarities, but since each has an independent source of truth in their own respective YAML files, I don't think subclassing it the way to go.

There is no guarantee that they will not diverge in the future.

Therefore, I think the right approach to reuse code is to use composition over inheritance.

tests/unit/aiplatform/test_automl_tabular_training_jobs.py

ivanmkc · 2021-08-20T14:24:01Z

google/cloud/aiplatform/training_jobs.py

+            Predefined splits:
+            Assigns input data to training, validation, and test sets based on the value of a provided key.
+            If using predefined splits, ``predefined_split_column_name`` must be provided.
+            Supported only for tabular Datasets.


Maybe this should include time-series Datasets.

@thehardikv can you confirm that fraction splits, predefined splits and timestamp splits are all supported by AutoMLForecasting?

Confirmed. But we don't really support timestamp splits. Our fraction splits are already time ordered. So internally, timestamp splits are converted into fraction splits. But setting timestamp splits won't throw an error, which I believe addresses the concern behind the question.

Okay, will leave as-is.

@thehardikv to confuse users less, I can just always set timestamp_split_column_name to None then right?

Do you mean as the default value?

Yes, basically remove timestamp_split_column_name from being set by the user, and internally always pass it as None to the service.

Should we remove the timestamp_split_column_name arg from run based on this conversation?

Yes, I will do that.

google/cloud/aiplatform/training_jobs.py

google/cloud/aiplatform/datasets/column_names_dataset.py

google/cloud/aiplatform/training_jobs.py

ivanmkc · 2021-08-20T17:02:19Z

I'm not sure why the unit tests requirements have changed, resulting in a bunch of lint errors on the Presubmit - Unit Tests check.

ivanmkc · 2021-08-20T17:06:51Z

google/cloud/aiplatform/datasets/column_names_dataset.py

@@ -0,0 +1,344 @@
+# -*- coding: utf-8 -*-


This is mostly copied from AutoMLTabularDataset

ivanmkc · 2021-08-20T17:08:03Z

google/cloud/aiplatform/datasets/column_names_dataset.py

+            )
+        }
+
+    def _get_default_column_transformations(


Unsure if this should be "private" or not.

In Java, it would be "protected", since a subclass accesses it.

It seems like this should be a method on the training job.

It was originally. I think you're right, Dataset doesn't need to know about target_column.

I'll move it back.

It just means we'll have to have a ColumnNamesTrainingJob or something to hold this.

google/cloud/aiplatform/datasets/column_names_dataset.py

sasha-gitg · 2021-08-20T18:01:02Z

google/cloud/aiplatform/datasets/column_names_dataset.py

+
+from google.cloud.aiplatform import utils
+
+from typing import Dict, List, Optional, Tuple


Some of these are imported twice.

google/cloud/aiplatform/datasets/column_names_dataset.py

sasha-gitg · 2021-08-20T18:13:22Z

google/cloud/aiplatform/datasets/column_names_dataset.py

+
+    def _get_default_column_transformations(
+        self, target_column: str,
+    ) -> Tuple[Dict, List[str]]:


What's the purpose of returning the column names and the column transformations? It seem like this should just return the column transformations.

Suggested change

) -> Tuple[Dict, List[str]]:

) -> Tuple[Dict[str, Dict[str, Union[bool, str]], List[str]]:

I debated with myself on this. It'd be cleaner just to return the transformations, but the column names are used for the log message at the callsite.

An alternative is to have the callsite extract the column names from the transformations to do the logging.

google/cloud/aiplatform/datasets/column_names_dataset.py

sasha-gitg · 2021-08-20T18:17:56Z

google/cloud/aiplatform/datasets/column_names_dataset.py

+    @staticmethod
+    def _validate_and_get_column_transformations(
+        column_specs: Optional[Dict[str, str]],
+        column_transformations: Optional[Union[Dict, List[Dict]]],


Column transformation can be qualified further like the comment above.

sasha-gitg · 2021-08-20T18:22:46Z

google/cloud/aiplatform/training_jobs.py

        predefined_split_column_name: Optional[str] = None,
+        timestamp_split_column_name: Optional[str] = None,


What was the previous behavior before adding these split columns? Did it split on the time_series_identifier_column?

@thehardikv do you know?

It defaults to fraction split.

sasha-gitg · 2021-08-20T18:30:13Z

google/cloud/aiplatform/datasets/column_names_dataset.py

+            )
+        }
+
+    def _get_default_column_transformations(


It seems like this should be a method on the training job.

google/cloud/aiplatform/datasets/__init__.py

ivanmkc · 2021-08-31T00:16:24Z

google/cloud/aiplatform/training_jobs.py

            predefined_split_column_name (str):
                Optional. The key is a name of one of the Dataset's data
                columns. The value of the key (either the label's value or
-                value in the column) must be one of {``TRAIN``,
-                ``VALIDATE``, ``TEST``}, and it defines to which set the
+                value in the column) must be one of {``training``,


Fixed to match proto at google/cloud/aiplatform_v1/types/training_pipeline.py

google/cloud/aiplatform/training_jobs.py

sasha-gitg · 2021-10-04T19:50:41Z

google/cloud/aiplatform/training_jobs.py

@@ -3070,7 +3070,7 @@ def __init__(
                ignored by the training, except for the targetColumn, which should have
                no transformations defined on.
                Only one of column_transformations or column_specs should be passed.
-            column_transformations (Union[Dict, List[Dict]]):
+            column_transformations (List[Dict[str, Dict[str, str]]]):


This type doesn't seem to match the type hint in the function signature.

google/cloud/aiplatform/datasets/column_names_dataset.py

google/cloud/aiplatform/training_jobs.py

sasha-gitg · 2021-10-04T19:59:25Z

google/cloud/aiplatform/training_jobs.py

+            Predefined splits:
+            Assigns input data to training, validation, and test sets based on the value of a provided key.
+            If using predefined splits, ``predefined_split_column_name`` must be provided.
+            Supported only for tabular Datasets.


Should we remove the timestamp_split_column_name arg from run based on this conversation?

google/cloud/aiplatform/utils/column_transformations_utils.py

ivanmkc · 2021-10-04T23:35:36Z

@sasha-gitg Fixed all issues and then some.

ivanmkc requested a review from a team as a code owner August 20, 2021 05:52

product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Aug 20, 2021

ivanmkc changed the title ~~Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob~~ feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 20, 2021

ivanmkc commented Aug 20, 2021

View reviewed changes

tests/unit/aiplatform/test_automl_tabular_training_jobs.py Outdated Show resolved Hide resolved

ivanmkc commented Aug 20, 2021

View reviewed changes

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved

ivanmkc commented Aug 20, 2021

View reviewed changes

google/cloud/aiplatform/datasets/column_names_dataset.py Outdated Show resolved Hide resolved

ivanmkc changed the title ~~feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob~~ [WIP] feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021

ivanmkc commented Aug 20, 2021

View reviewed changes

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved

ivanmkc changed the title ~~[WIP] feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob~~ feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021

ivanmkc changed the title ~~feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob~~ feat: Refactored/Improved AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021

ivanmkc commented Aug 20, 2021

View reviewed changes

sasha-gitg requested changes Aug 20, 2021

View reviewed changes

ivanmkc force-pushed the imkc--automlforecast-refactor2 branch 2 times, most recently from 49673fe to 8184673 Compare August 30, 2021 22:09

ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 517a3e9 to 97a958e Compare August 31, 2021 00:09

ivanmkc commented Aug 31, 2021

View reviewed changes

ghost approved these changes Aug 31, 2021

View reviewed changes

ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 97a958e to 29873eb Compare September 2, 2021 18:17

ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from f06505b to de5beb9 Compare September 9, 2021 21:24

sasha-gitg reviewed Oct 4, 2021

View reviewed changes

ivanmkc mentioned this pull request Oct 4, 2021

refactor: forecasting training job #597

Closed

ivanmkc added 3 commits October 4, 2021 17:00

Extracted column_names code from AutoMLTabularTrainingJob for reuse

ef3df08

Added missing parameters to AutoMLForecast

7466eae

Fixed tests and added encryption spec

0bc4b20

ivanmkc added 8 commits October 4, 2021 17:00

Added missing docstrings

a2812a2

Made _ColumnNamesDataset subclass _Dataset

4f40adb

Fixed docstrings

1d94c27

Moved transformations code out of column_names_dataset

a4582a1

Minor fixes

ce9b3ea

Cleanup

f2ddedf

Ran linter

b6c3648

Fix lint issue

f4def56

ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 0d9c694 to f4def56 Compare October 4, 2021 21:00

ivanmkc added 6 commits October 4, 2021 17:05

Removed timestamp_split_column_name from AutoMLForecasting

1d80363

Cleaned up typehints

1780a72

Fixed test

7d98acc

Ran linter

250ff1d

Ran lint

e6deae9

Added more docstrings for raising exceptions

80adc7c

sasha-gitg approved these changes Oct 5, 2021

View reviewed changes

Merge branch 'main' into imkc--automlforecast-refactor2

249f3f8

ivanmkc merged commit 7cb6976 into googleapis:main Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

ivanmkc commented Aug 20, 2021 •

edited

Loading

ivanmkc Aug 20, 2021

ghost Aug 31, 2021

ivanmkc Sep 2, 2021

ivanmkc Sep 3, 2021

ghost Sep 9, 2021

ivanmkc Sep 16, 2021

ghost Sep 16, 2021

sasha-gitg Oct 4, 2021

ivanmkc Oct 4, 2021

ivanmkc commented Aug 20, 2021

ivanmkc Aug 20, 2021

ivanmkc Aug 20, 2021

sasha-gitg Aug 20, 2021

ivanmkc Aug 20, 2021

ivanmkc Aug 20, 2021

sasha-gitg Aug 20, 2021

sasha-gitg Aug 20, 2021

ivanmkc Aug 20, 2021 •

edited

Loading

sasha-gitg Aug 20, 2021

sasha-gitg Aug 20, 2021

ivanmkc Aug 20, 2021

ghost Aug 31, 2021

sasha-gitg Aug 20, 2021

ivanmkc Aug 31, 2021

sasha-gitg Oct 4, 2021

ivanmkc Oct 4, 2021

sasha-gitg Oct 4, 2021

ivanmkc commented Oct 4, 2021


		from google.cloud.aiplatform import utils

		from typing import Dict, List, Optional, Tuple

	) -> Tuple[Dict, List[str]]:
	) -> Tuple[Dict[str, Dict[str, Union[bool, str]], List[str]]:

		predefined_split_column_name: Optional[str] = None,
		timestamp_split_column_name: Optional[str] = None,

feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

Conversation

ivanmkc commented Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc commented Aug 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc commented Oct 4, 2021

ivanmkc commented Aug 20, 2021 •

edited

Loading

ivanmkc Aug 20, 2021 •

edited

Loading