Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647

Merged
merged 18 commits into from
Oct 6, 2021

Conversation

ivanmkc
Copy link
Contributor

@ivanmkc ivanmkc commented Aug 20, 2021

  • Added support for column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init.
  • Added support for training_fraction_split, validation_fraction_split, test_fraction_split, predefined_split_column_name, timestamp_split_column_name to AutoMLForecastingTrainingJob.run.
  • Refactored to allow for code reuse.
  • Added unit tests

Looking at Forecasting vs Tabular, they are many similarities, but since each has an independent source of truth in their own respective YAML files, I don't think subclassing it the way to go.

There is no guarantee that they will not diverge in the future.

Therefore, I think the right approach to reuse code is to use composition over inheritance.

@ivanmkc ivanmkc requested a review from a team as a code owner August 20, 2021 05:52
@product-auto-label product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Aug 20, 2021
@ivanmkc ivanmkc changed the title Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 20, 2021
Predefined splits:
Assigns input data to training, validation, and test sets based on the value of a provided key.
If using predefined splits, ``predefined_split_column_name`` must be provided.
Supported only for tabular Datasets.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should include time-series Datasets.

@thehardikv can you confirm that fraction splits, predefined splits and timestamp splits are all supported by AutoMLForecasting?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed. But we don't really support timestamp splits. Our fraction splits are already time ordered. So internally, timestamp splits are converted into fraction splits. But setting timestamp splits won't throw an error, which I believe addresses the concern behind the question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will leave as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thehardikv to confuse users less, I can just always set timestamp_split_column_name to None then right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean as the default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, basically remove timestamp_split_column_name from being set by the user, and internally always pass it as None to the service.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the timestamp_split_column_name arg from run based on this conversation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do that.

@ivanmkc ivanmkc changed the title feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob [WIP] feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021
@ivanmkc ivanmkc changed the title [WIP] feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021
@ivanmkc
Copy link
Contributor Author

ivanmkc commented Aug 20, 2021

I'm not sure why the unit tests requirements have changed, resulting in a bunch of lint errors on the Presubmit - Unit Tests check.

@ivanmkc ivanmkc changed the title feat: Refactored AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob feat: Refactored/Improved AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob Aug 20, 2021
@@ -0,0 +1,344 @@
# -*- coding: utf-8 -*-
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly copied from AutoMLTabularDataset

)
}

def _get_default_column_transformations(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if this should be "private" or not.

In Java, it would be "protected", since a subclass accesses it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be a method on the training job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally. I think you're right, Dataset doesn't need to know about target_column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it back.

It just means we'll have to have a ColumnNamesTrainingJob or something to hold this.

google/cloud/aiplatform/datasets/column_names_dataset.py Outdated Show resolved Hide resolved

from google.cloud.aiplatform import utils

from typing import Dict, List, Optional, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these are imported twice.


def _get_default_column_transformations(
self, target_column: str,
) -> Tuple[Dict, List[str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of returning the column names and the column transformations? It seem like this should just return the column transformations.

Suggested change
) -> Tuple[Dict, List[str]]:
) -> Tuple[Dict[str, Dict[str, Union[bool, str]], List[str]]:

Copy link
Contributor Author

@ivanmkc ivanmkc Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated with myself on this. It'd be cleaner just to return the transformations, but the column names are used for the log message at the callsite.

An alternative is to have the callsite extract the column names from the transformations to do the logging.

google/cloud/aiplatform/datasets/column_names_dataset.py Outdated Show resolved Hide resolved
@staticmethod
def _validate_and_get_column_transformations(
column_specs: Optional[Dict[str, str]],
column_transformations: Optional[Union[Dict, List[Dict]]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column transformation can be qualified further like the comment above.

predefined_split_column_name: Optional[str] = None,
timestamp_split_column_name: Optional[str] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the previous behavior before adding these split columns? Did it split on the time_series_identifier_column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thehardikv do you know?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It defaults to fraction split.

)
}

def _get_default_column_transformations(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be a method on the training job.

google/cloud/aiplatform/datasets/__init__.py Show resolved Hide resolved
@ivanmkc ivanmkc force-pushed the imkc--automlforecast-refactor2 branch 2 times, most recently from 49673fe to 8184673 Compare August 30, 2021 22:09
@ivanmkc ivanmkc changed the title feat: Refactored/Improved AutoMLForecastingTrainingJob and AutoMLTabularTrainingJob feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and training_fraction_split, validation_fraction_split, test_fraction_split, predefined_split_column_name, timestamp_split_column_name to AutoMLForecastingTrainingJob.run Aug 31, 2021
@ivanmkc ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 517a3e9 to 97a958e Compare August 31, 2021 00:09
predefined_split_column_name (str):
Optional. The key is a name of one of the Dataset's data
columns. The value of the key (either the label's value or
value in the column) must be one of {``TRAIN``,
``VALIDATE``, ``TEST``}, and it defines to which set the
value in the column) must be one of {``training``,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to match proto at google/cloud/aiplatform_v1/types/training_pipeline.py

google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
@ivanmkc ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 97a958e to 29873eb Compare September 2, 2021 18:17
@ivanmkc ivanmkc changed the title feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and training_fraction_split, validation_fraction_split, test_fraction_split, predefined_split_column_name, timestamp_split_column_name to AutoMLForecastingTrainingJob.run feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run Sep 3, 2021
@ivanmkc ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from f06505b to de5beb9 Compare September 9, 2021 21:24
@@ -3070,7 +3070,7 @@ def __init__(
ignored by the training, except for the targetColumn, which should have
no transformations defined on.
Only one of column_transformations or column_specs should be passed.
column_transformations (Union[Dict, List[Dict]]):
column_transformations (List[Dict[str, Dict[str, str]]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type doesn't seem to match the type hint in the function signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

google/cloud/aiplatform/datasets/column_names_dataset.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Show resolved Hide resolved
Predefined splits:
Assigns input data to training, validation, and test sets based on the value of a provided key.
If using predefined splits, ``predefined_split_column_name`` must be provided.
Supported only for tabular Datasets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the timestamp_split_column_name arg from run based on this conversation?

@ivanmkc ivanmkc force-pushed the imkc--automlforecast-refactor2 branch from 0d9c694 to f4def56 Compare October 4, 2021 21:00
@ivanmkc
Copy link
Contributor Author

ivanmkc commented Oct 4, 2021

@sasha-gitg Fixed all issues and then some.

@ivanmkc ivanmkc merged commit 7cb6976 into googleapis:main Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants