-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added column_specs, training_encryption_spec_key_name, model_encryption_spec_key_name to AutoMLForecastingTrainingJob.init and various split methods to AutoMLForecastingTrainingJob.run #647
Conversation
Predefined splits: | ||
Assigns input data to training, validation, and test sets based on the value of a provided key. | ||
If using predefined splits, ``predefined_split_column_name`` must be provided. | ||
Supported only for tabular Datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should include time-series Datasets.
@thehardikv can you confirm that fraction splits, predefined splits and timestamp splits are all supported by AutoMLForecasting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed. But we don't really support timestamp splits. Our fraction splits are already time ordered. So internally, timestamp splits are converted into fraction splits. But setting timestamp splits won't throw an error, which I believe addresses the concern behind the question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will leave as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thehardikv to confuse users less, I can just always set timestamp_split_column_name
to None then right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean as the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basically remove timestamp_split_column_name
from being set by the user, and internally always pass it as None to the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the timestamp_split_column_name
arg from run
based on this conversation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do that.
I'm not sure why the unit tests requirements have changed, resulting in a bunch of lint errors on the |
@@ -0,0 +1,344 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly copied from AutoMLTabularDataset
) | ||
} | ||
|
||
def _get_default_column_transformations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure if this should be "private" or not.
In Java, it would be "protected", since a subclass accesses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this should be a method on the training job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally. I think you're right, Dataset doesn't need to know about target_column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move it back.
It just means we'll have to have a ColumnNamesTrainingJob or something to hold this.
|
||
from google.cloud.aiplatform import utils | ||
|
||
from typing import Dict, List, Optional, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these are imported twice.
|
||
def _get_default_column_transformations( | ||
self, target_column: str, | ||
) -> Tuple[Dict, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of returning the column names and the column transformations? It seem like this should just return the column transformations.
) -> Tuple[Dict, List[str]]: | |
) -> Tuple[Dict[str, Dict[str, Union[bool, str]], List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated with myself on this. It'd be cleaner just to return the transformations, but the column names are used for the log message at the callsite.
An alternative is to have the callsite extract the column names from the transformations to do the logging.
@staticmethod | ||
def _validate_and_get_column_transformations( | ||
column_specs: Optional[Dict[str, str]], | ||
column_transformations: Optional[Union[Dict, List[Dict]]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column transformation can be qualified further like the comment above.
predefined_split_column_name: Optional[str] = None, | ||
timestamp_split_column_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the previous behavior before adding these split columns? Did it split on the time_series_identifier_column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thehardikv do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It defaults to fraction split.
) | ||
} | ||
|
||
def _get_default_column_transformations( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this should be a method on the training job.
49673fe
to
8184673
Compare
517a3e9
to
97a958e
Compare
predefined_split_column_name (str): | ||
Optional. The key is a name of one of the Dataset's data | ||
columns. The value of the key (either the label's value or | ||
value in the column) must be one of {``TRAIN``, | ||
``VALIDATE``, ``TEST``}, and it defines to which set the | ||
value in the column) must be one of {``training``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to match proto at google/cloud/aiplatform_v1/types/training_pipeline.py
97a958e
to
29873eb
Compare
f06505b
to
de5beb9
Compare
@@ -3070,7 +3070,7 @@ def __init__( | |||
ignored by the training, except for the targetColumn, which should have | |||
no transformations defined on. | |||
Only one of column_transformations or column_specs should be passed. | |||
column_transformations (Union[Dict, List[Dict]]): | |||
column_transformations (List[Dict[str, Dict[str, str]]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type doesn't seem to match the type hint in the function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Predefined splits: | ||
Assigns input data to training, validation, and test sets based on the value of a provided key. | ||
If using predefined splits, ``predefined_split_column_name`` must be provided. | ||
Supported only for tabular Datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the timestamp_split_column_name
arg from run
based on this conversation?
0d9c694
to
f4def56
Compare
@sasha-gitg Fixed all issues and then some. |
Looking at Forecasting vs Tabular, they are many similarities, but since each has an independent source of truth in their own respective YAML files, I don't think subclassing it the way to go.
There is no guarantee that they will not diverge in the future.
Therefore, I think the right approach to reuse code is to use composition over inheritance.