-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add filter and timestamp splits #549
Conversation
From the docstring, it's unclear to me what happens when None is passed for Also, can we remove the defaults from private methods and only include on public methods? Seems cleaner. |
I have no objections to either. @sasha-gitg what are your thoughts? |
key=timestamp_split_column_name, | ||
) | ||
|
||
fraction_split = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the comment that *_fraction_splits need to default to None throughout because we will not be able to differentiate when the user explicitly set those values and when they have been defaulted which is important as we need to determine if we received an invalid set of arguments.
If we agree on that then we need to also update the documentation that if no fraction_spit, timestamp_split, filter_split, or predfined_split is set then we default to a 0.8, 0.1, 0.1 fraction_split.
There needs to be additional logic at the end here when if all splits are None then fallback to a 0.8, 0.1, 0.1 fraction split.
I don't think this is a breaking change as we moved to setting the argument as Optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything is now none and falls back to 0.8, 0.1, 0.1. with the exception of video, where 0.8, none, 0.2 has been used. we'll have to update the docs to reflect all of this.
test_fraction_split is None, | ||
] | ||
): | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be the case: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L319
If seems like this entire fraction split block can be simplified to:
if any([training_fraction_split, validation_fraction_split, test_fraction_split]):
fraction_split = gca_training_pipeline.FractionSplit(
training_fraction=training_fraction_split,
validation_fraction=validation_fraction_split,
test_fraction=test_fraction_split,
)
I think it's also cleaner to create the default fraction split at the very end instead of sharing the state between fraction_split creation and timeseries_split creation . ie::
if split_configs_count > 1:
raise ValueError(
"""Can only specify one of:
1. training_filter_split, validation_filter_split, test_filter_split OR
2. predefined_split_column_name OR
3. timestamp_split_column_name, training_fraction_split, validation_fraction_split, test_fraction_split OR
4. training_fraction_split, validation_fraction_split, test_fraction_split"""
)
# default
if split_configs_count == 0:
fraction_split = gca_training_pipeline.FractionSplit(
training_fraction=0.8,
validation_fraction=0.1,
test_fraction=0.1,
)
@@ -1653,14 +1829,35 @@ def run( | |||
accelerator_count (int): | |||
The number of accelerators to attach to a worker replica. | |||
training_fraction_split (float): | |||
The fraction of the input data that is to be | |||
used to train the Model. This is ignored if Dataset is not provided. | |||
Optional. The fraction of the input data that is to be used to train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be additional documentation at the top of all public run
method docstrings describing the different splits. Right now, It' just the fraction split that has that treatment.
Optional. The fraction of the input data that is to be used to evaluate | ||
the Model. This is ignored if Dataset is not provided. | ||
training_filter_split (str): | ||
Optional. A filter on DataItems of the Dataset. DataItems that match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth going back through the protos and ensuring any additional information that is not argument specific but split specific is surfaced like the minus sign usage here. This can only be done on public run methods.: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L348
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. You can put it under where the Data fraction splits
info is in the run
method docstring.
I'll make the final changes sometime over the next few days. |
Closing in favor of #627 |
Changes to training splits:
Future:
succeeds #210