Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add filter and timestamp splits #549

Closed
wants to merge 16 commits into from

Conversation

sirtorry
Copy link
Contributor

@sirtorry sirtorry commented Jul 16, 2021

Changes to training splits:

  • Modifies default behaviour of fraction split
  • Raises error when multiple split types passed
  • Adds filter splits (b/172365904) to image, text and video
  • Adds timestamp split (b/172368070) to tabular

Future:

  • next PR will change forecasting training jobs to extend tabular; therefore, adding timestamp split to forecasting

succeeds #210

@sirtorry sirtorry added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 16, 2021
@sirtorry sirtorry self-assigned this Jul 16, 2021
@product-auto-label product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Jul 16, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 16, 2021
@sirtorry sirtorry marked this pull request as draft July 16, 2021 03:34
@sirtorry sirtorry changed the title feat: add filter and timestamp splits for custom training feat: add filter and timestamp splits Jul 16, 2021
@ivanmkc
Copy link
Contributor

ivanmkc commented Jul 16, 2021

From the docstring, it's unclear to me what happens when None is passed for training_fraction_split.
Followup question, can we default to None?

Also, can we remove the defaults from private methods and only include on public methods? Seems cleaner.

@sirtorry
Copy link
Contributor Author

From the docstring, it's unclear to me what happens when None is passed for training_fraction_split.
Followup question, can we default to None?

Also, can we remove the defaults from private methods and only include on public methods? Seems cleaner.

I have no objections to either. @sasha-gitg what are your thoughts?

@sirtorry sirtorry removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 17, 2021
@sirtorry sirtorry marked this pull request as ready for review July 17, 2021 00:13
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
google/cloud/aiplatform/training_jobs.py Outdated Show resolved Hide resolved
key=timestamp_split_column_name,
)

fraction_split = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the comment that *_fraction_splits need to default to None throughout because we will not be able to differentiate when the user explicitly set those values and when they have been defaulted which is important as we need to determine if we received an invalid set of arguments.

If we agree on that then we need to also update the documentation that if no fraction_spit, timestamp_split, filter_split, or predfined_split is set then we default to a 0.8, 0.1, 0.1 fraction_split.

There needs to be additional logic at the end here when if all splits are None then fallback to a 0.8, 0.1, 0.1 fraction split.

I don't think this is a breaking change as we moved to setting the argument as Optional.

Copy link
Contributor Author

@sirtorry sirtorry Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything is now none and falls back to 0.8, 0.1, 0.1. with the exception of video, where 0.8, none, 0.2 has been used. we'll have to update the docs to reflect all of this.

google/cloud/aiplatform/training_jobs.py Show resolved Hide resolved
@sirtorry sirtorry requested a review from sasha-gitg July 29, 2021 21:03
test_fraction_split is None,
]
):
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the case: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L319

If seems like this entire fraction split block can be simplified to:

if any([training_fraction_split, validation_fraction_split, test_fraction_split]):
  fraction_split = gca_training_pipeline.FractionSplit(
                    training_fraction=training_fraction_split,
                    validation_fraction=validation_fraction_split,
                    test_fraction=test_fraction_split,
                )

I think it's also cleaner to create the default fraction split at the very end instead of sharing the state between fraction_split creation and timeseries_split creation . ie::

if split_configs_count > 1:
                raise ValueError(
                    """Can only specify one of:
                        1. training_filter_split, validation_filter_split, test_filter_split OR
                        2. predefined_split_column_name OR
                        3. timestamp_split_column_name, training_fraction_split, validation_fraction_split, test_fraction_split OR
                        4. training_fraction_split, validation_fraction_split, test_fraction_split"""
                )

# default
if split_configs_count == 0:
  fraction_split = gca_training_pipeline.FractionSplit(
                    training_fraction=0.8,
                    validation_fraction=0.1,
                    test_fraction=0.1,
                )          

@@ -1653,14 +1829,35 @@ def run(
accelerator_count (int):
The number of accelerators to attach to a worker replica.
training_fraction_split (float):
The fraction of the input data that is to be
used to train the Model. This is ignored if Dataset is not provided.
Optional. The fraction of the input data that is to be used to train
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be additional documentation at the top of all public run method docstrings describing the different splits. Right now, It' just the fraction split that has that treatment.

Optional. The fraction of the input data that is to be used to evaluate
the Model. This is ignored if Dataset is not provided.
training_filter_split (str):
Optional. A filter on DataItems of the Dataset. DataItems that match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth going back through the protos and ensuring any additional information that is not argument specific but split specific is surfaced like the minus sign usage here. This can only be done on public run methods.: https://github.com/googleapis/python-aiplatform/blob/master/google/cloud/aiplatform_v1/types/training_pipeline.py#L348

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. You can put it under where the Data fraction splits info is in the run method docstring.

@sirtorry
Copy link
Contributor Author

sirtorry commented Aug 7, 2021

I'll make the final changes sometime over the next few days.

@sasha-gitg
Copy link
Member

Closing in favor of #627

@sasha-gitg sasha-gitg closed this Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: aiplatform Issues related to the AI Platform API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants